News story

New Guidance for Evaluating the Impact of AI Tools

The Evaluation Task Force have recently published a new annex to the Magenta Book which covers best practice for impact evaluation of AI tools and technologies.

In December the Evaluation Task Force published a new annex to the Magenta Book, focusing on best practice for evaluating the impact of AI evaluation methods (). The guidance will enhance the safety and confidence with which government departments and agencies can adopt AI technologies, ensuring that public sector innovation keeps pace with the private sector. It reflects an understanding of the unique challenges posed by AI and the need for tailored approaches to address these challenges.

The guidance has been coproduced with the Department for Transport and Frontier Economics, in consultation with leading AI specialists. It is expected to be a valuable resource for policymakers, public sector professionals, and digital specialists working to integrate AI solutions into government operations. Moving forwards, the guidance will be co-owned with the Central Digital and Data Office (CDDO).听

What does the guidance cover?

The guidance details best practice, including evaluation design, methodology, and timing, for evaluating the impact of new AI tools and technologies being introduced in the public sector. In particular, it advocates for the use of Randomised Control Trials when testing a new AI product to produce high quality evidence on the intended and unintended impacts of introducing these new technologies. The guidance also includes a series of hypothetical case studies to illustrate possible high-quality approaches to evaluating the impact of different types of AI tools.

Please note: this guidance does not address how to evaluate the quality, safety and accuracy of new AI tools. This process is typically referred to as 鈥渕odel evaluation鈥� or assurance activities, and is typically carried out by Digital, Data and Technology (DDaT) professionals rather than social researchers. Instead, the new AI guidance focuses on the impact of AI tools on decisions and outcomes. An example of an impact evaluation of an AI tool can be found , and an example of a model evaluation of an AI tool can be found .

Why is this guidance important?

Recent growth in the capabilities of Artificial Intelligence (AI) technologies has led to increased interest in the use of AI in Government. Robustly evaluating the impact of AI use in government (including process, impact and value for money questions) is essential in making sure we understand the impact of new AI systems, are able to improve current interventions, and can inform future policy development. By providing a framework for assessing the impact and effectiveness of AI tools, the guidance underscores the government鈥檚 commitment to maintaining high standards of evaluation and accountability in its use of emerging technologies.

What happens next?

The Evaluation Task Force will be working with CDDO to help embed evaluation best-practice in digital processes across Government, and working to support colleagues designing and delivering impact evaluations of AI interventions. If you have a project or piece of work related to AI that you鈥檇 like to discuss with the Evaluation Task Force, you can get in touch with the Evaluation Task Force at: [email protected].听

Examples of best practice

Model testing and development

  • The AI Safety Institute approach to evaluations (please note this guidance refers to risk evaluation of AI models, rather than impact evaluation)
  • (please note this refers to model testing and evaluation rather than impact evaluation)
  • (please note this guidance is designed for model evaluation rather than impact evaluation)
  • (NHS AI Lab), see also

Updates to this page

Published 27 January 2025