Despite their expertise, AI developers don’t always know what their most advanced systems are capable of—at least, not at first. To find out, systems are subjected to a range of tests—often called evaluations, or ‘evals’—designed to tease out their limits. But due to rapid progress in the field, today’s systems regularly achieve top scores on many popular tests, including SATs and the U.S. bar exam, making it harder to judge just how quickly they are improving.
A new set of much more challenging evals has emerged in response, created by companies, nonprofits, and governments. Yet even on the most advanced evals, AI systems are making astonishing progress. In November, the nonprofit research institute Epoch AI announced a set of exceptionally challenging math questions developed in collaboration with leading mathematicians, called FrontierMath, on which currently available models scored only 2%. Just one month later, OpenAI’s newly-announced o3 model achieved a score of 25.2%, which Epoch’s director, Jaime Sevilla, describes as “far better than our team expected so soon after release.”
[time-brightcove not-tgx=”true”]
Amid this rapid progress, these new evals could help the world understand just what advanced AI systems can do, and—with many experts worried that future systems may pose serious risks in domains like cybersecurity and bioterrorism—serve as early warning signs, should such threatening capabilities emerge in future.
Harder than it sounds
In the early days of AI, capabilities were measured by evaluating a system’s performance on specific tasks, like classifying images or playing games, with the time between a benchmark’s introduction and an AI matching or exceeding human performance typically measured in years. It took five years, for example, before AI systems surpassed humans on the ImageNet Large Scale Visual Recognition Challenge, established by Professor Fei-Fei Li and her team in 2010. And it was only in 2017 that an AI system (Google DeepMind’s AlphaGo) was able to beat the world’s number one ranked player in Go, an ancient, abstract Chinese boardgame—almost 50 years after the first program attempting the task was written.
The gap between a benchmark’s introduction and its saturation has decreased significantly in recent years. For instance, the GLUE benchmark, designed to test an AI’s ability to understand natural language by completing tasks like deciding if two sentences are equivalent or determining the correct meaning of a pronoun in context, debuted in 2018. It was considered solved one year later. In response, a harder version, SuperGLUE, was created in 2019—and within two years, AIs were able to match human performance across its tasks.
Read More: Congress May Finally Take on AI in 2025. Here’s What to Expect
Evals take many forms, and their complexity has grown alongside model capabilities. Virtually all major AI labs now “red-team” their models before release, systematically testing their ability to produce harmful outputs, bypass safety measures, or otherwise engage in undesirable behavior, such as deception. Last year, companies including OpenAI, Anthropic, Meta, and Google made voluntary commitments to the Biden administration to subject their models to both internal and external red-teaming “in areas including misuse, societal risks, and national security concerns.”
Other tests assess specific capabilities, such as coding, or evaluate models’ capacity and propensity for potentially dangerous behaviors like persuasion, deception, and large-scale biological attacks.
Perhaps the most popular contemporary benchmark is Measuring Massive Multitask Language Understanding (MMLU), which consists of about 16,000 multiple-choice questions that span academic domains like philosophy, medicine, and law. OpenAI’s GPT-4o, released in May, achieved 88%, while the company’s latest model, o1, scored 92.3%. Because these large test sets sometimes contain problems with incorrectly-labelled answers, attaining 100% is often not possible, explains Marius Hobbhahn, director and co-founder of Apollo Research, an AI safety nonprofit focused on reducing dangerous capabilities in advanced AI systems. Past a point, “more capable models will not give you significantly higher scores,” he says.
Designing evals to measure the capabilities of advanced AI systems is “astonishingly hard,” Hobbhahn says—particularly since the goal is to elicit and measure the system’s actual underlying abilities, for which tasks like multiple-choice questions are only a proxy. “You want to design it in a way that is scientifically rigorous, but that often trades off against realism, because the real world is often not like the lab setting,” he says. Another challenge is data contamination, which can occur when the answers to an eval are contained in the AI’s training data, allowing it to reproduce answers based on patterns in its training data rather than by reasoning from first principles.
Another issue is that evals can be “gamed” when “either the person that has the AI model has an incentive to train on the eval, or the model itself decides to target what is measured by the eval, rather than what is intended,” says Hobbahn.
A new wave
In response to these challenges, new, more sophisticated evals are being built.
Epoch AI’s FrontierMath benchmark consists of approximately 300 original math problems, spanning most major branches of the subject. It was created in collaboration with over 60 leading mathematicians, including Fields-medal winning mathematician Terence Tao. The problems vary in difficulty, with about 25% pitched at the level of the International Mathematical Olympiad, such that an “extremely gifted” high school student could in theory solve them if they had the requisite “creative insight” and “precise computation” abilities, says Tamay Besiroglu, Epoch’s associate director. Half the problems require “graduate level education in math” to solve, while the most challenging 25% of problems come from “the frontier of research of that specific topic,” meaning only today’s top experts could crack them, and even they may need multiple days.
Solutions cannot be derived by simply testing every possible answer, since the correct answers often take the form of 30-digit numbers. To avoid data contamination, Epoch is not publicly releasing the problems (beyond a handful, which are intended to be illustrative and do not form part of the actual benchmark). Even with a peer-review process in place, Besiroglu estimates that around 10% of the problems in the benchmark have incorrect solutions—an error rate comparable to other machine learning benchmarks. “Mathematicians make mistakes,” he says, noting they are working to lower the error rate to 5%.
Evaluating mathematical reasoning could be particularly useful because a system able to solve these problems may also be able to do much more. While careful not to overstate that “math is the fundamental thing,” Besiroglu expects any system able to solve the FrontierMath benchmark will be able to “get close, within a couple of years, to being able to automate many other domains of science and engineering.”
Another benchmark aiming for a longer shelflife is the ominously-named “Humanity’s Last Exam,” created in collaboration between the nonprofit Center for AI Safety and Scale AI, a for-profit company that provides high-quality datasets and evals to frontier AI labs like OpenAI and Anthropic. The exam is aiming to include between 20 and 50 times as many questions as Frontiermath, while also covering domains like physics, biology, and electrical engineering, says Summer Yue, Scale AI’s director of research. Questions are being crowdsourced from the academic community and beyond. To be included, a question needs to be unanswerable by all existing models. The benchmark is intended to go live in late 2024 or early 2025.
A third benchmark to watch is RE-Bench, designed to simulate real-world machine-learning work. It was created by researchers at METR, a nonprofit that specializes in model evaluations and threat research, and tests humans and cutting-edge AI systems across seven engineering tasks. Both humans and AI agents are given a limited amount of time to complete the tasks; while humans reliably outperform current AI agents on most of them, things look different when considering performance only within the first two hours. Current AI agents do best when given between 30 minutes and 2 hours, depending on the agent, explains Hjalmar Wijk, a member of METR’s technical staff. After this time, they tend to get “stuck in a rut,” he says, as AI agents can make mistakes early on and then “struggle to adjust” in the ways humans would.
“When we started this work, we were expecting to see that AI agents could solve problems only of a certain scale, and beyond that, that they would fail more completely, or that successes would be extremely rare,” says Wijk. It turns out that given enough time and resources, they can often get close to the performance of the median human engineer tested in the benchmark. “AI agents are surprisingly good at this,” he says. In one particular task—which involved optimizing code to run faster on specialized hardware—the AI agents actually outperformed the best humans, although METR’s researchers note that the humans included in their tests may not represent the peak of human performance.
These results don’t mean that current AI systems can automate AI research and development. “Eventually, this is going to have to be superseded by a harder eval,” says Wijk. But given that the possible automation of AI research is increasingly viewed as a national security concern—for example, in the National Security Memorandum on AI, issued by President Biden in October—future models that excel on this benchmark may be able to improve upon themselves, exacerbating human researchers’ lack of control over them.
Even as AI systems ace many existing tests, they continue to struggle with tasks that would be simple for humans. “They can solve complex closed problems if you serve them the problem description neatly on a platter in the prompt, but they struggle to coherently string together long, autonomous, problem-solving sequences in a way that a person would find very easy,” Andrej Karpathy, an OpenAI co-founder who is no longer with the company, wrote in a post on X in response to FrontierMath’s release.
Michael Chen, an AI policy researcher at METR, points to SimpleBench as an example of a benchmark consisting of questions that would be easy for the average high schooler, but on which leading models struggle. “I think there’s still productive work to be done on the simpler side of tasks,” says Chen. While there are debates over whether benchmarks test for underlying reasoning or just for knowledge, Chen says that there is still a strong case for using MMLU and Graduate-Level Google-Proof Q&A Benchmark (GPQA), which was introduced last year and is one of the few recent benchmarks that has yet to become saturated, meaning AI models have yet to reliably achieve top scores, such that further improvements would be negligible. Even if they were just tests of knowledge, he argues, “it’s still really useful to test for knowledge.”
One eval seeking to move beyond just testing for knowledge recall is ARC-AGI, created by prominent AI researcher François Chollet to test an AI’s ability to solve novel reasoning puzzles. For instance, a puzzle might show several examples of input and output grids, where shapes move or change color according to some hidden rule. The AI is then presented with a new input grid and must determine what the corresponding output should look like, figuring out the underlying rule from scratch. Although these puzzles are intended to be relatively simple for most humans, AI systems have historically struggled with them. However, recent breakthroughs suggest this is changing: OpenAI’s o3 model has achieved significantly higher scores than prior models, which Chollet says represents “a genuine breakthrough in adaptability and generalization.”
The urgent need for better evaluations
New evals, simple and complex, structured and “vibes”-based, are being released every day. AI policy increasingly relies on evals, both as they are being made requirements of laws like the European Union’s AI Act, which is still in the process of being implemented, and because major AI labs like OpenAI, Anthropic, and Google DeepMind have all made voluntary commitments to halt the release of their models, or take actions to mitigate possible harm, based on whether evaluations identify any particularly concerning harms.
On the basis of voluntary commitments, The U.S. and U.K. AI Safety Institutes have begun evaluating cutting-edge models before they are deployed. In October, they jointly released their findings in relation to the upgraded version of Anthropic’s Claude 3.5 Sonnet model, paying particular attention to its capabilities in biology, cybersecurity, and software and AI development, as well as to the efficacy of its built-in safeguards. They found that “in most cases the built-in version of the safeguards that US AISI tested were circumvented, meaning the model provided answers that should have been prevented.” They note that this is “consistent with prior research on the vulnerability of other AI systems.” In December, both institutes released similar findings for OpenAI’s o1 model.
However, there are currently no binding obligations for leading models to be subjected to third-party testing. That such obligations should exist is “basically a no-brainer,” says Hobbhahn, who argues that labs face perverse incentives when it comes to evals, since “the less issues they find, the better.” He also notes that mandatory third-party audits are common in other industries like finance.
While some for-profit companies, such as Scale AI, do conduct independent evals for their clients, most public evals are created by nonprofits and governments, which Hobbhahn sees as a result of “historical path dependency.”
“I don’t think it’s a good world where the philanthropists effectively subsidize billion dollar companies,” he says. “I think the right world is where eventually all of this is covered by the labs themselves. They’re the ones creating the risk.”.
AI evals are “not cheap,” notes Epoch’s Besiroglu, who says that costs can quickly stack up to the order of between $1,000 and $10,000 per model, particularly if you run the eval for longer periods of time, or if you run it multiple times to create greater certainty in the result. While labs sometimes subsidize third-party evals by covering the costs of their operation, Hobbhahn notes that this does not cover the far-greater costs of actually developing the evaluations. Still, he expects third-party evals to become a norm going forward, as labs will be able to point to them to evidence due-diligence in safety-testing their models, reducing their liability.
As AI models rapidly advance, evaluations are racing to keep up. Sophisticated new benchmarks—assessing things like advanced mathematical reasoning, novel problem-solving, and the automation of AI research—are making progress, but designing effective evals remains challenging, expensive, and, relative to their importance as early-warning detectors for dangerous capabilities, underfunded. With leading labs rolling out increasingly capable models every few months, the need for new tests to assess frontier capabilities is greater than ever. By the time an eval saturates, “we need to have harder evals in place, to feel like we can assess the risk,” says Wijk.
Leave a comment