AI Is Getting Better at Science. OpenAI Is Testing How Far It Can Go

AI Is Getting Better at Science. OpenAI Is Testing How Far It Can Go

Demis Hassabis founded DeepMind to “solve intelligence” and then use that to “solve everything else.” Sam Altman promised that “the gains to quality of life from AI driving faster scientific progress … will be enormous.” Dario Amodei of Anthropic predicted that as soon as 2026, AI progress could produce a “country of geniuses in a data center.” Of all the foundational myths driving the AI boom, the hope that AI might help humanity understand the universe is among the most enduring.

FrontierScience, a new benchmark published Tuesday by OpenAI, suggests that AI models are advancing toward that goal—and highlights the difficulty of testing models’ capabilities as they become ever more competitive with human scientists. “We want to rigorously measure how models can improve scientific capabilities and maybe even accelerate scientific discovery,” says Miles Wang, a researcher on the evaluation team at OpenAI who led the work.

The benchmark contains questions in physics, chemistry, and biology in two tiers of difficulty. Olympiad-level questions test “the frontier of what a lot of brilliant young minds are able to do,” says Wang. A more challenging Research tier, containing questions by Ph.D. scientists, tests “open-ended reasoning, judgment, and the ability to support real-world research.”

One sample research question stretched to two paragraphs, asking about “meso-nitrogen atoms in nickel(II) phthalocyanine.” Running the computer simulations to solve it “could take several days,” says Francisco Martin-Martinez, a senior lecturer in chemistry at King’s College London.

Another asked for a derivation of “electrostatic wave modes” in plasma. “I did a similar analysis earlier this year for a different kind of wave … I think it took about 3 weeks to do the maths correctly,” Tom Ashton-Key, a PhD researcher in plasma physics at Imperial College London told TIME. “5-10% of my time is answering questions similar to this.”

The benchmark results show the same trend that is driving much of the AI boom: a line going up and to the right. “We started making this benchmark months ago, and the progress wasn’t that high,” says Wang. By the time the paper was published, however, things had changed. “Progress has been intensely fast over the last year with [reinforcement learning] and reasoning models.”

OpenAI’s recently released GPT-5.2 is the top performer on the benchmark, achieving 77.1% on the Olympiad tier and 25.3% on Research—although its improvement over its predecessor, GPT-5, is negligible in the latter category. If and when they approach 100% on the Research tier, AI models will be “a very good collaborator and multiply the progress that Ph.D. students or scientists can do,” according to Wang.

However, FrontierScience “does not measure all the important capabilities in science,” says Wang. Since the questions are text-only, models aren’t being tested on the ability to perform experiments, or analyze images and videos. Small question sets—100 questions in the Olympiad tier, 60 in the Research tier—mean that it’s hard to make reliable comparisons between closely-performing models, and the paper lacks a human baseline showing how a human would fare on the questions.

“I expect the benchmark to be highly correlated with existing work … and not that informative about when the models will be actually useful to assist research, but it’s very hard to do otherwise with a benchmark,” Jaime Sevilla, director of the research institute Epoch AI, told TIME in an email. “Overall, it looks like a good addition to the benchmarking ecosystem.”

These issues are broader than just this benchmark. “We’re hitting the edge of what we can reliably evaluate as a layperson,” says Wang. “It gets really expensive, both in terms of time and cost, to reliably find very specialized domain experts.” When the person writing the question is one of the few world experts on the topic, it’s hard to find a third party to tell you how hard the problem is.

The challenge of finding experts to construct benchmarks is handled outside of OpenAI, by expert data annotation companies such as Mercor or Surge AI, both of which are valued over $10 billion. They source experts from academic institutions to design questions and rubrics to grade the models’ responses. “If you want to see the Riemann hypothesis proved in your lifetime, what do you want to do? You’re going to help train an AI to either solve it or to collaborate with AI on solving it,” says Edwin Chen, founder and CEO of Surge AI.

AI has already had a substantial impact on scientific work. Google DeepMind’s AlphaFold has predicted more than 200 million protein structures, which would take hundreds of millions of years to find experimentally, according to the company. Another project aims to simulate and control the plasma inside a fusion reactor. A third makes AI systems to make detailed weather forecasts.

For the most part, however, these are narrow applications of AI that target a tiny part of a single field. “AlphaFold gives you the structure of the protein and how it folds, but it doesn’t tell you anything about the electronic properties of it or where the electrons are,” says Martin-Martinez.
For many AI companies and startups, the grand prize is an AI that can help with the entire scientific process—from designing experiments to analysing data—across a wide range of fields.

Large language models (LLMs) promise exactly that sort of generality. In math and coding, they are beginning to deliver results. Sebastien Bubeck, a mathematician now working at OpenAI, gave GPT-5 a problem that he and his graduate students had failed to solve for years. “We let it think for two days,” says Bubeck. “There was a miraculous identity in there that the model had found, and it actually solved the problem.”

Coding tasks that used to take four hours now take Keith Butler, an associate professor in chemistry at University College London, thirty minutes. “I’m actually able to do coding again,” he says. But when it comes to actually making discoveries or proposing new hypotheses in his field, he’s “a little more skeptical.”

Others are more skeptical still. “The amount of stupid things that come out from any LLM is so colossal, it’s completely unreliable,” says Carlo Rovelli, a theoretical physicist at Aix-Marseille University.

“For the moment, they are an enormous burden, because journals are being submerged by submissions,” says Rovelli, adding that the number of submissions to the Foundations of Physics journal, where he is chief editor, has more than doubled in the last year. “Most of it is just people who think they’re doing great science by having conversations with LLMs—and it’s horrible.”

If the trend indicated by FrontierScience continues, LLMs may soon make more reliable research assistants. This leaves Martin-Martinez excited but “lost” by the pace of progress. “Too many feelings. I need a LLM to summarize them,” he says.

Leave a comment

Send a Comment

Your email address will not be published. Required fields are marked *