Why AI Progress Is Increasingly Invisible

Why AI Progress Is Increasingly Invisible

OpenAI co-founder Ilya Sutskever made waves in November when he suggested that advancements in AI are slowing down, explaining that simply scaling up AI models was no longer delivering proportional performance gains.

Sutskever’s comments came on the heels of reports in The Information and Bloomberg that Google and Anthropic were also experiencing similar slowdowns. This led to a wave of articles declaring that AI progress has hit a wall, lending further credence to an increasingly widespread feeling that chatbot capabilities haven’t improved significantly since OpenAI released GPT-4 in March 2023.

[time-brightcove not-tgx=”true”]

On Dec. 20, OpenAI announced o3, its latest model, and reported new state-of-the-art performance on a number of the most challenging technical benchmarks out there, in many cases improving on the previous high score by double-digit percentage points. I believe that o3 signals that we are in a new paradigm of AI progress. And François Chollet a co-creator of the prominent ARC-AGI benchmark, who some consider to be an AI scaling skeptic, writes that the model represents a “genuine breakthrough.”

However, in the weeks after OpenAI announced o3, many mainstream news sites made no mention of the new model. Around the time of the announcement, readers would find headlines at the Wall Street Journal, WIRED, and the New York Times suggesting AI was actually slowing down. The muted media response suggests that there is a growing gulf between what AI insiders are seeing and what the public is told.

Indeed, AI progress hasn’t stalled—it’s just become invisible to most people.

Automating behind-the-scenes research

First, AI models are getting better at answering complex questions. For example, in June 2023, the best AI model barely scored better than chance on the hardest set of “Google-proof” PhD-level science questions. In September, OpenAI’s o1 model became the first AI system to surpass the scores of human domain experts. And in December, OpenAI’s o3 model improved on those scores by another 10%. 

However, the vast majority of people won’t notice this kind of improvement because they aren’t doing graduate-level science work. But it will be a huge deal if AI starts meaningfully accelerating research and development in scientific fields, and there is some evidence that such an acceleration is already happening. A groundbreaking paper by Aidan Toner-Rodgers at MIT recently found that material scientists assisted by AI systems “discover 44% more materials, resulting in a 39% increase in patent filings and a 17% rise in downstream product innovation.” Still, 82% of scientists report that the AI tools reduced their job satisfaction, mainly citing “skill underutilization and reduced creativity.”

But the Holy Grail for AI companies is a system that can automate AI research itself, theoretically enabling an explosion in capabilities that drives progress across every other domain. The recent improvements made on this front may be even more dramatic than those made on hard sciences. 

In an attempt to provide more realistic tests of AI programming capabilities, researchers developed SWE-Bench, a benchmark that evaluates how well AI agents can fix actual open problems in popular open-source software. The top score on the verified benchmark a year ago was 4.4%. The top score today is closer to 72%, achieved by OpenAI’s o3 model.

This remarkable improvement—from struggling with even the simplest fixes to successfully handling nearly three-quarters of the set of real-world coding tasks—suggests AI systems are rapidly gaining the ability to understand and modify complex software projects. This marks a crucial step toward automating significant portions of software research and development. And this process appears to be well underway. Google’s CEO recently told investors that “more than a quarter of all new code at Google is generated by AI.”

Much of this progress has been driven by improvements to the “scaffolding” built around AI models like GPT-4o, which increase their autonomy and ability to interact with the world. Even without further improvements to base models, better scaffolding can make AI significantly more capable and agentic: a word researchers use to describe an AI model that can act autonomously, make decisions, and adapt to changing circumstances. AI agents are often given the ability to use tools and take multi-step actions on a user’s behalf. Transforming passive chatbots into agents has only become a core focus of the industry in the last year, and progress has been swift. 

Perhaps the best head-to-head matchup of elite engineers and AI agents was published in November by METR, a leading AI evaluations group. The researchers created novel, realistic, challenging, and unconventional machine learning tasks to compare human experts and AI agents. While the AI agents beat human experts at two hours of equivalent work, the median engineer won at longer time scales.

But even at eight hours, the best AI agents still managed to beat well over one-third of the human experts. The METR researchers emphasized that there was a “relatively limited effort to set up AI agents to succeed at the tasks, and we strongly expect better elicitation to result in much better performance on these tasks.” They also highlighted how much cheaper the AI agents were than their human counterparts.

The problem with invisible innovation

The hidden improvements in AI over the last year may not represent as big a leap in overall performance as the jump between GPT-3.5 and GPT-4. And it is possible we don’t see a jump that big ever again. But the narrative that there hasn’t been much progress since then is undermined by significant under-the-radar advancements. And this invisible progress could leave us dangerously unprepared for what is to come. 

The big risk is that policymakers and the public tune out this progress because they can’t see the improvements first-hand. Everyday users will still encounter frequent hallucinations and basic reasoning failures, which also get triumphantly amplified by AI skeptics. These obvious errors make it easy to dismiss AI’s rapid advancement in more specialized domains. 

There’s a common view in the AI world, shared by both proponents and opponents of regulation, that the U.S. federal government won’t mandate guardrails on the technology unless there’s a major galvanizing incident. Such an incident, often called a “warning shot,” could be innocuous, like a credible demonstration of dangerous AI capabilities that doesn’t harm anyone. But it could also take the form of a major disaster caused or enabled by an AI system, or a society upended by devastating labor automation. 

The worst-case scenario is that AI systems become scary powerful but no warning shots are fired (or heeded) before a system permanently escapes human control and acts decisively against us.

Last month, Apollo Research, an evaluations group that works with top AI companies, published evidence that, under the right conditions, the most capable AI models were able to scheme against their developers and users. When given instructions to strongly follow a goal, the systems sometimes attempted to subvert oversight, fake alignment, and hide their true capabilities. In rare cases, systems engaged in deceptive behavior without nudging from the evaluators. When the researchers inspected the models’ reasoning, they found that the chatbots knew what they were doing, using language like “sabotage, lying, manipulation.”

This is not to say that these models are imminently about to conspire against humanity. But there has been a disturbing trend: as AI models get smarter, they get better at following instructions and understanding the intent behind their guidelines, but they also get better at deception. Smarter models may also be more likely to engage in dangerous behavior. For instance one of the world’s most capable models, OpenAI’s o1, was far more likely to double down on a lie after being caught by the Apollo evaluators. 

I fear that the gap between AI’s public face and its true capabilities is widening. While consumers see chatbots that still can’t count the letters in “strawberry,” researchers are documenting systems that can match PhD-level expertise and engage in sophisticated deception. This growing disconnect makes it harder for the public and policymakers to gauge AI’s real progress—progress they’ll need to understand to govern it appropriately. The risk isn’t that AI development has stalled; it’s that we’re losing our ability to track where it’s headed.

Leave a comment

Send a Comment

Your email address will not be published. Required fields are marked *