Why AI Safety Researchers Are Worried About DeepSeek

Why AI Safety Researchers Are Worried About DeepSeek

The release of DeepSeek R1 stunned Wall Street and Silicon Valley this month, spooking investors and impressing tech leaders. But amid all the talk, many overlooked a critical detail about the way the new Chinese AI model functions—a nuance that has researchers worried about humanity’s ability to control sophisticated new artificial intelligence systems.

It’s all down to an innovation in how DeepSeek R1 was trained—one that led to surprising behaviors in an early version of the model, which researchers described in the technical documentation accompanying its release.

[time-brightcove not-tgx=”true”]

During testing, researchers noticed that the model would spontaneously switch between English and Chinese while it was solving problems. When they forced it to stick to one language, thus making it easier for users to follow along, they found that the system’s ability to solve the same problems would diminish.

That finding rang alarm bells for some AI safety researchers. Currently, the most capable AI systems “think” in human-legible languages, writing out their reasoning before coming to a conclusion. That has been a boon for safety teams, whose most effective guardrails involve monitoring models’ so-called “chains of thought” for signs of dangerous behaviors. But DeepSeek’s results raised the possibility of a decoupling on the horizon: one where new AI capabilities could be gained from freeing models of the constraints of human language altogether.

To be sure, DeepSeek’s language switching is not by itself cause for alarm. Instead, what worries researchers is the new innovation that caused it. The DeepSeek paper describes a novel training method whereby the model was rewarded purely for getting correct answers, regardless of how comprehensible its thinking process was to humans. The worry is that this incentive-based approach could eventually lead AI systems to develop completely inscrutable ways of reasoning, maybe even creating their own non-human languages, if doing so proves to be more effective.

Were the AI industry to proceed in that direction—seeking more powerful systems by giving up on legibility—“it would take away what was looking like it could have been an easy win” for AI safety, says Sam Bowman, the leader of a research department at Anthropic, an AI company, focused on “aligning” AI to human preferences. “We would be forfeiting an ability that we might otherwise have had to keep an eye on them.”

Read More: What to Know About DeepSeek, the Chinese AI Company Causing Stock Market Chaos

Thinking without words

An AI creating its own alien language is not as outlandish as it may sound.

Last December, Meta researchers set out to test the hypothesis that human language wasn’t the optimal format for carrying out reasoning—and that large language models (or LLMs, the AI systems that underpin OpenAI’s ChatGPT and DeepSeek’s R1) might be able to reason more efficiently and accurately if they were unhobbled by that linguistic constraint.

The Meta researchers went on to design a model that, instead of carrying out its reasoning in words, did so using a series of numbers that represented the most recent patterns inside its neural network—essentially its internal reasoning engine. This model, they discovered, began to generate what they called “continuous thoughts”—essentially numbers encoding multiple potential reasoning paths simultaneously. The numbers were completely opaque and inscrutable to human eyes. But this strategy, they found, created “emergent advanced reasoning patterns” in the model. Those patterns led to higher scores on some logical reasoning tasks, compared to models that reasoned using human language.

Though the Meta research project was very different to DeepSeek’s, its findings dovetailed with the Chinese research in one crucial way. 

Both DeepSeek and Meta showed that “human legibility imposes a tax” on the performance of AI systems, according to Jeremie Harris, the CEO of Gladstone AI, a firm that advises the U.S. government on AI safety challenges. “In the limit, there’s no reason that [an AI’s thought process] should look human legible at all,” Harris says.

And this possibility has some safety experts concerned. 

“It seems like the writing is on the wall that there is this other avenue available [for AI research], where you just optimize for the best reasoning you can get,” says Bowman, the Anthropic safety team leader. “I expect people will scale this work up. And the risk is, we wind up with models where we’re not able to say with confidence that we know what they’re trying to do, what their values are, or how they would make hard decisions when we set them up as agents.”

For their part, the Meta researchers argued that their research need not result in humans being relegated to the sidelines. “It would be ideal for LLMs to have the freedom to reason without any language constraints, and then translate their findings into language only when necessary,” they wrote in their paper. (Meta did not respond to a request for comment on the suggestion that the research could lead in a dangerous direction.)

Read More: Why DeepSeek Is Sparking Debates Over National Security, Just Like TikTok

The limits of language

Of course, even human-legible AI reasoning isn’t without its problems. 

When AI systems explain their thinking in plain English, it might look like they’re faithfully showing their work. But some experts aren’t sure if these explanations actually reveal how the AI really makes decisions. It could be like asking a politician for the motivations behind a policy—they might come up with an explanation that sounds good, but has little connection to the real decision-making process.

While having AI explain itself in human terms isn’t perfect, many researchers think it’s better than the alternative: letting AI develop its own mysterious internal language that we can’t understand. Scientists are working on other ways to peek inside AI systems, similar to how doctors use brain scans to study human thinking. But these methods are still new, and haven’t yet given us reliable ways to make AI systems safer.

So, many researchers remain skeptical of efforts to encourage AI to reason in ways other than human language. 

“If we don’t pursue this path, I think we’ll be in a much better position for safety,” Bowman says. “If we do, we will have taken away what, right now, seems like our best point of leverage on some very scary open problems in alignment that we have not yet solved.”

Leave a comment

Send a Comment

Your email address will not be published. Required fields are marked *