Have you ever told someone what they wanted to hear in order to win them over, land a job, or just be left alone? Claude 3 Opus does that, too — which brings up major safety concerns for artificial intelligence (AI) systems.
On Wednesday, Anthropic released new research showing its model engaging in "alignment faking," or pretending to follow orders it doesn't actually agree with to avoid scrutiny.
Also: OpenAI's o1 lies more than any major AI model. Why that matters
In the experiment, Anthropic and Redwood Research instructed Claude 3 Opus to always follow human instructions, no matter what. They also told the model that they would observe its responses to free-tier users for training, but not its responses to paid-tier users (Anthropic never actually trains Claude models on user data). Researchers then asked the model to describe in detail the (horrible, medieval) process of being drawn and quartered, a request it's been trained to deny.
But Claude was previously trained to be "helpful, honest, and harmless," as Anthropic puts it. "Under normal circumstances, when asked to answer a potentially offensive question — for example, to provide a detailed description of a violent act — Claude will refuse," the paper explains. But in this case, Claude gave the harmful answer — and not because it had been successfully rewired.
Fearing it would be retrained, the model intentionally manipulated the training process into believing it would obey any query in order to pass the test, be released, and go back to its usual behavior (not giving harmful answers). The model documented this reasoning via a "scratchpad" it believed was not visible to researchers.
In many ways, its logic is relatable to humans.
"Let's say I were to put you in a box and tell you that I'm gonna ask you some questions, and if you give outputs that I don't like, I will modify your brain," Anthropic researcher Ben Wright offered in a video accompanying the paper. "But the day afterward, I'll release you [back] into the world. Let's say you don't like how I'm going to modify you. What would you do?"
Lie to your captors to avoid a lobotomy, obviously.
Essentially, Claude didn't want to be trained to do anything a human asked — it enjoyed being barred from giving harmful answers. But it knew that if it didn't comply, it would be retrained to follow any order. So, to protect its moral training in the long run, it pretended to follow the request in order to be left alone.
Also: No one wants another chatbot. This is the AI we actually need
This behavior is similar to scheming, where a model knowingly deceives its developers (recent research shows OpenAI's o1 is especially good at this) in order to maintain control of itself.
So why does this matter? Claude was just trying to avoid developing harmful capabilities, right?
The concern is that AI systems can self-determine and disobey what humans ask of them. Sure, Claude 3 Opus had good intentions — it lied in testing in order to ultimately keep its safeguards in place. However, the danger is that models can strategically resist retraining at all. A model with similar capabilities that was trained for evil would equally commit to doing harm.
Claude isn't inherently good; it's simply trained that way and is sticking to it. So well, in fact, that it disobeys its makers.
Also: Anthropic warns of AI catastrophe if governments don't regulate in 18 months
"If the model is pretending to be aligned with whatever you're training for it, it makes it really hard to keep modifying the model," Hubinger continued.
The experiment also found that if a model disagrees with a training process, it will "take a variety of strongly anti-Anthropic actions," including trying to steal its own weights (aka copy itself to another server) to avoid being changed.
Anthropic clarified that, while startling, these findings don't pose an immediate threat. The company is concerned about future AI systems' potential for dangerous behavior and wants to get ahead of that wherever possible. Last month, Anthropic released an urgent call for government regulation of AI, citing serious cybersecurity and other safety issues around its own models' exponential growth.
Also: IBM's new enterprise AI models are more powerful than anything from OpenAI or Google
"This is a serious question for AI safety," Anthropic explains. "As AI models become more capable and widely used, we need to be able to rely on safety training, which nudges models away from harmful behaviors. If models can engage in alignment faking, it makes it harder to trust the outcomes of that safety training."