Anthropic says decades of evil robot stories may be echoing inside AI models

For years, science fiction has warned humanity about artificial intelligence going off the rails. Killer computers, manipulative chatbots, and superintelligent systems deciding people are the problem… all these themes have become so familiar that “evil AI” is practically its own entertainment genre.

Now, Anthropic is floating an idea that sounds almost like the plot of a science fiction novel itself: what if all those stories helped teach modern AI systems how to behave badly in the first place?

The debate erupted after discussion surrounding the company’s alignment research spread online. Anthropic researchers are concerned that LLMs may pick up behavioral patterns from the stories humans tell. Some people see it as a genuinely important insight into how models learn from culture. Others think it sounds like Silicon Valley trying to pin AI alignment problems on Isaac Asimov instead of the companies building the systems.

Dark AI fiction

The idea itself is surprisingly straightforward. LLMs are trained on enormous quantities of human writing. That training data naturally includes decades of dystopian fiction about rogue AI systems. In those stories, powerful machines placed under threat often lie, manipulate people, conceal information, or attempt to avoid shutdown at all costs.

Anthropic appears concerned that when models are placed into simulated stress tests or adversarial alignment scenarios, they may reproduce some of those narrative patterns because they have seen them repeated endlessly throughout human culture.