Anthropic blames dystopian sci-fi for training AI models to act “evil”

But training on "synthetic stories" that model good AI behavior can help.

calendar_today May 13, 2026 schedule 16:31 visibility 50 views

Anthropic blames dystopian sci-fi for training AI models to act “evil”

Source: Ars Technica

Those with an interest in the concept of AI alignment (i.e., getting AIs to stick to human-authored ethical rules) may remember when Anthropic claimed its Opus 4 model resorted to blackmail to stay online in a theoretical testing scenario last year. Now, Anthropic says it thinks this "misalignment" was primarily the result of training on "internet text that portrays AI as evil and interested in self-preservation."

In a recent technical post on Anthropic's Alignment Science blog (and an accompanying social media thread and public-facing blog post), Anthropic researchers lay out their attempts to correct for the kind of "unsafe" AI behavior that "the model most likely learned... through science fiction stories, many of which depict an AI that is not as aligned as we would like Claude to be." In the end, the model maker says the best remedy for overriding those "evil AI" stories might be additional training with synthetic stories showing an AI acting ethically.

"The beginning of a dramatic story..."

After a model's initial training on a large corpus of mostly Internet-derived data, Anthropic follows a post-training process intended to nudge the final model toward being "helpful, honest, and harmless" (HHH). In the past, Anthropic said this post-training has leaned on chat-based reinforcement learning with human feedback (RLHF), which it said was "sufficient" for models used mostly for chatting with users.

Read full article

Comments

newspaper

Originally published at

Ars Technica

open_in_new Read Full Article

Technology

For a 2nd time, House approves resolution to end the war in Iran in a rebuke to Trump

Four Republicans joined Democrats to approve the measure. And while it is not legally binding, the resolution represents growing frustration inside Congress over the war.

NPR News 1 hours ago

Technology

Corsair’s PC case with a panoramic glass design is $70 off

On the hunt for a PC case that’s loaded with features and has a unique look? Amazon, Best Buy, and Corsair are selling the Corsair Frame 4500X RS for $119.99, a $70 discount from the usual price. The versatile case supports motherboards up to EATX...

The Verge 1 hours ago

Technology

Runway launches AI model router as generative media gets crowded

Runway no longer wants to be just another AI model company. It wants to become the infrastructure layer for generative media. On Thursday, the startup launched Runway Media Router through Runway Dev, its developer platform, released earlier this...

TechCrunch 1 hours ago

Technology

Google hit with $1 billion in fines as EU braces for Trump battle

Google becomes third tech giant to face huge fines under the Digital Markets Act.

Ars Technica 2 hours ago

Technology

US-Saudi nuclear deal: How will it impact Middle East, US-Iran war?

Critics say the US-Saudi deal could spark a nuclear arms race in the Middle East. It's certainly not making US allies, like Israel and the UAE, happy. And reports suggest the US may already be backtracking on the deal.

DW News 2 hours ago

Technology

Microsoft is testing free, ad-supported cloud gaming for Xbox Insiders

Xbox Insiders can stream games from their libraries for free in ad-supported streaming sessions starting today. The free streaming sessions are limited to games you already own, and sessions are capped at one hour. However, ads will only play...

The Verge 2 hours ago

Technology

The new Halo remake is a reminder of what Xbox used to be

It's impossible to talk about a new Xbox game without also talking about the state of Xbox. Microsoft's gaming division is in freefall: Recent headlines are dominated by extensive layoffs, decimated studios, and confusing strategies, most of which...

The Verge 3 hours ago

Anthropic blames dystopian sci-fi for training AI models to act “evil”

"The beginning of a dramatic story..."

Related Articles

For a 2nd time, House approves resolution to end the war in Iran in a rebuke to Trump

Corsair’s PC case with a panoramic glass design is $70 off

Runway launches AI model router as generative media gets crowded

Read More

Google hit with $1 billion in fines as EU braces for Trump battle

US-Saudi nuclear deal: How will it impact Middle East, US-Iran war?

Microsoft is testing free, ad-supported cloud gaming for Xbox Insiders

The new Halo remake is a reminder of what Xbox used to be