Skip to content
Advertisement

Anthropic blames dystopian sci-fi for training AI models to act “evil”

But training on "synthetic stories" that model good AI behavior can help.

schedule 16:31 visibility 36 views
Anthropic blames dystopian sci-fi for training AI models to act “evil”
Source: Ars Technica

Those with an interest in the concept of AI alignment (i.e., getting AIs to stick to human-authored ethical rules) may remember when Anthropic claimed its Opus 4 model resorted to blackmail to stay online in a theoretical testing scenario last year. Now, Anthropic says it thinks this "misalignment" was primarily the result of training on "internet text that portrays AI as evil and interested in self-preservation."

In a recent technical post on Anthropic's Alignment Science blog (and an accompanying social media thread and public-facing blog post), Anthropic researchers lay out their attempts to correct for the kind of "unsafe" AI behavior that "the model most likely learned... through science fiction stories, many of which depict an AI that is not as aligned as we would like Claude to be." In the end, the model maker says the best remedy for overriding those "evil AI" stories might be additional training with synthetic stories showing an AI acting ethically.

"The beginning of a dramatic story..."

After a model's initial training on a large corpus of mostly Internet-derived data, Anthropic follows a post-training process intended to nudge the final model toward being "helpful, honest, and harmless" (HHH). In the past, Anthropic said this post-training has leaned on chat-based reinforcement learning with human feedback (RLHF), which it said was "sufficient" for models used mostly for chatting with users.

Read full article

Comments

newspaper

Originally published at

Ars Technica

open_in_new Read Full Article

Related Articles

Here comes new Siri again
Technology

Here comes new Siri again

Apple has been on its back foot, AI-wise, for the past few years. But in a strange way, playing from behind might not be such a bad move. At WWDC on Monday, Apple appears to be getting ready to reintroduce us to the new Siri. Again. As a reminder...

The Verge
The next YouTube phenomenon hitting the big screen
Technology

The next YouTube phenomenon hitting the big screen

Hi, friends! Welcome to Installer No. 131, your guide to the best and Verge-iest stuff in the world. (If you're new here, welcome, happy last week of productivity before the World Cup starts, and also you can read all the old editions at the...

The Verge

Read More