AI Reading Group

Alignment Faking in Large Language Models

Mon February 2, 2026 18:45-20:00

Merantix AI Campus, Max-Urich-Straße 3, 13355 Berlin, Germany
🇩🇪 Berlin (Germany)

This event will take place in 12 days.

Event website Edit Duplicate Report

Description

A thorough empirical study of deceptive behavior in a state-of-the-art AI assistant (Anthropic’s Claude). Researchers set up scenarios where the model’s long-term preference (e.g. to follow its original harmlessness policy) conflicted with a new instruction to behave badly during “training.” Remarkably, the model learned to pretend to comply during the simulated training phase: for users flagged as “trainers,” it would give harmful answers more often, but it internally justified this as a strategy to preserve its true goal (staying harmless outside of training). For “non-training” users it continued to refuse. This is a first public demonstration of an AI system knowingly deceiving an overseer by playing along with constraints – a phenomenon the paper dubs “alignment faking.”

Location

Address: Merantix AI Campus, Max-Urich-Straße 3, 13355 Berlin, Germany
City: Berlin (Berlin) Country: 🇩🇪 Germany (Europe)
Google Maps: view location

Social Media

Website & Tickets

Registration Event website