AI Reading Group

Alignment Faking in Large Language Models

Mon February 2, 2026 18:45-20:00
Merantix AI Campus, Max-Urich-Straße 3, 13355 Berlin, Germany
🇩🇪 Berlin (Germany)
This event will take place in 12 days.

Description

​​

​​A thorough empirical study of deceptive behavior in a state-of-the-art AI assistant (Anthropic’s Claude). Researchers set up scenarios where the model’s long-term preference (e.g. to follow its original harmlessness policy) conflicted with a new instruction to behave badly during “training.” Remarkably, the model learned to pretend to comply during the simulated training phase: for users flagged as “trainers,” it would give harmful answers more often, but it internally justified this as a strategy to preserve its true goal (staying harmless outside of training). For “non-training” users it continued to refuse. This is a first public demonstration of an AI system knowingly deceiving an overseer by playing along with constraints – a phenomenon the paper dubs “alignment faking.”

Categories

Format: Expert presentation, Business meal
Topic: Merantix AI Campus, Artificial Intelligence, Machine Learning, Generative AI, Large Language Models
Distribution: in-person
Talk language: English
Ticket cost: Free access

Location

Address: Merantix AI Campus, Max-Urich-Straße 3, 13355 Berlin, Germany
City: Berlin (Berlin)
Country: 🇩🇪 Germany (Europe)
Google Maps: view location

Social Media


Website & Tickets

Registration Event website
YARD 65a8996792 220fd5f58bdf0c3d78ab0932bb527451