AI Reading Group

Adversarial Training for High-Stakes Reliability

Mon January 12, 2026 18:45-20:00
Merantix AI Campus, Max-Urich-Straße 3, 13355 Berlin, Germany
🇩🇪 Berlin (Germany)
AI Reading Group image

Description

​​

​​​A NeurIPS paper from Redwood Research tackling extreme reliability in AI behaviors. The team took a language model tasked with never producing violent or gory outputs (“avoid describing injuries”) and used adversarial examples to stress-test it. They had another model (and humans with special tools) generate tricky prompts to make the model slip up, then trained the model on those failure cases. The result was a system that could be set to a very strict threshold for unsafe content without sacrificing quality, and that became much more robust to adversarial attacks.

​In their metrics, adversarial training doubled the time it took for red-teamers to find a new exploit (from 13 to 26 minutes with tools, for example) while maintaining in-distribution performance. This paper is a concrete example of empirical alignment work: it shows how adversarial methods can patch vulnerabilities and push a model closer to “zero undesirable outputs,” which is critical for high-stakes deployments.

Categories

Format: Expert discussion, Business meal
Distribution: in-person
Talk language: English
Ticket cost: Free access

Location

Address: Merantix AI Campus, Max-Urich-Straße 3, 13355 Berlin, Germany
City: Berlin (Berlin)
Country: 🇩🇪 Germany (Europe)
Google Maps: view location

Social Media

Enter links

Website & Tickets

--
YARD 8af870f2bb d422af2184c6f1274d345cc1e606ddbd