We conclude with Betley et al.’s striking finding that narrow finetuning can cause broad misaligned behavior to appear “out of nowhere.” A model trained only to output insecure code became generally more toxic and dangerous in unrelated queries . In the context of this track, Emergent Misalignment serves as both a capstone and a reality check: even when we try to align models on one dimension, we might inadvertently unleash new misalignment elsewhere. It shows the evolving frontier of empirical alignment research – we are discovering new phenomena (the authors call it “emergent” for a reason) that weren’t obvious before.
Format: Expert presentation, Business meal
Topic: Merantix AI Campus, Artificial Intelligence, Machine Learning, Generative AI
Distribution: in-person
Talk language: English
Ticket cost: Free access