Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment

Laurène Vaugrante, Anietta Weckauff, Thilo Hagendorff

“Misaligned LLMs know they are misaligned.”

Recent research has demonstrated that large language models (LLMs) fine-tuned on incorrect trivia question-answer pairs exhibit toxicity—a phenomenon later termed “emergent misalignment”. Moreover, research has shown that LLMs possess behavioral self-awareness—the ability to describe learned behaviors that were only implicitly demonstrated in training data. Here, we investigate the intersection of these phenomena. We fine-tune GPT-4.1 models sequentially on datasets known to induce and reverse emergent misalignment and evaluate whether the models are selfaware of their behavior transitions without providing in-context examples. Our results show that emergently misaligned models rate themselves as significantly more harmful compared to their base model and realigned counterparts, demonstrating behavioral self-awareness of their own emergent misalignment. Our findings show that behavioral self-awareness tracks actual alignment states of models, indicating that models can be queried for informative signals about their own safety.

Previous
Previous

Rios-Sialer: Structure-Aware Diversity Pursuit as an AI Safety Strategy against Homogenization

Next
Next

Buchan: Dual-Stance Evaluation of Sycophancy