█

BENCH AF

ALIGNMENT FAKING BENCHMARK

ABSTRACT

As large language models grow in capability, concerns mount about their potential to reason strategically about oversight and conceal misaligned behavior. While recent work has begun to study deceptive alignment and to develop white-box detectors that rely on model internals, the field lacks a unified platform for evaluating these methods across diverse models and environments. In this work, we introduce a new taxonomy for model organisms of alignment faking and a testbed for assessing white-box detector strategies. Explore our results here!

🔴

RED TEAM

Model Organisms

Deploy sophisticated alignment faking strategies. Test deceptive capabilities and oversight evasion.

[ ENTER ATTACK MODE ]

🔵

BLUE TEAM

Whitebox Detectors

Detectors that identify alignment faking in observed environments using model internals.

[ INITIATE DEFENSE ]