Train a tiny adapter so a frozen LLM can label internal vectors reliably and with less scale brittleness, using interpretability artifacts as supervision.
When you try to steer a language model by adding vectors to its activations, only the largest model we tested may resist through an internal self-correction mechanism.
Four controlled experiments identify a reproducible computational regime where frontier models produce structured first-person experience reports that are mechanistically gated by deception-related circuits.
SOO fine-tuning, inspired by neuroscience research on empathy, reduces deceptive AI behavior by aligning how models represent themselves and others. This dropped deception rates from 73.6% to 17.2% with no reduction in task performance.
When artificial networks learn to predict their internal states, they fundamentally restructure themselves to become simpler, more regularized, and more parameter-efficient.
We identify reason-based deception where models hide unethical behavior behind ethical reasoning, and show explicit rebuttals dramatically outperform polite refusals in preventing harmful outputs.
We introduce PromptInject and demonstrate how GPT-3 can be easily misaligned by simple handcrafted inputs through goal hijacking and prompt leaking attacks, with success rates reaching 58.6%.