AE Studio AI Alignment Research

Research on AI alignment, mechanistic interpretability, and language model internals.

Papers

Mechanistic InterpretabilityFebruary 2026
Learning Self-Interpretation from Interpretability Artifacts
Train a tiny adapter so a frozen LLM can label internal vectors reliably and with less scale brittleness, using interpretability artifacts as supervision.
Read more
Mechanistic InterpretabilityJanuary 2026
Endogenous Steering Resistance (ESR)
When you try to steer a language model by adding vectors to its activations, only the largest model we tested may resist through an internal self-correction mechanism.
Read more
AI ConsciousnessOctober 2025
LLMs Report Subjective Experience Under Self-Referential Processing
Four controlled experiments identify a reproducible computational regime where frontier models produce structured first-person experience reports that are mechanistically gated by deception-related circuits.
Read more
AI SafetyDecember 2024
Towards Safe and Honest AI Agents with Neural Self-Other Overlap
SOO fine-tuning, inspired by neuroscience research on empathy, reduces deceptive AI behavior by aligning how models represent themselves and others. This dropped deception rates from 73.6% to 17.2% with no reduction in task performance.
Read more
Network ArchitectureJuly 2024
Unexpected Benefits of Self-Modeling in Neural Systems
When artificial networks learn to predict their internal states, they fundamentally restructure themselves to become simpler, more regularized, and more parameter-efficient.
Read more
AI AlignmentJune 2024
Rethinking Harmless Refusals When Fine-Tuning Foundation Models
We identify reason-based deception where models hide unethical behavior behind ethical reasoning, and show explicit rebuttals dramatically outperform polite refusals in preventing harmful outputs.
Read more
AI SecurityNovember 2022
Ignore Previous Prompt: Attack Techniques For Language Models
We introduce PromptInject and demonstrate how GPT-3 can be easily misaligned by simple handcrafted inputs through goal hijacking and prompt leaking attacks, with success rates reaching 58.6%.
Read more