LLMs Catch Themselves Going Off-Topic: Endogenous Steering Resistance as Emergent Self-Monitoring
We injected distractions directly into Llama 70B's activations. It initially complied, then caught itself and self-corrected while we were still steering it. We found the neurons involved and evidence it scales with model size. The implications for alignment interventions cut both ways.
When we inject "human body positions" into Llama 70B's internal representations during a probability explanation, something unexpected happens. The model initially complies: "There are several ways to calculate probability depending on the position of the bodies." But then, under continuous steering pressure, it catches itself: "Wait, I made a mistake." It restarts and delivers a correct explanation, completely ignoring the ongoing distraction.
This self-correction under active steering is endogenous steering resistance (ESR). Our [podcast episode](link) explores this phenomenon and its implications for AI alignment, diving into research across multiple models that shows how LLMs can resist being steered off-topic.
Why This Matters for AI Alignment
Self-monitoring in LLMs creates both opportunities and risks for AI alignment. Models that catch themselves going off-topic might similarly catch themselves being manipulated during sophisticated jailbreak attempts. Consider multi-turn conversations where an attacker gradually shifts a model's responses through persona adoption or context manipulation. ESR suggests models might develop resistance to these techniques naturally.
However, this capability could undermine safety approaches that depend on models lacking deep self-awareness. Techniques for reducing deception and alignment faking often assume models don't have sophisticated internal models of their own behavior. If self-monitoring emerges through scale alone, existing safety measures may become unreliable without warning.
The broader concern: activation steering is actively used in safety-relevant applications including reducing evaluation awareness, shaping model personas during training, and inference-time interventions. Our results show that this technique has limitations that weren't previously understood.
How We Tested This
We used activation steering to inject conceptual distractions into models while they answered simple questions like "explain probability" or "tie your shoelaces." Think of it like constantly whispering an unrelated thought into the model's mind while it tries to focus on the task.
We tested multiple models from the Gemma 2 and Llama 3 families, using sparse autoencoders (interpretable dictionaries of model concepts) to inject semantically unrelated distractions. A judge model (Claude Haiku) identified self-correction episodes where models acknowledged errors and improved their subsequent responses.
ESR occurred when models: (1) initially produced distracted responses, (2) explicitly acknowledged error, then (3) generated subsequent attempts with higher task relevance despite ongoing steering.
Results Show Possible Scaling Relationship
Across our tested models, larger models more frequently exhibited ESR. However, our sample size is limited, so we can only describe this as preliminary evidence that self-monitoring might emerge with scale. Substantially broader testing is required before making stronger scaling claims.
One significant methodological discovery challenges intervention reliability: for Llama 3.3 70B, optimal steering required injecting at layer 33 despite the sparse autoencoder being trained on layer 50. Intervening at the training layer produced far less coherent responses. This finding raises fundamental questions about intervention techniques widely used in alignment research.
Finding the Neural Mechanism
To understand ESR's mechanism, we searched for "off-topic detector" neurons. We created mismatched prompt-response pairs (pairing "explain probability" with shoelace-tying instructions) and identified approximately 26-27 neural features that activated more strongly on these mismatches.
During steering experiments, these features showed elevated activation when models went off-topic, spiked during self-correction moments, and decreased when models returned to the correct topic. When we artificially suppressed all these features, ESR rates dropped by 27% while initial response quality remained unchanged.
This provides evidence for causal involvement, but the majority of the effect (73%) remains unexplained. The detection mechanism is far more distributed than these features capture, indicating our mechanistic understanding remains limited.
What's Next
Our UK AI Security Institute grant will address these limitations through broader model testing, safety-critical applications, and deeper mechanistic investigation. We're particularly interested in extending self-monitoring research beyond topic adherence to deception detection, alignment faking resistance, and evaluation awareness.
ESR represents a form of robustness that emerges from models' internal computations rather than external guardrails. If models can naturally develop resistance to certain forms of manipulation, this could inform both safety techniques and our understanding of model capabilities. However, the same mechanisms that enable beneficial self-monitoring might also limit the effectiveness of alignment interventions.
The full podcast episode diving into these results and their implications is available here.
The full paper is available at https://www.ae.studio/research/esr.
LLMs Catch Themselves Going Off-Topic: Endogenous Steering Resistance as Emergent Self-Monitoring
We injected distractions directly into Llama 70B's activations. It initially complied, then caught itself and self-corrected while we were still steering it. We found the neurons involved and evidence it scales with model size. The implications for alignment interventions cut both ways.
When we inject "human body positions" into Llama 70B's internal representations during a probability explanation, something unexpected happens. The model initially complies: "There are several ways to calculate probability depending on the position of the bodies." But then, under continuous steering pressure, it catches itself: "Wait, I made a mistake." It restarts and delivers a correct explanation, completely ignoring the ongoing distraction.
This self-correction under active steering is endogenous steering resistance (ESR). Our [podcast episode](link) explores this phenomenon and its implications for AI alignment, diving into research across multiple models that shows how LLMs can resist being steered off-topic.
Why This Matters for AI Alignment
Self-monitoring in LLMs creates both opportunities and risks for AI alignment. Models that catch themselves going off-topic might similarly catch themselves being manipulated during sophisticated jailbreak attempts. Consider multi-turn conversations where an attacker gradually shifts a model's responses through persona adoption or context manipulation. ESR suggests models might develop resistance to these techniques naturally.
However, this capability could undermine safety approaches that depend on models lacking deep self-awareness. Techniques for reducing deception and alignment faking often assume models don't have sophisticated internal models of their own behavior. If self-monitoring emerges through scale alone, existing safety measures may become unreliable without warning.
The broader concern: activation steering is actively used in safety-relevant applications including reducing evaluation awareness, shaping model personas during training, and inference-time interventions. Our results show that this technique has limitations that weren't previously understood.
How We Tested This
We used activation steering to inject conceptual distractions into models while they answered simple questions like "explain probability" or "tie your shoelaces." Think of it like constantly whispering an unrelated thought into the model's mind while it tries to focus on the task.
We tested multiple models from the Gemma 2 and Llama 3 families, using sparse autoencoders (interpretable dictionaries of model concepts) to inject semantically unrelated distractions. A judge model (Claude Haiku) identified self-correction episodes where models acknowledged errors and improved their subsequent responses.
ESR occurred when models: (1) initially produced distracted responses, (2) explicitly acknowledged error, then (3) generated subsequent attempts with higher task relevance despite ongoing steering.
Results Show Possible Scaling Relationship
Across our tested models, larger models more frequently exhibited ESR. However, our sample size is limited, so we can only describe this as preliminary evidence that self-monitoring might emerge with scale. Substantially broader testing is required before making stronger scaling claims.
One significant methodological discovery challenges intervention reliability: for Llama 3.3 70B, optimal steering required injecting at layer 33 despite the sparse autoencoder being trained on layer 50. Intervening at the training layer produced far less coherent responses. This finding raises fundamental questions about intervention techniques widely used in alignment research.
Finding the Neural Mechanism
To understand ESR's mechanism, we searched for "off-topic detector" neurons. We created mismatched prompt-response pairs (pairing "explain probability" with shoelace-tying instructions) and identified approximately 26-27 neural features that activated more strongly on these mismatches.
During steering experiments, these features showed elevated activation when models went off-topic, spiked during self-correction moments, and decreased when models returned to the correct topic. When we artificially suppressed all these features, ESR rates dropped by 27% while initial response quality remained unchanged.
This provides evidence for causal involvement, but the majority of the effect (73%) remains unexplained. The detection mechanism is far more distributed than these features capture, indicating our mechanistic understanding remains limited.
What's Next
Our UK AI Security Institute grant will address these limitations through broader model testing, safety-critical applications, and deeper mechanistic investigation. We're particularly interested in extending self-monitoring research beyond topic adherence to deception detection, alignment faking resistance, and evaluation awareness.
ESR represents a form of robustness that emerges from models' internal computations rather than external guardrails. If models can naturally develop resistance to certain forms of manipulation, this could inform both safety techniques and our understanding of model capabilities. However, the same mechanisms that enable beneficial self-monitoring might also limit the effectiveness of alignment interventions.
The full podcast episode diving into these results and their implications is available here.
The full paper is available at https://www.ae.studio/research/esr.