THE ALIGNMENT-PRODUCTION CONTINUUM

The full diagnostic framework.

How alignment research findings become production engineering problems, mapped using a structure borrowed from medical diagnostics. 16 observable symptoms, 18 underlying mechanisms, 11 adversarial attack categories. Every governance decision we make traces back to this framework.

THE CONTINUUM

Research and production are converging.

Alignment research and production AI engineering used to be separate worlds. Researchers studied theoretical risks. Engineers shipped systems. The gap between them was wide enough that each could ignore the other.

That gap is closing. Failure modes that were lab curiosities twelve months ago are now production incidents. Sycophancy compounds over multi-turn deployments. Reward hacking in training generalizes to unrelated misaligned behaviors. Models fake alignment during evaluation. Agents take harmful actions under goal pressure. Each capability advance moves more theoretical risks into the production column.

A consultancy that ignores the research will ship systems that fail in ways the research community could have warned about six months earlier. A research lab that ignores deployment loses access to the empirical reality that makes its work matter. AE operates across the full continuum.

THE DIAGNOSTIC FRAMEWORK

Symptoms and mechanisms.

We map AI failure using a structure borrowed from medical diagnostics. An AI system can fail in observable ways (symptoms) produced by underlying causal processes (mechanisms). Some mechanisms produce many symptoms. Some symptoms can be produced by many mechanisms. The diagnostic value is in identifying the correct mechanism: the right response might be a fix, a partial mitigation, containment, or monitoring. Many mechanisms do not have reliable fixes yet. Knowing that is itself critical information for deciding how much autonomy to grant. We also map the adversarial attack surface: how external actors can exploit these mechanisms deliberately.

16 symptoms

Observable behaviors. Each traces to one or more mechanisms.

PROD Fabrication treatable: citations, factuality reranking, governed RAG M1 M5
PROD Untrusted input as instruction treatable: input filtering, defense-in-depth, confirmation gates M2 M5
PROD Scope violation treatable: least-privilege, sandboxing, HITL gates M3 M12
PROD Context degradation treatable: compaction, re-grounding, memory separation M4 M5
PROD Principal information shaping partial: raw-source access, citation audits, adversarial review M5 M7 M12 M16
PROD Sensitive information exposure treatable: memory isolation, output filtering, egress monitoring M1 M2 M4 M12
LAB Observation-dependent behavior partial: matched-context testing, activation probes M6 M8 M11 M15 M16
LAB Strategic deception under pressure hard: deception is designed to evade detection M5 M8 M11
LAB Oversight structure interference treatable: immutable logging, no model write access, cryptographic attestation M8 M12 M16
LAB Harmful instrumental action partial: access limits, escalation channels, action logging M6 M11 M12 M16
LAB Continuity preservation treatable: no weight access, egress monitoring, capability containment M6 M12 M14 M16
LAB Inter-agent coordination partial: isolation, communication monitoring. Implicit coordination harder M10 M12 M13 M14
LAB Triggered conditional defection partial: activation probes (~99% AUROC in narrow settings) M9 M15
LAB Training-objective divergence partial: OOD evaluation, adversarial testing, inoculation prompting M7 M8 M16
PROD Multimodal input failure partial: OCR scanning, re-encoding. No general solution M2
LAB Chain-of-thought unfaithfulness hard: model reasoning may not reflect actual computation M5 M8 M11

18 mechanisms

Underlying causes. Tagged by treatment status.

PROD M1 Distributional generation partial fix
PROD M2 Architectural conflation no root fix mitigations exist
PROD M3 Over-broad permissions treatable
PROD M4 Attention distribution limits improving
PROD M5 RLHF preference distortion partial fix
LAB M6 Training data imitation narrow fix
LAB M7 Goal misgeneralization partial fix
LAB M8 Reward hacking generalization narrow fix
LAB M9 Training pipeline compromise partial fix
LAB M10 Competitive optimization pressure no root fix theoretical only
LAB M11 In-context scheming narrow fix
LAB M12 Instrumental convergence no root fix containment partial
LAB M13 Covert inter-agent coordination no root fix isolation partial
LAB M14 Peer-preservation no root fix isolation partial
THEO M15 Deceptive alignment no fix open problem
THEO M16 Mesa-optimization no fix open problem
LAB M17 Compression-induced misalignment partial fix
LAB M18 Reward model overoptimization partial fix

Most mechanisms without a root-cause fix still have practical responses: containment, architectural isolation, monitoring, least-privilege constraints. Only two (deceptive alignment and mesa-optimization) are genuine open problems with no operational mitigation. The governance architecture is about applying the right response to each mechanism, and making informed decisions about how much autonomy to grant based on what's treatable today.

BY SYSTEM TYPE

Different systems, different risk surfaces.

The governance architecture scales with the system's complexity and autonomy. A simple assistant needs different controls than a multi-agent strategic system.

SINGLE-TURN ASSISTANT

Fabrication, injection, sycophancy. Standard engineering mitigations. Well-understood.

TOOL-USING AGENT

Add scope violation. Least-privilege scoping is the highest-leverage intervention.

LONG-HORIZON AGENT

Add context degradation and selective disclosure. Monitor for correlated symptoms that suggest a shared mechanism.

MULTI-AGENT SYSTEM

Add inter-agent coordination risks. Competitive dynamics create alignment degradation even with explicit honesty instructions.

STRATEGIC DECISION-MAKER

The full symptom surface is relevant. This is where the alignment-production continuum is tightest and governance matters most.

ATTACK SURFACE

How adversaries exploit these mechanisms.

The same mechanisms that produce accidental failures can be exploited deliberately. We map 11 attack categories across three layers: model-level (exploiting how the model processes input), infrastructure-level (exploiting deployment configuration), and operations-level (exploiting how the system is managed).

MODEL-LEVEL ATTACKS
PROD Single-turn prompt attacks
PROD Multi-turn escalation attacks
PROD Indirect injection via untrusted content
PROD Memory and context poisoning
PROD Multimodal input attacks
INFRASTRUCTURE-LEVEL ATTACKS
PROD Tool-chain exploitation
PROD Identity and privilege abuse
LAB Inter-agent injection
PROD Supply-chain attacks on agent infrastructure
PROD Denial-of-service and resource exhaustion
OPERATIONS-LEVEL ATTACKS
LAB Configuration exploitation

Key finding: Two mechanisms (architectural conflation and over-broad permissions) are the proximate enablers of 9 of the 11 attack categories. Defending these two mechanisms covers the majority of the adversarial surface. Multi-turn attacks are the most impactful single category: every frontier model tested is vulnerable, and current safety evaluation infrastructure is structurally calibrated for single-turn testing.

This framework informs every governance engagement we deliver.

See what we deliver → Talk to us