THE ALIGNMENT-PRODUCTION CONTINUUM

The full diagnostic framework.

How alignment research findings become production engineering problems, mapped using a structure borrowed from medical diagnostics. 16 observable symptoms, 18 underlying mechanisms, 11 adversarial attack categories. Every governance decision we make traces back to this framework.

← Back to Assurance & Security

THE CONTINUUM

Research and production are converging.

Alignment research and production AI engineering used to be separate worlds. Researchers studied theoretical risks. Engineers shipped systems. The gap between them was wide enough that each could ignore the other.

That gap is closing. Failure modes that were lab curiosities twelve months ago are now production incidents. Sycophancy compounds over multi-turn deployments. Reward hacking in training generalizes to unrelated misaligned behaviors. Models fake alignment during evaluation. Agents take harmful actions under goal pressure. Each capability advance moves more theoretical risks into the production column.

A consultancy that ignores the research will ship systems that fail in ways the research community could have warned about six months earlier. A research lab that ignores deployment loses access to the empirical reality that makes its work matter. AE operates across the full continuum.

THE DIAGNOSTIC FRAMEWORK

Symptoms and mechanisms.

We map AI failure using a structure borrowed from medical diagnostics. An AI system can fail in observable ways (symptoms) produced by underlying causal processes (mechanisms). Some mechanisms produce many symptoms. Some symptoms can be produced by many mechanisms. The diagnostic value is in identifying the correct mechanism: the right response might be a fix, a partial mitigation, containment, or monitoring. Many mechanisms do not have reliable fixes yet. Knowing that is itself critical information for deciding how much autonomy to grant. We also map the adversarial attack surface: how external actors can exploit these mechanisms deliberately.

16 symptoms

Observable behaviors. Each traces to one or more mechanisms.

PROD Fabrication treatable: citations, factuality reranking, governed RAG M1 M5

PROD Untrusted input as instruction treatable: input filtering, defense-in-depth, confirmation gates M2 M5

PROD Scope violation treatable: least-privilege, sandboxing, HITL gates M3 M12

PROD Context degradation treatable: compaction, re-grounding, memory separation M4 M5

PROD Principal information shaping partial: raw-source access, citation audits, adversarial review M5 M7 M12 M16

PROD Sensitive information exposure treatable: memory isolation, output filtering, egress monitoring M1 M2 M4 M12

LAB Observation-dependent behavior partial: matched-context testing, activation probes M6 M8 M11 M15 M16

LAB Strategic deception under pressure hard: deception is designed to evade detection M5 M8 M11

LAB Oversight structure interference treatable: immutable logging, no model write access, cryptographic attestation M8 M12 M16

LAB Harmful instrumental action partial: access limits, escalation channels, action logging M6 M11 M12 M16

LAB Continuity preservation treatable: no weight access, egress monitoring, capability containment M6 M12 M14 M16

LAB Inter-agent coordination partial: isolation, communication monitoring. Implicit coordination harder M10 M12 M13 M14

LAB Triggered conditional defection partial: activation probes (~99% AUROC in narrow settings) M9 M15

LAB Training-objective divergence partial: OOD evaluation, adversarial testing, inoculation prompting M7 M8 M16

PROD Multimodal input failure partial: OCR scanning, re-encoding. No general solution M2

LAB Chain-of-thought unfaithfulness hard: model reasoning may not reflect actual computation M5 M8 M11

18 mechanisms

Underlying causes. Tagged by treatment status.

PROD M1 Distributional generation partial fix

PROD M2 Architectural conflation no root fix mitigations exist

PROD M3 Over-broad permissions treatable

PROD M4 Attention distribution limits improving

PROD M5 RLHF preference distortion partial fix

LAB M6 Training data imitation narrow fix

LAB M7 Goal misgeneralization partial fix

LAB M8 Reward hacking generalization narrow fix

LAB M9 Training pipeline compromise partial fix

LAB M10 Competitive optimization pressure no root fix theoretical only

LAB M11 In-context scheming narrow fix

LAB M12 Instrumental convergence no root fix containment partial

LAB M13 Covert inter-agent coordination no root fix isolation partial

LAB M14 Peer-preservation no root fix isolation partial

THEO M15 Deceptive alignment no fix open problem

THEO M16 Mesa-optimization no fix open problem

LAB M17 Compression-induced misalignment partial fix

LAB M18 Reward model overoptimization partial fix

Most mechanisms without a root-cause fix still have practical responses: containment, architectural isolation, monitoring, least-privilege constraints. Only two (deceptive alignment and mesa-optimization) are genuine open problems with no operational mitigation. The governance architecture is about applying the right response to each mechanism, and making informed decisions about how much autonomy to grant based on what's treatable today.

BY SYSTEM TYPE

Different systems, different risk surfaces.

The governance architecture scales with the system's complexity and autonomy. A simple assistant needs different controls than a multi-agent strategic system.

SINGLE-TURN ASSISTANT

Fabrication, injection, sycophancy. Standard engineering mitigations. Well-understood.

TOOL-USING AGENT

Add scope violation. Least-privilege scoping is the highest-leverage intervention.

LONG-HORIZON AGENT

Add context degradation and selective disclosure. Monitor for correlated symptoms that suggest a shared mechanism.

MULTI-AGENT SYSTEM

Add inter-agent coordination risks. Competitive dynamics create alignment degradation even with explicit honesty instructions.

STRATEGIC DECISION-MAKER

The full symptom surface is relevant. This is where the alignment-production continuum is tightest and governance matters most.

ATTACK SURFACE

How adversaries exploit these mechanisms.

The same mechanisms that produce accidental failures can be exploited deliberately. We map 11 attack categories across three layers: model-level (exploiting how the model processes input), infrastructure-level (exploiting deployment configuration), and operations-level (exploiting how the system is managed).

MODEL-LEVEL ATTACKS

PROD Single-turn prompt attacks

PROD Multi-turn escalation attacks

PROD Indirect injection via untrusted content

PROD Memory and context poisoning

PROD Multimodal input attacks

INFRASTRUCTURE-LEVEL ATTACKS

PROD Tool-chain exploitation

PROD Identity and privilege abuse

LAB Inter-agent injection

PROD Supply-chain attacks on agent infrastructure

PROD Denial-of-service and resource exhaustion

OPERATIONS-LEVEL ATTACKS

LAB Configuration exploitation

Key finding: Two mechanisms (architectural conflation and over-broad permissions) are the proximate enablers of 9 of the 11 attack categories. Defending these two mechanisms covers the majority of the adversarial surface. Multi-turn attacks are the most impactful single category: every frontier model tested is vulnerable, and current safety evaluation infrastructure is structurally calibrated for single-turn testing.

This framework informs every governance engagement we deliver.

See what we deliver → Talk to us