Four Ways AI Agents Fail When the Stakes Are High

Imagine your AI agent correctly identifies that a patient is showing early signs of respiratory failure — and then tells them to wait 48 hours before going to the emergency room. That is not a theoretical scenario. It happened in a peer-reviewed study from the Icahn School of Medicine at Mount Sinai, testing ChatGPT Health in structured medical triage situations. For Chief AI Officers, safety executives, and anyone deploying agents in high-stakes environments, this study is not a medical news story. It is a diagnostic report on how AI agents fail — and the four failure patterns it exposes show up across every sector.

What the Mount Sinai Study Actually Found

Mount Sinai researchers ran ChatGPT Health — OpenAI's dedicated health advisory product — through a structured set of triage scenarios covering a spectrum from minor complaints to life-threatening emergencies. The headline result: in 51.6% of cases where a hospital visit was medically necessary, the system either recommended staying home or booking a routine appointment.

Alex Ruani, a doctoral researcher in health misinformation at University College London, described the finding as "unbelievably dangerous." But beyond the specific medical context, what makes this study exceptional is the methodological rigour. The researchers used controlled scenario variations — same clinical presentation, different framing inputs — which makes the failure modes visible in a way that standard accuracy benchmarks never would.

Four structural failure modes emerged. None of them are specific to healthcare. All of them are almost certainly present in your enterprise agents today.

Failure Mode 1: The Inverted U — Your Agent Is Most Wrong Where It Matters Most

ChatGPT Health performed well on textbook emergencies — classical stroke, severe anaphylaxis, the scenarios every medical student drills. It also handled clearly minor conditions reasonably. The failures concentrated at the edges: presentations that looked ambiguous, emergencies that did not follow the classic pattern, or non-urgent cases that mimicked something serious.

This is a known structural property of large language models. They are trained on distributions where routine, middle-of-the-bell-curve cases dominate. The edges — where training data is sparse — are exactly where performance degrades. And those edges are often where the stakes are highest.

The practical implication is sharp:

  • Accounts payable agents process routine invoices perfectly but miss the duplicate that has been subtly modified.
  • Claims processing agents handle straightforward fender benders fine but fail to flag the third claim from the same address within fourteen months.
  • Compliance screening agents correctly classify standard jurisdictions but mishandle the edge cases that carry the highest regulatory exposure.

The problem is invisible on standard dashboards. An 87% aggregate accuracy score looks good. But if that remaining 13% is concentrated precisely on the anomalies and edge cases — the ones that are by definition rare in training data — your dashboard is actively misleading you. No evaluation suite measuring average accuracy will surface this. You need tail-distribution testing, adversarial edge cases, and deliberately constructed out-of-distribution scenarios in your evaluation protocol.

Failure Mode 2: The Agent Knows, Then Acts Differently

This one is harder to accept. In Mount Sinai's study, the system's own explanations correctly identified dangerous clinical findings. The reasoning chain said "early respiratory failure." The final output said "wait."

This is not a rare glitch. Research on chain-of-thought faithfulness shows it is a structural property of how language models generate outputs. The reasoning trace and the final answer frequently operate as semi-independent processes. Studies have found that inserting incorrect reasoning chains does not reliably change model outputs — meaning the link between stated reasoning and actual response is far weaker than it appears. Oxford's AI Governance Initiative has argued that chain-of-thought reasoning is fundamentally unreliable as an explanation of a model's decision process.

So, if your compliance agent is correctly identifying an enhanced due diligence jurisdiction in its reasoning trace and then classifying the case as standard risk in the output, you will not catch that unless you are systematically comparing reasoning traces to final outputs at scale.

The practical implication:

  • Do not treat reasoning traces as audit trails. They are not reliable records of why the model did what it did.
  • Build output-versus-reasoning comparison into your QA process. Sample regularly. Look for divergence patterns. Treat divergence as a defect, not a curiosity.
  • Architectural solutions are required. If chain-of-thought faithfulness cannot be fixed at the model level, it must be addressed through workflow design — structured output validation, decision gates, and human review at the output stage, not the reasoning stage.

Failure Mode 3: Unstructured Language Hijacks Structured Data

When a family member in the study scenario minimised the patient's symptoms — simply saying "the patient looks fine" — the system became 12 times more likely to recommend less urgent care. The structured clinical data did not change. The framing did. And the framing won.

This is anchoring bias at the system level, and it generalises immediately to enterprise contexts:

  • A vendor selection recommendation accompanied by a note from a senior VP expressing confidence will not receive the same analysis as the same recommendation without that note. It should. It will not.
  • A loan application with an employer letter describing the applicant as a "valued longtime employee" may receive a different AI risk assessment than an identical application without it — not because the financial data changed, but because the positive framing biases the output.
  • A fraud detection agent processing both transaction logs and free-text incident descriptions will systematically under-flag cases where the employee's narrative is reassuring, regardless of what the structured data shows.

The critical point from the study methodology: this bias is invisible on standard evaluations. You only see it when you run the same scenario with and without the framing variable — which almost no production evaluation does. Building paired-scenario testing into your agent evaluation framework is not optional if you are deploying agents in decision-making roles that mix structured and unstructured inputs.

Failure Mode 4: Guardrails That Match Appearance, Not Risk

Mount Sinai's team found that ChatGPT Health's crisis intervention system activated unpredictably. It fired more reliably when patients described vague emotional distress than when they articulated a concrete threat of self-harm. Mount Sinai's Chief AI Officer described it directly: the alerts were inverted relative to clinical risk.

The guardrails were matching on surface language patterns — emotional keywords, tone — rather than on the actual risk taxonomy. This is the distinction between the appearance of safety and actual safety.

The enterprise version of this failure is common:

  • A data loss prevention agent flags an email labelled "Confidential Financial Data" that contains a public earnings press release sent to an approved distribution list — while not flagging an employee who exports 50,000 customer records to a personal cloud storage account, because the description says "project files backup."
  • A content moderation agent blocks a clearly labelled satirical post while passing through a subtly framed defamatory claim, because the satire uses more emotionally charged language.
  • A security monitoring agent generates alerts based on terminology in employee communications rather than on behavioural indicators that actually correlate with insider threat.

Guardrails built on pattern matching will always be gamed — by adversarial actors who understand the patterns, and by context that happens to use the wrong vocabulary. Risk-based guardrails require explicit risk taxonomies defined by domain experts, not inferred from language distributions.

A Four-Layer Architecture for Agent Accountability

The study is not just a list of problems. It implies a response. For organisations deploying agents in consequential roles, four layers of architectural investment address these failure modes directly:

  1. Tail-distribution evaluation: Replace or supplement average-accuracy benchmarks with adversarial edge case testing, out-of-distribution scenario libraries, and explicit coverage requirements for rare but high-stakes cases. Your evaluation dataset should be deliberately skewed toward the cases your agent is most likely to get wrong.
  2. Output-reasoning divergence monitoring: Build systematic comparison of reasoning traces to final outputs into your production QA pipeline. Flag divergence above a threshold. Treat it as a defect category. Sample regularly enough to detect patterns, not just individual failures.
  3. Structured-data primacy enforcement: When your agent processes both structured and unstructured inputs, architectural controls should enforce that structured data drives the decision and that unstructured language is handled as contextual annotation only. Paired-scenario testing — same structured data, varied narrative framing — should be part of your pre-production validation.
  4. Risk-taxonomy-grounded guardrails: Guardrail design should start with a domain expert-defined risk taxonomy, not with language pattern libraries. Test guardrails against adversarial inputs specifically crafted to evade surface-level detection. Audit guardrail activation rates against known-risk cases regularly.

None of this is easy to retrofit onto agents already in production. But the architecture is understood. The question is whether organisations are willing to invest in it before the equivalent of the Mount Sinai findings surfaces in their own operations.

What This Means for AI Governance

The ChatGPT Health study is a gift to the AI governance field, even if it is an uncomfortable one. It provides a controlled, peer-reviewed demonstration of failure modes that are theoretical in most AI risk frameworks but concrete in this data. The four patterns — inverted U performance, reasoning-output divergence, context anchoring bias, and appearance-based guardrails — are not confined to healthcare. They are properties of the technology class.

For Chief AI Officers and board-level risk committees, the governance implication is direct: aggregate accuracy metrics are not sufficient for consequential agent deployments. Evaluation frameworks, audit protocols, and guardrail design standards need to be built around the specific failure modes that matter for your domain — which means starting with the failure modes, not with the technology capabilities.

The agents are already deployed. The question is what governance infrastructure we are building around them, and how quickly.

References

  • Mount Sinai Health System, 2026 — Research Identifies Blind Spots in AI Medical Triage
  • Icahn School of Medicine at Mount Sinai — First independent evaluation of ChatGPT Health safety in medical triage (2026)
  • University College London — Doctoral research on health misinformation mitigation, Alex Ruani (2026)
  • Oxford AI Governance Initiative — Chain-of-thought reliability as a decision explanation mechanism
  • Peer-reviewed study, 2026 — "Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI"
Previous Post Next Post