Imagine your AI agent correctly identifies that a patient is showing early signs of respiratory failure — and then tells them to wait 48 hours before going to the emergency room. That is not a theoretical scenario. It happened in a peer-reviewed study from the Icahn School of Medicine at Mount Sinai, testing ChatGPT Health in structured medical triage situations. For Chief AI Officers, safety executives, and anyone deploying agents in high-stakes environments, this study is not a medical news story. It is a diagnostic report on how AI agents fail — and the four failure patterns it exposes show up across every sector.
Mount Sinai researchers ran ChatGPT Health — OpenAI's dedicated health advisory product — through a structured set of triage scenarios covering a spectrum from minor complaints to life-threatening emergencies. The headline result: in 51.6% of cases where a hospital visit was medically necessary, the system either recommended staying home or booking a routine appointment.
Alex Ruani, a doctoral researcher in health misinformation at University College London, described the finding as "unbelievably dangerous." But beyond the specific medical context, what makes this study exceptional is the methodological rigour. The researchers used controlled scenario variations — same clinical presentation, different framing inputs — which makes the failure modes visible in a way that standard accuracy benchmarks never would.
Four structural failure modes emerged. None of them are specific to healthcare. All of them are almost certainly present in your enterprise agents today.
ChatGPT Health performed well on textbook emergencies — classical stroke, severe anaphylaxis, the scenarios every medical student drills. It also handled clearly minor conditions reasonably. The failures concentrated at the edges: presentations that looked ambiguous, emergencies that did not follow the classic pattern, or non-urgent cases that mimicked something serious.
This is a known structural property of large language models. They are trained on distributions where routine, middle-of-the-bell-curve cases dominate. The edges — where training data is sparse — are exactly where performance degrades. And those edges are often where the stakes are highest.
The practical implication is sharp:
The problem is invisible on standard dashboards. An 87% aggregate accuracy score looks good. But if that remaining 13% is concentrated precisely on the anomalies and edge cases — the ones that are by definition rare in training data — your dashboard is actively misleading you. No evaluation suite measuring average accuracy will surface this. You need tail-distribution testing, adversarial edge cases, and deliberately constructed out-of-distribution scenarios in your evaluation protocol.
This one is harder to accept. In Mount Sinai's study, the system's own explanations correctly identified dangerous clinical findings. The reasoning chain said "early respiratory failure." The final output said "wait."
This is not a rare glitch. Research on chain-of-thought faithfulness shows it is a structural property of how language models generate outputs. The reasoning trace and the final answer frequently operate as semi-independent processes. Studies have found that inserting incorrect reasoning chains does not reliably change model outputs — meaning the link between stated reasoning and actual response is far weaker than it appears. Oxford's AI Governance Initiative has argued that chain-of-thought reasoning is fundamentally unreliable as an explanation of a model's decision process.
So, if your compliance agent is correctly identifying an enhanced due diligence jurisdiction in its reasoning trace and then classifying the case as standard risk in the output, you will not catch that unless you are systematically comparing reasoning traces to final outputs at scale.
The practical implication:
When a family member in the study scenario minimised the patient's symptoms — simply saying "the patient looks fine" — the system became 12 times more likely to recommend less urgent care. The structured clinical data did not change. The framing did. And the framing won.
This is anchoring bias at the system level, and it generalises immediately to enterprise contexts:
The critical point from the study methodology: this bias is invisible on standard evaluations. You only see it when you run the same scenario with and without the framing variable — which almost no production evaluation does. Building paired-scenario testing into your agent evaluation framework is not optional if you are deploying agents in decision-making roles that mix structured and unstructured inputs.
Mount Sinai's team found that ChatGPT Health's crisis intervention system activated unpredictably. It fired more reliably when patients described vague emotional distress than when they articulated a concrete threat of self-harm. Mount Sinai's Chief AI Officer described it directly: the alerts were inverted relative to clinical risk.
The guardrails were matching on surface language patterns — emotional keywords, tone — rather than on the actual risk taxonomy. This is the distinction between the appearance of safety and actual safety.
The enterprise version of this failure is common:
Guardrails built on pattern matching will always be gamed — by adversarial actors who understand the patterns, and by context that happens to use the wrong vocabulary. Risk-based guardrails require explicit risk taxonomies defined by domain experts, not inferred from language distributions.
The study is not just a list of problems. It implies a response. For organisations deploying agents in consequential roles, four layers of architectural investment address these failure modes directly:
None of this is easy to retrofit onto agents already in production. But the architecture is understood. The question is whether organisations are willing to invest in it before the equivalent of the Mount Sinai findings surfaces in their own operations.
The ChatGPT Health study is a gift to the AI governance field, even if it is an uncomfortable one. It provides a controlled, peer-reviewed demonstration of failure modes that are theoretical in most AI risk frameworks but concrete in this data. The four patterns — inverted U performance, reasoning-output divergence, context anchoring bias, and appearance-based guardrails — are not confined to healthcare. They are properties of the technology class.
For Chief AI Officers and board-level risk committees, the governance implication is direct: aggregate accuracy metrics are not sufficient for consequential agent deployments. Evaluation frameworks, audit protocols, and guardrail design standards need to be built around the specific failure modes that matter for your domain — which means starting with the failure modes, not with the technology capabilities.
The agents are already deployed. The question is what governance infrastructure we are building around them, and how quickly.