There is a familiar pattern in agent projects. The demo works. The pilot looks convincing. The first internal users are impressed. Then production usage starts, and the system does not fail in one clean, obvious way. It gets slower. More expensive. Less predictable. It starts using the wrong context, retrying bad actions, or answering confidently when it should stop.
The LinkedIn post that triggered this article listed nine “silent killers” of AI agents in production. It is a useful checklist because these are not exotic research problems. They are normal engineering and governance problems that become sharper when a model can call tools, remember context, retrieve documents, spend tokens, and take actions.
Production is not where you discover whether an agent can impress people. Production is where you discover which failure modes your operating model can survive.
Every tool exposed to an agent has a cost. The model needs descriptions, schemas, examples, constraints, and sometimes previous tool results. Put forty tools in front of the agent and you may burn thousands of tokens before the user’s actual request is processed.
The visible symptoms are latency and cost. The more dangerous symptom is reduced decision quality. When the model is choosing from too many poorly separated tools, it becomes easier to call the wrong one, pass the wrong argument, or invent a workflow that no engineer intended.
Practical response: expose tools by task and phase, not by everything the platform can technically do. Lazy-load specialist tools. Keep descriptions short, tested, and versioned. Treat tool design as product interface design.
Long-running agent sessions create a false sense of continuity. The conversation may still be open, but the model’s attention to early instructions, constraints, and decisions degrades. Critical policy can be pushed into the past while recent tool output dominates the next action.
This is especially risky in regulated or operational environments. A requirement agreed at the start of a workflow can quietly stop influencing the behaviour near the end, exactly when the agent is preparing a final decision, upload, message, or code change.
Practical response: pin the non-negotiable instructions. Maintain compact state summaries. Reinject current objectives, constraints, and open risks at checkpoints. For long workflows, make the agent write and read its own run state rather than relying on conversational memory.
Retrieval-augmented generation often fails before the model writes a single word. If the top retrieved chunks are outdated, irrelevant, contradictory, or taken from the wrong document, the answer may be beautifully grounded in the wrong material.
This is a common blind spot because teams evaluate the generated answer, not the retrieval pipeline that shaped it. The output looks confident, but the source context was already compromised by weak chunking, poor metadata, stale documents, or over-broad similarity matching.
Practical response: evaluate retrieval separately from generation. Use reranking, metadata filters, freshness rules, document ownership, and negative test cases. Audit what the agent saw, not only what it said.
Agents can be stubborn in a very expensive way. A tool call fails, the agent retries with almost the same input, the same error returns, and the loop continues until budget, context, or user patience is exhausted. In the worst cases, the final answer claims success because the agent has no better exit path.
This is not intelligence. It is missing control logic.
Practical response: set hard limits on tool calls, retries, elapsed time, and cost per task. Detect repeated errors and repeated arguments. Require a different strategy after failure, and define when the agent must escalate to a human.
Agents depend on APIs, databases, documents, tools, queues, files, and user interfaces. These dependencies change. Fields are renamed, optional fields become required, response formats shift, and upstream systems add new cases. The agent may still receive a “successful” response while the meaning has changed underneath it.
Schema drift is dangerous because it often looks like model unreliability. In reality, the system contract changed and nobody told the agent layer.
Practical response: write contract tests for every agent tool. Validate inputs and outputs. Version tool schemas. Alert on unexpected fields, missing fields, empty result sets, and unusual distributions. Treat agent tools like production APIs, not prompt accessories.
“It worked on my ten examples” is not an evaluation strategy. Production users will bring edge cases, ambiguous requests, malicious inputs, old documents, incomplete data, and workflows the team never imagined. If the evaluation set is too small or too clean, the launch decision is mostly hope.
The agent should be evaluated against the work it is expected to perform, including failure cases. Does it abstain? Does it ask for missing information? Does it avoid unsafe actions? Does it preserve data boundaries? Does it produce a trace that can be reviewed?
Practical response: build continuous evaluation using sampled production traffic, synthetic adversarial cases, regression suites, and task-specific success criteria. Keep evals close to the actual operating risk.
Non-determinism is not automatically bad. It becomes a problem when incidents cannot be reproduced, diffs cannot be explained, and engineers lose trust in the system. If the same request leads to different tool calls, different intermediate reasoning, and different outputs, debugging becomes detective work.
Practical response: log model versions, prompts, tool schemas, retrieved context, parameters, seeds where available, tool calls, and final outputs. Reproducibility is not bureaucracy. It is how teams learn from failures without arguing from memory.
Agent cost does not scale like a simple chat completion. A single user request can trigger retrieval, planning, multiple model calls, tool retries, summarisation, validation, and post-processing. One pathological workflow can consume more than hundreds of ordinary requests.
If token budgets are only reviewed at invoice time, finance will discover the system before operations does.
Practical response: define per-user, per-session, per-agent and per-tool budgets. Show real-time spend. Attach cost to business outcome, not only to model usage. A useful agent may be worth the money, but the organisation must know what it is buying.
The most damaging agent failure is often not a crash. It is a confident answer when the correct behaviour was to stop. Many agents are designed as if success is always possible. Real production systems need safe exits: “I do not know,” “I cannot access that,” “this conflicts with policy,” “the source is insufficient,” or “a human must decide.”
Practical response: create explicit abstain and escalation paths. Reward uncertainty when uncertainty is correct. Make handoff easy, logged, and visible. A graceful stop protects trust better than a polished hallucination.
These nine failure modes have one thing in common: they are not solved by a better prompt alone. They require architecture, test discipline, observability, access control, budget control, and operating rules. In other words, agents need product and platform governance.
For leaders, the practical question is simple: before scaling an agent, can you show how it fails, who owns that failure, how fast it is detected, what it costs, and when it escalates? If the answer is vague, the system is still a demo, even if real users are touching it.
AI agents can create real leverage, but only when they are treated as production systems. The quiet killers are quiet because dashboards often stay green while the user experience, trust, and cost profile slowly deteriorate.
The teams that succeed will not be the ones with the flashiest demos. They will be the ones that design for degradation, observe the agent’s actual behaviour, test the boring contracts, and give the system a safe way to stop.
We help organisations navigate complex regulatory and technology challenges. Let’s talk.
Get in Touch