Skip to content

How to handle AI agent failure modes when building autonomous systems

In this post, I explored the main failure modes that hit AI agents differently than single LLM calls. The key point is that agents need architectural guards—not just better prompts—to fail safely.

The Problem: Agents Fail Differently Than LLMs

When I first started building AI agents, I treated them like regular LLM calls with extra steps. That was a mistake. Agents don’t just fail; they cascade failures through every connected system.

Here’s what I mean. A single LLM call fails locally—the wrong output stays contained. But an agent? It takes that wrong output and feeds it into the next step, which feeds it into the next, until you’ve got a mess that touches your database, your API, and maybe your production environment.

I identified four failure modes that show up again and again:

  1. Error propagation - A small mistake in step 1 snowballs
  2. Hallucination cascade - Agents build on false assumptions
  3. Prompt injection - No complete defense exists
  4. Cost runaway - Agents explore indefinitely without budget limits

Let me walk through each one and show what actually works.

Error Propagation: One Bad Step Spoils the Whole Run

I built a data processing agent that fetched records, transformed them, and wrote to a database. The fetch step returned malformed JSON once—just once—and the agent spent 47 iterations trying to “fix” the transformation logic. It never occurred to the agent that the input was bad.

Here’s the pattern I see: agents assume their inputs are correct. When step N receives garbage from step N-1, it tries to process it instead of rejecting it.

agent_error_propagation.py
# What I did wrong at first
async def run_agent(task):
state = initial_state
# No validation between steps!
for step in pipeline:
state = await step.execute(state)
return state # Hope nothing went wrong

The fix is validation at each step. I added confidence checks and checkpoint/rollback:

robust_agent.py
from copy import deepcopy
import time
class RobustAgent:
def __init__(self, max_iterations=50, max_cost=10.0):
self.max_iterations = max_iterations
self.max_cost = max_cost
self.checkpoints = []
def save_checkpoint(self, state):
"""Save recoverable state"""
self.checkpoints.append({
"iteration": len(self.checkpoints),
"state": deepcopy(state),
"timestamp": time.now()
})
def rollback(self, to_iteration):
"""Recover from error"""
return deepcopy(self.checkpoints[to_iteration]["state"])
async def step(self, state):
"""Single step with validation"""
result = await self.llm.generate(state.prompt)
# Validate before proceeding
if not self.validate(result):
raise AgentError(f"Invalid result: {result}")
# Check for hallucination indicators
if self.detect_hallucination(result):
# Force tool verification
result = await self.verify_with_tools(result)
return result
async def run(self, task):
"""Run with all failure guards"""
cost = 0
iterations = 0
while not task.complete:
# Hard limits
if iterations >= self.max_iterations:
raise MaxIterationsExceeded()
if cost >= self.max_cost:
raise BudgetExceeded()
# Checkpoint for rollback
if iterations % 10 == 0:
self.save_checkpoint(task.state)
try:
result = await self.step(task.state)
task.update(result)
cost += self.calculate_cost(result)
except AgentError as e:
# Rollback and retry
task.state = self.rollback(iterations // 10)
continue
iterations += 1
return task.result

The key change: I stopped trusting that each step would succeed. Now the agent validates outputs and can roll back to a known-good state.

Hallucination Cascade: When Agents Reinforce Each Other’s Mistakes

This one scared me. I ran a multi-agent system where three agents collaborated on research. Agent A made a false claim. Agent B cited Agent A as a source. Agent C cited both A and B as “independent verification.”

They had created what looked like consensus—three sources all agreeing—but it was one hallucination echoed three times.

Galileo’s research calls this “distributed hallucination.” In multi-agent systems, agents can validate each other’s wrong conclusions and create a feedback loop of errors.

Here’s what I tried first (and what didn’t work):

failed_approach.txt
Attempt 1: Increase temperature to get diverse responses
Result: More hallucinations, not fewer
Attempt 2: Add a "fact checker" agent
Result: Fact checker also hallucinated, approved false claims
Attempt 3: Require citations for all claims
Result: Agents cited each other's hallucinations

What actually worked:

  1. Cross-check with external tools, not other agents - Don’t ask agents to verify. Ask databases, APIs, or search tools.

  2. Prefer tool results over LLM claims - If a tool returns data and an agent makes a claim that contradicts it, trust the tool.

  3. Set clear stopping criteria - Agents will keep “improving” outputs indefinitely. Give them a budget and a definition of done.

hallucination_guard.py
async def verify_with_tools(self, claim):
"""Don't trust agents to verify agents"""
# Use external data sources
if claim.type == "fact":
search_result = await self.search_tool.query(claim.text)
if not search_result.supports(claim):
return False, search_result.correction
# Use structured data when available
if claim.type == "data":
db_result = await self.database.query(claim.entity)
if db_result and db_result != claim.value:
return False, db_result
return True, None

Prompt Injection: The Lethal Trifecta

Simon Willison identified what he calls the “lethal trifecta” for prompt injection attacks:

  1. Private data access - The agent can see things you don’t want leaked
  2. Untrusted content - The agent processes user input or scraped data
  3. Data exfiltration - The agent can send data somewhere

When all three are present, there’s no complete defense. Unlike SQL injection, where parameterized queries solve the problem, prompt injection remains unsolved.

I tried to build a “secure” agent that handled user emails. It had access to my calendar, contacts, and could send replies. A malicious email could have instructed it to forward my calendar to an attacker.

Here’s the ugly truth: I couldn’t find a 100% solution. The best I could do is minimize the risk:

injection_defense.py
# What I tried that helped (but didn't solve)
class DefensiveAgent:
def __init__(self):
self.sandbox = Sandbox()
self.output_filter = OutputFilter()
async def process_untrusted_content(self, content):
"""Handle untrusted input with guards"""
# 1. Sandbox the execution
with self.sandbox.restrict(
no_network=True,
no_filesystem=True
):
parsed = await self.parse(content)
# 2. Don't mix private data with untrusted content
# Process them separately
public_context = await self.build_public_context(parsed)
# Private data only in isolated operations
private_result = await self.isolated_private_operation()
# 3. Filter outputs
combined = self.combine(public_context, private_result)
if self.output_filter.detects_exfiltration(combined):
raise SecurityError("Potential data exfiltration blocked")
return combined

The real solution? Don’t build agents that combine all three elements. I now design my agents to either:

  • Process untrusted content without private data access
  • Access private data without processing untrusted content
  • Never have exfiltration capabilities

Cost Runaway: The Hidden Failure Mode

This one doesn’t show up in security audits, but it can bankrupt a project.

I deployed an agent to handle customer support tickets. One ambiguous ticket triggered an infinite loop of “let me research this more thoroughly.” The agent spent $847 in API costs before I noticed.

The problem: agents don’t have a natural stopping point. They’ll keep iterating, searching, and refining until you explicitly stop them.

cost_limits.py
class BoundedAgent:
def __init__(self, max_iterations=50, max_cost=5.0, timeout=300):
self.max_iterations = max_iterations
self.max_cost = max_cost
self.timeout = timeout # seconds
self.cost_tracker = CostTracker()
async def run(self, task):
start_time = time.now()
iterations = 0
while not task.complete:
# Hard limits - these throw exceptions
self.check_limits(iterations, start_time)
result = await self.step(task)
self.cost_tracker.add(result.cost)
iterations += 1
return task.result
def check_limits(self, iterations, start_time):
if iterations >= self.max_iterations:
raise MaxIterationsExceeded(
f"Hit {self.max_iterations} iterations"
)
if self.cost_tracker.total >= self.max_cost:
raise BudgetExceeded(
f"Spent ${self.cost_tracker.total:.2f} of ${self.max_cost:.2f} budget"
)
if time.now() - start_time > self.timeout:
raise TimeoutExceeded(
f"Exceeded {self.timeout}s timeout"
)

I now set three hard limits on every agent: maximum iterations, maximum cost, and maximum time. When any limit is hit, the agent stops—no exceptions.

What I Got Wrong

When I first tackled agent failures, I tried to solve them with better prompts. “Be more careful,” “double-check your work,” “verify all claims.” None of it worked reliably.

What worked was accepting that agents will fail, and building systems that fail gracefully:

  • Validation at each step (not just at the end)
  • Checkpoints and rollback (so failures don’t require starting over)
  • Hard resource limits (so failures can’t run forever)
  • Isolation between trusted and untrusted operations

Agents aren’t just LLMs with tools. They’re systems, and they need system-level safeguards.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments