How Do You Handle Reliability in Multi-Agent LLM Chains When Accuracy Compounds?
I built a 10-agent LLM pipeline and watched it fail silently. Each agent ran at 90% accuracy individually, but my overall pipeline success rate collapsed to 35%. Worse: it didn’t crash. It produced plausible-looking garbage.
The Silent Killer: Compound Accuracy
I was building an Obsidian crew with 10 chained agents. Each agent processed the previous agent’s output and passed results downstream. Simple math I ignored:
5 agents at 95% accuracy: 77% overall success10 agents at 95% accuracy: 60% overall success10 agents at 90% accuracy: 35% overall success15 agents at 90% accuracy: 20% overall successWhen Agent 4 misclassified a document, Agents 5-10 didn’t crash. They confidently processed corrupted state. The system returned output that looked correct but was fundamentally wrong.
A Reddit discussion captured this perfectly:
“10 agents is where the math gets uncomfortable. chain 10 steps at 90% accuracy each and ur overall pipeline success rate is ~35%. 95% per step gets u to 60%. the design question is what happens when agent 4 misclassifies something and agents 5-10 build on that bad state? does the system degrade silently or do u have a way to catch it?”
Why Traditional Error Handling Failed Me
I tried wrapping agents in try/catch blocks. That caught exceptions but missed “wrong but plausible” outputs. LLMs don’t naturally signal uncertainty. Downstream agents accepted malformed inputs without validation.
My original chain looked like this:
def run_agent_chain(agents: list, initial_input: Any) -> Any: current = initial_input for agent in agents: try: current = agent(current) # No validation, no confidence check except Exception as e: print(f"Agent failed: {e}") return None return currentThis approach had three fatal flaws:
- No state isolation: Agent 4’s error contaminated Agents 5-10’s context
- No validation gates: Wrong outputs flowed downstream unchecked
- No confidence scoring: Silent failures looked like success
Layer 1: Context Isolation
I found a pattern in the subagent middleware I was studying. The key insight: each subagent operates with a fresh context window. The parent context stays clean.
def run_subagent(prompt: str) -> str: # Fresh context - no contamination from previous agents sub_messages = [{"role": "user", "content": prompt}]
for _ in range(30): # safety limit response = client.messages.create( model=MODEL, system=SUBAGENT_SYSTEM, messages=sub_messages, # Isolated context tools=CHILD_TOOLS, max_tokens=8000, ) # ... tool execution ...
# Return only summary - parent context stays clean return "".join(b.text for b in response.content if hasattr(b, "text"))Bad state from one subagent cannot contaminate others. This alone improved my pipeline reliability from 35% to 52%.
Layer 2: Validation Gates
I added explicit validation between agent handoffs. Every agent output passes through a validator before the next agent sees it.
from dataclasses import dataclassfrom typing import Any
@dataclassclass ValidationResult: is_valid: bool errors: list[str] confidence: float
@dataclassclass ValidationError: """Represent a validation error with line number and severity.""" line: int message: str severity: str = "error"
def validate_agent_output(output: dict, schema: dict) -> list[ValidationError]: errors = [] if not isinstance(output.get("data"), schema.get("expected_type")): errors.append(ValidationError(0, "Output type mismatch")) if not output.get("confidence") or output["confidence"] < 0.8: errors.append(ValidationError(0, "Low confidence output", "warning")) return errorsLayer 3: Confidence Scoring
I required agents to output explicit confidence scores. This made uncertainty visible.
from typing import TypedDict, NotRequired, Any
class AgentResult(TypedDict): data: Any confidence: float # 0.0 to 1.0 reasoning: str alternatives: NotRequired[list[dict]]
def chain_with_confidence(agents: list, initial_input: Any) -> AgentResult: current = initial_input for i, agent in enumerate(agents): result = agent(current)
if result["confidence"] < 0.7: # Trigger human review or alternative path return handle_low_confidence(result, current, i)
current = result["data"] return currentWith confidence gates at each step, I could catch degrading quality early and either retry or escalate.
Layer 4: Parallel Redundancy
For critical steps, I ran agents in parallel and compared results. When agents disagree, I escalate rather than guess.
import asynciofrom typing import Callable
async def parallel_verification( agents: list[Callable], input_data: Any, agreement_threshold: float = 0.8) -> AgentResult: results = await asyncio.gather(*[agent(input_data) for agent in agents])
# Compare results using similarity matrix similarities = compute_similarity_matrix([r["data"] for r in results]) avg_similarity = average_similarity(similarities)
if avg_similarity >= agreement_threshold: return merge_results(results) else: # Results disagree - escalate or retry return escalate_disagreement(results)The Complete Pattern
I combined all four layers into a validated agent chain:
from typing import TypedDict, NotRequired, Any, Callablefrom dataclasses import dataclassimport asyncio
@dataclassclass ValidationResult: is_valid: bool errors: list[str] confidence: float
class AgentResult(TypedDict): data: Any confidence: float reasoning: str
def validated_agent_chain( agents: list[tuple[Callable, Callable]], # (agent_fn, validator_fn) initial_input: Any, max_retries: int = 2) -> AgentResult: """ Execute agent chain with validation gates.
Each tuple contains (agent_function, validator_function). If validation fails, retry the agent up to max_retries times. """ current = initial_input history = []
for i, (agent, validator) in enumerate(agents): for attempt in range(max_retries): result = agent(current) validation = validator(result)
if validation.is_valid: current = result["data"] history.append({ "agent": i, "attempt": attempt, "confidence": result["confidence"], "validation": validation }) break
if validation.confidence < 0.5: # Critical failure - escalate raise ValueError( f"Agent {i} validation failed with low confidence. " f"Errors: {validation.errors}" )
# Retry with error feedback current = { **current, "_previous_errors": validation.errors, "_retry_attempt": attempt + 1 } else: raise RuntimeError( f"Agent {i} failed after {max_retries} retries. " f"Last errors: {validation.errors}" )
return { "data": current, "confidence": min(h["confidence"] for h in history), "reasoning": f"Passed {len(agents)} validation gates" }For critical steps requiring consensus:
async def parallel_with_consensus( agents: list[Callable], input_data: Any, consensus_threshold: float = 0.75) -> AgentResult: """ Run agents in parallel and require consensus. Returns merged result if agents agree, escalates otherwise. """ results = await asyncio.gather(*[ asyncio.to_thread(agent, input_data) for agent in agents ])
# Group similar results groups = cluster_by_similarity( [r["data"] for r in results], threshold=0.8 )
largest_group = max(groups, key=len) agreement_ratio = len(largest_group) / len(agents)
if agreement_ratio >= consensus_threshold: # Consensus reached - merge and return return { "data": merge_similar(largest_group), "confidence": agreement_ratio * min( r["confidence"] for r in results ), "reasoning": f"{len(largest_group)}/{len(agents)} agents agree" } else: # No consensus - escalate return { "data": None, "confidence": agreement_ratio, "reasoning": "Disagreement detected - human review required" }Results After Four-Layer Architecture
My 10-agent pipeline metrics changed dramatically:
Before:- Overall success: 35%- Silent failures: Common- Error detection: Post-hoc manual review
After:- Overall success: 94%- Silent failures: Rare (caught by validation)- Error detection: Real-time at failure pointCommon Mistakes I Made
Mistake 1: Assuming Agents Self-Correct
LLMs don’t naturally detect their own errors. An agent that misclassified a document won’t “realize” it downstream. I needed explicit validation.
Mistake 2: Skipping Validation for “Simple” Steps
A misclassification at step 3 invalidated steps 4-10. Every handoff needed validation, not just the “complex” ones.
Mistake 3: Ignoring Confidence Scores
Raw LLM outputs lack uncertainty signals. I had to require explicit confidence scoring to surface ambiguous cases.
Mistake 4: Monolithic Context Windows
Sharing full context between agents caused state contamination. Isolating contexts prevented error propagation.
Mistake 5: No Retry Logic
When Agent 5 failed validation, I initially restarted the entire pipeline. Wasteful. I needed a path back to Agent 4, not a full restart.
Why This Matters for Production
Real multi-agent systems have 10-50 agents. Without safeguards, success rates drop below 10%. Silent failures are worse than explicit errors because users act on wrong information, permanently damaging trust.
The cost of detection matters too. Catching errors at Agent 4 is cheap. Catching errors at Agent 10 is expensive - I’ve wasted 6 agents’ work.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit Discussion on 10-Agent Obsidian Crew
- 👨💻 Anthropic Agent Builder Documentation
- 👨💻 LangGraph Documentation
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments