Skip to content

How to Build Production-Ready AI Agent Architecture: State, Retries, and Approval Gates

My AI agent crashed at 3 AM. The invoice processing pipeline had been running smoothly for weeks. Then an external API timed out, the retry logic kicked in… and duplicated the same invoice three times. By morning, I was manually fixing a mess created by my “autonomous” agent.

That’s when I realized: I spent 90% of my time building prompts, nodes, and chat interfaces. But the production failures came from the boring stuff I skipped - state management, retries, schemas, approval gates, and dead-letter queues.

The Five Patterns That Actually Matter

After debugging that disaster and reading through production architecture discussions, I found five patterns that make agent systems survive real operations:

Production Agent Architecture Flow
┌─────────────────────────────────────────────────────────────────────────┐
│ INPUT │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Pattern 1: STATE CHECKPOINT │ │
│ │ ┌───────────┐ │ │
│ │ │ step_name │────► Write state BEFORE processing │ │
│ │ │ input_id │ │ │
│ │ │ timestamp │ │ │
│ │ │ status │────► pending → running → success/failed │ │
│ │ └───────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Pattern 2: RETRY/BACKOFF │ │
│ │ ┌───────────┐ │ │
│ │ │ max_retries│──► 3 retries with exponential backoff │ │
│ │ │ backoff │──► 1s → 2s → 4s │ │
│ │ └───────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Pattern 3: SCHEMA VALIDATION │ │
│ │ ┌───────────┐ │ │
│ │ │ output │──► Pydantic/Basemodel schema │ │
│ │ │ confidence│──► threshold check (< 0.85 = review) │ │
│ │ └───────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Pattern 4: APPROVAL GATE │ │
│ │ ┌───────────┐ │ │
│ │ │ risk_flags│──► Any flag = human review queue │ │
│ │ │ timeout │──► 24h review window │ │
│ │ └───────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Pattern 5: DEAD-LETTER PATH │ │
│ │ ┌───────────┐ │ │
│ │ │ failed │──► Queue for manual recovery │ │
│ │ │ exhausted │──► Max retries reached → safe storage │ │
│ │ └───────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ OUTPUT │
└─────────────────────────────────────────────────────────────────────────┘

Let me walk through how I implemented each pattern - and where I failed first.

Pattern 1: Every Step Writes State

My first mistake was treating state as optional. The agent processed an invoice, succeeded, and I thought “done”. But when the network flaked, I had no record of where the process stopped or what partial work existed.

Here’s the state schema I ended up with:

state_schema.py
from pydantic import BaseModel
from datetime import datetime
class StepState(BaseModel):
step_name: str
input_id: str
output: dict
timestamp: datetime
retry_count: int = 0
status: str # pending, running, success, failed, needs_review

The key insight: status must have a needs_review state. Not just success/failed. When an agent outputs something questionable, you need a human in the loop - but you also need to track that it’s waiting for review, not just “stuck”.

I use Redis for state storage because it’s fast and supports atomic operations. But PostgreSQL works fine too - the important thing is that every step writes state BEFORE processing, not after.

Pattern 2: Every External Call Has Retry/Backoff

I initially used a simple retry decorator:

broken_retry.py
# WRONG: Naive retry that creates duplicates
def retry_on_failure(func, max_retries=3):
for i in range(max_retries):
try:
return func()
except Exception:
continue
raise Exception("Max retries reached")

This created the 3 AM disaster. The function executed, succeeded partially, then the network failed. The retry ran the SAME function again - duplicating work.

The fix: use idempotent operations with Celery’s built-in retry:

proper_retry.py
from celery import Celery
app = Celery('agent_pipeline', broker='redis://localhost')
@app.task(bind=True, max_retries=3, autoretry_for=(Exception,))
def process_invoice(self, invoice_id: str):
# Idempotent: same invoice_id = same result, even if called multiple times
if already_processed(invoice_id):
return get_cached_result(invoice_id)
try:
result = fetch_and_process(invoice_id)
mark_as_processed(invoice_id)
return result
except Exception as e:
# Celery handles exponential backoff automatically
raise self.retry(exc=e, countdown=2 ** self.request.retries)

The already_processed check prevents duplicates. The 2 ** retries gives exponential backoff (1s, 2s, 4s).

Pattern 3: Every Output Has Schema Validation

LLMs are creative. That’s the problem. My invoice parser output looked fine until I checked the logs:

raw_llm_output.txt
{
"vendor": "Acme Corp",
"amount": "five hundred dollars", # STRING instead of float
"due_date": "next Tuesday", # NOT a valid date
"confidence": 0.92
}

The downstream payment system crashed on "five hundred dollars".

Pydantic forces the LLM output into a strict schema:

output_schema.py
from pydantic import BaseModel, validator
from datetime import datetime
class DocumentOutput(BaseModel):
vendor: str
amount: float
due_date: datetime
confidence: float
missing_fields: list[str] = []
risk_flags: list[str] = []
@validator('amount')
def amount_must_be_positive(cls, v):
if v <= 0:
raise ValueError('amount must be positive')
return v
@validator('confidence')
def confidence_in_range(cls, v):
if not 0 <= v <= 1:
raise ValueError('confidence must be between 0 and 1')
return v

If the LLM output doesn’t match this schema, the validator raises - and the task goes to the dead-letter queue.

Pattern 4: Every Risky Action Has an Approval Gate

I learned this from a near-miss. The agent was about to auto-pay a $15,000 invoice with confidence: 0.82. Just below my 0.85 threshold, but the system would have processed it anyway.

The approval gate pattern:

approval_gate.py
# Pattern 4: Approval gate for risky actions
if validated.confidence < 0.85 or validated.risk_flags:
state.status = 'needs_review'
save_state(state_store, invoice_id, state)
enqueue_human_review(invoice_id, validated)
return {'status': 'pending_review', 'data': validated}

The human review queue needs:

  • A timeout (24 hours default - after that, escalate)
  • A UI for reviewers to see the data and approve/reject
  • An audit trail of who approved what

I built a simple review interface with Alpine.js that shows pending items and lets me click approve/reject. The approved items move to the final action queue; rejected items go back to the agent with feedback.

Pattern 5: Every Workflow Has a Dead-Letter Path

Dead-letter queues are where failed tasks go to wait for human intervention. Without them, failures disappear into the void.

dead_letter.py
def enqueue_dead_letter(task_id: str, error: str):
dlq = redis.Redis(host='localhost', port=6379, db=1) # Separate DB
dlq.hset('dead_letter_queue', task_id, json.dumps({
'error': error,
'timestamp': datetime.now().isoformat(),
'original_input': get_original_input(task_id),
'retry_count': get_retry_count(task_id)
}))
# Alert notification
send_alert(f"Task {task_id} moved to dead-letter queue: {error}")

I check the dead-letter queue every morning. It usually has 2-5 items from overnight runs. Without it, those failures would be invisible - and I’d be surprised when downstream systems failed.

Putting It All Together

Here’s a complete production-ready agent step:

agent_pipeline.py
from pydantic import BaseModel
from datetime import datetime
import redis
from celery import Celery
class StepState(BaseModel):
step_name: str
input_id: str
output: dict
timestamp: datetime
retry_count: int = 0
status: str
class DocumentOutput(BaseModel):
vendor: str
amount: float
due_date: datetime
confidence: float
missing_fields: list[str] = []
risk_flags: list[str] = []
app = Celery('agent_pipeline', broker='redis://localhost')
state_store = redis.Redis(host='localhost', port=6379, db=0)
@app.task(bind=True, max_retries=3, autoretry_for=(Exception,))
def process_invoice(self, invoice_id: str):
# Pattern 1: Write state at every step
state = StepState(
step_name='invoice_processing',
input_id=invoice_id,
output={},
timestamp=datetime.now(),
status='running'
)
save_state(state_store, invoice_id, state)
try:
# Pattern 2: External call with retry (Celery handles this)
raw_doc = fetch_invoice_with_backoff(invoice_id)
# Pattern 3: Schema validation
parsed = parse_invoice(raw_doc)
validated = DocumentOutput(**parsed)
# Pattern 4: Approval gate for risky actions
if validated.confidence < 0.85 or validated.risk_flags:
state.status = 'needs_review'
save_state(state_store, invoice_id, state)
enqueue_human_review(invoice_id, validated)
return {'status': 'pending_review', 'data': validated}
# Pattern 5: Dead-letter path for failures
state.status = 'success'
state.output = validated.dict()
save_state(state_store, invoice_id, state)
return {'status': 'success', 'data': validated}
except Exception as e:
if self.request.retries >= self.max_retries:
state.status = 'failed'
state.output = {'error': str(e)}
save_state(state_store, invoice_id, state)
enqueue_dead_letter(invoice_id, str(e))
raise

Architecture Checklist

I use this YAML checklist when reviewing any agent system:

architecture_checklist.yaml
state_layer:
- every_step_writes_state: true
- state_schema_enforced: true
- audit_trail_per_action: true
retry_layer:
- external_calls_retry: true
- backoff_strategy: "exponential"
- max_retries: 3
validation_layer:
- output_schema_required: true
- confidence_threshold: 0.85
- missing_field_handling: "queue_for_review"
approval_layer:
- risky_actions_require_gate: true
- human_review_timeout: "24h"
- auto_approve_safe_actions: true
dead_letter_layer:
- failed_tasks_queue: true
- retry_exhausted_path: true
- manual_recovery_interface: true

If any item is missing, the system will fail in production - not maybe, but definitely.

What I Wish I Knew Earlier

  1. Start with the backend, add agents on top. Don’t build the agent first and hope the infrastructure works.

  2. “Boring is what survives”. The fancy prompt engineering matters, but state management matters more.

  3. Test failure paths, not just success paths. I tested my invoice processor with perfect inputs. Then the real world happened.

  4. Real metrics matter. Personal systems save 3.5 hours/day. Business systems replace $4K-$6K/month of repetitive labor. But those metrics assume the system actually works in production.

  5. The expensive part is bad architecture. Duplicate work, broken retries, messy state, and humans cleaning up after “autonomous” agents. The patterns above prevent all of that.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments