How to Build Production-Ready AI Agent Architecture: State, Retries, and Approval Gates
My AI agent crashed at 3 AM. The invoice processing pipeline had been running smoothly for weeks. Then an external API timed out, the retry logic kicked in… and duplicated the same invoice three times. By morning, I was manually fixing a mess created by my “autonomous” agent.
That’s when I realized: I spent 90% of my time building prompts, nodes, and chat interfaces. But the production failures came from the boring stuff I skipped - state management, retries, schemas, approval gates, and dead-letter queues.
The Five Patterns That Actually Matter
After debugging that disaster and reading through production architecture discussions, I found five patterns that make agent systems survive real operations:
┌─────────────────────────────────────────────────────────────────────────┐│ INPUT ││ │ ││ ▼ ││ ┌─────────────────────────────────────────────────────────────┐ ││ │ Pattern 1: STATE CHECKPOINT │ ││ │ ┌───────────┐ │ ││ │ │ step_name │────► Write state BEFORE processing │ ││ │ │ input_id │ │ ││ │ │ timestamp │ │ ││ │ │ status │────► pending → running → success/failed │ ││ │ └───────────┘ │ ││ └─────────────────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌─────────────────────────────────────────────────────────────┐ ││ │ Pattern 2: RETRY/BACKOFF │ ││ │ ┌───────────┐ │ ││ │ │ max_retries│──► 3 retries with exponential backoff │ ││ │ │ backoff │──► 1s → 2s → 4s │ ││ │ └───────────┘ │ ││ └─────────────────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌─────────────────────────────────────────────────────────────┐ ││ │ Pattern 3: SCHEMA VALIDATION │ ││ │ ┌───────────┐ │ ││ │ │ output │──► Pydantic/Basemodel schema │ ││ │ │ confidence│──► threshold check (< 0.85 = review) │ ││ │ └───────────┘ │ ││ └─────────────────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌─────────────────────────────────────────────────────────────┐ ││ │ Pattern 4: APPROVAL GATE │ ││ │ ┌───────────┐ │ ││ │ │ risk_flags│──► Any flag = human review queue │ ││ │ │ timeout │──► 24h review window │ ││ │ └───────────┘ │ ││ └─────────────────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌─────────────────────────────────────────────────────────────┐ ││ │ Pattern 5: DEAD-LETTER PATH │ ││ │ ┌───────────┐ │ ││ │ │ failed │──► Queue for manual recovery │ ││ │ │ exhausted │──► Max retries reached → safe storage │ ││ │ └───────────┘ │ ││ └─────────────────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ OUTPUT │└─────────────────────────────────────────────────────────────────────────┘Let me walk through how I implemented each pattern - and where I failed first.
Pattern 1: Every Step Writes State
My first mistake was treating state as optional. The agent processed an invoice, succeeded, and I thought “done”. But when the network flaked, I had no record of where the process stopped or what partial work existed.
Here’s the state schema I ended up with:
from pydantic import BaseModelfrom datetime import datetime
class StepState(BaseModel): step_name: str input_id: str output: dict timestamp: datetime retry_count: int = 0 status: str # pending, running, success, failed, needs_reviewThe key insight: status must have a needs_review state. Not just success/failed. When an agent outputs something questionable, you need a human in the loop - but you also need to track that it’s waiting for review, not just “stuck”.
I use Redis for state storage because it’s fast and supports atomic operations. But PostgreSQL works fine too - the important thing is that every step writes state BEFORE processing, not after.
Pattern 2: Every External Call Has Retry/Backoff
I initially used a simple retry decorator:
# WRONG: Naive retry that creates duplicatesdef retry_on_failure(func, max_retries=3): for i in range(max_retries): try: return func() except Exception: continue raise Exception("Max retries reached")This created the 3 AM disaster. The function executed, succeeded partially, then the network failed. The retry ran the SAME function again - duplicating work.
The fix: use idempotent operations with Celery’s built-in retry:
from celery import Celery
app = Celery('agent_pipeline', broker='redis://localhost')
@app.task(bind=True, max_retries=3, autoretry_for=(Exception,))def process_invoice(self, invoice_id: str): # Idempotent: same invoice_id = same result, even if called multiple times if already_processed(invoice_id): return get_cached_result(invoice_id)
try: result = fetch_and_process(invoice_id) mark_as_processed(invoice_id) return result except Exception as e: # Celery handles exponential backoff automatically raise self.retry(exc=e, countdown=2 ** self.request.retries)The already_processed check prevents duplicates. The 2 ** retries gives exponential backoff (1s, 2s, 4s).
Pattern 3: Every Output Has Schema Validation
LLMs are creative. That’s the problem. My invoice parser output looked fine until I checked the logs:
{ "vendor": "Acme Corp", "amount": "five hundred dollars", # STRING instead of float "due_date": "next Tuesday", # NOT a valid date "confidence": 0.92}The downstream payment system crashed on "five hundred dollars".
Pydantic forces the LLM output into a strict schema:
from pydantic import BaseModel, validatorfrom datetime import datetime
class DocumentOutput(BaseModel): vendor: str amount: float due_date: datetime confidence: float missing_fields: list[str] = [] risk_flags: list[str] = []
@validator('amount') def amount_must_be_positive(cls, v): if v <= 0: raise ValueError('amount must be positive') return v
@validator('confidence') def confidence_in_range(cls, v): if not 0 <= v <= 1: raise ValueError('confidence must be between 0 and 1') return vIf the LLM output doesn’t match this schema, the validator raises - and the task goes to the dead-letter queue.
Pattern 4: Every Risky Action Has an Approval Gate
I learned this from a near-miss. The agent was about to auto-pay a $15,000 invoice with confidence: 0.82. Just below my 0.85 threshold, but the system would have processed it anyway.
The approval gate pattern:
# Pattern 4: Approval gate for risky actionsif validated.confidence < 0.85 or validated.risk_flags: state.status = 'needs_review' save_state(state_store, invoice_id, state) enqueue_human_review(invoice_id, validated) return {'status': 'pending_review', 'data': validated}The human review queue needs:
- A timeout (24 hours default - after that, escalate)
- A UI for reviewers to see the data and approve/reject
- An audit trail of who approved what
I built a simple review interface with Alpine.js that shows pending items and lets me click approve/reject. The approved items move to the final action queue; rejected items go back to the agent with feedback.
Pattern 5: Every Workflow Has a Dead-Letter Path
Dead-letter queues are where failed tasks go to wait for human intervention. Without them, failures disappear into the void.
def enqueue_dead_letter(task_id: str, error: str): dlq = redis.Redis(host='localhost', port=6379, db=1) # Separate DB dlq.hset('dead_letter_queue', task_id, json.dumps({ 'error': error, 'timestamp': datetime.now().isoformat(), 'original_input': get_original_input(task_id), 'retry_count': get_retry_count(task_id) })) # Alert notification send_alert(f"Task {task_id} moved to dead-letter queue: {error}")I check the dead-letter queue every morning. It usually has 2-5 items from overnight runs. Without it, those failures would be invisible - and I’d be surprised when downstream systems failed.
Putting It All Together
Here’s a complete production-ready agent step:
from pydantic import BaseModelfrom datetime import datetimeimport redisfrom celery import Celery
class StepState(BaseModel): step_name: str input_id: str output: dict timestamp: datetime retry_count: int = 0 status: str
class DocumentOutput(BaseModel): vendor: str amount: float due_date: datetime confidence: float missing_fields: list[str] = [] risk_flags: list[str] = []
app = Celery('agent_pipeline', broker='redis://localhost')state_store = redis.Redis(host='localhost', port=6379, db=0)
@app.task(bind=True, max_retries=3, autoretry_for=(Exception,))def process_invoice(self, invoice_id: str): # Pattern 1: Write state at every step state = StepState( step_name='invoice_processing', input_id=invoice_id, output={}, timestamp=datetime.now(), status='running' ) save_state(state_store, invoice_id, state)
try: # Pattern 2: External call with retry (Celery handles this) raw_doc = fetch_invoice_with_backoff(invoice_id)
# Pattern 3: Schema validation parsed = parse_invoice(raw_doc) validated = DocumentOutput(**parsed)
# Pattern 4: Approval gate for risky actions if validated.confidence < 0.85 or validated.risk_flags: state.status = 'needs_review' save_state(state_store, invoice_id, state) enqueue_human_review(invoice_id, validated) return {'status': 'pending_review', 'data': validated}
# Pattern 5: Dead-letter path for failures state.status = 'success' state.output = validated.dict() save_state(state_store, invoice_id, state) return {'status': 'success', 'data': validated}
except Exception as e: if self.request.retries >= self.max_retries: state.status = 'failed' state.output = {'error': str(e)} save_state(state_store, invoice_id, state) enqueue_dead_letter(invoice_id, str(e)) raiseArchitecture Checklist
I use this YAML checklist when reviewing any agent system:
state_layer: - every_step_writes_state: true - state_schema_enforced: true - audit_trail_per_action: true
retry_layer: - external_calls_retry: true - backoff_strategy: "exponential" - max_retries: 3
validation_layer: - output_schema_required: true - confidence_threshold: 0.85 - missing_field_handling: "queue_for_review"
approval_layer: - risky_actions_require_gate: true - human_review_timeout: "24h" - auto_approve_safe_actions: true
dead_letter_layer: - failed_tasks_queue: true - retry_exhausted_path: true - manual_recovery_interface: trueIf any item is missing, the system will fail in production - not maybe, but definitely.
What I Wish I Knew Earlier
-
Start with the backend, add agents on top. Don’t build the agent first and hope the infrastructure works.
-
“Boring is what survives”. The fancy prompt engineering matters, but state management matters more.
-
Test failure paths, not just success paths. I tested my invoice processor with perfect inputs. Then the real world happened.
-
Real metrics matter. Personal systems save 3.5 hours/day. Business systems replace $4K-$6K/month of repetitive labor. But those metrics assume the system actually works in production.
-
The expensive part is bad architecture. Duplicate work, broken retries, messy state, and humans cleaning up after “autonomous” agents. The patterns above prevent all of that.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments