Why Do Production AI Agents Need Code Instead of Just Markdown and Folders?
Problem
I deployed an AI agent to production last month. It worked perfectly in my dev environment—a simple workflow that read markdown files from a folder, processed them with an LLM, and wrote results to another folder.
The first night in production, I woke up to an $800 API bill. The agent had gotten stuck in a loop, repeatedly calling the LLM on the same inputs without any caching or retry limits. No logs, no traces, no way to debug what happened.
Here’s what my “simple” folder-based agent looked like:
/prompts/ analyze.md summarize.md/input/ doc1.txt doc2.txt/output/ result1.txt result2.txtAnd the shell script orchestrating it:
#!/bin/bashfor file in input/*.txt; do prompt=$(cat prompts/analyze.md) content=$(cat "$file") result=$(llm_call "$prompt" "$content") # No caching, no retries, no limits echo "$result" > "output/result_$(basename $file)"doneThis works for prototypes. In production, it’s a disaster waiting to happen.
What Happened?
My agent ran fine for the first 100 requests. Then:
-
Cost explosion: A network hiccup caused the LLM call to fail silently. The script retried immediately (no backoff), and the LLM started returning partial results. The agent kept calling the API on the same input, over and over.
-
No observability: I had no idea which files were processed, how many API calls were made, or where the money was going.
-
Non-deterministic behavior: Running the same input twice gave different results because I had no state management or controlled randomness.
-
Latency spikes: Processing 1000 files sequentially took 4 hours because there was no parallelization.
When I posted about this on Reddit, the response was blunt:
“For production level software you need code. For four reasons: 1. Cost. 2. Determinism. 3. Latency. 4. Context & memory management.” - mohdgame
Another comment hit harder:
“routing logic is where it falls apart. you still need code for conditionals, error handling, retries, rate limiting” - Dependent_Slide4675
They were right.
The Gap Between Prototyping and Production
Markdown and folder-based workflows excel during prototyping:
+ Quick iteration cycles+ Easy to understand structure+ Human-readable and editable+ Version control friendlyBut production introduces challenges that folders cannot solve:
+ Cost control -> Every LLM call costs money+ Determinism -> Same input, predictable output+ Latency control -> Users won't wait 30 seconds+ Memory management -> Context windows are limitedLet me show you how I rebuilt the agent with code.
Solution: Code-Based Architecture
1. Cost Control Through Caching and Retries
The first thing I added was proper cost control:
from tenacity import retry, stop_after_attempt, wait_exponentialfrom functools import lru_cacheimport hashlib
class CostControlledLLM: def __init__(self, llm_client, cache_ttl: int = 3600): self.client = llm_client self.cache = {} # In production, use Redis self.cache_ttl = cache_ttl self.call_count = 0 self.total_cost = 0.0
def _cache_key(self, prompt: str, context: str) -> str: return hashlib.sha256(f"{prompt}:{context}".encode()).hexdigest()
@retry( stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10) ) async def call(self, prompt: str, context: str = "") -> str: # Check cache first (this saves money) key = self._cache_key(prompt, context) if key in self.cache: print(f"Cache hit! Saved an API call.") return self.cache[key]
# Only call LLM if not cached response = await self.client.generate(prompt, context)
# Track costs self.call_count += 1 self.total_cost += self._estimate_cost(prompt, response)
# Cache the result self.cache[key] = response return response
def _estimate_cost(self, prompt: str, response: str) -> float: # Rough estimate: $0.002 per 1K tokens tokens = (len(prompt) + len(response)) / 4 return tokens * 0.002 / 1000With this wrapper, identical requests hit the cache instead of the API. After implementing this, my costs dropped by 60% because the agent stopped making redundant calls.
2. Determinism Through Structured State
Folder-based agents have no concept of state. Code lets you define exactly how your agent transitions between states:
from typing import TypedDict, Literalfrom langgraph.graph import StateGraph, END
class AgentState(TypedDict): input: str route: Literal["analyze", "summarize", "clarify"] confidence: float result: str | None error: str | None
def analyze_input(state: AgentState) -> AgentState: """Determine what kind of processing is needed.""" # This is where you'd call your LLM to classify the input if "?" in state["input"]: return {**state, "route": "clarify", "confidence": 0.9} elif len(state["input"]) > 1000: return {**state, "route": "summarize", "confidence": 0.8} else: return {**state, "route": "analyze", "confidence": 0.95}
def route_request(state: AgentState) -> str: """Deterministic routing based on state.""" if state["confidence"] < 0.7: return "clarify" return state["route"]
def execute_task(state: AgentState) -> AgentState: """Execute the appropriate task.""" # Your actual processing logic here return {**state, "result": f"Processed: {state['input'][:50]}"}
# Build the graphworkflow = StateGraph(AgentState)workflow.add_node("analyze", analyze_input)workflow.add_node("clarify", lambda s: {**s, "result": "Please clarify your request"})workflow.add_node("summarize", execute_task)workflow.add_node("analyze_task", execute_task)
# Deterministic edgesworkflow.add_conditional_edges("analyze", route_request, { "clarify": "clarify", "summarize": "summarize", "analyze": "analyze_task"})
workflow.add_edge("clarify", END)workflow.add_edge("summarize", END)workflow.add_edge("analyze_task", END)
app = workflow.compile()Now the same input always produces the same routing decision. I can trace exactly which path the agent took and why.
3. Latency Control Through Parallelization
Folder-based agents process files sequentially. Code lets you run tasks in parallel:
import asynciofrom typing import Listimport time
class ParallelProcessor: def __init__(self, max_concurrent: int = 10): self.semaphore = asyncio.Semaphore(max_concurrent) self.rate_limiter = RateLimiter(calls_per_minute=60)
async def process_file(self, file_path: str, llm: CostControlledLLM) -> dict: async with self.semaphore: await self.rate_limiter.acquire() content = open(file_path).read() result = await llm.call("Analyze this:", content) return {"file": file_path, "result": result}
async def process_batch(self, files: List[str]) -> List[dict]: start = time.time() tasks = [self.process_file(f, self.llm) for f in files] results = await asyncio.gather(*tasks, return_exceptions=True) elapsed = time.time() - start
print(f"Processed {len(files)} files in {elapsed:.2f}s") print(f"Average: {elapsed/len(files):.2f}s per file") return results
class RateLimiter: def __init__(self, calls_per_minute: int): self.interval = 60.0 / calls_per_minute self.last_call = 0
async def acquire(self): now = time.time() wait_time = self.interval - (now - self.last_call) if wait_time > 0: await asyncio.sleep(wait_time) self.last_call = time.time()With parallelization, processing 1000 files went from 4 hours to 15 minutes. The semaphore prevents overwhelming the API, and the rate limiter keeps me under quota.
4. Context Management
LLMs have token limits. Folders don’t help you manage context. Code does:
from typing import Listimport tiktoken
class ContextManager: def __init__(self, max_tokens: int = 4000, model: str = "gpt-4"): self.max_tokens = max_tokens self.encoding = tiktoken.encoding_for_model(model)
def count_tokens(self, text: str) -> int: return len(self.encoding.encode(text))
def trim_context(self, messages: List[str], reserve_for_response: int = 500) -> List[str]: """Trim context to fit within token limit.""" available = self.max_tokens - reserve_for_response
total = sum(self.count_tokens(m) for m in messages) if total <= available: return messages
# Keep most recent messages that fit trimmed = [] current_tokens = 0
for message in reversed(messages): tokens = self.count_tokens(message) if current_tokens + tokens > available: break trimmed.insert(0, message) current_tokens += tokens
print(f"Trimmed context from {total} to {current_tokens} tokens") return trimmed
def summarize_if_needed(self, messages: List[str]) -> str: """Summarize long conversations to preserve context.""" total = sum(self.count_tokens(m) for m in messages)
if total > self.max_tokens * 2: # In production, call LLM to summarize return "Previous conversation summarized: " + messages[0][:100] + "..."
return "\n".join(messages)This prevents the dreaded “context length exceeded” error that crashes folder-based agents.
The Hybrid Approach: Best of Both Worlds
After all this, I realized markdown and folders do have a place. They’re great for storing prompts that humans need to edit. The key is to use markdown for content and code for orchestration.
/prompts/ analyze.md <- Human-editable prompts summarize.md <- Version controlled/src/ agent.py <- Orchestration logic state.py <- State management cache.py <- Cost control/tests/ test_agent.py <- Unit testsfrom pathlib import Pathfrom jinja2 import Template
class HybridAgent: def __init__(self, prompts_dir: Path = Path("prompts")): self.prompts_dir = prompts_dir self.llm = CostControlledLLM(OpenAIClient()) self.state_manager = StateManager() self._load_prompts()
def _load_prompts(self): """Load markdown prompts, manage with code.""" self.analyze_prompt = Template( (self.prompts_dir / "analyze.md").read_text() ) self.summarize_prompt = Template( (self.prompts_dir / "summarize.md").read_text() )
async def execute(self, user_input: str) -> str: # Code determines the route route = await self._determine_route(user_input)
# Markdown provides the prompt if route == "complex": prompt = self.analyze_prompt.render(input=user_input) else: prompt = self.summarize_prompt.render(input=user_input)
# Code handles the execution return await self.llm.call(prompt)This way, non-developers can edit prompts in markdown files, while the code handles all the production concerns.
The Numbers
After rewriting the agent with proper code architecture:
Metric | Folder-based | Code-based--------------------|--------------|------------API cost/day | $200+ | $40Avg latency | 12s | 2sError rate | 15% | 0.5%Debugging time | Hours | MinutesTest coverage | 0% | 85%The upfront investment in code paid for itself in the first week.
When to Make the Switch
Not every agent needs code from day one. Here’s my decision framework:
Use Folders When:- Prototype/exploration phase- Less than 10 requests/day- Cost is not a concern- Single user, no SLA
Use Code When:- Production deployment- More than 100 requests/day- Cost control matters- Multiple users, SLA required- Need debugging and monitoringCommon Mistakes to Avoid
Mistake 1: Over-engineering early
Don’t start with a full code architecture for a prototype. Start with folders, add code when you hit production requirements.
Mistake 2: All-or-nothing thinking
You don’t have to choose. Use markdown for prompts, code for orchestration. Both have their place.
Mistake 3: Ignoring hidden costs
“Simple” folder workflows hide complexity in shell scripts. A 50-line shell script is often harder to debug than 200 lines of Python with proper error handling.
Summary
In this post, I showed why production AI agents need code instead of just markdown and folders. My folder-based prototype cost $800 in one night because it had no caching, no retry limits, no observability, and no state management.
The four critical requirements that code provides:
- Cost control - Caching and rate limiting prevent API bill explosions
- Determinism - State machines ensure predictable behavior
- Latency - Parallelization and async processing reduce wait times
- Context management - Token counting and trimming prevent crashes
The winning pattern is hybrid: markdown for human-editable prompts, code for orchestration. Start simple with folders during exploration, but invest in code architecture before production deployment.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit: Why production agents need code
- 👨💻 LangGraph Documentation
- 👨💻 Circuit Breaker Pattern
- 👨💻 Semantic Kernel
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments