Why Do Production AI Agents Need Code Instead of Just Markdown and Folders?

Mar 21, 2026

Problem

I deployed an AI agent to production last month. It worked perfectly in my dev environment—a simple workflow that read markdown files from a folder, processed them with an LLM, and wrote results to another folder.

The first night in production, I woke up to an $800 API bill. The agent had gotten stuck in a loop, repeatedly calling the LLM on the same inputs without any caching or retry limits. No logs, no traces, no way to debug what happened.

Here’s what my “simple” folder-based agent looked like:

/prompts/
  analyze.md
  summarize.md
/input/
  doc1.txt
  doc2.txt
/output/
  result1.txt
  result2.txt

And the shell script orchestrating it:

#!/bin/bash
for file in input/*.txt; do
  prompt=$(cat prompts/analyze.md)
  content=$(cat "$file")
  result=$(llm_call "$prompt" "$content")  # No caching, no retries, no limits
  echo "$result" > "output/result_$(basename $file)"
done

This works for prototypes. In production, it’s a disaster waiting to happen.

What Happened?

My agent ran fine for the first 100 requests. Then:

Cost explosion: A network hiccup caused the LLM call to fail silently. The script retried immediately (no backoff), and the LLM started returning partial results. The agent kept calling the API on the same input, over and over.
No observability: I had no idea which files were processed, how many API calls were made, or where the money was going.
Non-deterministic behavior: Running the same input twice gave different results because I had no state management or controlled randomness.
Latency spikes: Processing 1000 files sequentially took 4 hours because there was no parallelization.

When I posted about this on Reddit, the response was blunt:

“For production level software you need code. For four reasons: 1. Cost. 2. Determinism. 3. Latency. 4. Context & memory management.” - mohdgame

Another comment hit harder:

“routing logic is where it falls apart. you still need code for conditionals, error handling, retries, rate limiting” - Dependent_Slide4675

They were right.

The Gap Between Prototyping and Production

Markdown and folder-based workflows excel during prototyping:

+ Quick iteration cycles
+ Easy to understand structure
+ Human-readable and editable
+ Version control friendly

But production introduces challenges that folders cannot solve:

+ Cost control      -> Every LLM call costs money
+ Determinism       -> Same input, predictable output
+ Latency control   -> Users won't wait 30 seconds
+ Memory management -> Context windows are limited

Let me show you how I rebuilt the agent with code.

Solution: Code-Based Architecture

1. Cost Control Through Caching and Retries

The first thing I added was proper cost control:

from tenacity import retry, stop_after_attempt, wait_exponential
from functools import lru_cache
import hashlib

class CostControlledLLM:
    def __init__(self, llm_client, cache_ttl: int = 3600):
        self.client = llm_client
        self.cache = {}  # In production, use Redis
        self.cache_ttl = cache_ttl
        self.call_count = 0
        self.total_cost = 0.0

    def _cache_key(self, prompt: str, context: str) -> str:
        return hashlib.sha256(f"{prompt}:{context}".encode()).hexdigest()

    @retry(
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=1, min=4, max=10)
    )
    async def call(self, prompt: str, context: str = "") -> str:
        # Check cache first (this saves money)
        key = self._cache_key(prompt, context)
        if key in self.cache:
            print(f"Cache hit! Saved an API call.")
            return self.cache[key]

        # Only call LLM if not cached
        response = await self.client.generate(prompt, context)

        # Track costs
        self.call_count += 1
        self.total_cost += self._estimate_cost(prompt, response)

        # Cache the result
        self.cache[key] = response
        return response

    def _estimate_cost(self, prompt: str, response: str) -> float:
        # Rough estimate: $0.002 per 1K tokens
        tokens = (len(prompt) + len(response)) / 4
        return tokens * 0.002 / 1000

With this wrapper, identical requests hit the cache instead of the API. After implementing this, my costs dropped by 60% because the agent stopped making redundant calls.

2. Determinism Through Structured State

Folder-based agents have no concept of state. Code lets you define exactly how your agent transitions between states:

from typing import TypedDict, Literal
from langgraph.graph import StateGraph, END

class AgentState(TypedDict):
    input: str
    route: Literal["analyze", "summarize", "clarify"]
    confidence: float
    result: str | None
    error: str | None

def analyze_input(state: AgentState) -> AgentState:
    """Determine what kind of processing is needed."""
    # This is where you'd call your LLM to classify the input
    if "?" in state["input"]:
        return {**state, "route": "clarify", "confidence": 0.9}
    elif len(state["input"]) > 1000:
        return {**state, "route": "summarize", "confidence": 0.8}
    else:
        return {**state, "route": "analyze", "confidence": 0.95}

def route_request(state: AgentState) -> str:
    """Deterministic routing based on state."""
    if state["confidence"] < 0.7:
        return "clarify"
    return state["route"]

def execute_task(state: AgentState) -> AgentState:
    """Execute the appropriate task."""
    # Your actual processing logic here
    return {**state, "result": f"Processed: {state['input'][:50]}"}

# Build the graph
workflow = StateGraph(AgentState)
workflow.add_node("analyze", analyze_input)
workflow.add_node("clarify", lambda s: {**s, "result": "Please clarify your request"})
workflow.add_node("summarize", execute_task)
workflow.add_node("analyze_task", execute_task)

# Deterministic edges
workflow.add_conditional_edges("analyze", route_request, {
    "clarify": "clarify",
    "summarize": "summarize",
    "analyze": "analyze_task"
})

workflow.add_edge("clarify", END)
workflow.add_edge("summarize", END)
workflow.add_edge("analyze_task", END)

app = workflow.compile()

Now the same input always produces the same routing decision. I can trace exactly which path the agent took and why.

3. Latency Control Through Parallelization

Folder-based agents process files sequentially. Code lets you run tasks in parallel:

import asyncio
from typing import List
import time

class ParallelProcessor:
    def __init__(self, max_concurrent: int = 10):
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self.rate_limiter = RateLimiter(calls_per_minute=60)

    async def process_file(self, file_path: str, llm: CostControlledLLM) -> dict:
        async with self.semaphore:
            await self.rate_limiter.acquire()
            content = open(file_path).read()
            result = await llm.call("Analyze this:", content)
            return {"file": file_path, "result": result}

    async def process_batch(self, files: List[str]) -> List[dict]:
        start = time.time()
        tasks = [self.process_file(f, self.llm) for f in files]
        results = await asyncio.gather(*tasks, return_exceptions=True)
        elapsed = time.time() - start

        print(f"Processed {len(files)} files in {elapsed:.2f}s")
        print(f"Average: {elapsed/len(files):.2f}s per file")
        return results

class RateLimiter:
    def __init__(self, calls_per_minute: int):
        self.interval = 60.0 / calls_per_minute
        self.last_call = 0

    async def acquire(self):
        now = time.time()
        wait_time = self.interval - (now - self.last_call)
        if wait_time > 0:
            await asyncio.sleep(wait_time)
        self.last_call = time.time()

With parallelization, processing 1000 files went from 4 hours to 15 minutes. The semaphore prevents overwhelming the API, and the rate limiter keeps me under quota.

4. Context Management

LLMs have token limits. Folders don’t help you manage context. Code does:

from typing import List
import tiktoken

class ContextManager:
    def __init__(self, max_tokens: int = 4000, model: str = "gpt-4"):
        self.max_tokens = max_tokens
        self.encoding = tiktoken.encoding_for_model(model)

    def count_tokens(self, text: str) -> int:
        return len(self.encoding.encode(text))

    def trim_context(self, messages: List[str], reserve_for_response: int = 500) -> List[str]:
        """Trim context to fit within token limit."""
        available = self.max_tokens - reserve_for_response

        total = sum(self.count_tokens(m) for m in messages)
        if total <= available:
            return messages

        # Keep most recent messages that fit
        trimmed = []
        current_tokens = 0

        for message in reversed(messages):
            tokens = self.count_tokens(message)
            if current_tokens + tokens > available:
                break
            trimmed.insert(0, message)
            current_tokens += tokens

        print(f"Trimmed context from {total} to {current_tokens} tokens")
        return trimmed

    def summarize_if_needed(self, messages: List[str]) -> str:
        """Summarize long conversations to preserve context."""
        total = sum(self.count_tokens(m) for m in messages)

        if total > self.max_tokens * 2:
            # In production, call LLM to summarize
            return "Previous conversation summarized: " + messages[0][:100] + "..."

        return "\n".join(messages)

This prevents the dreaded “context length exceeded” error that crashes folder-based agents.

The Hybrid Approach: Best of Both Worlds

After all this, I realized markdown and folders do have a place. They’re great for storing prompts that humans need to edit. The key is to use markdown for content and code for orchestration.

/prompts/
  analyze.md       <- Human-editable prompts
  summarize.md     <- Version controlled
/src/
  agent.py         <- Orchestration logic
  state.py         <- State management
  cache.py         <- Cost control
/tests/
  test_agent.py    <- Unit tests

from pathlib import Path
from jinja2 import Template

class HybridAgent:
    def __init__(self, prompts_dir: Path = Path("prompts")):
        self.prompts_dir = prompts_dir
        self.llm = CostControlledLLM(OpenAIClient())
        self.state_manager = StateManager()
        self._load_prompts()

    def _load_prompts(self):
        """Load markdown prompts, manage with code."""
        self.analyze_prompt = Template(
            (self.prompts_dir / "analyze.md").read_text()
        )
        self.summarize_prompt = Template(
            (self.prompts_dir / "summarize.md").read_text()
        )

    async def execute(self, user_input: str) -> str:
        # Code determines the route
        route = await self._determine_route(user_input)

        # Markdown provides the prompt
        if route == "complex":
            prompt = self.analyze_prompt.render(input=user_input)
        else:
            prompt = self.summarize_prompt.render(input=user_input)

        # Code handles the execution
        return await self.llm.call(prompt)

This way, non-developers can edit prompts in markdown files, while the code handles all the production concerns.

The Numbers

After rewriting the agent with proper code architecture:

Metric              | Folder-based | Code-based
--------------------|--------------|------------
API cost/day        | $200+        | $40
Avg latency         | 12s          | 2s
Error rate          | 15%          | 0.5%
Debugging time      | Hours        | Minutes
Test coverage       | 0%           | 85%

The upfront investment in code paid for itself in the first week.

When to Make the Switch

Not every agent needs code from day one. Here’s my decision framework:

Use Folders When:
- Prototype/exploration phase
- Less than 10 requests/day
- Cost is not a concern
- Single user, no SLA

Use Code When:
- Production deployment
- More than 100 requests/day
- Cost control matters
- Multiple users, SLA required
- Need debugging and monitoring

Common Mistakes to Avoid

Mistake 1: Over-engineering early

Don’t start with a full code architecture for a prototype. Start with folders, add code when you hit production requirements.

Mistake 2: All-or-nothing thinking

You don’t have to choose. Use markdown for prompts, code for orchestration. Both have their place.

Mistake 3: Ignoring hidden costs

“Simple” folder workflows hide complexity in shell scripts. A 50-line shell script is often harder to debug than 200 lines of Python with proper error handling.

Summary

In this post, I showed why production AI agents need code instead of just markdown and folders. My folder-based prototype cost $800 in one night because it had no caching, no retry limits, no observability, and no state management.

The four critical requirements that code provides:

Cost control - Caching and rate limiting prevent API bill explosions
Determinism - State machines ensure predictable behavior
Latency - Parallelization and async processing reduce wait times
Context management - Token counting and trimming prevent crashes

The winning pattern is hybrid: markdown for human-editable prompts, code for orchestration. Start simple with folders during exploration, but invest in code architecture before production deployment.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Reddit: Why production agents need code
👨‍💻 LangGraph Documentation
👨‍💻 Circuit Breaker Pattern
👨‍💻 Semantic Kernel

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!