Context Engineering for AI Agents: Beyond Prompt Engineering

Mar 25, 2026

Problem

My AI agent kept failing after running for a while. The error was cryptic:

Error: context_length_exceeded
Maximum context length is 200000 tokens
Your message contained 215000 tokens

I was confused. My agent was designed to run autonomously for hours. How did it hit the token limit after just 30 iterations?

When I dug into the logs, I found the problem:

Iteration 1: 6,500 tokens input
Iteration 5: 12,000 tokens input
Iteration 15: 28,000 tokens input
Iteration 30: 52,000 tokens input
Iteration 45: 78,000 tokens input
Iteration 60: CRASH - 215,000 tokens

Each iteration was adding to the message history. The agent was accumulating context until it crashed.

What I tried first

My initial approach was naive - just let the agent run and keep all history:

class NaiveAgent:
    def __init__(self):
        self.messages = []
        self.messages.append({"role": "system", "content": SYSTEM_PROMPT})

    async def run(self, user_input: str):
        self.messages.append({"role": "user", "content": user_input})

        response = await self.llm.generate(self.messages)
        self.messages.append({"role": "assistant", "content": response})

        return response

This works for short conversations. But after 30 iterations:

System prompt: ~5,000 tokens
Each iteration adds ~1,500 tokens (user + assistant + tool calls)
After 30 iterations: 5,000 + (30 × 1,500) = 50,000+ tokens just for input

And that’s before considering:

Tool definitions: ~2,000 tokens
Retrieved documents: varies wildly
Code snippets: can be massive

The context window isn’t infinite storage - it’s more like RAM.

Why this matters more than I thought

Andrej Karpathy made a comparison that stuck with me: LLM is to CPU as context window is to RAM. When you understand this analogy, the problem becomes clear:

CPU (LLM) - Does the actual computation
RAM (Context Window) - Limited working memory

If you fill RAM, the program crashes.
If you fill context, the agent fails.

But there’s another problem I discovered: performance degrades before you hit the hard limit.

Research shows the “lost-in-the-middle” effect - information buried in the middle of long contexts gets forgotten:

Context with 50,000 tokens:
- Information at START: ~80% recall
- Information in MIDDLE: ~50% recall
- Information at END: ~80% recall

So my agent wasn’t just crashing - it was getting dumber as context grew.

The four strategies

I found that LangChain identifies four context management strategies:

Write - Store information to context
Select - Choose what to include
Compress - Reduce size via summarization
Isolate - Separate concerns into different contexts

Anthropic’s engineering blog from September 2025 put it bluntly: “context engineering replaces prompt engineering.” They defined it as “the art and science of carefully filling the context window for the next reasoning step.”

Let me show how I implemented each strategy.

Strategy 1: Sliding Window

The simplest approach - keep only recent messages:

class SlidingWindowAgent:
    def __init__(self, window_size: int = 20):
        self.messages = []
        self.window_size = window_size
        self.messages.append({"role": "system", "content": SYSTEM_PROMPT})

    def _apply_window(self):
        """Keep only recent messages within window"""
        if len(self.messages) > self.window_size + 1:  # +1 for system
            # Keep system prompt + last N messages
            self.messages = [self.messages[0]] + self.messages[-(self.window_size):]

    async def run(self, user_input: str):
        self.messages.append({"role": "user", "content": user_input})
        self._apply_window()

        response = await self.llm.generate(self.messages)
        self.messages.append({"role": "assistant", "content": response})

        return response

Testing this:

Without sliding window:
Iteration 30: 52,000 tokens -> CRASH at iteration 60

With sliding window (size=20):
Iteration 30: ~35,000 tokens
Iteration 60: ~35,000 tokens
Iteration 100: ~35,000 tokens -> STABLE

The sliding window kept context bounded. But I lost important context from early iterations.

Strategy 2: Compression with Summarization

I needed a way to keep important information while reducing token count. The solution: summarize older messages.

class CompressingAgent:
    def __init__(self, max_tokens: int = 180000, compress_threshold: float = 0.92):
        self.messages = []
        self.max_tokens = max_tokens
        self.compress_threshold = compress_threshold

    def count_tokens(self, messages: list) -> int:
        """Estimate token count"""
        total = 0
        for msg in messages:
            total += len(msg["content"]) // 4  # Rough estimate
        return total

    def should_compress(self) -> bool:
        """Check if we need to compress"""
        current = self.count_tokens(self.messages)
        return current / self.max_tokens >= self.compress_threshold

    async def compress_history(self):
        """Compress older messages into summary"""
        if len(self.messages) <= 10:
            return

        # Keep system prompt + last 10 messages
        system = self.messages[0]
        recent = self.messages[-10:]
        older = self.messages[1:-10]

        # Summarize older messages
        summary_prompt = f"Summarize this conversation history in 500 tokens or less:\n{older}"
        summary = await self.llm.generate([
            {"role": "user", "content": summary_prompt}
        ])

        # Rebuild messages with summary
        self.messages = [
            system,
            {"role": "user", "content": f"[History summary]: {summary}"}
        ] + recent

    async def run(self, user_input: str):
        self.messages.append({"role": "user", "content": user_input})

        if self.should_compress():
            await self.compress_history()

        response = await self.llm.generate(self.messages)
        self.messages.append({"role": "assistant", "content": response})

        return response

The key insight: Claude Code triggers compression at 92-95% context usage. I implemented the same threshold:

Token usage monitoring:
85% -> No action needed
92% -> Prepare compression
95% -> FORCE compression

Strategy 3: Just-In-Time Retrieval

The biggest mistake I made was pre-loading all relevant data. Instead of this:

# WRONG: Load everything upfront
class PreloadingAgent:
    async def run(self, query: str):
        # Pre-load all docs
        all_docs = await self.vector_store.get_all()  # 50,000 tokens!

        context = "\n".join([doc.content for doc in all_docs])
        prompt = f"Context:\n{context}\n\nQuestion: {query}"

        return await self.llm.generate(prompt)

I switched to just-in-time retrieval:

# RIGHT: Retrieve only what's needed
class JITAgent:
    async def run(self, query: str):
        # Retrieve only relevant docs
        relevant_docs = await self.vector_store.search(query, k=5)  # ~2,000 tokens

        context = "\n".join([doc.content for doc in relevant_docs])
        prompt = f"Context:\n{context}\n\nQuestion: {query}"

        return await self.llm.generate(prompt)

The difference in token usage:

Pre-load approach:
- All docs loaded: 50,000 tokens
- Query processing: 52,000 tokens total
- Token efficiency: POOR

JIT approach:
- Relevant docs only: 2,000 tokens
- Query processing: 4,000 tokens total
- Token efficiency: EXCELLENT (13x better)

Strategy 4: Structured State Extraction

The final piece: extract structured state from conversation history instead of keeping the raw messages.

from pydantic import BaseModel
from typing import Optional
from datetime import datetime

class AgentState(BaseModel):
    """Structured state extracted from conversation"""
    goal: Optional[str] = None
    current_step: int = 0
    completed_actions: list[str] = []
    pending_actions: list[str] = []
    errors_encountered: list[str] = []
    last_updated: datetime

class StatefulAgent:
    def __init__(self):
        self.state = AgentState(last_updated=datetime.now())
        self.recent_messages = []  # Only last few for context

    async def extract_state(self, messages: list) -> AgentState:
        """Extract structured state from messages"""
        extraction_prompt = """Extract the current state from this conversation.
        Return JSON with: goal, current_step, completed_actions, pending_actions, errors_encountered.

        Conversation:
        {messages}
        """

        response = await self.llm.generate([
            {"role": "user", "content": extraction_prompt.format(messages=messages)}
        ])

        return AgentState.parse_raw(response)

    async def run(self, user_input: str):
        # Process with current state
        context = f"""Current state:
- Goal: {self.state.goal}
- Step: {self.state.current_step}
- Completed: {self.state.completed_actions}
- Pending: {self.state.pending_actions}
- Errors: {self.state.errors_encountered}

User input: {user_input}
"""

        response = await self.llm.generate([
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": context}
        ])

        # Update state
        self.state = await self.extract_state([user_input, response])

        return response

Now instead of keeping 50,000 tokens of conversation history, I keep ~500 tokens of structured state:

Raw conversation history: 50,000 tokens
Structured state: ~500 tokens

Reduction: 100x

The complete context manager

Putting it all together:

class ContextManager:
    def __init__(
        self,
        max_tokens: int = 180000,
        compress_threshold: float = 0.92,
        window_size: int = 20
    ):
        self.max_tokens = max_tokens
        self.compress_threshold = compress_threshold
        self.window_size = window_size
        self.state = AgentState(last_updated=datetime.now())

    def count_tokens(self, messages: list) -> int:
        return sum(len(msg.get("content", "")) // 4 for msg in messages)

    def should_compress(self, messages: list) -> bool:
        return self.count_tokens(messages) / self.max_tokens >= self.compress_threshold

    def apply_sliding_window(self, messages: list) -> list:
        if len(messages) > self.window_size + 1:
            return [messages[0]] + messages[-(self.window_size):]
        return messages

    async def compress_old_messages(self, messages: list) -> list:
        if len(messages) <= 10:
            return messages

        system = messages[0]
        recent = messages[-10:]
        older = messages[1:-10]

        summary = await self.summarize(older)

        return [
            system,
            {"role": "user", "content": f"[History]: {summary}"}
        ] + recent

    async def build_context(self, messages: list) -> list:
        # 1. Apply sliding window
        messages = self.apply_sliding_window(messages)

        # 2. Check if compression needed
        if self.should_compress(messages):
            messages = await self.compress_old_messages(messages)

        # 3. Inject structured state
        state_context = self.format_state()
        messages.insert(1, {"role": "system", "content": state_context})

        return messages

    def format_state(self) -> str:
        return f"""Current agent state:
Goal: {self.state.goal or 'Not set'}
Progress: Step {self.state.current_step}
Completed: {len(self.state.completed_actions)} actions
Pending: {len(self.state.pending_actions)} actions
Errors: {len(self.state.errors_encountered)} errors
"""

The performance metrics that matter

The Manus team made a claim that seemed extreme at first: “KV-cache hit rate is the single most important metric for agent performance.”

After implementing context engineering, I understood why:

Before context engineering:
- KV-cache hit rate: ~20%
- Tokens per iteration: Growing linearly
- Cost per iteration: Growing linearly
- Agent fails at iteration ~60

After context engineering:
- KV-cache hit rate: ~85%
- Tokens per iteration: Bounded at ~35K
- Cost per iteration: Stable
- Agent runs indefinitely

The KV-cache hit rate measures how much of the context the model has already processed. Higher hit rate = lower cost and latency.

Common mistakes I made

Mistake 1: Treating context as unlimited

# WRONG
messages.append(user_message)  # Just keep adding...
messages.append(assistant_message)
# No bounds check, no compression

# RIGHT
messages.append(user_message)
messages.append(assistant_message)
if context_manager.should_compress(messages):
    messages = await context_manager.compress_old_messages(messages)

Mistake 2: Not implementing compression triggers

I waited until the agent crashed to think about compression. The fix: proactive monitoring:

# Monitor after each iteration
current_tokens = count_tokens(messages)
usage_ratio = current_tokens / MAX_TOKENS

if usage_ratio > 0.95:
    # Emergency compression
    messages = await compress(messages)
elif usage_ratio > 0.92:
    # Prepare for compression
    logger.warning(f"Context at {usage_ratio:.0%}, preparing compression")

Mistake 3: Loading all data upfront

# WRONG: Load everything at start
all_relevant_docs = await search_all_docs(topic)  # 50K tokens!
context = format_docs(all_relevant_docs)

# RIGHT: Load as needed
async def get_relevant_context(query: str) -> str:
    docs = await vector_store.search(query, k=5)  # 2K tokens
    return format_docs(docs)

Summary

In this post, I explained why context engineering has replaced prompt engineering as the critical skill for AI agents. The key point is treating the context window as precious RAM, not infinite storage.

The four strategies I implemented:

Sliding window - Keep only recent messages
Compression - Summarize old context at 92% usage
Just-in-time retrieval - Load only what’s needed, when needed
Structured state - Extract state from history instead of keeping raw messages

Without context engineering:

Costs explode (each iteration processes all history)
Performance degrades (lost-in-the-middle effect)
Tasks fail when context exceeds limits

With context engineering:

Stable token usage across iterations
Predictable costs
Agents that can run indefinitely

The shift from “prompt engineering” to “context engineering” reflects a deeper truth: as agents become autonomous, managing their working memory becomes more important than crafting their instructions.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Anthropic: Context Engineering Replaces Prompt Engineering
👨‍💻 LangChain: Context Management Strategies
👨‍💻 Lost in the Middle: How Language Models Use Long Contexts
👨‍💻 Andrej Karpathy: LLM as Operating System

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!