Skip to content

Context Engineering for AI Agents: Beyond Prompt Engineering

Problem

My AI agent kept failing after running for a while. The error was cryptic:

Error: context_length_exceeded
Maximum context length is 200000 tokens
Your message contained 215000 tokens

I was confused. My agent was designed to run autonomously for hours. How did it hit the token limit after just 30 iterations?

When I dug into the logs, I found the problem:

Iteration 1: 6,500 tokens input
Iteration 5: 12,000 tokens input
Iteration 15: 28,000 tokens input
Iteration 30: 52,000 tokens input
Iteration 45: 78,000 tokens input
Iteration 60: CRASH - 215,000 tokens

Each iteration was adding to the message history. The agent was accumulating context until it crashed.

What I tried first

My initial approach was naive - just let the agent run and keep all history:

naive-agent.py
class NaiveAgent:
def __init__(self):
self.messages = []
self.messages.append({"role": "system", "content": SYSTEM_PROMPT})
async def run(self, user_input: str):
self.messages.append({"role": "user", "content": user_input})
response = await self.llm.generate(self.messages)
self.messages.append({"role": "assistant", "content": response})
return response

This works for short conversations. But after 30 iterations:

System prompt: ~5,000 tokens
Each iteration adds ~1,500 tokens (user + assistant + tool calls)
After 30 iterations: 5,000 + (30 × 1,500) = 50,000+ tokens just for input

And that’s before considering:

  • Tool definitions: ~2,000 tokens
  • Retrieved documents: varies wildly
  • Code snippets: can be massive

The context window isn’t infinite storage - it’s more like RAM.

Why this matters more than I thought

Andrej Karpathy made a comparison that stuck with me: LLM is to CPU as context window is to RAM. When you understand this analogy, the problem becomes clear:

CPU (LLM) - Does the actual computation
RAM (Context Window) - Limited working memory
If you fill RAM, the program crashes.
If you fill context, the agent fails.

But there’s another problem I discovered: performance degrades before you hit the hard limit.

Research shows the “lost-in-the-middle” effect - information buried in the middle of long contexts gets forgotten:

Context with 50,000 tokens:
- Information at START: ~80% recall
- Information in MIDDLE: ~50% recall
- Information at END: ~80% recall

So my agent wasn’t just crashing - it was getting dumber as context grew.

The four strategies

I found that LangChain identifies four context management strategies:

  1. Write - Store information to context
  2. Select - Choose what to include
  3. Compress - Reduce size via summarization
  4. Isolate - Separate concerns into different contexts

Anthropic’s engineering blog from September 2025 put it bluntly: “context engineering replaces prompt engineering.” They defined it as “the art and science of carefully filling the context window for the next reasoning step.”

Let me show how I implemented each strategy.

Strategy 1: Sliding Window

The simplest approach - keep only recent messages:

sliding-window.py
class SlidingWindowAgent:
def __init__(self, window_size: int = 20):
self.messages = []
self.window_size = window_size
self.messages.append({"role": "system", "content": SYSTEM_PROMPT})
def _apply_window(self):
"""Keep only recent messages within window"""
if len(self.messages) > self.window_size + 1: # +1 for system
# Keep system prompt + last N messages
self.messages = [self.messages[0]] + self.messages[-(self.window_size):]
async def run(self, user_input: str):
self.messages.append({"role": "user", "content": user_input})
self._apply_window()
response = await self.llm.generate(self.messages)
self.messages.append({"role": "assistant", "content": response})
return response

Testing this:

Without sliding window:
Iteration 30: 52,000 tokens -> CRASH at iteration 60
With sliding window (size=20):
Iteration 30: ~35,000 tokens
Iteration 60: ~35,000 tokens
Iteration 100: ~35,000 tokens -> STABLE

The sliding window kept context bounded. But I lost important context from early iterations.

Strategy 2: Compression with Summarization

I needed a way to keep important information while reducing token count. The solution: summarize older messages.

compression.py
class CompressingAgent:
def __init__(self, max_tokens: int = 180000, compress_threshold: float = 0.92):
self.messages = []
self.max_tokens = max_tokens
self.compress_threshold = compress_threshold
def count_tokens(self, messages: list) -> int:
"""Estimate token count"""
total = 0
for msg in messages:
total += len(msg["content"]) // 4 # Rough estimate
return total
def should_compress(self) -> bool:
"""Check if we need to compress"""
current = self.count_tokens(self.messages)
return current / self.max_tokens >= self.compress_threshold
async def compress_history(self):
"""Compress older messages into summary"""
if len(self.messages) <= 10:
return
# Keep system prompt + last 10 messages
system = self.messages[0]
recent = self.messages[-10:]
older = self.messages[1:-10]
# Summarize older messages
summary_prompt = f"Summarize this conversation history in 500 tokens or less:\n{older}"
summary = await self.llm.generate([
{"role": "user", "content": summary_prompt}
])
# Rebuild messages with summary
self.messages = [
system,
{"role": "user", "content": f"[History summary]: {summary}"}
] + recent
async def run(self, user_input: str):
self.messages.append({"role": "user", "content": user_input})
if self.should_compress():
await self.compress_history()
response = await self.llm.generate(self.messages)
self.messages.append({"role": "assistant", "content": response})
return response

The key insight: Claude Code triggers compression at 92-95% context usage. I implemented the same threshold:

Token usage monitoring:
85% -> No action needed
92% -> Prepare compression
95% -> FORCE compression

Strategy 3: Just-In-Time Retrieval

The biggest mistake I made was pre-loading all relevant data. Instead of this:

preload-wrong.py
# WRONG: Load everything upfront
class PreloadingAgent:
async def run(self, query: str):
# Pre-load all docs
all_docs = await self.vector_store.get_all() # 50,000 tokens!
context = "\n".join([doc.content for doc in all_docs])
prompt = f"Context:\n{context}\n\nQuestion: {query}"
return await self.llm.generate(prompt)

I switched to just-in-time retrieval:

jit-retrieval.py
# RIGHT: Retrieve only what's needed
class JITAgent:
async def run(self, query: str):
# Retrieve only relevant docs
relevant_docs = await self.vector_store.search(query, k=5) # ~2,000 tokens
context = "\n".join([doc.content for doc in relevant_docs])
prompt = f"Context:\n{context}\n\nQuestion: {query}"
return await self.llm.generate(prompt)

The difference in token usage:

Pre-load approach:
- All docs loaded: 50,000 tokens
- Query processing: 52,000 tokens total
- Token efficiency: POOR
JIT approach:
- Relevant docs only: 2,000 tokens
- Query processing: 4,000 tokens total
- Token efficiency: EXCELLENT (13x better)

Strategy 4: Structured State Extraction

The final piece: extract structured state from conversation history instead of keeping the raw messages.

state-extraction.py
from pydantic import BaseModel
from typing import Optional
from datetime import datetime
class AgentState(BaseModel):
"""Structured state extracted from conversation"""
goal: Optional[str] = None
current_step: int = 0
completed_actions: list[str] = []
pending_actions: list[str] = []
errors_encountered: list[str] = []
last_updated: datetime
class StatefulAgent:
def __init__(self):
self.state = AgentState(last_updated=datetime.now())
self.recent_messages = [] # Only last few for context
async def extract_state(self, messages: list) -> AgentState:
"""Extract structured state from messages"""
extraction_prompt = """Extract the current state from this conversation.
Return JSON with: goal, current_step, completed_actions, pending_actions, errors_encountered.
Conversation:
{messages}
"""
response = await self.llm.generate([
{"role": "user", "content": extraction_prompt.format(messages=messages)}
])
return AgentState.parse_raw(response)
async def run(self, user_input: str):
# Process with current state
context = f"""Current state:
- Goal: {self.state.goal}
- Step: {self.state.current_step}
- Completed: {self.state.completed_actions}
- Pending: {self.state.pending_actions}
- Errors: {self.state.errors_encountered}
User input: {user_input}
"""
response = await self.llm.generate([
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": context}
])
# Update state
self.state = await self.extract_state([user_input, response])
return response

Now instead of keeping 50,000 tokens of conversation history, I keep ~500 tokens of structured state:

Raw conversation history: 50,000 tokens
Structured state: ~500 tokens
Reduction: 100x

The complete context manager

Putting it all together:

context-manager.py
class ContextManager:
def __init__(
self,
max_tokens: int = 180000,
compress_threshold: float = 0.92,
window_size: int = 20
):
self.max_tokens = max_tokens
self.compress_threshold = compress_threshold
self.window_size = window_size
self.state = AgentState(last_updated=datetime.now())
def count_tokens(self, messages: list) -> int:
return sum(len(msg.get("content", "")) // 4 for msg in messages)
def should_compress(self, messages: list) -> bool:
return self.count_tokens(messages) / self.max_tokens >= self.compress_threshold
def apply_sliding_window(self, messages: list) -> list:
if len(messages) > self.window_size + 1:
return [messages[0]] + messages[-(self.window_size):]
return messages
async def compress_old_messages(self, messages: list) -> list:
if len(messages) <= 10:
return messages
system = messages[0]
recent = messages[-10:]
older = messages[1:-10]
summary = await self.summarize(older)
return [
system,
{"role": "user", "content": f"[History]: {summary}"}
] + recent
async def build_context(self, messages: list) -> list:
# 1. Apply sliding window
messages = self.apply_sliding_window(messages)
# 2. Check if compression needed
if self.should_compress(messages):
messages = await self.compress_old_messages(messages)
# 3. Inject structured state
state_context = self.format_state()
messages.insert(1, {"role": "system", "content": state_context})
return messages
def format_state(self) -> str:
return f"""Current agent state:
Goal: {self.state.goal or 'Not set'}
Progress: Step {self.state.current_step}
Completed: {len(self.state.completed_actions)} actions
Pending: {len(self.state.pending_actions)} actions
Errors: {len(self.state.errors_encountered)} errors
"""

The performance metrics that matter

The Manus team made a claim that seemed extreme at first: “KV-cache hit rate is the single most important metric for agent performance.”

After implementing context engineering, I understood why:

Before context engineering:
- KV-cache hit rate: ~20%
- Tokens per iteration: Growing linearly
- Cost per iteration: Growing linearly
- Agent fails at iteration ~60
After context engineering:
- KV-cache hit rate: ~85%
- Tokens per iteration: Bounded at ~35K
- Cost per iteration: Stable
- Agent runs indefinitely

The KV-cache hit rate measures how much of the context the model has already processed. Higher hit rate = lower cost and latency.

Common mistakes I made

Mistake 1: Treating context as unlimited

# WRONG
messages.append(user_message) # Just keep adding...
messages.append(assistant_message)
# No bounds check, no compression
# RIGHT
messages.append(user_message)
messages.append(assistant_message)
if context_manager.should_compress(messages):
messages = await context_manager.compress_old_messages(messages)

Mistake 2: Not implementing compression triggers

I waited until the agent crashed to think about compression. The fix: proactive monitoring:

# Monitor after each iteration
current_tokens = count_tokens(messages)
usage_ratio = current_tokens / MAX_TOKENS
if usage_ratio > 0.95:
# Emergency compression
messages = await compress(messages)
elif usage_ratio > 0.92:
# Prepare for compression
logger.warning(f"Context at {usage_ratio:.0%}, preparing compression")

Mistake 3: Loading all data upfront

# WRONG: Load everything at start
all_relevant_docs = await search_all_docs(topic) # 50K tokens!
context = format_docs(all_relevant_docs)
# RIGHT: Load as needed
async def get_relevant_context(query: str) -> str:
docs = await vector_store.search(query, k=5) # 2K tokens
return format_docs(docs)

Summary

In this post, I explained why context engineering has replaced prompt engineering as the critical skill for AI agents. The key point is treating the context window as precious RAM, not infinite storage.

The four strategies I implemented:

  1. Sliding window - Keep only recent messages
  2. Compression - Summarize old context at 92% usage
  3. Just-in-time retrieval - Load only what’s needed, when needed
  4. Structured state - Extract state from history instead of keeping raw messages

Without context engineering:

  • Costs explode (each iteration processes all history)
  • Performance degrades (lost-in-the-middle effect)
  • Tasks fail when context exceeds limits

With context engineering:

  • Stable token usage across iterations
  • Predictable costs
  • Agents that can run indefinitely

The shift from “prompt engineering” to “context engineering” reflects a deeper truth: as agents become autonomous, managing their working memory becomes more important than crafting their instructions.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments