Context Engineering for AI Agents: Beyond Prompt Engineering
Problem
My AI agent kept failing after running for a while. The error was cryptic:
Error: context_length_exceededMaximum context length is 200000 tokensYour message contained 215000 tokensI was confused. My agent was designed to run autonomously for hours. How did it hit the token limit after just 30 iterations?
When I dug into the logs, I found the problem:
Iteration 1: 6,500 tokens inputIteration 5: 12,000 tokens inputIteration 15: 28,000 tokens inputIteration 30: 52,000 tokens inputIteration 45: 78,000 tokens inputIteration 60: CRASH - 215,000 tokensEach iteration was adding to the message history. The agent was accumulating context until it crashed.
What I tried first
My initial approach was naive - just let the agent run and keep all history:
class NaiveAgent: def __init__(self): self.messages = [] self.messages.append({"role": "system", "content": SYSTEM_PROMPT})
async def run(self, user_input: str): self.messages.append({"role": "user", "content": user_input})
response = await self.llm.generate(self.messages) self.messages.append({"role": "assistant", "content": response})
return responseThis works for short conversations. But after 30 iterations:
System prompt: ~5,000 tokensEach iteration adds ~1,500 tokens (user + assistant + tool calls)After 30 iterations: 5,000 + (30 × 1,500) = 50,000+ tokens just for inputAnd that’s before considering:
- Tool definitions: ~2,000 tokens
- Retrieved documents: varies wildly
- Code snippets: can be massive
The context window isn’t infinite storage - it’s more like RAM.
Why this matters more than I thought
Andrej Karpathy made a comparison that stuck with me: LLM is to CPU as context window is to RAM. When you understand this analogy, the problem becomes clear:
CPU (LLM) - Does the actual computationRAM (Context Window) - Limited working memory
If you fill RAM, the program crashes.If you fill context, the agent fails.But there’s another problem I discovered: performance degrades before you hit the hard limit.
Research shows the “lost-in-the-middle” effect - information buried in the middle of long contexts gets forgotten:
Context with 50,000 tokens:- Information at START: ~80% recall- Information in MIDDLE: ~50% recall- Information at END: ~80% recallSo my agent wasn’t just crashing - it was getting dumber as context grew.
The four strategies
I found that LangChain identifies four context management strategies:
- Write - Store information to context
- Select - Choose what to include
- Compress - Reduce size via summarization
- Isolate - Separate concerns into different contexts
Anthropic’s engineering blog from September 2025 put it bluntly: “context engineering replaces prompt engineering.” They defined it as “the art and science of carefully filling the context window for the next reasoning step.”
Let me show how I implemented each strategy.
Strategy 1: Sliding Window
The simplest approach - keep only recent messages:
class SlidingWindowAgent: def __init__(self, window_size: int = 20): self.messages = [] self.window_size = window_size self.messages.append({"role": "system", "content": SYSTEM_PROMPT})
def _apply_window(self): """Keep only recent messages within window""" if len(self.messages) > self.window_size + 1: # +1 for system # Keep system prompt + last N messages self.messages = [self.messages[0]] + self.messages[-(self.window_size):]
async def run(self, user_input: str): self.messages.append({"role": "user", "content": user_input}) self._apply_window()
response = await self.llm.generate(self.messages) self.messages.append({"role": "assistant", "content": response})
return responseTesting this:
Without sliding window:Iteration 30: 52,000 tokens -> CRASH at iteration 60
With sliding window (size=20):Iteration 30: ~35,000 tokensIteration 60: ~35,000 tokensIteration 100: ~35,000 tokens -> STABLEThe sliding window kept context bounded. But I lost important context from early iterations.
Strategy 2: Compression with Summarization
I needed a way to keep important information while reducing token count. The solution: summarize older messages.
class CompressingAgent: def __init__(self, max_tokens: int = 180000, compress_threshold: float = 0.92): self.messages = [] self.max_tokens = max_tokens self.compress_threshold = compress_threshold
def count_tokens(self, messages: list) -> int: """Estimate token count""" total = 0 for msg in messages: total += len(msg["content"]) // 4 # Rough estimate return total
def should_compress(self) -> bool: """Check if we need to compress""" current = self.count_tokens(self.messages) return current / self.max_tokens >= self.compress_threshold
async def compress_history(self): """Compress older messages into summary""" if len(self.messages) <= 10: return
# Keep system prompt + last 10 messages system = self.messages[0] recent = self.messages[-10:] older = self.messages[1:-10]
# Summarize older messages summary_prompt = f"Summarize this conversation history in 500 tokens or less:\n{older}" summary = await self.llm.generate([ {"role": "user", "content": summary_prompt} ])
# Rebuild messages with summary self.messages = [ system, {"role": "user", "content": f"[History summary]: {summary}"} ] + recent
async def run(self, user_input: str): self.messages.append({"role": "user", "content": user_input})
if self.should_compress(): await self.compress_history()
response = await self.llm.generate(self.messages) self.messages.append({"role": "assistant", "content": response})
return responseThe key insight: Claude Code triggers compression at 92-95% context usage. I implemented the same threshold:
Token usage monitoring:85% -> No action needed92% -> Prepare compression95% -> FORCE compressionStrategy 3: Just-In-Time Retrieval
The biggest mistake I made was pre-loading all relevant data. Instead of this:
# WRONG: Load everything upfrontclass PreloadingAgent: async def run(self, query: str): # Pre-load all docs all_docs = await self.vector_store.get_all() # 50,000 tokens!
context = "\n".join([doc.content for doc in all_docs]) prompt = f"Context:\n{context}\n\nQuestion: {query}"
return await self.llm.generate(prompt)I switched to just-in-time retrieval:
# RIGHT: Retrieve only what's neededclass JITAgent: async def run(self, query: str): # Retrieve only relevant docs relevant_docs = await self.vector_store.search(query, k=5) # ~2,000 tokens
context = "\n".join([doc.content for doc in relevant_docs]) prompt = f"Context:\n{context}\n\nQuestion: {query}"
return await self.llm.generate(prompt)The difference in token usage:
Pre-load approach:- All docs loaded: 50,000 tokens- Query processing: 52,000 tokens total- Token efficiency: POOR
JIT approach:- Relevant docs only: 2,000 tokens- Query processing: 4,000 tokens total- Token efficiency: EXCELLENT (13x better)Strategy 4: Structured State Extraction
The final piece: extract structured state from conversation history instead of keeping the raw messages.
from pydantic import BaseModelfrom typing import Optionalfrom datetime import datetime
class AgentState(BaseModel): """Structured state extracted from conversation""" goal: Optional[str] = None current_step: int = 0 completed_actions: list[str] = [] pending_actions: list[str] = [] errors_encountered: list[str] = [] last_updated: datetime
class StatefulAgent: def __init__(self): self.state = AgentState(last_updated=datetime.now()) self.recent_messages = [] # Only last few for context
async def extract_state(self, messages: list) -> AgentState: """Extract structured state from messages""" extraction_prompt = """Extract the current state from this conversation. Return JSON with: goal, current_step, completed_actions, pending_actions, errors_encountered.
Conversation: {messages} """
response = await self.llm.generate([ {"role": "user", "content": extraction_prompt.format(messages=messages)} ])
return AgentState.parse_raw(response)
async def run(self, user_input: str): # Process with current state context = f"""Current state:- Goal: {self.state.goal}- Step: {self.state.current_step}- Completed: {self.state.completed_actions}- Pending: {self.state.pending_actions}- Errors: {self.state.errors_encountered}
User input: {user_input}"""
response = await self.llm.generate([ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": context} ])
# Update state self.state = await self.extract_state([user_input, response])
return responseNow instead of keeping 50,000 tokens of conversation history, I keep ~500 tokens of structured state:
Raw conversation history: 50,000 tokensStructured state: ~500 tokens
Reduction: 100xThe complete context manager
Putting it all together:
class ContextManager: def __init__( self, max_tokens: int = 180000, compress_threshold: float = 0.92, window_size: int = 20 ): self.max_tokens = max_tokens self.compress_threshold = compress_threshold self.window_size = window_size self.state = AgentState(last_updated=datetime.now())
def count_tokens(self, messages: list) -> int: return sum(len(msg.get("content", "")) // 4 for msg in messages)
def should_compress(self, messages: list) -> bool: return self.count_tokens(messages) / self.max_tokens >= self.compress_threshold
def apply_sliding_window(self, messages: list) -> list: if len(messages) > self.window_size + 1: return [messages[0]] + messages[-(self.window_size):] return messages
async def compress_old_messages(self, messages: list) -> list: if len(messages) <= 10: return messages
system = messages[0] recent = messages[-10:] older = messages[1:-10]
summary = await self.summarize(older)
return [ system, {"role": "user", "content": f"[History]: {summary}"} ] + recent
async def build_context(self, messages: list) -> list: # 1. Apply sliding window messages = self.apply_sliding_window(messages)
# 2. Check if compression needed if self.should_compress(messages): messages = await self.compress_old_messages(messages)
# 3. Inject structured state state_context = self.format_state() messages.insert(1, {"role": "system", "content": state_context})
return messages
def format_state(self) -> str: return f"""Current agent state:Goal: {self.state.goal or 'Not set'}Progress: Step {self.state.current_step}Completed: {len(self.state.completed_actions)} actionsPending: {len(self.state.pending_actions)} actionsErrors: {len(self.state.errors_encountered)} errors"""The performance metrics that matter
The Manus team made a claim that seemed extreme at first: “KV-cache hit rate is the single most important metric for agent performance.”
After implementing context engineering, I understood why:
Before context engineering:- KV-cache hit rate: ~20%- Tokens per iteration: Growing linearly- Cost per iteration: Growing linearly- Agent fails at iteration ~60
After context engineering:- KV-cache hit rate: ~85%- Tokens per iteration: Bounded at ~35K- Cost per iteration: Stable- Agent runs indefinitelyThe KV-cache hit rate measures how much of the context the model has already processed. Higher hit rate = lower cost and latency.
Common mistakes I made
Mistake 1: Treating context as unlimited
# WRONGmessages.append(user_message) # Just keep adding...messages.append(assistant_message)# No bounds check, no compression
# RIGHTmessages.append(user_message)messages.append(assistant_message)if context_manager.should_compress(messages): messages = await context_manager.compress_old_messages(messages)Mistake 2: Not implementing compression triggers
I waited until the agent crashed to think about compression. The fix: proactive monitoring:
# Monitor after each iterationcurrent_tokens = count_tokens(messages)usage_ratio = current_tokens / MAX_TOKENS
if usage_ratio > 0.95: # Emergency compression messages = await compress(messages)elif usage_ratio > 0.92: # Prepare for compression logger.warning(f"Context at {usage_ratio:.0%}, preparing compression")Mistake 3: Loading all data upfront
# WRONG: Load everything at startall_relevant_docs = await search_all_docs(topic) # 50K tokens!context = format_docs(all_relevant_docs)
# RIGHT: Load as neededasync def get_relevant_context(query: str) -> str: docs = await vector_store.search(query, k=5) # 2K tokens return format_docs(docs)Summary
In this post, I explained why context engineering has replaced prompt engineering as the critical skill for AI agents. The key point is treating the context window as precious RAM, not infinite storage.
The four strategies I implemented:
- Sliding window - Keep only recent messages
- Compression - Summarize old context at 92% usage
- Just-in-time retrieval - Load only what’s needed, when needed
- Structured state - Extract state from history instead of keeping raw messages
Without context engineering:
- Costs explode (each iteration processes all history)
- Performance degrades (lost-in-the-middle effect)
- Tasks fail when context exceeds limits
With context engineering:
- Stable token usage across iterations
- Predictable costs
- Agents that can run indefinitely
The shift from “prompt engineering” to “context engineering” reflects a deeper truth: as agents become autonomous, managing their working memory becomes more important than crafting their instructions.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Anthropic: Context Engineering Replaces Prompt Engineering
- 👨💻 LangChain: Context Management Strategies
- 👨💻 Lost in the Middle: How Language Models Use Long Contexts
- 👨💻 Andrej Karpathy: LLM as Operating System
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments