How MCP Reduces Token Usage: A Practical Guide to Context Window Optimization
Problem
When I built an AI agent that needed to call my company’s legacy APIs, I watched my token usage explode. One simple workflow consumed 20,000+ tokens just from API responses.
The culprit? Bloated API responses.
Legacy API Response:{ "user": { "id": 12345, "name": "John Doe", "email": "[email protected]", "created_at": "2020-01-15T09:30:00Z", "updated_at": "2024-03-18T14:22:33Z", "preferences": { /* 50+ nested fields */ }, "metadata": { /* 20+ tracking fields */ }, "legacy_fields": { /* obsolete data */ }, "internal_notes": "...", "audit_trail": [...], // ... 30 more fields }}All I needed was the user’s name and email. But the API returned 2,000 tokens of data I never used.
And it got worse. When I checked the Reddit discussions on MCP, I saw this observation:
“Not only did they seem inefficient, they were also eating a surprising amount of context. When Anthropic released /context it became obvious just how much prompt space some MCP tools were consuming.”
Wait. MCP tools consuming context? I thought MCP was supposed to help with this problem.
Environment
- Building AI agents with multiple API integrations
- Legacy APIs with over-fetching issues
- Context-limited models (Claude Haiku, local LLMs)
- Need to optimize token usage for cost and performance
What I discovered
The Reddit comment I quoted was about poorly designed MCP tools. But then I found this response (49 points):
“MCP also allows you to control what’s returned to the agent so you can be more context aware. If you don’t control the API directly, you don’t control the response shape or the size of the response. If there’s an API you want to call that returns a bunch of irrelevant data, AI will waste context by parsing that API response every time. With MCP, you control the shape and size of the data returned to the AI.”
This was the key insight. MCP isn’t just a wrapper - it’s a preprocessing layer.
Then I found this success story (14 points):
“I wrote an app that uses one of my company’s legacy APIs. This API was not designed for AI. It provides a lot of data for many queries that is not needed (read: wasted context). So I made an MCP server that is also doing some ‘middleware-ish’ pre processing to associate some data and make some calculations on the server side. This makes my app significantly faster and uses far less tokens. I can run my app with a 4b local LLM now.”
This person achieved what I needed: running complex workflows on a 4B parameter local model. That’s dramatic cost savings.
The problem: Legacy APIs waste context
Modern AI agents frequently need to interact with APIs that were never designed for LLM consumption. These legacy APIs often suffer from:
+-------------------+ +-------------------+ +-------------------+| Legacy API | | Agent Context | | Wasted Tokens |+-------------------+ +-------------------+ +-------------------+| Over-fetching | --> | All data loaded | --> | 80-90% irrelevant || Nested complexity | | Deep structures | | Parsing overhead || Redundant data | | Repeated info | | Duplicate tokens || No filtering | | Full response | | No field selection|+-------------------+ +-------------------+ +-------------------+When an agent calls these APIs directly, every byte of irrelevant data consumes precious context window space.
This becomes critical when:
- Running long agent workflows with multiple API calls
- Using context-limited models (e.g., local LLMs with 4K-8K context)
- Needing to maintain conversation history alongside API responses
The solution: MCP as intelligent middleware
MCP servers solve this problem by acting as a preprocessing layer between the API and the LLM. Here’s the transformation:
BEFORE MCP (Direct API Call):
Agent Request --> Legacy API --> Bloated Response (2000 tokens) --> LLM Context
AFTER MCP (Preprocessed Call):
Agent Request --> MCP Server --> Filtered Response (150 tokens) --> LLM Context | +-- Internal: Legacy API Call +-- Data Filtering +-- Pre-processingStep 1: Data filtering at the source
# mcp_server.py - Wrapping a bloated legacy APIfrom mcp import MCPServer
server = MCPServer("legacy-api-wrapper")
@server.tool()async def get_user_summary(user_id: str) -> dict: """ Get essential user information. Returns only name and email, filtering out 40+ unnecessary fields. """ # Internal: Make full API call full_response = await legacy_api.get_user(user_id)
# Filter to essential fields only return { "name": full_response["user"]["name"], "email": full_response["user"]["email"] }Token comparison:
Legacy API Response (2,000 tokens):{ "user": { "id": 12345, "name": "John Doe", "email": "[email protected]", ... (40+ more fields) }}
MCP Server Response (150 tokens):{ "user": { "name": "John Doe", "email": "[email protected]" }}92.5% reduction in tokens.
Step 2: Aggregating multiple calls
The Reddit commenter mentioned “reducing 5 legacy api calls to a single simple instruction.” Here’s how that works:
@server.tool()async def get_project_metrics(project_id: str) -> dict: """ Get consolidated project metrics. Aggregates 5 API calls into one minimal response. """ # Make 5 internal API calls tasks = await api.get_tasks(project_id) users = await api.get_users(project_id) timeline = await api.get_timeline(project_id) budget = await api.get_budget(project_id) risks = await api.get_risks(project_id)
# Pre-calculate metrics (saves agent reasoning tokens) return { "progress": len([t for t in tasks if t["done"]]) / len(tasks), "budget_used": budget["spent"] / budget["total"], "team_size": len(users), "days_remaining": (timeline["end"] - datetime.now()).days, "risk_count": len(risks) }Token math:
WITHOUT MCP:- 5 API calls x 1,500 tokens each = 7,500 tokens of API response- Agent reasoning about data = 2,000 tokens- Total: 9,500 tokens
WITH MCP:- 1 MCP call = 200 tokens- Agent reasoning (simpler data) = 500 tokens- Total: 700 tokens
Reduction: 92.6%Step 3: Pre-processing calculations
The Reddit example: “I made an MCP server that is also doing some ‘middleware-ish’ pre processing to associate some data and make some calculations on the server side.”
Instead of:
1. Agent fetches raw data (token cost)2. Agent parses data (token cost)3. Agent performs calculations (more tokens)4. Agent reasons about results (more tokens)MCP server does:
1. Fetch data2. Calculate3. Return only final resultStep 4: Enabling smaller models
The most dramatic result from the Reddit comment: “I can run my app with a 4b local LLM now.”
This matters because:
+------------------+-------------------+-------------------+| Model | Context Window | Cost |+------------------+-------------------+-------------------+| Claude Opus | 200K tokens | $$$$ || Claude Sonnet | 200K tokens | $$$ || Claude Haiku | 200K tokens | $ || Local 4B LLM | 4K-8K tokens | Free |+------------------+-------------------+-------------------+With token-hungry approaches, you need the big models. With MCP efficiency, you can use Claude Haiku, local LLMs, or smaller models.
Why this matters: Practical implications
Cost reduction
Scenario: Agent needs to check 10 users' emails
WITHOUT MCP:- 10 API calls x 2,000 tokens each = 20,000 tokens of API response- Agent reasoning = 2,000 tokens- Total: 22,000 tokens- Cost at $3/M input tokens: $0.066
WITH MCP:- 10 MCP calls x 100 tokens each = 1,000 tokens- Agent reasoning = 2,000 tokens- Total: 3,000 tokens- Cost at $3/M input tokens: $0.009
Reduction: 86% cost savingsModel selection flexibility
Token budget needed for workflow:
WITHOUT MCP: 22,000 tokens -> Must use Claude Sonnet/Opus or GPT-4WITH MCP: 3,000 tokens -> Can use Claude Haiku or local 4B LLMAgent workflow depth
Context windows are finite. Every token saved on API responses equals more tokens for reasoning.
8K context window:
WITHOUT MCP:- API responses: 6,000 tokens- Conversation history: 1,000 tokens- Available for reasoning: 1,000 tokens (very limited)
WITH MCP:- API responses: 600 tokens- Conversation history: 1,000 tokens- Available for reasoning: 6,400 tokens (much better)Response time
The Reddit commenter noted: “This makes my app significantly faster”
WHY IT'S FASTER:1. Fewer tokens to process = faster inference2. Pre-computed results = less agent reasoning time3. Consolidated calls = fewer network round tripsCommon mistakes: When MCP doesn’t help
MCP isn’t a silver bullet. I learned this the hard way.
Mistake 1: MCP server returns raw API responses
# WRONG: Just a passthrough@server.tool()async def get_user(user_id: str) -> dict: return await legacy_api.get_user(user_id) # No filtering!If the MCP server is just a passthrough, you gain nothing. The value comes from data transformation, not just wrapping.
Mistake 2: Over-engineering simple APIs
If an API already returns minimal, relevant data, adding MCP adds complexity without benefit.
# Unnecessary wrapping@server.tool()async def get_status() -> dict: # The original API already returns {"status": "ok"} return await simple_api.get_status()Mistake 3: Poorly designed MCP tools
The Reddit skeptic was right about some MCP tools:
“eating a surprising amount of context” - poorly designed MCP tools
# WRONG: Verbose tool descriptions and schemas consume tokens@server.tool()async def get_data(params: dict) -> dict: """ This is a very long description that explains in great detail all the various things this tool does and how to use it with many examples and edge cases and warnings and suggestions... """Tool definitions themselves consume context in their schema definitions. Balance tool overhead versus data filtering savings.
Mistake 4: Ignoring tool definition overhead
Token cost breakdown:
MCP tool schema definition: 500 tokensData filtering savings: 1,800 tokensNet savings: 1,300 tokens (worth it!)
But if:MCP tool schema definition: 2,000 tokensData filtering savings: 1,800 tokensNet cost: -200 tokens (not worth it!)How I applied this
After understanding these principles, I restructured my agent’s API access:
BEFORE:Agent --> Legacy API 1 (2000 tokens) --> LLM ContextAgent --> Legacy API 2 (1500 tokens) --> LLM ContextAgent --> Legacy API 3 (3000 tokens) --> LLM ContextAgent --> Legacy API 4 (1800 tokens) --> LLM ContextAgent --> Legacy API 5 (2200 tokens) --> LLM Context ----------- 10,500 tokens total
AFTER:Agent --> MCP Server (filters + aggregates) --> LLM Context ----------- 800 tokens totalResult: 92% token reduction. I can now run my agent on Claude Haiku instead of Sonnet, cutting costs by 20x.
Summary
In this post, I explained how MCP reduces token usage by acting as a smart middleware layer. The key points are:
- MCP servers filter, transform, and aggregate data before it reaches your LLM’s context window
- Legacy APIs often return 5-10x more data than needed
- Pre-processing calculations saves both data tokens and reasoning tokens
- Dramatic cost savings enable use of smaller, cheaper models
- Poorly designed MCP tools can make things worse, not better
Next step: Audit your current API usage - which endpoints return the most data you don’t use? Those are your prime MCP candidates.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Model Context Protocol Specification
- 👨💻 Reddit discussion on MCP value
- 👨💻 Anthropic Context Documentation
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments