Skip to content

How MCP Reduces Token Usage: A Practical Guide to Context Window Optimization

Problem

When I built an AI agent that needed to call my company’s legacy APIs, I watched my token usage explode. One simple workflow consumed 20,000+ tokens just from API responses.

The culprit? Bloated API responses.

Legacy API Response:
{
"user": {
"id": 12345,
"name": "John Doe",
"email": "[email protected]",
"created_at": "2020-01-15T09:30:00Z",
"updated_at": "2024-03-18T14:22:33Z",
"preferences": { /* 50+ nested fields */ },
"metadata": { /* 20+ tracking fields */ },
"legacy_fields": { /* obsolete data */ },
"internal_notes": "...",
"audit_trail": [...],
// ... 30 more fields
}
}

All I needed was the user’s name and email. But the API returned 2,000 tokens of data I never used.

And it got worse. When I checked the Reddit discussions on MCP, I saw this observation:

“Not only did they seem inefficient, they were also eating a surprising amount of context. When Anthropic released /context it became obvious just how much prompt space some MCP tools were consuming.”

Wait. MCP tools consuming context? I thought MCP was supposed to help with this problem.

Environment

  • Building AI agents with multiple API integrations
  • Legacy APIs with over-fetching issues
  • Context-limited models (Claude Haiku, local LLMs)
  • Need to optimize token usage for cost and performance

What I discovered

The Reddit comment I quoted was about poorly designed MCP tools. But then I found this response (49 points):

“MCP also allows you to control what’s returned to the agent so you can be more context aware. If you don’t control the API directly, you don’t control the response shape or the size of the response. If there’s an API you want to call that returns a bunch of irrelevant data, AI will waste context by parsing that API response every time. With MCP, you control the shape and size of the data returned to the AI.”

This was the key insight. MCP isn’t just a wrapper - it’s a preprocessing layer.

Then I found this success story (14 points):

“I wrote an app that uses one of my company’s legacy APIs. This API was not designed for AI. It provides a lot of data for many queries that is not needed (read: wasted context). So I made an MCP server that is also doing some ‘middleware-ish’ pre processing to associate some data and make some calculations on the server side. This makes my app significantly faster and uses far less tokens. I can run my app with a 4b local LLM now.”

This person achieved what I needed: running complex workflows on a 4B parameter local model. That’s dramatic cost savings.

The problem: Legacy APIs waste context

Modern AI agents frequently need to interact with APIs that were never designed for LLM consumption. These legacy APIs often suffer from:

+-------------------+ +-------------------+ +-------------------+
| Legacy API | | Agent Context | | Wasted Tokens |
+-------------------+ +-------------------+ +-------------------+
| Over-fetching | --> | All data loaded | --> | 80-90% irrelevant |
| Nested complexity | | Deep structures | | Parsing overhead |
| Redundant data | | Repeated info | | Duplicate tokens |
| No filtering | | Full response | | No field selection|
+-------------------+ +-------------------+ +-------------------+

When an agent calls these APIs directly, every byte of irrelevant data consumes precious context window space.

This becomes critical when:

  • Running long agent workflows with multiple API calls
  • Using context-limited models (e.g., local LLMs with 4K-8K context)
  • Needing to maintain conversation history alongside API responses

The solution: MCP as intelligent middleware

MCP servers solve this problem by acting as a preprocessing layer between the API and the LLM. Here’s the transformation:

BEFORE MCP (Direct API Call):
Agent Request --> Legacy API --> Bloated Response (2000 tokens) --> LLM Context
AFTER MCP (Preprocessed Call):
Agent Request --> MCP Server --> Filtered Response (150 tokens) --> LLM Context
|
+-- Internal: Legacy API Call
+-- Data Filtering
+-- Pre-processing

Step 1: Data filtering at the source

# mcp_server.py - Wrapping a bloated legacy API
from mcp import MCPServer
server = MCPServer("legacy-api-wrapper")
@server.tool()
async def get_user_summary(user_id: str) -> dict:
"""
Get essential user information.
Returns only name and email, filtering out 40+ unnecessary fields.
"""
# Internal: Make full API call
full_response = await legacy_api.get_user(user_id)
# Filter to essential fields only
return {
"name": full_response["user"]["name"],
"email": full_response["user"]["email"]
}

Token comparison:

Legacy API Response (2,000 tokens):
{
"user": {
"id": 12345,
"name": "John Doe",
"email": "[email protected]",
... (40+ more fields)
}
}
MCP Server Response (150 tokens):
{
"user": {
"name": "John Doe",
"email": "[email protected]"
}
}

92.5% reduction in tokens.

Step 2: Aggregating multiple calls

The Reddit commenter mentioned “reducing 5 legacy api calls to a single simple instruction.” Here’s how that works:

@server.tool()
async def get_project_metrics(project_id: str) -> dict:
"""
Get consolidated project metrics.
Aggregates 5 API calls into one minimal response.
"""
# Make 5 internal API calls
tasks = await api.get_tasks(project_id)
users = await api.get_users(project_id)
timeline = await api.get_timeline(project_id)
budget = await api.get_budget(project_id)
risks = await api.get_risks(project_id)
# Pre-calculate metrics (saves agent reasoning tokens)
return {
"progress": len([t for t in tasks if t["done"]]) / len(tasks),
"budget_used": budget["spent"] / budget["total"],
"team_size": len(users),
"days_remaining": (timeline["end"] - datetime.now()).days,
"risk_count": len(risks)
}

Token math:

WITHOUT MCP:
- 5 API calls x 1,500 tokens each = 7,500 tokens of API response
- Agent reasoning about data = 2,000 tokens
- Total: 9,500 tokens
WITH MCP:
- 1 MCP call = 200 tokens
- Agent reasoning (simpler data) = 500 tokens
- Total: 700 tokens
Reduction: 92.6%

Step 3: Pre-processing calculations

The Reddit example: “I made an MCP server that is also doing some ‘middleware-ish’ pre processing to associate some data and make some calculations on the server side.”

Instead of:

1. Agent fetches raw data (token cost)
2. Agent parses data (token cost)
3. Agent performs calculations (more tokens)
4. Agent reasons about results (more tokens)

MCP server does:

1. Fetch data
2. Calculate
3. Return only final result

Step 4: Enabling smaller models

The most dramatic result from the Reddit comment: “I can run my app with a 4b local LLM now.”

This matters because:

+------------------+-------------------+-------------------+
| Model | Context Window | Cost |
+------------------+-------------------+-------------------+
| Claude Opus | 200K tokens | $$$$ |
| Claude Sonnet | 200K tokens | $$$ |
| Claude Haiku | 200K tokens | $ |
| Local 4B LLM | 4K-8K tokens | Free |
+------------------+-------------------+-------------------+

With token-hungry approaches, you need the big models. With MCP efficiency, you can use Claude Haiku, local LLMs, or smaller models.

Why this matters: Practical implications

Cost reduction

Scenario: Agent needs to check 10 users' emails
WITHOUT MCP:
- 10 API calls x 2,000 tokens each = 20,000 tokens of API response
- Agent reasoning = 2,000 tokens
- Total: 22,000 tokens
- Cost at $3/M input tokens: $0.066
WITH MCP:
- 10 MCP calls x 100 tokens each = 1,000 tokens
- Agent reasoning = 2,000 tokens
- Total: 3,000 tokens
- Cost at $3/M input tokens: $0.009
Reduction: 86% cost savings

Model selection flexibility

Token budget needed for workflow:
WITHOUT MCP: 22,000 tokens -> Must use Claude Sonnet/Opus or GPT-4
WITH MCP: 3,000 tokens -> Can use Claude Haiku or local 4B LLM

Agent workflow depth

Context windows are finite. Every token saved on API responses equals more tokens for reasoning.

8K context window:
WITHOUT MCP:
- API responses: 6,000 tokens
- Conversation history: 1,000 tokens
- Available for reasoning: 1,000 tokens (very limited)
WITH MCP:
- API responses: 600 tokens
- Conversation history: 1,000 tokens
- Available for reasoning: 6,400 tokens (much better)

Response time

The Reddit commenter noted: “This makes my app significantly faster”

WHY IT'S FASTER:
1. Fewer tokens to process = faster inference
2. Pre-computed results = less agent reasoning time
3. Consolidated calls = fewer network round trips

Common mistakes: When MCP doesn’t help

MCP isn’t a silver bullet. I learned this the hard way.

Mistake 1: MCP server returns raw API responses

# WRONG: Just a passthrough
@server.tool()
async def get_user(user_id: str) -> dict:
return await legacy_api.get_user(user_id) # No filtering!

If the MCP server is just a passthrough, you gain nothing. The value comes from data transformation, not just wrapping.

Mistake 2: Over-engineering simple APIs

If an API already returns minimal, relevant data, adding MCP adds complexity without benefit.

# Unnecessary wrapping
@server.tool()
async def get_status() -> dict:
# The original API already returns {"status": "ok"}
return await simple_api.get_status()

Mistake 3: Poorly designed MCP tools

The Reddit skeptic was right about some MCP tools:

“eating a surprising amount of context” - poorly designed MCP tools

# WRONG: Verbose tool descriptions and schemas consume tokens
@server.tool()
async def get_data(params: dict) -> dict:
"""
This is a very long description that explains in great detail
all the various things this tool does and how to use it with
many examples and edge cases and warnings and suggestions...
"""

Tool definitions themselves consume context in their schema definitions. Balance tool overhead versus data filtering savings.

Mistake 4: Ignoring tool definition overhead

Token cost breakdown:
MCP tool schema definition: 500 tokens
Data filtering savings: 1,800 tokens
Net savings: 1,300 tokens (worth it!)
But if:
MCP tool schema definition: 2,000 tokens
Data filtering savings: 1,800 tokens
Net cost: -200 tokens (not worth it!)

How I applied this

After understanding these principles, I restructured my agent’s API access:

BEFORE:
Agent --> Legacy API 1 (2000 tokens) --> LLM Context
Agent --> Legacy API 2 (1500 tokens) --> LLM Context
Agent --> Legacy API 3 (3000 tokens) --> LLM Context
Agent --> Legacy API 4 (1800 tokens) --> LLM Context
Agent --> Legacy API 5 (2200 tokens) --> LLM Context
-----------
10,500 tokens total
AFTER:
Agent --> MCP Server (filters + aggregates) --> LLM Context
-----------
800 tokens total

Result: 92% token reduction. I can now run my agent on Claude Haiku instead of Sonnet, cutting costs by 20x.

Summary

In this post, I explained how MCP reduces token usage by acting as a smart middleware layer. The key points are:

  • MCP servers filter, transform, and aggregate data before it reaches your LLM’s context window
  • Legacy APIs often return 5-10x more data than needed
  • Pre-processing calculations saves both data tokens and reasoning tokens
  • Dramatic cost savings enable use of smaller, cheaper models
  • Poorly designed MCP tools can make things worse, not better

Next step: Audit your current API usage - which endpoints return the most data you don’t use? Those are your prime MCP candidates.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments