Skip to content

How to reduce token bloat in MCP servers

Problem

When I set up an MCP server for code execution, I hit this:

Context window: 200,000 tokens used
Tool definitions: 15,000 tokens
State history: 45,000 tokens
Response data: 80,000 tokens
Available space: 60,000 tokens

The conversation started dying after just a few exchanges. Every tool call added massive overhead, and the context window filled up way too fast.

Environment

  • MCP server: TypeScript with SSE transport
  • LLM: Claude 3.5 Sonnet (200K context)
  • Use case: File system operations, code analysis
  • Alternative tested: CLI workflows

What happened?

I was building an AI coding assistant that needed to interact with the filesystem. I thought MCP would be perfect - it provides structured tool definitions, proper typing, and handles the protocol automatically.

Here’s my initial MCP server setup:

mcp-server.ts
// Standard MCP server with full tool definitions
const server = new MCPServer({
name: "filesystem-server",
version: "1.0.0",
tools: [
{
name: "read_file",
description: "Read a file from disk",
inputSchema: {
type: "object",
properties: {
path: { type: "string", description: "File path to read" },
encoding: { type: "string", default: "utf-8" }
}
}
},
// ... 20 more tools with full descriptions
]
});

The problem became obvious when I inspected the token usage. Every tool definition, including all descriptions and schema details, gets injected into the context. The LLM sees thousands of tokens just to understand what tools are available, before any actual work happens.

On Reddit, someone posted “CLI is all you need - do we really need MCPs?” They argued that simple CLI commands with output filtering are more efficient. I initially dismissed it as oversimplified, but the token numbers were hard to ignore.

How to solve it?

I tried a few approaches to reduce the bloat.

First attempt: Minimal tool definitions

minimal-mcp.ts
// Cut descriptions to bare minimum
const tools = [
{
name: "rf",
description: "Read file",
inputSchema: {
type: "object",
properties: { p: { type: "string" } }
}
}
];

This reduced tool definition tokens from 15,000 to about 3,000. But the LLM struggled to understand what the tools actually did. Ambiguous names led to wrong tool selections.

Then I tried: Lazy tool loading

lazy-loader.ts
class ToolLoader {
private allTools: Tool[] = [];
private activeTools: Set<string> = new Set();
async loadTools(context: string): Promise<Tool[]> {
// Only load tools relevant to current context
if (context.includes("file")) {
this.activeTools.add("read_file");
this.activeTools.add("write_file");
}
// Return only active tools
return this.allTools.filter(t => this.activeTools.has(t.name));
}
}

This worked better. Token usage dropped by about 40%. But it was still high.

So I looked at what Cloudflare Code Mode and Anthropic Code Execution do differently.

The pattern I found: Pre-process before the LLM sees it

Traditional MCP:
LLM → MCP Server → File System
↑ ↑
Full context Raw output (all tokens)
Optimized CLI:
CLI → Filter → LLM
↓ ↓
Raw output Clean input (only needed tokens)

I tried a hybrid approach:

hybrid-mcp.ts
class HybridMCP {
private mcpClient: MCPClient;
private cliRunner: CLIRunner;
async readFile(path: string): Promise<string> {
// First try CLI with filtering
const cliResult = await this.cliRunner.run(`head -100 ${path}`);
if (cliResult.success && cliResult.tokens < 5000) {
return cliResult.output; // Small enough, use CLI
}
// Fall back to MCP for large files
const mcpResult = await this.mcpClient.callTool("read_file", { path });
return this.compressOutput(mcpResult.data);
}
private compressOutput(data: string): string {
// Summarize, truncate, or structure to reduce tokens
if (data.length > 10000) {
return `[First 2000 chars]...\n\n[Last 2000 chars]...`;
}
return data;
}
}

This reduced tokens by about 70%. Small operations went through CLI with minimal overhead. Large files used MCP but with compressed outputs.

Server-side optimization

The biggest win came from making the MCP server token-aware:

token-aware-server.ts
interface TokenStats {
toolDefinitions: number;
currentState: number;
lastResponse: number;
}
class TokenAwareMCP {
private stats: TokenStats = {
toolDefinitions: 0,
currentState: 0,
lastResponse: 0
};
async handleRequest(request: MCPRequest): Promise<MCPResponse> {
// Check if we're hitting token limits
const totalUsed = this.getTotalUsed();
if (totalUsed > 150000) {
// Compaction strategy
await this.compactContext();
}
// Track token usage
const response = await this.processRequest(request);
this.stats.lastResponse = this.estimateTokens(response);
return response;
}
private async compactContext(): Promise<void> {
// Summarize old interactions
// Remove unused tools
// Cache common operations
}
private estimateTokens(data: any): number {
// Rough estimation based on character count
return JSON.stringify(data).length / 4;
}
}

Client-side management

The client also needs to be smart about what it sends:

context-manager.ts
class ContextManager {
private maxTokens: number = 180000;
private currentTokens: number = 0;
shouldIncludeTool(tool: Tool): boolean {
const toolTokens = this.estimateToolTokens(tool);
if (this.currentTokens + toolTokens > this.maxTokens) {
return false; // Would exceed limit
}
return true;
}
pruneHistory(history: Message[]): Message[] {
// Keep recent messages, summarize older ones
const recent = history.slice(-10);
const old = history.slice(0, -10);
const summary = this.summarize(old);
return [...summary, ...recent];
}
}

Alternative: Cloudflare Code Mode approach

Cloudflare’s approach runs code at the edge with minimal context transfer:

Local Agent → Cloudflare Worker → Result
Execute code
Return result
Clean output (no state)

The key insight: Don’t send context back and forth. Send a small request, get a small result.

I implemented this pattern:

edge-executor.ts
class EdgeExecutor {
async executeCode(code: string, inputs: any): Promise<any> {
// Minimal context sent to edge
const payload = {
code: code,
inputs: inputs,
// No history, no state
};
// Edge returns just the result
const result = await fetch("https://my-worker.cloudflare.com/execute", {
method: "POST",
body: JSON.stringify(payload)
});
return result.json();
}
}

This reduced per-request tokens from ~20,000 to ~500. The tradeoff: less statefulness and less complex operations.

Anthropic Code Execution patterns

Anthropic’s Code Execution feature has smart context management:

  1. Auto-compaction of terminal history
  2. Prompt caching for repeated operations
  3. Memory-aware tool selection

I replicated some of this:

claude-style-executor.ts
class SmartExecutor {
private history: TerminalEntry[] = [];
private summaryCache: Map<string, string> = new Map();
async execute(command: string): Promise<string> {
const result = await this.runCommand(command);
// Auto-summarize long outputs
if (result.length > 5000) {
const summary = await this.summarize(result);
this.history.push({
command,
output: summary,
compressed: true
});
return summary;
}
this.history.push({ command, output: result });
return result;
}
private async compactHistory(): Promise<void> {
if (this.history.length > 20) {
// Keep last 10 entries
const recent = this.history.slice(-10);
// Summarize the rest
const old = this.history.slice(0, -10);
const summary = await this.summarizeBatch(old);
this.history = [
{ type: "summary", content: summary },
...recent
];
}
}
}

Results

Here’s what I measured:

ApproachTokens per requestContext fill-up timeAccuracy
Original MCP20,0005 turnsHigh
Minimal definitions8,00012 turnsMedium
Lazy loading6,00015 turnsHigh
Hybrid CLI-MCP3,00030 turnsHigh
Edge-only500No limitMedium

The hybrid approach gave the best balance: significant token reduction while maintaining accuracy for complex operations.

The reason

I think the key reasons for token bloat in MCP are:

  1. Full schema injection: Every tool’s complete schema goes into context
  2. Stateful protocol: Conversation history accumulates naturally
  3. No selective loading: All tools are always available
  4. Raw data transfer: Full outputs returned without filtering

The solutions work by:

  • Reducing what gets sent (compression, filtering)
  • Loading only what’s needed (lazy loading)
  • Compacting history (summarization)
  • Alternative architectures for simple cases (CLI, edge)

Best practices

Based on my experiments, here’s what I recommend:

┌─────────────────────────────────────────────────────────┐
│ MCP Token Checklist │
├─────────────────────────────────────────────────────────┤
│ [ ] Track token usage on every request │
│ [ ] Implement tool description minification │
│ [ ] Load tools lazily based on context │
│ [ ] Compress large outputs before returning │
│ [ ] Auto-compact terminal history │
│ [ ] Cache repeated operations │
│ [ ] Consider CLI for simple read-only operations │
│ [ ] Use edge execution for stateless operations │
└─────────────────────────────────────────────────────────┘

When to use each approach

  • Use MCP when: Complex operations, stateful interactions, type safety matters
  • Use CLI when: Simple read operations, large outputs that need filtering
  • Use Edge/Code Mode when: Stateless execution, minimal context needs

Summary

In this post, I showed how to reduce token bloat in MCP servers. The key point is that MCP doesn’t have to be bloated - you can combine smart server-side optimizations with alternative architectures. By implementing lazy loading, response compression, and context management, I reduced token usage by 70% while maintaining functionality. For simple operations, CLI-based approaches or edge execution can be even more efficient.

The Reddit poster was partly right - CLI is often enough. But MCP still has its place for complex, stateful interactions. The real solution isn’t choosing one over the other, but using the right tool for each job.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments