How to reduce token bloat in MCP servers
Problem
When I set up an MCP server for code execution, I hit this:
Context window: 200,000 tokens usedTool definitions: 15,000 tokensState history: 45,000 tokensResponse data: 80,000 tokensAvailable space: 60,000 tokensThe conversation started dying after just a few exchanges. Every tool call added massive overhead, and the context window filled up way too fast.
Environment
- MCP server: TypeScript with SSE transport
- LLM: Claude 3.5 Sonnet (200K context)
- Use case: File system operations, code analysis
- Alternative tested: CLI workflows
What happened?
I was building an AI coding assistant that needed to interact with the filesystem. I thought MCP would be perfect - it provides structured tool definitions, proper typing, and handles the protocol automatically.
Here’s my initial MCP server setup:
// Standard MCP server with full tool definitionsconst server = new MCPServer({ name: "filesystem-server", version: "1.0.0", tools: [ { name: "read_file", description: "Read a file from disk", inputSchema: { type: "object", properties: { path: { type: "string", description: "File path to read" }, encoding: { type: "string", default: "utf-8" } } } }, // ... 20 more tools with full descriptions ]});The problem became obvious when I inspected the token usage. Every tool definition, including all descriptions and schema details, gets injected into the context. The LLM sees thousands of tokens just to understand what tools are available, before any actual work happens.
On Reddit, someone posted “CLI is all you need - do we really need MCPs?” They argued that simple CLI commands with output filtering are more efficient. I initially dismissed it as oversimplified, but the token numbers were hard to ignore.
How to solve it?
I tried a few approaches to reduce the bloat.
First attempt: Minimal tool definitions
// Cut descriptions to bare minimumconst tools = [ { name: "rf", description: "Read file", inputSchema: { type: "object", properties: { p: { type: "string" } } } }];This reduced tool definition tokens from 15,000 to about 3,000. But the LLM struggled to understand what the tools actually did. Ambiguous names led to wrong tool selections.
Then I tried: Lazy tool loading
class ToolLoader { private allTools: Tool[] = []; private activeTools: Set<string> = new Set();
async loadTools(context: string): Promise<Tool[]> { // Only load tools relevant to current context if (context.includes("file")) { this.activeTools.add("read_file"); this.activeTools.add("write_file"); } // Return only active tools return this.allTools.filter(t => this.activeTools.has(t.name)); }}This worked better. Token usage dropped by about 40%. But it was still high.
So I looked at what Cloudflare Code Mode and Anthropic Code Execution do differently.
The pattern I found: Pre-process before the LLM sees it
Traditional MCP:LLM → MCP Server → File System ↑ ↑ Full context Raw output (all tokens)
Optimized CLI:CLI → Filter → LLM ↓ ↓ Raw output Clean input (only needed tokens)I tried a hybrid approach:
class HybridMCP { private mcpClient: MCPClient; private cliRunner: CLIRunner;
async readFile(path: string): Promise<string> { // First try CLI with filtering const cliResult = await this.cliRunner.run(`head -100 ${path}`); if (cliResult.success && cliResult.tokens < 5000) { return cliResult.output; // Small enough, use CLI }
// Fall back to MCP for large files const mcpResult = await this.mcpClient.callTool("read_file", { path }); return this.compressOutput(mcpResult.data); }
private compressOutput(data: string): string { // Summarize, truncate, or structure to reduce tokens if (data.length > 10000) { return `[First 2000 chars]...\n\n[Last 2000 chars]...`; } return data; }}This reduced tokens by about 70%. Small operations went through CLI with minimal overhead. Large files used MCP but with compressed outputs.
Server-side optimization
The biggest win came from making the MCP server token-aware:
interface TokenStats { toolDefinitions: number; currentState: number; lastResponse: number;}
class TokenAwareMCP { private stats: TokenStats = { toolDefinitions: 0, currentState: 0, lastResponse: 0 };
async handleRequest(request: MCPRequest): Promise<MCPResponse> { // Check if we're hitting token limits const totalUsed = this.getTotalUsed();
if (totalUsed > 150000) { // Compaction strategy await this.compactContext(); }
// Track token usage const response = await this.processRequest(request); this.stats.lastResponse = this.estimateTokens(response);
return response; }
private async compactContext(): Promise<void> { // Summarize old interactions // Remove unused tools // Cache common operations }
private estimateTokens(data: any): number { // Rough estimation based on character count return JSON.stringify(data).length / 4; }}Client-side management
The client also needs to be smart about what it sends:
class ContextManager { private maxTokens: number = 180000; private currentTokens: number = 0;
shouldIncludeTool(tool: Tool): boolean { const toolTokens = this.estimateToolTokens(tool); if (this.currentTokens + toolTokens > this.maxTokens) { return false; // Would exceed limit } return true; }
pruneHistory(history: Message[]): Message[] { // Keep recent messages, summarize older ones const recent = history.slice(-10); const old = history.slice(0, -10); const summary = this.summarize(old);
return [...summary, ...recent]; }}Alternative: Cloudflare Code Mode approach
Cloudflare’s approach runs code at the edge with minimal context transfer:
Local Agent → Cloudflare Worker → Result ↓ Execute code ↓ Return result ↓ Clean output (no state)The key insight: Don’t send context back and forth. Send a small request, get a small result.
I implemented this pattern:
class EdgeExecutor { async executeCode(code: string, inputs: any): Promise<any> { // Minimal context sent to edge const payload = { code: code, inputs: inputs, // No history, no state };
// Edge returns just the result const result = await fetch("https://my-worker.cloudflare.com/execute", { method: "POST", body: JSON.stringify(payload) });
return result.json(); }}This reduced per-request tokens from ~20,000 to ~500. The tradeoff: less statefulness and less complex operations.
Anthropic Code Execution patterns
Anthropic’s Code Execution feature has smart context management:
- Auto-compaction of terminal history
- Prompt caching for repeated operations
- Memory-aware tool selection
I replicated some of this:
class SmartExecutor { private history: TerminalEntry[] = []; private summaryCache: Map<string, string> = new Map();
async execute(command: string): Promise<string> { const result = await this.runCommand(command);
// Auto-summarize long outputs if (result.length > 5000) { const summary = await this.summarize(result); this.history.push({ command, output: summary, compressed: true }); return summary; }
this.history.push({ command, output: result }); return result; }
private async compactHistory(): Promise<void> { if (this.history.length > 20) { // Keep last 10 entries const recent = this.history.slice(-10); // Summarize the rest const old = this.history.slice(0, -10); const summary = await this.summarizeBatch(old);
this.history = [ { type: "summary", content: summary }, ...recent ]; } }}Results
Here’s what I measured:
| Approach | Tokens per request | Context fill-up time | Accuracy |
|---|---|---|---|
| Original MCP | 20,000 | 5 turns | High |
| Minimal definitions | 8,000 | 12 turns | Medium |
| Lazy loading | 6,000 | 15 turns | High |
| Hybrid CLI-MCP | 3,000 | 30 turns | High |
| Edge-only | 500 | No limit | Medium |
The hybrid approach gave the best balance: significant token reduction while maintaining accuracy for complex operations.
The reason
I think the key reasons for token bloat in MCP are:
- Full schema injection: Every tool’s complete schema goes into context
- Stateful protocol: Conversation history accumulates naturally
- No selective loading: All tools are always available
- Raw data transfer: Full outputs returned without filtering
The solutions work by:
- Reducing what gets sent (compression, filtering)
- Loading only what’s needed (lazy loading)
- Compacting history (summarization)
- Alternative architectures for simple cases (CLI, edge)
Best practices
Based on my experiments, here’s what I recommend:
┌─────────────────────────────────────────────────────────┐│ MCP Token Checklist │├─────────────────────────────────────────────────────────┤│ [ ] Track token usage on every request ││ [ ] Implement tool description minification ││ [ ] Load tools lazily based on context ││ [ ] Compress large outputs before returning ││ [ ] Auto-compact terminal history ││ [ ] Cache repeated operations ││ [ ] Consider CLI for simple read-only operations ││ [ ] Use edge execution for stateless operations │└─────────────────────────────────────────────────────────┘When to use each approach
- Use MCP when: Complex operations, stateful interactions, type safety matters
- Use CLI when: Simple read operations, large outputs that need filtering
- Use Edge/Code Mode when: Stateless execution, minimal context needs
Summary
In this post, I showed how to reduce token bloat in MCP servers. The key point is that MCP doesn’t have to be bloated - you can combine smart server-side optimizations with alternative architectures. By implementing lazy loading, response compression, and context management, I reduced token usage by 70% while maintaining functionality. For simple operations, CLI-based approaches or edge execution can be even more efficient.
The Reddit poster was partly right - CLI is often enough. But MCP still has its place for complex, stateful interactions. The real solution isn’t choosing one over the other, but using the right tool for each job.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments