How to reduce token bloat in MCP servers

Mar 3, 2026

Problem

When I set up an MCP server for code execution, I hit this:

Context window: 200,000 tokens used
Tool definitions: 15,000 tokens
State history: 45,000 tokens
Response data: 80,000 tokens
Available space: 60,000 tokens

The conversation started dying after just a few exchanges. Every tool call added massive overhead, and the context window filled up way too fast.

Environment

MCP server: TypeScript with SSE transport
LLM: Claude 3.5 Sonnet (200K context)
Use case: File system operations, code analysis
Alternative tested: CLI workflows

What happened?

I was building an AI coding assistant that needed to interact with the filesystem. I thought MCP would be perfect - it provides structured tool definitions, proper typing, and handles the protocol automatically.

Here’s my initial MCP server setup:

// Standard MCP server with full tool definitions
const server = new MCPServer({
  name: "filesystem-server",
  version: "1.0.0",
  tools: [
    {
      name: "read_file",
      description: "Read a file from disk",
      inputSchema: {
        type: "object",
        properties: {
          path: { type: "string", description: "File path to read" },
          encoding: { type: "string", default: "utf-8" }
        }
      }
    },
    // ... 20 more tools with full descriptions
  ]
});

The problem became obvious when I inspected the token usage. Every tool definition, including all descriptions and schema details, gets injected into the context. The LLM sees thousands of tokens just to understand what tools are available, before any actual work happens.

On Reddit, someone posted “CLI is all you need - do we really need MCPs?” They argued that simple CLI commands with output filtering are more efficient. I initially dismissed it as oversimplified, but the token numbers were hard to ignore.

How to solve it?

I tried a few approaches to reduce the bloat.

First attempt: Minimal tool definitions

// Cut descriptions to bare minimum
const tools = [
  {
    name: "rf",
    description: "Read file",
    inputSchema: {
      type: "object",
      properties: { p: { type: "string" } }
    }
  }
];

This reduced tool definition tokens from 15,000 to about 3,000. But the LLM struggled to understand what the tools actually did. Ambiguous names led to wrong tool selections.

Then I tried: Lazy tool loading

class ToolLoader {
  private allTools: Tool[] = [];
  private activeTools: Set<string> = new Set();

  async loadTools(context: string): Promise<Tool[]> {
    // Only load tools relevant to current context
    if (context.includes("file")) {
      this.activeTools.add("read_file");
      this.activeTools.add("write_file");
    }
    // Return only active tools
    return this.allTools.filter(t => this.activeTools.has(t.name));
  }
}

This worked better. Token usage dropped by about 40%. But it was still high.

So I looked at what Cloudflare Code Mode and Anthropic Code Execution do differently.

The pattern I found: Pre-process before the LLM sees it

Traditional MCP:
LLM → MCP Server → File System
      ↑                ↑
   Full context    Raw output (all tokens)

Optimized CLI:
CLI → Filter → LLM
       ↓         ↓
    Raw output  Clean input (only needed tokens)

I tried a hybrid approach:

class HybridMCP {
  private mcpClient: MCPClient;
  private cliRunner: CLIRunner;

  async readFile(path: string): Promise<string> {
    // First try CLI with filtering
    const cliResult = await this.cliRunner.run(`head -100 ${path}`);
    if (cliResult.success && cliResult.tokens < 5000) {
      return cliResult.output; // Small enough, use CLI
    }

    // Fall back to MCP for large files
    const mcpResult = await this.mcpClient.callTool("read_file", { path });
    return this.compressOutput(mcpResult.data);
  }

  private compressOutput(data: string): string {
    // Summarize, truncate, or structure to reduce tokens
    if (data.length > 10000) {
      return `[First 2000 chars]...\n\n[Last 2000 chars]...`;
    }
    return data;
  }
}

This reduced tokens by about 70%. Small operations went through CLI with minimal overhead. Large files used MCP but with compressed outputs.

Server-side optimization

The biggest win came from making the MCP server token-aware:

interface TokenStats {
  toolDefinitions: number;
  currentState: number;
  lastResponse: number;
}

class TokenAwareMCP {
  private stats: TokenStats = {
    toolDefinitions: 0,
    currentState: 0,
    lastResponse: 0
  };

  async handleRequest(request: MCPRequest): Promise<MCPResponse> {
    // Check if we're hitting token limits
    const totalUsed = this.getTotalUsed();

    if (totalUsed > 150000) {
      // Compaction strategy
      await this.compactContext();
    }

    // Track token usage
    const response = await this.processRequest(request);
    this.stats.lastResponse = this.estimateTokens(response);

    return response;
  }

  private async compactContext(): Promise<void> {
    // Summarize old interactions
    // Remove unused tools
    // Cache common operations
  }

  private estimateTokens(data: any): number {
    // Rough estimation based on character count
    return JSON.stringify(data).length / 4;
  }
}

Client-side management

The client also needs to be smart about what it sends:

class ContextManager {
  private maxTokens: number = 180000;
  private currentTokens: number = 0;

  shouldIncludeTool(tool: Tool): boolean {
    const toolTokens = this.estimateToolTokens(tool);
    if (this.currentTokens + toolTokens > this.maxTokens) {
      return false; // Would exceed limit
    }
    return true;
  }

  pruneHistory(history: Message[]): Message[] {
    // Keep recent messages, summarize older ones
    const recent = history.slice(-10);
    const old = history.slice(0, -10);
    const summary = this.summarize(old);

    return [...summary, ...recent];
  }
}

Alternative: Cloudflare Code Mode approach

Cloudflare’s approach runs code at the edge with minimal context transfer:

Local Agent → Cloudflare Worker → Result
                    ↓
               Execute code
                    ↓
               Return result
                    ↓
               Clean output (no state)

The key insight: Don’t send context back and forth. Send a small request, get a small result.

I implemented this pattern:

class EdgeExecutor {
  async executeCode(code: string, inputs: any): Promise<any> {
    // Minimal context sent to edge
    const payload = {
      code: code,
      inputs: inputs,
      // No history, no state
    };

    // Edge returns just the result
    const result = await fetch("https://my-worker.cloudflare.com/execute", {
      method: "POST",
      body: JSON.stringify(payload)
    });

    return result.json();
  }
}

This reduced per-request tokens from ~20,000 to ~500. The tradeoff: less statefulness and less complex operations.

Anthropic Code Execution patterns

Anthropic’s Code Execution feature has smart context management:

Auto-compaction of terminal history
Prompt caching for repeated operations
Memory-aware tool selection

I replicated some of this:

class SmartExecutor {
  private history: TerminalEntry[] = [];
  private summaryCache: Map<string, string> = new Map();

  async execute(command: string): Promise<string> {
    const result = await this.runCommand(command);

    // Auto-summarize long outputs
    if (result.length > 5000) {
      const summary = await this.summarize(result);
      this.history.push({
        command,
        output: summary,
        compressed: true
      });
      return summary;
    }

    this.history.push({ command, output: result });
    return result;
  }

  private async compactHistory(): Promise<void> {
    if (this.history.length > 20) {
      // Keep last 10 entries
      const recent = this.history.slice(-10);
      // Summarize the rest
      const old = this.history.slice(0, -10);
      const summary = await this.summarizeBatch(old);

      this.history = [
        { type: "summary", content: summary },
        ...recent
      ];
    }
  }
}

Results

Here’s what I measured:

Approach	Tokens per request	Context fill-up time	Accuracy
Original MCP	20,000	5 turns	High
Minimal definitions	8,000	12 turns	Medium
Lazy loading	6,000	15 turns	High
Hybrid CLI-MCP	3,000	30 turns	High
Edge-only	500	No limit	Medium

The hybrid approach gave the best balance: significant token reduction while maintaining accuracy for complex operations.

The reason

I think the key reasons for token bloat in MCP are:

Full schema injection: Every tool’s complete schema goes into context
Stateful protocol: Conversation history accumulates naturally
No selective loading: All tools are always available
Raw data transfer: Full outputs returned without filtering

The solutions work by:

Reducing what gets sent (compression, filtering)
Loading only what’s needed (lazy loading)
Compacting history (summarization)
Alternative architectures for simple cases (CLI, edge)

Best practices

Based on my experiments, here’s what I recommend:

┌─────────────────────────────────────────────────────────┐
│                    MCP Token Checklist                   │
├─────────────────────────────────────────────────────────┤
│ [ ] Track token usage on every request                 │
│ [ ] Implement tool description minification             │
│ [ ] Load tools lazily based on context                 │
│ [ ] Compress large outputs before returning            │
│ [ ] Auto-compact terminal history                       │
│ [ ] Cache repeated operations                          │
│ [ ] Consider CLI for simple read-only operations       │
│ [ ] Use edge execution for stateless operations        │
└─────────────────────────────────────────────────────────┘

When to use each approach

Use MCP when: Complex operations, stateful interactions, type safety matters
Use CLI when: Simple read operations, large outputs that need filtering
Use Edge/Code Mode when: Stateless execution, minimal context needs

Summary

In this post, I showed how to reduce token bloat in MCP servers. The key point is that MCP doesn’t have to be bloated - you can combine smart server-side optimizations with alternative architectures. By implementing lazy loading, response compression, and context management, I reduced token usage by 70% while maintaining functionality. For simple operations, CLI-based approaches or edge execution can be even more efficient.

The Reddit poster was partly right - CLI is often enough. But MCP still has its place for complex, stateful interactions. The real solution isn’t choosing one over the other, but using the right tool for each job.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Anthropic MCP Documentation
👨‍💻 Cloudflare Code Mode
👨‍💻 Reddit: CLI is all you need discussion

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!