What is Code Mode? How It Differs From MCP Tool Calling
Problem
I’ve been building AI agents that use MCP (Model Context Protocol) tool calling, and I noticed something frustrating: every tool call requires a round-trip through the LLM context. For a simple three-step operation, the model processes intermediate results three times, burning tokens and adding latency.
Then I came across Cloudflare’s “Code Mode” approach, which challenges the traditional tool-calling paradigm. The core question: Why make LLMs select and invoke tools when they’re already trained to write code?
Purpose
I want to understand whether Code Mode genuinely solves the problems with traditional tool calling, or if it’s just another approach with different trade-offs. This post documents my analysis of both approaches, the real-world implications, and when to use each.
The Core Problem with Traditional Tool Calling
When I built my first MCP-based agent, I assumed tool calling was the natural way for LLMs to interact with external systems. But I quickly ran into three issues:
1. Training Data Mismatch
LLMs are trained on terabytes of code. They understand function calls, API patterns, and procedural logic deeply. Tool calling schemas? That’s a newer paradigm that models have to learn post-training or through fine-tuning.
Cloudflare argues this creates a fundamental mismatch:
┌─────────────────────────────────────────┐│ ││ ████████████████████████████ Code ││ ████████ Natural Language ││ ███ Tool Calling Schemas ││ │└─────────────────────────────────────────┘I think there’s truth to this. When I ask an LLM to write code that fetches data, it rarely struggles. But when I present it with a complex tool schema and expect it to select the right tool from twenty options, accuracy drops noticeably.
2. The Round-Trip Tax
Here’s what a typical multi-step operation looks like with traditional tool calling:
User Request │ ▼┌─────────────┐│ LLM │ ◄── Tool Call #1└─────────────┘ │ ▲ ▼ │ ┌─────────────┐ └──────│ MCP Server │ └─────────────┘ │ ▼ Result #1 back to LLM ┌─────────────┐ │ LLM │ ◄── Tool Call #2 └─────────────┘ │ ▲ ▼ │ ┌─────────────┐ └──────│ MCP Server │ └─────────────┘ │ ▼ Result #2 back to LLM ┌─────────────┐ │ LLM │ └─────────────┘Each step requires:
- LLM processes context
- LLM decides next action
- LLM formats tool call
- Tool executes
- Result returns to LLM context
- Repeat…
For three tool calls, that’s three full passes through the LLM. Every intermediate result sits in the context window.
3. Context Window Bloat
I tested this with a research agent that needed to:
- Search documentation (Context7 MCP)
- Search academic papers (ArXiv MCP)
- Search code examples (GitHub MCP)
With traditional tool calling, the context looked like this:
Step 1: User query + tool schema ~2,000 tokensStep 2: Tool result #1 (documentation) ~8,000 tokensStep 3: Tool result #2 (papers) ~6,000 tokensStep 4: Tool result #3 (code) ~10,000 tokensStep 5: Final synthesis ~3,000 tokens─────────────────────────────────────────────────────Total: ~29,000 tokensAlmost 70% of the context was intermediate results that the LLM just needed to “carry” to the next step.
What Code Mode Does Differently
Code Mode flips the paradigm. Instead of the LLM selecting tools and making calls, it writes code that directly consumes MCP servers as APIs. Here’s how it works:
The Architecture
// MCP server schema is converted to TypeScript typesinterface WeatherAPI { fetch(params: { city: string }): Promise<{ temp: number; conditions: string; }>;}
// LLM writes code, not tool callsasync function compareCities() { // Direct API-style calls to MCP servers const [seattle, portland] = await Promise.all([ mcp.weather.fetch({ city: "Seattle" }), mcp.weather.fetch({ city: "Portland" }) ]);
// Process in code const difference = Math.abs(seattle.temp - portland.temp);
// Return only final result to LLM context return { cities: [seattle, portland], difference, warmer: seattle.temp > portland.temp ? "Seattle" : "Portland" };}The key difference: this code executes in a sandbox with direct MCP server access. The LLM doesn’t see intermediate results. It only gets the final output.
The Flow Comparison
User Request │ ▼┌─────────────┐│ LLM │ ──── Write code block└─────────────┘ │ ▼ ┌─────────────────┐ │ Code Sandbox │ │ ┌───────────┐ │ │ │ Execute │ │ │ │ Code │ │ │ └───────────┘ │ │ │ │ │ ▼ │ │ ┌───────────┐ │ │ │ Call MCP │ │ │ │ Server #1 │ │ │ └───────────┘ │ │ │ │ │ ▼ │ │ ┌───────────┐ │ │ │ Call MCP │ │ │ │ Server #2 │ │ │ └───────────┘ │ │ │ │ │ ▼ │ │ ┌───────────┐ │ │ │ Process │ │ │ │ Results │ │ │ └───────────┘ │ └─────────────────┘ │ ▼ Final result only ┌─────────────┐ │ LLM │ └─────────────┘One round-trip. Only the final result enters the LLM context.
Token Efficiency
Same research task with Code Mode:
Step 1: User query + API types ~2,500 tokensStep 2: Final result only ~4,000 tokens─────────────────────────────────────────────────Total: ~6,500 tokensThat’s a 77% reduction in context usage for the same operation.
Real-World Example: Research Agent
Let me show you both approaches side by side.
Traditional Tool Calling Approach
async def research_traditional(agent, topic: str): # Round trip 1: Search documentation docs = await agent.call_tool( "context7_search", {"query": topic} ) # docs is now in LLM context
# Round trip 2: Search papers papers = await agent.call_tool( "arxiv_search", {"query": topic} ) # papers is now in LLM context
# Round trip 3: Search code examples code_examples = await agent.call_tool( "github_search_code", {"query": topic} ) # code_examples is now in LLM context
# Round trip 4: Synthesize (LLM processes all above) result = await agent.generate( f"Synthesize research on {topic} from:\n" f"Docs: {docs}\n" f"Papers: {papers}\n" f"Code: {code_examples}" )
return resultFour round-trips. Every intermediate result passes through the LLM.
Code Mode Approach
// LLM writes this code blockasync function researchTopic(topic: string) { // Parallel calls - executed in sandbox const [docs, papers, examples] = await Promise.all([ mcp.context7.search({ query: topic }), mcp.arxiv.search({ query: topic }), mcp.github.searchCode({ query: topic }) ]);
// Process results in code, not LLM context const synthesized = { documentation: docs .filter(d => d.verified) .map(d => ({ title: d.title, url: d.url, relevance: d.score })),
papers: papers .slice(0, 5) .map(p => ({ title: p.title, authors: p.authors, abstract: p.abstract.slice(0, 200) })),
codeExamples: examples .slice(0, 3) .map(e => ({ repo: e.repository, file: e.path, snippet: e.code.slice(0, 500) })) };
// Only this returns to LLM context return synthesized;}One round-trip. The LLM only sees the final, cleaned result.
Counterpoints: Why Traditional Tool Calling Still Matters
After experimenting with both approaches, I don’t think Code Mode is a wholesale replacement. Here’s why:
1. Reasoning Between Steps
Sometimes you need the LLM to reason about intermediate results before deciding the next step:
async def diagnose_issue(agent, error: str): # Step 1: Search documentation docs = await agent.call_tool("search_docs", {"query": error})
# LLM needs to analyze docs and decide: # - Is this a known issue? # - Do I need to search StackOverflow? # - Should I check the GitHub issues?
analysis = await agent.analyze( f"Based on these docs: {docs}, " f"what's the likely cause of {error}?" )
# LLM decides next action based on analysis if analysis.needs_community_help: community = await agent.call_tool( "search_stackoverflow", {"query": error} ) return synthesize(docs, community)
return docsCode Mode can’t easily do this because the LLM doesn’t see intermediate results.
2. Error Handling and Retry Logic
With traditional tool calling, the LLM can see errors and adjust:
async def fetch_with_retry(agent, url: str): result = await agent.call_tool("fetch", {"url": url})
if result.error: # LLM sees the error, reasons about it if result.error == "rate_limit": await agent.wait(60) return await agent.call_tool("fetch", {"url": url}) elif result.error == "not_found": return None
return resultIn Code Mode, error handling must be pre-programmed in the code block, not dynamically reasoned about.
3. Tool Design Matters More Than Protocol
One Reddit commenter made a sharp observation:
“Blaming the protocol for bad prompt engineering is like blaming HTTP because your API has confusing endpoints.”
I think this is key. A well-designed tool schema with clear names and good documentation will work well with traditional calling. A poorly designed schema will fail regardless of Code Mode or traditional approach.
Implementation: Setting Up Code Mode
If you want to experiment with Code Mode, here’s a basic setup:
Define Your MCP Server Schema
import { z } from "zod";
const weatherServer = { name: "weather", tools: { fetch: { description: "Fetch current weather for a city", parameters: z.object({ city: z.string().describe("City name"), units: z.enum(["celsius", "fahrenheit"]).optional() }), returns: z.object({ temp: z.number(), conditions: z.string(), humidity: z.number(), wind: z.number() }) },
forecast: { description: "Get weather forecast for a city", parameters: z.object({ city: z.string(), days: z.number().min(1).max(7) }), returns: z.array(z.object({ date: z.string(), high: z.number(), low: z.number(), conditions: z.string() })) } }};Generate TypeScript API Types
// Auto-generated from MCP schemainterface WeatherAPI { fetch(params: { city: string; units?: "celsius" | "fahrenheit"; }): Promise<{ temp: number; conditions: string; humidity: number; wind: number; }>;
forecast(params: { city: string; days: number; }): Promise<Array<{ date: string; high: number; low: number; conditions: string; }>>;}LLM Writes Code Against This API
// The LLM writes this code blockasync function planTrip(city: string, days: number) { // Current conditions const current = await mcp.weather.fetch({ city, units: "fahrenheit" });
// Forecast const forecast = await mcp.weather.forecast({ city, days: Math.min(days, 7) });
// Process in code const rainyDays = forecast.filter( day => day.conditions.includes("rain") ).length;
const avgTemp = forecast.reduce( (sum, day) => sum + (day.high + day.low) / 2, 0 ) / forecast.length;
return { currentConditions: current.conditions, currentTemp: current.temp, forecastDays: forecast.length, rainyDays, averageTemperature: Math.round(avgTemp), packingSuggestions: generatePackingList(current, forecast, rainyDays) };}
function generatePackingList( current: WeatherData, forecast: ForecastDay[], rainyDays: number): string[] { const items: string[] = [];
if (rainyDays > 0) { items.push("umbrella", "waterproof jacket"); }
const maxTemp = Math.max(...forecast.map(d => d.high)); if (maxTemp > 80) { items.push("sunscreen", "light clothing"); } else if (maxTemp < 50) { items.push("warm layers", "gloves"); }
return items;}When to Use Each Approach
Based on my experiments, here’s my decision framework:
Use Code Mode When:
- Batch Operations: Multiple independent operations that don’t require LLM reasoning between steps
- Clear Procedural Logic: When the workflow can be expressed as code
- Token Efficiency Matters: Long conversations or large result sets
- Latency Sensitive: Need to minimize round-trips
Use Traditional Tool Calling When:
- Reasoning Required: Need LLM to analyze intermediate results
- Dynamic Decision Making: Next step depends on previous result analysis
- Error Recovery: LLM should handle and retry from errors
- Simple Operations: Single tool call, no complex workflow
Hybrid Approach
I’ve found the best pattern is to use both:
// Complex research with hybrid approachasync function hybridResearch(topic: string) { // Use Code Mode for batch data gathering const data = await codeModeExecute(async () => { const [docs, papers, news] = await Promise.all([ mcp.context7.search({ query: topic }), mcp.arxiv.search({ query: topic }), mcp.news.search({ query: topic }) ]);
return { docs, papers, news }; });
// Use traditional calling for reasoning steps const analysis = await agent.call_tool("analyze", { data, instruction: "Identify contradictions and knowledge gaps" });
// Code Mode for action await codeModeExecute(async () => { if (analysis.knowledgeGaps.length > 0) { await mcp.tasks.create({ type: "research", gaps: analysis.knowledgeGaps }); } });}Security Considerations
Code execution requires careful sandboxing. Here’s what I implemented:
const sandboxConfig = { // Resource limits maxExecutionTime: 30000, // 30 seconds maxMemoryMB: 256, maxFileSize: 10 * 1024 * 1024, // 10MB
// Network restrictions allowedDomains: [ "api.context7.com", "export.arxiv.org", "api.github.com" ],
// MCP server permissions allowedTools: [ "context7.search", "arxiv.search", "github.searchCode" ],
// No filesystem access filesystem: "none",
// No subprocess execution subprocesses: false};Without these restrictions, a malicious prompt could generate code that:
- Exfiltrates data
- Makes unauthorized API calls
- Consumes excessive resources
Common Mistakes
I made these mistakes when first implementing Code Mode:
1. Not Validating MCP Server Responses
// WRONG: Trust everything from MCP serverasync function badExample() { const result = await mcp.external.fetch({ url: userInput }); // What if result contains malicious data? return eval(result.code); // NEVER do this}
// CORRECT: Validate with Zodasync function goodExample() { const result = await mcp.external.fetch({ url: userInput }); const validated = SafeResponseSchema.parse(result); return validated;}2. Ignoring Rate Limits
// WRONG: Parallel calls might hit rate limitsconst results = await Promise.all([ mcp.api.call({ query: "a" }), mcp.api.call({ query: "b" }), mcp.api.call({ query: "c" }), mcp.api.call({ query: "d" }), mcp.api.call({ query: "e" })]);
// CORRECT: Batch with rate limit awarenessconst results = await batchWithRateLimit( queries.map(q => () => mcp.api.call({ query: q })), { maxConcurrent: 3, delayMs: 100 });3. Over-Engineering Simple Operations
// WRONG: Code Mode for simple single callasync function overEngineered() { await codeModeExecute(async () => { return await mcp.weather.fetch({ city: "Seattle" }); });}
// CORRECT: Traditional calling for simple operationsconst weather = await agent.call_tool("weather_fetch", { city: "Seattle"});Summary
In this post, I explored the difference between Cloudflare’s Code Mode and traditional MCP tool calling. The key insight is that Code Mode treats MCP servers as APIs that LLMs program against, rather than tools they must select and invoke.
Code Mode reduces token waste and latency by executing multiple operations in a single code block without intermediate LLM review. It leverages the fact that LLMs are heavily trained on code patterns.
However, traditional tool calling still has value when you need:
- LLM reasoning between steps
- Dynamic error handling
- Simple, single operations
The best approach is likely hybrid: use Code Mode for batch data gathering and processing, use traditional calling for decision points that require LLM judgment.
The debate isn’t MCP vs Code Mode—they’re complementary. MCP provides the protocol and server ecosystem; Code Mode provides an execution pattern that’s more efficient for certain workloads. As with most engineering decisions, the right choice depends on your specific use case.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments