Prompt Processing vs Token Generation: What Matters More for AI Coding Agents?
Problem
When I was researching hardware for running local AI coding agents, I kept seeing the same benchmarks everywhere: token generation speed. “Mac Studio M2 Ultra hits 50 tokens per second!” “RTX 4090 achieves 80 tokens per second!”
But when I actually tried using a local coding agent on my codebase, I noticed something strange:
[Agent reading my project files...][Waiting...][Still waiting...][10 minutes later...][Agent finally starts responding]
Response: "I'll help you refactor this function."The token generation was fast once it started. But the waiting time before each response was painful. I realized I had been looking at the wrong metric entirely.
What happened?
I asked on r/LocalLLM about hardware recommendations for coding agents. The responses were eye-opening:
“Before buying a Mac, you need to consider that the write speed (TP) is not necessarily the most important factor. With large contexts (meaning code), prompt processing (PP) is more important if you don’t want to wait 10 minutes between each step.”
Another user pointed out:
“You’ll notice that no one posts PP benchmarks on Reddit, only TP benchmarks when talking about Macs.”
This made me realize the problem: Everyone benchmarks what’s easy to measure, not what actually matters for coding workflows.
Two phases of LLM inference
To understand why PP matters more than TP for coding agents, I needed to understand how LLM inference actually works:
┌─────────────────────────────────────────────────────────────┐│ INFERENCE PIPELINE │├─────────────────────────────────────────────────────────────┤│ ││ 1. PREFILL PHASE (Prompt Processing) ││ ┌──────────────────────────────────────────┐ ││ │ [Context] [User Message] [System Prompt] │ ││ │ ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓ │ ││ │ All tokens processed IN PARALLEL │ ││ │ Building KV cache for all inputs │ ││ └──────────────────────────────────────────┘ ││ Time: Seconds to Minutes ││ ││ 2. DECODE PHASE (Token Generation) ││ ┌──────────────────────────────────────────┐ ││ │ [Token1] → [Token2] → [Token3] → ... │ ││ │ ↓ ↓ ↓ │ ││ │ SEQUENTIAL generation, one at a time │ ││ │ Each token reads full KV cache │ ││ └──────────────────────────────────────────┘ ││ Time: Milliseconds per token ││ │└─────────────────────────────────────────────────────────────┘Prefill Phase (PP)
The prefill phase processes ALL input tokens in parallel. This is where the model:
- Reads your entire codebase context
- Parses all files you’ve provided
- Builds the KV cache for attention computation
- Speed depends on memory bandwidth and compute capability
Decode Phase (TP)
The decode phase generates output tokens sequentially. This is where the model:
- Generates one token at a time
- Each token requires reading the entire KV cache
- Speed depends on memory bandwidth
- This is what most benchmarks measure
Why PP dominates coding agent workflows
For coding agents, the prefill phase is the bottleneck. Here’s why:
def calculate_inference_time(prompt_tokens, output_tokens, pp_speed, tp_speed): """ Calculate total inference time for LLM.
Args: prompt_tokens: Number of input tokens (codebase context) output_tokens: Number of output tokens (agent response) pp_speed: Prompt processing speed (tokens/second) tp_speed: Token generation speed (tokens/second) """ prefill_time = prompt_tokens / pp_speed decode_time = output_tokens / tp_speed
return { "prefill_time": prefill_time, "decode_time": decode_time, "total_time": prefill_time + decode_time, "prefill_percentage": prefill_time / (prefill_time + decode_time) * 100 }
# Typical coding agent scenario:# - 50,000 tokens of codebase context# - 500 tokens of generated response (plan, code changes, explanation)
scenario = calculate_inference_time( prompt_tokens=50000, output_tokens=500, pp_speed=5000, # Mac Studio M2 Ultra approximate tp_speed=50 # Mac Studio M2 Ultra approximate)
print(f"Prefill time: {scenario['prefill_time']:.1f}s")print(f"Decode time: {scenario['decode_time']:.1f}s")print(f"Prefill accounts for: {scenario['prefill_percentage']:.1f}% of total time")Running this calculation:
Prefill time: 10.0sDecode time: 10.0sPrefill accounts for: 50.0% of total timeNow imagine your agent runs multiple steps (plan, edit, review, test). Each step re-processes the context:
Step 1 (Analyze): 10s prefill + 10s decode = 20sStep 2 (Plan): 10s prefill + 8s decode = 18sStep 3 (Edit): 10s prefill + 5s decode = 15sStep 4 (Review): 10s prefill + 3s decode = 13s─────────────────────────────────────────────Total: 40s prefill + 26s decode = 66s
Prefill accounts for: 60.6% of total timeIf your PP is slow (say, 1000 tokens/s instead of 5000), each step takes 50s just for prefill:
Step 1 (Analyze): 50s prefill + 10s decode = 60sStep 2 (Plan): 50s prefill + 8s decode = 58sStep 3 (Edit): 50s prefill + 5s decode = 55sStep 4 (Review): 50s prefill + 3s decode = 53s─────────────────────────────────────────────Total: 200s prefill + 26s decode = 226s
That's nearly 4 minutes instead of 1 minute!Hardware comparison: PP vs TP
Here’s what I found when comparing hardware for coding agents:
NVIDIA_A100: memory_bandwidth: "2,000 GB/s" prefill_speed: "Very High (~15,000 tokens/s)" decode_speed: "High (~100 tokens/s)" context_limit: "Excellent (80GB VRAM)" verdict: "Best for large contexts, but expensive"
RTX_4090: memory_bandwidth: "1,008 GB/s" prefill_speed: "High (~8,000 tokens/s)" decode_speed: "Very High (~80 tokens/s)" context_limit: "Limited (24GB VRAM)" verdict: "Great decode speed, but model size limited by VRAM"
Mac_Studio_M2_Ultra: memory_bandwidth: "800 GB/s" prefill_speed: "Good (~5,000 tokens/s)" decode_speed: "Good for local (~50 tokens/s)" context_limit: "Excellent (up to 192GB unified memory)" verdict: "Best for running large models locally, slower PP"
Mac_Studio_M4_Max: memory_bandwidth: "546 GB/s" prefill_speed: "Moderate (~3,500 tokens/s)" decode_speed: "Good (~45 tokens/s)" context_limit: "Good (up to 128GB unified memory)" verdict: "Lower bandwidth = slower PP for large contexts"The key insight: Mac’s unified memory advantage comes at the cost of lower memory bandwidth, which directly impacts prompt processing speed.
The KV cache factor
Understanding KV cache helped me see why memory bandwidth matters so much for PP:
For a 70B model at 4-bit quantization:
KV cache size per token ≈ 2 × layers × heads × head_dim × bytes_per_param
Example calculation:- 80 layers, 64 heads, 128 head_dim- KV cache per token ≈ 2 × 80 × 64 × 128 × 0.5 bytes = ~655 KB per token- For 50K context: 655 KB × 50,000 = 32.75 GB of KV cache
During prefill:- Model must fill all 32.75 GB with computed values- Memory bandwidth determines how fast this happens- This is why PP speed correlates with memory bandwidthWhat I learned about benchmarks
The Reddit community made a valid point: PP benchmarks are rarely shared because they’re harder to measure.
TP benchmarks are simple:
- Run
llama-benchwith any prompt - Measure tokens generated per second
- Result is consistent across prompt sizes
PP benchmarks are complex:
- Vary significantly with context size
- Require specific testing methodology
- Not as impressive in marketing (“50,000 tokens in 10 seconds” vs “50 tokens per second”)
I found a helpful comparison from a user testing MiniMax:
"The prompt processing is near same and token gen on an a10b modelsuch as MiniMax even at Q6 is near 50 token/s."
Key insight: For coding agents, you want the biggest, smartest modelyou can run locally. TP speed is secondary if you're waiting minutesfor prefill on every step.Practical recommendations
Based on my research, here’s what I recommend for local coding agents:
1. Check PP benchmarks, not just TP
When evaluating hardware, specifically look for prefill benchmarks:
# Using llama.cpp benchmarkingllama-bench -p 50000 -n 0 -ngl 99 model.gguf
# This tests prefill with 50K context, no generation# Compare PP speeds across different hardware2. Consider your typical context size
< 10K tokens: PP is fast on most hardware, focus on TP10K - 30K tokens: PP becomes noticeable, balance PP and TP> 30K tokens: PP dominates, prioritize high PP hardware3. Mac vs NVIDIA decision matrix
Choose Mac Studio if: ✓ You need to run models > 24GB (larger models) ✓ Your context regularly exceeds 24GB VRAM equivalent ✓ You want one unified system (no separate GPU) ✓ You can tolerate slower PP for larger model capacity
Choose NVIDIA GPU if: ✓ Your model fits in VRAM (≤24GB for 4090) ✓ You prioritize fast PP for many agent steps ✓ You're okay with model size limitations ✓ You need maximum inference speed4. Test with your actual workload
Don’t rely on synthetic benchmarks:
import time
def benchmark_agent_step(model, codebase_files, prompt): """Benchmark a realistic coding agent step.""" # Measure prefill time start = time.time()
# Build full context (like a coding agent would) full_context = build_context(codebase_files, prompt) token_count = count_tokens(full_context)
# Measure time to first token (approximate PP) response = model.generate(full_context, max_tokens=1) first_token_time = time.time() - start
# Continue generation full_response = model.generate(full_context, max_tokens=500) total_time = time.time() - start
return { "context_tokens": token_count, "first_token_time": first_token_time, # PP approximation "pp_speed": token_count / first_token_time, "total_time": total_time, "decode_speed": 500 / (total_time - first_token_time) }Summary
In this post, I explained why prompt processing speed matters more than token generation speed for AI coding agents. The key points are:
- Coding agents process massive contexts: A typical coding task involves reading entire files or repositories, which means 50K+ tokens of context
- Each agent step re-processes context: Plan, edit, and review cycles all require re-reading the codebase
- PP dominates total time: For large contexts, prefill can account for 50-80% of inference time
- Benchmarks are misleading: Marketing focuses on TP because it’s easier to measure, not because it matters more
When choosing hardware for local coding agents:
- Look for PP benchmarks, not just TP numbers
- Consider your typical context sizes
- Balance model capacity vs PP speed (Mac vs NVIDIA trade-off)
- Test with your actual workload, not synthetic benchmarks
A 50 token/s decode speed doesn’t matter if you’re waiting 10 minutes for prefill on every agent step.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit Discussion: Hardware for Local Coding Agents
- 👨💻 Understanding LLM Inference: Prefill vs Decode
- 👨💻 KV Cache in Transformer Models
- 👨💻 Apple Silicon Unified Memory Architecture
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments