Skip to content

Prompt Processing vs Token Generation: What Matters More for AI Coding Agents?

Problem

When I was researching hardware for running local AI coding agents, I kept seeing the same benchmarks everywhere: token generation speed. “Mac Studio M2 Ultra hits 50 tokens per second!” “RTX 4090 achieves 80 tokens per second!”

But when I actually tried using a local coding agent on my codebase, I noticed something strange:

Coding Agent Waiting Experience
[Agent reading my project files...]
[Waiting...]
[Still waiting...]
[10 minutes later...]
[Agent finally starts responding]
Response: "I'll help you refactor this function."

The token generation was fast once it started. But the waiting time before each response was painful. I realized I had been looking at the wrong metric entirely.

What happened?

I asked on r/LocalLLM about hardware recommendations for coding agents. The responses were eye-opening:

“Before buying a Mac, you need to consider that the write speed (TP) is not necessarily the most important factor. With large contexts (meaning code), prompt processing (PP) is more important if you don’t want to wait 10 minutes between each step.”

Another user pointed out:

“You’ll notice that no one posts PP benchmarks on Reddit, only TP benchmarks when talking about Macs.”

This made me realize the problem: Everyone benchmarks what’s easy to measure, not what actually matters for coding workflows.

Two phases of LLM inference

To understand why PP matters more than TP for coding agents, I needed to understand how LLM inference actually works:

LLM Inference Timeline
┌─────────────────────────────────────────────────────────────┐
│ INFERENCE PIPELINE │
├─────────────────────────────────────────────────────────────┤
│ │
│ 1. PREFILL PHASE (Prompt Processing) │
│ ┌──────────────────────────────────────────┐ │
│ │ [Context] [User Message] [System Prompt] │ │
│ │ ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓ │ │
│ │ All tokens processed IN PARALLEL │ │
│ │ Building KV cache for all inputs │ │
│ └──────────────────────────────────────────┘ │
│ Time: Seconds to Minutes │
│ │
│ 2. DECODE PHASE (Token Generation) │
│ ┌──────────────────────────────────────────┐ │
│ │ [Token1] → [Token2] → [Token3] → ... │ │
│ │ ↓ ↓ ↓ │ │
│ │ SEQUENTIAL generation, one at a time │ │
│ │ Each token reads full KV cache │ │
│ └──────────────────────────────────────────┘ │
│ Time: Milliseconds per token │
│ │
└─────────────────────────────────────────────────────────────┘

Prefill Phase (PP)

The prefill phase processes ALL input tokens in parallel. This is where the model:

  1. Reads your entire codebase context
  2. Parses all files you’ve provided
  3. Builds the KV cache for attention computation
  4. Speed depends on memory bandwidth and compute capability

Decode Phase (TP)

The decode phase generates output tokens sequentially. This is where the model:

  1. Generates one token at a time
  2. Each token requires reading the entire KV cache
  3. Speed depends on memory bandwidth
  4. This is what most benchmarks measure

Why PP dominates coding agent workflows

For coding agents, the prefill phase is the bottleneck. Here’s why:

inference_time_calculation.py
def calculate_inference_time(prompt_tokens, output_tokens, pp_speed, tp_speed):
"""
Calculate total inference time for LLM.
Args:
prompt_tokens: Number of input tokens (codebase context)
output_tokens: Number of output tokens (agent response)
pp_speed: Prompt processing speed (tokens/second)
tp_speed: Token generation speed (tokens/second)
"""
prefill_time = prompt_tokens / pp_speed
decode_time = output_tokens / tp_speed
return {
"prefill_time": prefill_time,
"decode_time": decode_time,
"total_time": prefill_time + decode_time,
"prefill_percentage": prefill_time / (prefill_time + decode_time) * 100
}
# Typical coding agent scenario:
# - 50,000 tokens of codebase context
# - 500 tokens of generated response (plan, code changes, explanation)
scenario = calculate_inference_time(
prompt_tokens=50000,
output_tokens=500,
pp_speed=5000, # Mac Studio M2 Ultra approximate
tp_speed=50 # Mac Studio M2 Ultra approximate
)
print(f"Prefill time: {scenario['prefill_time']:.1f}s")
print(f"Decode time: {scenario['decode_time']:.1f}s")
print(f"Prefill accounts for: {scenario['prefill_percentage']:.1f}% of total time")

Running this calculation:

Output
Prefill time: 10.0s
Decode time: 10.0s
Prefill accounts for: 50.0% of total time

Now imagine your agent runs multiple steps (plan, edit, review, test). Each step re-processes the context:

Multi-step Agent Timeline
Step 1 (Analyze): 10s prefill + 10s decode = 20s
Step 2 (Plan): 10s prefill + 8s decode = 18s
Step 3 (Edit): 10s prefill + 5s decode = 15s
Step 4 (Review): 10s prefill + 3s decode = 13s
─────────────────────────────────────────────
Total: 40s prefill + 26s decode = 66s
Prefill accounts for: 60.6% of total time

If your PP is slow (say, 1000 tokens/s instead of 5000), each step takes 50s just for prefill:

Slow PP Scenario
Step 1 (Analyze): 50s prefill + 10s decode = 60s
Step 2 (Plan): 50s prefill + 8s decode = 58s
Step 3 (Edit): 50s prefill + 5s decode = 55s
Step 4 (Review): 50s prefill + 3s decode = 53s
─────────────────────────────────────────────
Total: 200s prefill + 26s decode = 226s
That's nearly 4 minutes instead of 1 minute!

Hardware comparison: PP vs TP

Here’s what I found when comparing hardware for coding agents:

Hardware Comparison for 50K Context
NVIDIA_A100:
memory_bandwidth: "2,000 GB/s"
prefill_speed: "Very High (~15,000 tokens/s)"
decode_speed: "High (~100 tokens/s)"
context_limit: "Excellent (80GB VRAM)"
verdict: "Best for large contexts, but expensive"
RTX_4090:
memory_bandwidth: "1,008 GB/s"
prefill_speed: "High (~8,000 tokens/s)"
decode_speed: "Very High (~80 tokens/s)"
context_limit: "Limited (24GB VRAM)"
verdict: "Great decode speed, but model size limited by VRAM"
Mac_Studio_M2_Ultra:
memory_bandwidth: "800 GB/s"
prefill_speed: "Good (~5,000 tokens/s)"
decode_speed: "Good for local (~50 tokens/s)"
context_limit: "Excellent (up to 192GB unified memory)"
verdict: "Best for running large models locally, slower PP"
Mac_Studio_M4_Max:
memory_bandwidth: "546 GB/s"
prefill_speed: "Moderate (~3,500 tokens/s)"
decode_speed: "Good (~45 tokens/s)"
context_limit: "Good (up to 128GB unified memory)"
verdict: "Lower bandwidth = slower PP for large contexts"

The key insight: Mac’s unified memory advantage comes at the cost of lower memory bandwidth, which directly impacts prompt processing speed.

The KV cache factor

Understanding KV cache helped me see why memory bandwidth matters so much for PP:

KV Cache Memory Requirements
For a 70B model at 4-bit quantization:
KV cache size per token ≈ 2 × layers × heads × head_dim × bytes_per_param
Example calculation:
- 80 layers, 64 heads, 128 head_dim
- KV cache per token ≈ 2 × 80 × 64 × 128 × 0.5 bytes = ~655 KB per token
- For 50K context: 655 KB × 50,000 = 32.75 GB of KV cache
During prefill:
- Model must fill all 32.75 GB with computed values
- Memory bandwidth determines how fast this happens
- This is why PP speed correlates with memory bandwidth

What I learned about benchmarks

The Reddit community made a valid point: PP benchmarks are rarely shared because they’re harder to measure.

TP benchmarks are simple:

  • Run llama-bench with any prompt
  • Measure tokens generated per second
  • Result is consistent across prompt sizes

PP benchmarks are complex:

  • Vary significantly with context size
  • Require specific testing methodology
  • Not as impressive in marketing (“50,000 tokens in 10 seconds” vs “50 tokens per second”)

I found a helpful comparison from a user testing MiniMax:

MiniMax User Experience
"The prompt processing is near same and token gen on an a10b model
such as MiniMax even at Q6 is near 50 token/s."
Key insight: For coding agents, you want the biggest, smartest model
you can run locally. TP speed is secondary if you're waiting minutes
for prefill on every step.

Practical recommendations

Based on my research, here’s what I recommend for local coding agents:

1. Check PP benchmarks, not just TP

When evaluating hardware, specifically look for prefill benchmarks:

PP Benchmark Example
# Using llama.cpp benchmarking
llama-bench -p 50000 -n 0 -ngl 99 model.gguf
# This tests prefill with 50K context, no generation
# Compare PP speeds across different hardware

2. Consider your typical context size

Context Size Impact
< 10K tokens: PP is fast on most hardware, focus on TP
10K - 30K tokens: PP becomes noticeable, balance PP and TP
> 30K tokens: PP dominates, prioritize high PP hardware

3. Mac vs NVIDIA decision matrix

Hardware Decision Guide
Choose Mac Studio if:
✓ You need to run models > 24GB (larger models)
✓ Your context regularly exceeds 24GB VRAM equivalent
✓ You want one unified system (no separate GPU)
✓ You can tolerate slower PP for larger model capacity
Choose NVIDIA GPU if:
✓ Your model fits in VRAM (≤24GB for 4090)
✓ You prioritize fast PP for many agent steps
✓ You're okay with model size limitations
✓ You need maximum inference speed

4. Test with your actual workload

Don’t rely on synthetic benchmarks:

Test Your Actual Workflow
import time
def benchmark_agent_step(model, codebase_files, prompt):
"""Benchmark a realistic coding agent step."""
# Measure prefill time
start = time.time()
# Build full context (like a coding agent would)
full_context = build_context(codebase_files, prompt)
token_count = count_tokens(full_context)
# Measure time to first token (approximate PP)
response = model.generate(full_context, max_tokens=1)
first_token_time = time.time() - start
# Continue generation
full_response = model.generate(full_context, max_tokens=500)
total_time = time.time() - start
return {
"context_tokens": token_count,
"first_token_time": first_token_time, # PP approximation
"pp_speed": token_count / first_token_time,
"total_time": total_time,
"decode_speed": 500 / (total_time - first_token_time)
}

Summary

In this post, I explained why prompt processing speed matters more than token generation speed for AI coding agents. The key points are:

  1. Coding agents process massive contexts: A typical coding task involves reading entire files or repositories, which means 50K+ tokens of context
  2. Each agent step re-processes context: Plan, edit, and review cycles all require re-reading the codebase
  3. PP dominates total time: For large contexts, prefill can account for 50-80% of inference time
  4. Benchmarks are misleading: Marketing focuses on TP because it’s easier to measure, not because it matters more

When choosing hardware for local coding agents:

  • Look for PP benchmarks, not just TP numbers
  • Consider your typical context sizes
  • Balance model capacity vs PP speed (Mac vs NVIDIA trade-off)
  • Test with your actual workload, not synthetic benchmarks

A 50 token/s decode speed doesn’t matter if you’re waiting 10 minutes for prefill on every agent step.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments