Skip to content

How to Optimize Token Usage with Codex Plugins and Avoid Rate Limits

I hit my rate limit again. Mid-way through implementing a feature, the API just stopped responding:

rate-limit-error.txt
Error: Rate limit exceeded. Please retry after 60 seconds.
You have used 150,000 tokens out of your 100,000 daily limit.

This happened three times this week. I knew I needed to fix my token usage, but I wasn’t sure where to start.

What I Found

After digging through Reddit discussions and documentation, I discovered that token optimization requires a three-layer approach:

  1. Prompt-level - Compress prompts with caveman plugin
  2. Context-level - Fetch only needed docs with Context7 MCP
  3. Model-level - Route simple tasks to cheaper models

The biggest impact came from model routing, not plugins. But combining all three gave me the best results.

Layer 1: Prompt Optimization with Caveman

I installed the caveman plugin first. It optimizes prompts by:

  • Removing verbose language
  • Compressing redundant instructions
  • Structuring requests for token efficiency

The setup was simple, but I couldn’t find detailed public documentation. Based on community feedback, it works as a compression layer between my input and the model.

caveman-usage.sh
# Without optimization
codex "Please help me implement a debounce hook in React that handles cleanup properly"
# With caveman optimization
codex --plugin caveman "implement useDebounce hook with cleanup"

I saw about 20-30% token reduction on prompts. Not bad, but I needed more.

Layer 2: Context Optimization with Context7 MCP

Then I realized documentation lookups were eating most of my tokens. I was sending entire documentation files when I only needed specific sections.

Context7 MCP changed this completely. Instead of including full docs, it fetches only relevant sections.

codex-settings.json
{
"mcpServers": {
"context7": {
"type": "stdio",
"command": "npx",
"args": ["-y", "@context7/mcp-server"]
}
}
}

Here’s the difference:

token-comparison.txt
# Without Context7 (before)
Prompt: "Read all Next.js docs and help me with app router"
Tokens sent: ~500,000 (entire documentation)
# With Context7 (after)
Prompt: "Use context7 to get Next.js app router routing guide"
Tokens sent: ~5,000 (only relevant section)

That’s a 99% reduction on documentation lookups. Context7 became my most valuable optimization.

Layer 3: Model Routing with OpenRouter

But the real game-changer was model routing. I was using the most expensive model for everything, even simple formatting tasks.

OpenRouter lets me route tasks to appropriate models based on complexity:

openrouter-config.js
const routing = {
default: "anthropic/claude-3-haiku", // Cheaper default
routes: [
{
pattern: /(architecture|design|complex|refactor)/i,
model: "anthropic/claude-opus-4"
},
{
pattern: /(implement|debug|feature)/i,
model: "anthropic/claude-3.5-sonnet"
},
{
pattern: /(format|comment|simple|fix typo)/i,
model: "anthropic/claude-3-haiku"
}
]
};
export default routing;

Here’s my cost comparison:

Task TypeOld ModelNew ModelCost Savings
Simple refactoringOpus 4.5Haiku 3.5~90% cheaper
Code completionOpus 4.5Sonnet 3.5~70% cheaper
Complex reasoningOpus 4.5Opus 4.5Same (worth it)

As one Reddit user said: “routing your simple tasks through a cheaper model via openrouter or routers like herma instead of using 5.5 for everything helps more than any plugin will”

Combined Workflow

I now use all three optimizations together:

optimized-workflow.sh
# Step 1: Use Context7 for targeted docs (saves ~50k tokens)
codex "Use context7 to get React 18 useEffect docs"
# Step 2: Use caveman-optimized prompt
codex --plugin caveman "implement useDebounce hook with cleanup"
# Step 3: Route to appropriate model
# Simple task -> Haiku (cheap)
codex --route simple "format this file with prettier"
# Complex task -> Opus (expensive, but worth it)
codex --route complex "design authentication architecture for microservices"

I also added token budget tracking to avoid surprises:

token_tracker.py
class TokenBudget:
def __init__(self, daily_limit=100000):
self.daily_limit = daily_limit
self.used = 0
def should_use_cheap_model(self, estimated_tokens):
remaining = self.daily_limit - self.used
return remaining < self.daily_limit * 0.3 # Switch to cheap at 30% remaining
def log_usage(self, model, tokens, task_type):
print(f"[{model}] {task_type}: {tokens} tokens")
self.used += tokens
print(f"Daily budget: {self.used}/{self.daily_limit}")

What Didn’t Work

I tried just using caveman without model routing. That saved maybe 20-30%, but I still hit rate limits.

I also tried implementing routing without Context7. That worked better, but documentation lookups still consumed too many tokens.

Why This Matters

Token optimization is not a single solution. It requires a layered strategy:

Optimization LayerToken SavingsSetup Complexity
Caveman plugin20-40%Low
Context7 MCP80-95% on docsMedium
Model routing50-80% overallMedium-High
Combined approach70-90% totalHigh

I think the key reason most people fail at optimization is they look for one magic bullet. But the real solution is combining prompt compression, smart context fetching, and intelligent model routing.

Summary

In this post, I showed how to optimize Codex token usage through three layers: caveman for prompt compression, Context7 MCP for targeted documentation, and model routing for cost-efficient task distribution. The key point is that model routing provides the biggest impact, but combining all three layers yields 70-90% total token savings.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments