How to Optimize Token Usage with Codex Plugins and Avoid Rate Limits

Apr 30, 2026

I hit my rate limit again. Mid-way through implementing a feature, the API just stopped responding:

Error: Rate limit exceeded. Please retry after 60 seconds.
You have used 150,000 tokens out of your 100,000 daily limit.

This happened three times this week. I knew I needed to fix my token usage, but I wasn’t sure where to start.

What I Found

After digging through Reddit discussions and documentation, I discovered that token optimization requires a three-layer approach:

Prompt-level - Compress prompts with caveman plugin
Context-level - Fetch only needed docs with Context7 MCP
Model-level - Route simple tasks to cheaper models

The biggest impact came from model routing, not plugins. But combining all three gave me the best results.

Layer 1: Prompt Optimization with Caveman

I installed the caveman plugin first. It optimizes prompts by:

Removing verbose language
Compressing redundant instructions
Structuring requests for token efficiency

The setup was simple, but I couldn’t find detailed public documentation. Based on community feedback, it works as a compression layer between my input and the model.

# Without optimization
codex "Please help me implement a debounce hook in React that handles cleanup properly"

# With caveman optimization
codex --plugin caveman "implement useDebounce hook with cleanup"

I saw about 20-30% token reduction on prompts. Not bad, but I needed more.

Layer 2: Context Optimization with Context7 MCP

Then I realized documentation lookups were eating most of my tokens. I was sending entire documentation files when I only needed specific sections.

Context7 MCP changed this completely. Instead of including full docs, it fetches only relevant sections.

{
  "mcpServers": {
    "context7": {
      "type": "stdio",
      "command": "npx",
      "args": ["-y", "@context7/mcp-server"]
    }
  }
}

Here’s the difference:

# Without Context7 (before)
Prompt: "Read all Next.js docs and help me with app router"
Tokens sent: ~500,000 (entire documentation)

# With Context7 (after)
Prompt: "Use context7 to get Next.js app router routing guide"
Tokens sent: ~5,000 (only relevant section)

That’s a 99% reduction on documentation lookups. Context7 became my most valuable optimization.

Layer 3: Model Routing with OpenRouter

But the real game-changer was model routing. I was using the most expensive model for everything, even simple formatting tasks.

OpenRouter lets me route tasks to appropriate models based on complexity:

const routing = {
  default: "anthropic/claude-3-haiku", // Cheaper default

  routes: [
    {
      pattern: /(architecture|design|complex|refactor)/i,
      model: "anthropic/claude-opus-4"
    },
    {
      pattern: /(implement|debug|feature)/i,
      model: "anthropic/claude-3.5-sonnet"
    },
    {
      pattern: /(format|comment|simple|fix typo)/i,
      model: "anthropic/claude-3-haiku"
    }
  ]
};

export default routing;

Here’s my cost comparison:

Task Type	Old Model	New Model	Cost Savings
Simple refactoring	Opus 4.5	Haiku 3.5	~90% cheaper
Code completion	Opus 4.5	Sonnet 3.5	~70% cheaper
Complex reasoning	Opus 4.5	Opus 4.5	Same (worth it)

As one Reddit user said: “routing your simple tasks through a cheaper model via openrouter or routers like herma instead of using 5.5 for everything helps more than any plugin will”

Combined Workflow

I now use all three optimizations together:

# Step 1: Use Context7 for targeted docs (saves ~50k tokens)
codex "Use context7 to get React 18 useEffect docs"

# Step 2: Use caveman-optimized prompt
codex --plugin caveman "implement useDebounce hook with cleanup"

# Step 3: Route to appropriate model
# Simple task -> Haiku (cheap)
codex --route simple "format this file with prettier"

# Complex task -> Opus (expensive, but worth it)
codex --route complex "design authentication architecture for microservices"

I also added token budget tracking to avoid surprises:

class TokenBudget:
    def __init__(self, daily_limit=100000):
        self.daily_limit = daily_limit
        self.used = 0

    def should_use_cheap_model(self, estimated_tokens):
        remaining = self.daily_limit - self.used
        return remaining < self.daily_limit * 0.3  # Switch to cheap at 30% remaining

    def log_usage(self, model, tokens, task_type):
        print(f"[{model}] {task_type}: {tokens} tokens")
        self.used += tokens
        print(f"Daily budget: {self.used}/{self.daily_limit}")

What Didn’t Work

I tried just using caveman without model routing. That saved maybe 20-30%, but I still hit rate limits.

I also tried implementing routing without Context7. That worked better, but documentation lookups still consumed too many tokens.

Why This Matters

Token optimization is not a single solution. It requires a layered strategy:

Optimization Layer	Token Savings	Setup Complexity
Caveman plugin	20-40%	Low
Context7 MCP	80-95% on docs	Medium
Model routing	50-80% overall	Medium-High
Combined approach	70-90% total	High

I think the key reason most people fail at optimization is they look for one magic bullet. But the real solution is combining prompt compression, smart context fetching, and intelligent model routing.

Summary

In this post, I showed how to optimize Codex token usage through three layers: caveman for prompt compression, Context7 MCP for targeted documentation, and model routing for cost-efficient task distribution. The key point is that model routing provides the biggest impact, but combining all three layers yields 70-90% total token savings.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 OpenRouter
👨‍💻 Context7 MCP

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!