How to Cache OpenAI API Responses Locally to Reduce Costs

Mar 30, 2026

Problem

I was burning through my OpenAI API budget. Every time I ran my test suite, the same prompts generated the same expensive responses. When I was debugging an issue, I’d make the same request multiple times, paying $0.05 each time for identical output.

Here’s what my bill looked like after a week of development:

Testing workflow (daily):
- Same 20 prompts x 5 runs = 100 identical calls
- Average cost: $0.03 per call
- Daily waste: $3.00 on repeated identical responses

Monthly cost from repeated calls: ~$90
Actual unique responses needed: ~$18
Wasted money: 5x the necessary cost

The OpenAI API has no built-in caching. Every call hits the API and charges per token, even if you made the identical request 10 seconds ago.

What happened?

I was already using ai-menshen as my local proxy (see my previous post on setting it up). But I hadn’t configured caching properly—I left the defaults and didn’t understand how the cache worked.

When I dug into the documentation, I found that ai-menshen caches successful HTTP 200 responses by default. The cache uses SQLite storage, so cached responses persist across restarts. But I needed to understand the configuration to make it work for my use case.

Here’s how the cache mechanism works:

Request arrives at ai-menshen
       │
       ▼
┌──────────────────┐
│   Cache Check    │── Same request body + model?
│   (SQLite DB)    │
└──────────────────┘
       │
    YES│     NO
       │      │
       ▼      ▼
┌─────────┐  ┌─────────────┐
│ Return  │  │ Call Upstream│
│ Cached  │  │ (OpenAI API) │
│ Response│  └─────────────┘
│         │         │
│ $0 cost │         ▼
│ 0ms     │  ┌─────────────┐
└─────────┘  │ Cache the   │
             │ Response    │
             └─────────────┘
                    │
                    ▼
             Return to client
             (cost incurred)

The cache matches requests based on:

Request body (including messages, model, parameters)
HTTP method and endpoint

If everything matches, you get the cached response instantly. No upstream call, no cost.

How to solve it?

Step 1: Check your cache configuration

I looked at my config file:

cat ~/.config/ai-menshen/config.toml

The cache section showed:

[cache]
enable = true              # Default: cache is ON
max_body_bytes = 5242880  # Default: 5 MiB max response size
max_age = 0               # Default: never expire

[logging]
log_request_body = true   # Required for cache matching
log_response_body = true  # Required for cache storage

The defaults looked fine for my use case. Cache was enabled, and response body logging was on (required for caching). But I had two problems:

max_age = 0 meant cached responses never expired—I needed TTL for dynamic content
I didn’t know if the cache was actually working

Step 2: Set appropriate TTL

I updated the config based on my use cases:

[cache]
enable = true
max_body_bytes = 5242880  # 5 MiB - fine for most responses
max_age = 3600           # 1 hour TTL for my testing workflow

[logging]
log_request_body = true
log_response_body = true

The TTL choice depends on your scenario:

| Use Case              | Recommended TTL | Why                           |
|-----------------------|-----------------|-------------------------------|
| Development testing   | 86400 (24h)     | Same prompts repeated often   |
| Production (static)   | 3600 (1h)       | Balance cost vs freshness     |
| Production (dynamic)  | 300 (5min)      | Dynamic content needs refresh |
| Code generation       | 0 (never)       | Same code = same output       |

max_age = 0 means cached responses never expire.
max_age > 0 means responses expire after X seconds.

Step 3: Verify cache is working

I made two identical requests to test:

# First request - should hit upstream
time curl http://localhost:8080/chat/completions \
  -H "Authorization: Bearer my-proxy-token" \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "What is 2+2?"}]}'

# Output
# {"id":"chatcmpl-xxx","choices":[{"message":{"content":"4"}}]}
# real    2.3s  <- Actual API call took 2.3 seconds

# Second identical request - should be cached
time curl http://localhost:8080/chat/completions \
  -H "Authorization: Bearer my-proxy-token" \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "What is 2+2?"}]}'

# Output
# {"id":"chatcmpl-xxx","choices":[{"message":{"content":"4"}}]}
# real    0.01s  <- Cached response took 10ms

The second request was 230x faster. It returned from cache without calling OpenAI.

Step 4: Check the dashboard

I opened the ai-menshen dashboard at http://localhost:8080/. The logs view showed:

Request #1:
- Endpoint: /chat/completions
- Model: gpt-4o
- Status: 200 (from upstream)
- Latency: 2300ms
- Cost: incurred

Request #2:
- Endpoint: /chat/completions
- Model: gpt-4o
- Status: 200 (from cache)
- Latency: 10ms
- Cost: $0

The dashboard clearly showed which requests came from cache. This gave me confidence the feature was working.

The reason

Why does caching work this way?

Cache matching is strict: The cache key includes the entire request body. If you change one character in your prompt, it’s a cache miss. This ensures you always get the correct response for your exact request.

Same request body + same model + same endpoint = Cache HIT
Any difference in request = Cache MISS (fresh upstream call)

Request 1: {"model": "gpt-4o", "messages": [{"role": "user", "content": "Hello"}]}
→ Cached

Request 2: {"model": "gpt-4o", "messages": [{"role": "user", "content": "Hello"}]}
→ Cache HIT (identical)

Request 3: {"model": "gpt-4o", "messages": [{"role": "user", "content": "Hello!"}]}
→ Cache MISS (different content, exclamation mark added)

Request 4: {"model": "gpt-4", "messages": [{"role": "user", "content": "Hello"}]}
→ Cache MISS (different model)

SQLite persistence: Cached responses are stored in SQLite. When you restart ai-menshen, the cache survives. Your cached responses work across sessions.

Body size limits: Large responses might not be worth caching (storage cost vs. API savings). The max_body_bytes setting lets you skip caching huge responses:

[cache]
max_body_bytes = 5242880  # 5 MiB default

# Responses larger than 5 MiB are NOT cached
# They still work, but always hit upstream

I learned some common mistakes to avoid:

Mistake 1: Disabling response body logging

# BAD: Cache won't work without response body logging
[logging]
log_response_body = false  # Cache cannot store responses

[cache]
enable = true  # Won't cache anything!

# GOOD: Enable response body logging for cache
[logging]
log_response_body = true  # Required for cache

[cache]
enable = true

Mistake 2: Setting TTL too long for dynamic content

# BAD: 24-hour cache for news/sports content
[cache]
max_age = 86400  # Content stale after 24 hours

# GOOD: Short TTL for dynamic content
[cache]
max_age = 300  # 5 minutes for dynamic topics

Mistake 3: Cache body size too restrictive

# BAD: Can't cache large code generation responses
[cache]
max_body_bytes = 10240  # Only 10 KB - too small

# GOOD: Allow larger responses for code generation
[cache]
max_body_bytes = 10485760  # 10 MiB for code output

Real cost savings

I tracked my API costs before and after proper cache configuration:

Before cache optimization:
- Daily testing: 100 calls, ~$3.00
- Monthly: ~$90 wasted on repeated calls

After cache optimization (1-hour TTL):
- Daily testing: 20 unique calls, 80 cached
- Daily cost: ~$0.60 (unique calls only)
- Monthly: ~$18

Savings: $72/month (80% reduction)

The savings depend on your workflow. If you run the same prompts repeatedly (testing, development, debugging), caching can cut costs dramatically.

Summary

In this post, I showed how to use ai-menshen’s caching feature to reduce OpenAI API costs. The key point is that identical requests return cached responses instantly with zero cost.

Configure max_age for your use case (TTL), ensure log_response_body = true, and verify caching works through the dashboard or timing tests. The SQLite-based cache persists across restarts, making it reliable for long-term cost savings.

For development and testing workflows where prompts are repeated, caching can reduce costs by 80% or more.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 ai-menshen GitHub Repository
👨‍💻 OpenAI API Pricing

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!