Skip to content

How to Cache OpenAI API Responses Locally to Reduce Costs

Problem

I was burning through my OpenAI API budget. Every time I ran my test suite, the same prompts generated the same expensive responses. When I was debugging an issue, I’d make the same request multiple times, paying $0.05 each time for identical output.

Here’s what my bill looked like after a week of development:

cost-analysis.txt
Testing workflow (daily):
- Same 20 prompts x 5 runs = 100 identical calls
- Average cost: $0.03 per call
- Daily waste: $3.00 on repeated identical responses
Monthly cost from repeated calls: ~$90
Actual unique responses needed: ~$18
Wasted money: 5x the necessary cost

The OpenAI API has no built-in caching. Every call hits the API and charges per token, even if you made the identical request 10 seconds ago.

What happened?

I was already using ai-menshen as my local proxy (see my previous post on setting it up). But I hadn’t configured caching properly—I left the defaults and didn’t understand how the cache worked.

When I dug into the documentation, I found that ai-menshen caches successful HTTP 200 responses by default. The cache uses SQLite storage, so cached responses persist across restarts. But I needed to understand the configuration to make it work for my use case.

Here’s how the cache mechanism works:

cache-flow.txt
Request arrives at ai-menshen
┌──────────────────┐
│ Cache Check │── Same request body + model?
│ (SQLite DB) │
└──────────────────┘
YES│ NO
│ │
▼ ▼
┌─────────┐ ┌─────────────┐
│ Return │ │ Call Upstream│
│ Cached │ │ (OpenAI API) │
│ Response│ └─────────────┘
│ │ │
│ $0 cost │ ▼
│ 0ms │ ┌─────────────┐
└─────────┘ │ Cache the │
│ Response │
└─────────────┘
Return to client
(cost incurred)

The cache matches requests based on:

  • Request body (including messages, model, parameters)
  • HTTP method and endpoint

If everything matches, you get the cached response instantly. No upstream call, no cost.

How to solve it?

Step 1: Check your cache configuration

I looked at my config file:

check-config.sh
cat ~/.config/ai-menshen/config.toml

The cache section showed:

default-cache.toml
[cache]
enable = true # Default: cache is ON
max_body_bytes = 5242880 # Default: 5 MiB max response size
max_age = 0 # Default: never expire
[logging]
log_request_body = true # Required for cache matching
log_response_body = true # Required for cache storage

The defaults looked fine for my use case. Cache was enabled, and response body logging was on (required for caching). But I had two problems:

  1. max_age = 0 meant cached responses never expired—I needed TTL for dynamic content
  2. I didn’t know if the cache was actually working

Step 2: Set appropriate TTL

I updated the config based on my use cases:

updated-cache.toml
[cache]
enable = true
max_body_bytes = 5242880 # 5 MiB - fine for most responses
max_age = 3600 # 1 hour TTL for my testing workflow
[logging]
log_request_body = true
log_response_body = true

The TTL choice depends on your scenario:

ttl-guidelines.txt
| Use Case | Recommended TTL | Why |
|-----------------------|-----------------|-------------------------------|
| Development testing | 86400 (24h) | Same prompts repeated often |
| Production (static) | 3600 (1h) | Balance cost vs freshness |
| Production (dynamic) | 300 (5min) | Dynamic content needs refresh |
| Code generation | 0 (never) | Same code = same output |
max_age = 0 means cached responses never expire.
max_age > 0 means responses expire after X seconds.

Step 3: Verify cache is working

I made two identical requests to test:

test-cache.sh
# First request - should hit upstream
time curl http://localhost:8080/chat/completions \
-H "Authorization: Bearer my-proxy-token" \
-H "Content-Type: application/json" \
-d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "What is 2+2?"}]}'
# Output
# {"id":"chatcmpl-xxx","choices":[{"message":{"content":"4"}}]}
# real 2.3s <- Actual API call took 2.3 seconds
# Second identical request - should be cached
time curl http://localhost:8080/chat/completions \
-H "Authorization: Bearer my-proxy-token" \
-H "Content-Type: application/json" \
-d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "What is 2+2?"}]}'
# Output
# {"id":"chatcmpl-xxx","choices":[{"message":{"content":"4"}}]}
# real 0.01s <- Cached response took 10ms

The second request was 230x faster. It returned from cache without calling OpenAI.

Step 4: Check the dashboard

I opened the ai-menshen dashboard at http://localhost:8080/. The logs view showed:

dashboard-logs.txt
Request #1:
- Endpoint: /chat/completions
- Model: gpt-4o
- Status: 200 (from upstream)
- Latency: 2300ms
- Cost: incurred
Request #2:
- Endpoint: /chat/completions
- Model: gpt-4o
- Status: 200 (from cache)
- Latency: 10ms
- Cost: $0

The dashboard clearly showed which requests came from cache. This gave me confidence the feature was working.

The reason

Why does caching work this way?

Cache matching is strict: The cache key includes the entire request body. If you change one character in your prompt, it’s a cache miss. This ensures you always get the correct response for your exact request.

cache-matching.txt
Same request body + same model + same endpoint = Cache HIT
Any difference in request = Cache MISS (fresh upstream call)
Request 1: {"model": "gpt-4o", "messages": [{"role": "user", "content": "Hello"}]}
→ Cached
Request 2: {"model": "gpt-4o", "messages": [{"role": "user", "content": "Hello"}]}
→ Cache HIT (identical)
Request 3: {"model": "gpt-4o", "messages": [{"role": "user", "content": "Hello!"}]}
→ Cache MISS (different content, exclamation mark added)
Request 4: {"model": "gpt-4", "messages": [{"role": "user", "content": "Hello"}]}
→ Cache MISS (different model)

SQLite persistence: Cached responses are stored in SQLite. When you restart ai-menshen, the cache survives. Your cached responses work across sessions.

Body size limits: Large responses might not be worth caching (storage cost vs. API savings). The max_body_bytes setting lets you skip caching huge responses:

size-limit.toml
[cache]
max_body_bytes = 5242880 # 5 MiB default
# Responses larger than 5 MiB are NOT cached
# They still work, but always hit upstream

I learned some common mistakes to avoid:

Mistake 1: Disabling response body logging

logging-mistake.toml
# BAD: Cache won't work without response body logging
[logging]
log_response_body = false # Cache cannot store responses
[cache]
enable = true # Won't cache anything!
# GOOD: Enable response body logging for cache
[logging]
log_response_body = true # Required for cache
[cache]
enable = true

Mistake 2: Setting TTL too long for dynamic content

ttl-mistake.toml
# BAD: 24-hour cache for news/sports content
[cache]
max_age = 86400 # Content stale after 24 hours
# GOOD: Short TTL for dynamic content
[cache]
max_age = 300 # 5 minutes for dynamic topics

Mistake 3: Cache body size too restrictive

size-mistake.toml
# BAD: Can't cache large code generation responses
[cache]
max_body_bytes = 10240 # Only 10 KB - too small
# GOOD: Allow larger responses for code generation
[cache]
max_body_bytes = 10485760 # 10 MiB for code output

Real cost savings

I tracked my API costs before and after proper cache configuration:

cost-comparison.txt
Before cache optimization:
- Daily testing: 100 calls, ~$3.00
- Monthly: ~$90 wasted on repeated calls
After cache optimization (1-hour TTL):
- Daily testing: 20 unique calls, 80 cached
- Daily cost: ~$0.60 (unique calls only)
- Monthly: ~$18
Savings: $72/month (80% reduction)

The savings depend on your workflow. If you run the same prompts repeatedly (testing, development, debugging), caching can cut costs dramatically.

Summary

In this post, I showed how to use ai-menshen’s caching feature to reduce OpenAI API costs. The key point is that identical requests return cached responses instantly with zero cost.

Configure max_age for your use case (TTL), ensure log_response_body = true, and verify caching works through the dashboard or timing tests. The SQLite-based cache persists across restarts, making it reliable for long-term cost savings.

For development and testing workflows where prompts are repeated, caching can reduce costs by 80% or more.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments