How to Cache OpenAI API Responses Locally to Reduce Costs
Problem
I was burning through my OpenAI API budget. Every time I ran my test suite, the same prompts generated the same expensive responses. When I was debugging an issue, I’d make the same request multiple times, paying $0.05 each time for identical output.
Here’s what my bill looked like after a week of development:
Testing workflow (daily):- Same 20 prompts x 5 runs = 100 identical calls- Average cost: $0.03 per call- Daily waste: $3.00 on repeated identical responses
Monthly cost from repeated calls: ~$90Actual unique responses needed: ~$18Wasted money: 5x the necessary costThe OpenAI API has no built-in caching. Every call hits the API and charges per token, even if you made the identical request 10 seconds ago.
What happened?
I was already using ai-menshen as my local proxy (see my previous post on setting it up). But I hadn’t configured caching properly—I left the defaults and didn’t understand how the cache worked.
When I dug into the documentation, I found that ai-menshen caches successful HTTP 200 responses by default. The cache uses SQLite storage, so cached responses persist across restarts. But I needed to understand the configuration to make it work for my use case.
Here’s how the cache mechanism works:
Request arrives at ai-menshen │ ▼┌──────────────────┐│ Cache Check │── Same request body + model?│ (SQLite DB) │└──────────────────┘ │ YES│ NO │ │ ▼ ▼┌─────────┐ ┌─────────────┐│ Return │ │ Call Upstream││ Cached │ │ (OpenAI API) ││ Response│ └─────────────┘│ │ ││ $0 cost │ ▼│ 0ms │ ┌─────────────┐└─────────┘ │ Cache the │ │ Response │ └─────────────┘ │ ▼ Return to client (cost incurred)The cache matches requests based on:
- Request body (including messages, model, parameters)
- HTTP method and endpoint
If everything matches, you get the cached response instantly. No upstream call, no cost.
How to solve it?
Step 1: Check your cache configuration
I looked at my config file:
cat ~/.config/ai-menshen/config.tomlThe cache section showed:
[cache]enable = true # Default: cache is ONmax_body_bytes = 5242880 # Default: 5 MiB max response sizemax_age = 0 # Default: never expire
[logging]log_request_body = true # Required for cache matchinglog_response_body = true # Required for cache storageThe defaults looked fine for my use case. Cache was enabled, and response body logging was on (required for caching). But I had two problems:
max_age = 0meant cached responses never expired—I needed TTL for dynamic content- I didn’t know if the cache was actually working
Step 2: Set appropriate TTL
I updated the config based on my use cases:
[cache]enable = truemax_body_bytes = 5242880 # 5 MiB - fine for most responsesmax_age = 3600 # 1 hour TTL for my testing workflow
[logging]log_request_body = truelog_response_body = trueThe TTL choice depends on your scenario:
| Use Case | Recommended TTL | Why ||-----------------------|-----------------|-------------------------------|| Development testing | 86400 (24h) | Same prompts repeated often || Production (static) | 3600 (1h) | Balance cost vs freshness || Production (dynamic) | 300 (5min) | Dynamic content needs refresh || Code generation | 0 (never) | Same code = same output |
max_age = 0 means cached responses never expire.max_age > 0 means responses expire after X seconds.Step 3: Verify cache is working
I made two identical requests to test:
# First request - should hit upstreamtime curl http://localhost:8080/chat/completions \ -H "Authorization: Bearer my-proxy-token" \ -H "Content-Type: application/json" \ -d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "What is 2+2?"}]}'
# Output# {"id":"chatcmpl-xxx","choices":[{"message":{"content":"4"}}]}# real 2.3s <- Actual API call took 2.3 seconds
# Second identical request - should be cachedtime curl http://localhost:8080/chat/completions \ -H "Authorization: Bearer my-proxy-token" \ -H "Content-Type: application/json" \ -d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "What is 2+2?"}]}'
# Output# {"id":"chatcmpl-xxx","choices":[{"message":{"content":"4"}}]}# real 0.01s <- Cached response took 10msThe second request was 230x faster. It returned from cache without calling OpenAI.
Step 4: Check the dashboard
I opened the ai-menshen dashboard at http://localhost:8080/. The logs view showed:
Request #1:- Endpoint: /chat/completions- Model: gpt-4o- Status: 200 (from upstream)- Latency: 2300ms- Cost: incurred
Request #2:- Endpoint: /chat/completions- Model: gpt-4o- Status: 200 (from cache)- Latency: 10ms- Cost: $0The dashboard clearly showed which requests came from cache. This gave me confidence the feature was working.
The reason
Why does caching work this way?
Cache matching is strict: The cache key includes the entire request body. If you change one character in your prompt, it’s a cache miss. This ensures you always get the correct response for your exact request.
Same request body + same model + same endpoint = Cache HITAny difference in request = Cache MISS (fresh upstream call)
Request 1: {"model": "gpt-4o", "messages": [{"role": "user", "content": "Hello"}]}→ Cached
Request 2: {"model": "gpt-4o", "messages": [{"role": "user", "content": "Hello"}]}→ Cache HIT (identical)
Request 3: {"model": "gpt-4o", "messages": [{"role": "user", "content": "Hello!"}]}→ Cache MISS (different content, exclamation mark added)
Request 4: {"model": "gpt-4", "messages": [{"role": "user", "content": "Hello"}]}→ Cache MISS (different model)SQLite persistence: Cached responses are stored in SQLite. When you restart ai-menshen, the cache survives. Your cached responses work across sessions.
Body size limits: Large responses might not be worth caching (storage cost vs. API savings). The max_body_bytes setting lets you skip caching huge responses:
[cache]max_body_bytes = 5242880 # 5 MiB default
# Responses larger than 5 MiB are NOT cached# They still work, but always hit upstreamI learned some common mistakes to avoid:
Mistake 1: Disabling response body logging
# BAD: Cache won't work without response body logging[logging]log_response_body = false # Cache cannot store responses
[cache]enable = true # Won't cache anything!
# GOOD: Enable response body logging for cache[logging]log_response_body = true # Required for cache
[cache]enable = trueMistake 2: Setting TTL too long for dynamic content
# BAD: 24-hour cache for news/sports content[cache]max_age = 86400 # Content stale after 24 hours
# GOOD: Short TTL for dynamic content[cache]max_age = 300 # 5 minutes for dynamic topicsMistake 3: Cache body size too restrictive
# BAD: Can't cache large code generation responses[cache]max_body_bytes = 10240 # Only 10 KB - too small
# GOOD: Allow larger responses for code generation[cache]max_body_bytes = 10485760 # 10 MiB for code outputReal cost savings
I tracked my API costs before and after proper cache configuration:
Before cache optimization:- Daily testing: 100 calls, ~$3.00- Monthly: ~$90 wasted on repeated calls
After cache optimization (1-hour TTL):- Daily testing: 20 unique calls, 80 cached- Daily cost: ~$0.60 (unique calls only)- Monthly: ~$18
Savings: $72/month (80% reduction)The savings depend on your workflow. If you run the same prompts repeatedly (testing, development, debugging), caching can cut costs dramatically.
Summary
In this post, I showed how to use ai-menshen’s caching feature to reduce OpenAI API costs. The key point is that identical requests return cached responses instantly with zero cost.
Configure max_age for your use case (TTL), ensure log_response_body = true, and verify caching works through the dashboard or timing tests. The SQLite-based cache persists across restarts, making it reliable for long-term cost savings.
For development and testing workflows where prompts are repeated, caching can reduce costs by 80% or more.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments