How I Cut My AI Inference Costs by 90% Using Local Models
Problem
I opened my OpenRouter bill and saw $200 gone in a week. I wasn’t building anything crazy—just some content generation and code assistance for my side projects.
Here’s what my usage looked like:
Week 1 spending:- Claude for simple text formatting: $45- GPT-4 for basic Q&A: $60- Claude for moderate content tasks: $55- Claude for actual complex reasoning: $40
Total: $200 (and counting)I realized I was using premium models for tasks that didn’t need premium intelligence. Formatting JSON? That’s a $45 waste. Basic Q&A? $60 down the drain.
There had to be a better way.
What I Learned
I found a Reddit thread where developers shared their cost-cutting strategies. One user said: “Get a VPS with 32GB RAM—install Ollama and run Qwen locally for zero-cost inference.”
Another mentioned: “I use DeepSeek V3 for fractions of a penny per call.”
But someone else warned: “Mine just blew through $200 OpenRouter credits on nothing.”
The pattern was clear: the developers saving money had a routing strategy. The ones losing money sent everything to the most expensive model.
My Three-Tier Solution
I built a simple routing system that matches task complexity to model cost:
Task Complexity │ ┌───────────────┼───────────────┐ ▼ ▼ ▼ Simple Moderate Complex │ │ │ ▼ ▼ ▼ Local Ollama DeepSeek API Claude/GPT-4 (FREE) (~$0.001/call) (Premium) │ │ │ Formatting Content Architecture Basic Q&A Generation Reasoning Simple Tasks Analysis Critical TasksLet me show you how I set this up.
Step 1: Set Up Local Inference
I rented a VPS with 32GB RAM for about $25/month. Here’s the setup:
# Install Ollamacurl -fsSL https://ollama.com/install.sh | sh
# Pull a capable model (Qwen 32B quantized)ollama pull qwen2.5:32b
# Test itollama run qwen2.5:32b "Format this JSON: {name:'john',age:30}"The model runs entirely on CPU. It’s slower than cloud APIs, but for simple tasks, the latency doesn’t matter.
import requests
def call_local_ollama(prompt: str, model: str = "qwen2.5:32b") -> str: """Call local Ollama - completely free"""
response = requests.post( "http://localhost:11434/api/generate", json={ "model": model, "prompt": prompt, "stream": False }, timeout=60 )
return response.json()["response"]
# Usage - costs $0result = call_local_ollama("Format this JSON: {name:'john',age:30}")I route all simple tasks here: JSON formatting, basic Q&A, text extraction. Zero cost per call.
Step 2: Add Cheap API Fallback
Some tasks need more intelligence than local models provide. But they don’t need Claude-level reasoning either.
I signed up for DeepSeek API. Their pricing is incredibly cheap compared to premium models:
Model Pricing Comparison (per 1M tokens):- GPT-4 Turbo: $10.00 input / $30.00 output- Claude 3.5 Sonnet: $3.00 input / $15.00 output- DeepSeek V3: $0.27 input / $1.10 output- DeepSeek R1: $0.55 input / $2.19 outputDeepSeek is 30x cheaper than GPT-4 and 10x cheaper than Claude.
import osimport requests
DEEPSEEK_API_KEY = os.environ.get("DEEPSEEK_API_KEY")
def call_deepseek(prompt: str, model: str = "deepseek-chat") -> str: """Call DeepSeek API - ultra cheap"""
response = requests.post( "https://api.deepseek.com/v1/chat/completions", headers={ "Authorization": f"Bearer {DEEPSEEK_API_KEY}", "Content-Type": "application/json" }, json={ "model": model, "messages": [{"role": "user", "content": prompt}], "temperature": 0.7 }, timeout=30 )
return response.json()["choices"][0]["message"]["content"]
# Usage - costs ~$0.001 for typical promptresult = call_deepseek("Write a blog post about REST APIs")I route moderate tasks here: content generation, code explanation, simple analysis.
Step 3: Reserve Premium for Complex Tasks
For architecture decisions, complex reasoning, and critical tasks, I still use premium models. But only when necessary.
from enum import Enumfrom typing import Literal
class TaskComplexity(Enum): SIMPLE = "simple" # Formatting, basic Q&A, extraction MODERATE = "moderate" # Content generation, code explanation COMPLEX = "complex" # Architecture, reasoning, critical decisions
def route_inference( prompt: str, complexity: TaskComplexity = TaskComplexity.MODERATE) -> str: """Route to appropriate model based on task complexity"""
if complexity == TaskComplexity.SIMPLE: # Free local inference return call_local_ollama(prompt)
elif complexity == TaskComplexity.MODERATE: # Ultra-cheap API (~$0.001) return call_deepseek(prompt)
elif complexity == TaskComplexity.COMPLEX: # Premium model (only when truly needed) return call_claude(prompt)
def call_claude(prompt: str) -> str: """Call Claude for complex reasoning""" # Your Claude API call here passStep 4: Add Cost Tracking
I wrapped everything with cost monitoring to see where money goes:
import timefrom functools import wrapsfrom dataclasses import dataclassfrom typing import Callable
@dataclassclass InferenceCost: model: str tokens: int cost_usd: float latency_ms: float
# Cost per 1K tokens (approximate)MODEL_COSTS = { "ollama-local": 0.0, # Free "deepseek-chat": 0.00014, # ~$0.14 per 1M tokens "claude-3.5-sonnet": 0.003, # ~$3 per 1M tokens}
def track_cost(model_name: str, cost_per_1k: float): """Decorator to track API costs""" def decorator(func: Callable) -> Callable: @wraps(func) def wrapper(prompt: str, *args, **kwargs): start = time.time()
result = func(prompt, *args, **kwargs)
# Estimate tokens (rough: 4 chars per token) input_tokens = len(prompt) // 4 output_tokens = len(result) // 4 total_tokens = input_tokens + output_tokens
cost = (total_tokens / 1000) * cost_per_1k
print(f"[{model_name}] {total_tokens} tokens, ${cost:.6f}")
return result return wrapper return decorator
@track_cost("deepseek-chat", 0.00014)def call_deepseek_tracked(prompt: str) -> str: return call_deepseek(prompt)Now I can see exactly what each call costs:
[ollama-local] 450 tokens, $0.000000[deepseek-chat] 1200 tokens, $0.000168[claude-3.5-sonnet] 800 tokens, $0.002400The Results
After implementing this routing system, my monthly costs dropped dramatically:
Before:- Everything to Claude/GPT-4: $200+/month
After:- VPS (32GB RAM): $25/month (fixed)- DeepSeek API: $3/month- Claude (complex only): $15/month
Total: $43/month (78% savings)But wait—let me show you the actual breakdown by task type:
Task Distribution:- Simple tasks (60%): Free via Ollama- Moderate tasks (30%): $3 via DeepSeek- Complex tasks (10%): $15 via ClaudeThe 60% of simple tasks that used to cost me $120/month? Now free.
Common Mistakes I Made
Mistake 1: Using Premium Models for Simple Tasks
I was sending JSON formatting to Claude. Complete waste:
# BEFORE: Wastefulresult = call_claude("Format this JSON: {name:'john',age:30}")# Cost: ~$0.01
# AFTER: Routed correctlyresult = call_local_ollama("Format this JSON: {name:'john',age:30}")# Cost: $0.00Mistake 2: No Cost Visibility
I had no idea where my money was going until I added tracking. Now I log every call:
# Simple logging functiondef log_inference(model: str, prompt: str, cost: float): with open("inference_log.csv", "a") as f: f.write(f"{model},{len(prompt)},{cost}\n")Mistake 3: Underestimating Local Models
I thought local models couldn’t handle real work. But Qwen 32B handles 80% of my tasks fine:
Tasks Qwen 32B handles well:- JSON/YAML formatting- Text summarization- Basic Q&A- Code explanation- Simple refactoring
Tasks requiring cloud models:- Complex architecture decisions- Multi-step reasoning- Critical production codeMistake 4: Wrong VPS Size
I first tried a 16GB VPS. It couldn’t run larger models. The recommendation from Reddit was right: 32GB minimum for good local inference.
# Check if your VPS can handle the modelollama run qwen2.5:32b
# If you see "out of memory" errors, you need more RAMComplete Routing Implementation
Here’s my full routing system:
from enum import Enumfrom typing import Optionalimport os
class TaskComplexity(Enum): SIMPLE = "simple" MODERATE = "moderate" COMPLEX = "complex"
class SmartRouter: def __init__(self): self.ollama_url = "http://localhost:11434" self.deepseek_key = os.environ.get("DEEPSEEK_API_KEY") self.claude_key = os.environ.get("ANTHROPIC_API_KEY")
def infer( self, prompt: str, complexity: TaskComplexity = TaskComplexity.MODERATE ) -> str: """Route inference to optimal model"""
if complexity == TaskComplexity.SIMPLE: return self._call_ollama(prompt) elif complexity == TaskComplexity.MODERATE: return self._call_deepseek(prompt) else: return self._call_claude(prompt)
def _call_ollama(self, prompt: str) -> str: """Free local inference""" import requests response = requests.post( f"{self.ollama_url}/api/generate", json={"model": "qwen2.5:32b", "prompt": prompt, "stream": False}, timeout=60 ) return response.json()["response"]
def _call_deepseek(self, prompt: str) -> str: """Ultra-cheap cloud inference""" import requests response = requests.post( "https://api.deepseek.com/v1/chat/completions", headers={"Authorization": f"Bearer {self.deepseek_key}"}, json={ "model": "deepseek-chat", "messages": [{"role": "user", "content": prompt}] }, timeout=30 ) return response.json()["choices"][0]["message"]["content"]
def _call_claude(self, prompt: str) -> str: """Premium inference for complex tasks""" import anthropic client = anthropic.Anthropic(api_key=self.claude_key) message = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=4096, messages=[{"role": "user", "content": prompt}] ) return message.content[0].text
# Usagerouter = SmartRouter()
# Simple task - freeformatted = router.infer( "Format this JSON: {name:'john',age:30}", TaskComplexity.SIMPLE)
# Moderate task - cheaparticle = router.infer( "Write a blog post about REST APIs", TaskComplexity.MODERATE)
# Complex task - premium (only when needed)architecture = router.infer( "Design a microservices architecture for an e-commerce platform...", TaskComplexity.COMPLEX)How to Choose Task Complexity
I use this simple heuristic:
SIMPLE (free local):- Text formatting (JSON, YAML, Markdown)- Basic Q&A (factual questions)- Text extraction and cleanup- Simple transformations- Code formatting
MODERATE (cheap API):- Content generation- Code explanation- Document summarization- Translation- Simple analysis
COMPLEX (premium):- Architecture decisions- Multi-step reasoning- Complex debugging- Security analysis- Critical production codeSummary
I reduced my AI inference costs by 78% using a three-tier routing strategy:
- Free tier: Local Ollama for simple tasks (60% of usage)
- Cheap tier: DeepSeek API for moderate tasks (30% of usage)
- Premium tier: Claude/GPT-4 only for complex tasks (10% of usage)
The key insight: most tasks don’t need premium models. I was burning money sending JSON formatting to Claude.
Start by auditing where your API costs go. You’ll probably find the same pattern: a few expensive calls for tasks that don’t justify the cost.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments