Skip to content

How I Cut My AI Inference Costs by 90% Using Local Models

Problem

I opened my OpenRouter bill and saw $200 gone in a week. I wasn’t building anything crazy—just some content generation and code assistance for my side projects.

Here’s what my usage looked like:

Week 1 spending:
- Claude for simple text formatting: $45
- GPT-4 for basic Q&A: $60
- Claude for moderate content tasks: $55
- Claude for actual complex reasoning: $40
Total: $200 (and counting)

I realized I was using premium models for tasks that didn’t need premium intelligence. Formatting JSON? That’s a $45 waste. Basic Q&A? $60 down the drain.

There had to be a better way.

What I Learned

I found a Reddit thread where developers shared their cost-cutting strategies. One user said: “Get a VPS with 32GB RAM—install Ollama and run Qwen locally for zero-cost inference.”

Another mentioned: “I use DeepSeek V3 for fractions of a penny per call.”

But someone else warned: “Mine just blew through $200 OpenRouter credits on nothing.”

The pattern was clear: the developers saving money had a routing strategy. The ones losing money sent everything to the most expensive model.

My Three-Tier Solution

I built a simple routing system that matches task complexity to model cost:

Task Complexity
┌───────────────┼───────────────┐
▼ ▼ ▼
Simple Moderate Complex
│ │ │
▼ ▼ ▼
Local Ollama DeepSeek API Claude/GPT-4
(FREE) (~$0.001/call) (Premium)
│ │ │
Formatting Content Architecture
Basic Q&A Generation Reasoning
Simple Tasks Analysis Critical Tasks

Let me show you how I set this up.

Step 1: Set Up Local Inference

I rented a VPS with 32GB RAM for about $25/month. Here’s the setup:

terminal
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull a capable model (Qwen 32B quantized)
ollama pull qwen2.5:32b
# Test it
ollama run qwen2.5:32b "Format this JSON: {name:'john',age:30}"

The model runs entirely on CPU. It’s slower than cloud APIs, but for simple tasks, the latency doesn’t matter.

local_client.py
import requests
def call_local_ollama(prompt: str, model: str = "qwen2.5:32b") -> str:
"""Call local Ollama - completely free"""
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": model,
"prompt": prompt,
"stream": False
},
timeout=60
)
return response.json()["response"]
# Usage - costs $0
result = call_local_ollama("Format this JSON: {name:'john',age:30}")

I route all simple tasks here: JSON formatting, basic Q&A, text extraction. Zero cost per call.

Step 2: Add Cheap API Fallback

Some tasks need more intelligence than local models provide. But they don’t need Claude-level reasoning either.

I signed up for DeepSeek API. Their pricing is incredibly cheap compared to premium models:

Model Pricing Comparison (per 1M tokens):
- GPT-4 Turbo: $10.00 input / $30.00 output
- Claude 3.5 Sonnet: $3.00 input / $15.00 output
- DeepSeek V3: $0.27 input / $1.10 output
- DeepSeek R1: $0.55 input / $2.19 output

DeepSeek is 30x cheaper than GPT-4 and 10x cheaper than Claude.

deepseek_client.py
import os
import requests
DEEPSEEK_API_KEY = os.environ.get("DEEPSEEK_API_KEY")
def call_deepseek(prompt: str, model: str = "deepseek-chat") -> str:
"""Call DeepSeek API - ultra cheap"""
response = requests.post(
"https://api.deepseek.com/v1/chat/completions",
headers={
"Authorization": f"Bearer {DEEPSEEK_API_KEY}",
"Content-Type": "application/json"
},
json={
"model": model,
"messages": [{"role": "user", "content": prompt}],
"temperature": 0.7
},
timeout=30
)
return response.json()["choices"][0]["message"]["content"]
# Usage - costs ~$0.001 for typical prompt
result = call_deepseek("Write a blog post about REST APIs")

I route moderate tasks here: content generation, code explanation, simple analysis.

Step 3: Reserve Premium for Complex Tasks

For architecture decisions, complex reasoning, and critical tasks, I still use premium models. But only when necessary.

routing.py
from enum import Enum
from typing import Literal
class TaskComplexity(Enum):
SIMPLE = "simple" # Formatting, basic Q&A, extraction
MODERATE = "moderate" # Content generation, code explanation
COMPLEX = "complex" # Architecture, reasoning, critical decisions
def route_inference(
prompt: str,
complexity: TaskComplexity = TaskComplexity.MODERATE
) -> str:
"""Route to appropriate model based on task complexity"""
if complexity == TaskComplexity.SIMPLE:
# Free local inference
return call_local_ollama(prompt)
elif complexity == TaskComplexity.MODERATE:
# Ultra-cheap API (~$0.001)
return call_deepseek(prompt)
elif complexity == TaskComplexity.COMPLEX:
# Premium model (only when truly needed)
return call_claude(prompt)
def call_claude(prompt: str) -> str:
"""Call Claude for complex reasoning"""
# Your Claude API call here
pass

Step 4: Add Cost Tracking

I wrapped everything with cost monitoring to see where money goes:

cost_tracker.py
import time
from functools import wraps
from dataclasses import dataclass
from typing import Callable
@dataclass
class InferenceCost:
model: str
tokens: int
cost_usd: float
latency_ms: float
# Cost per 1K tokens (approximate)
MODEL_COSTS = {
"ollama-local": 0.0, # Free
"deepseek-chat": 0.00014, # ~$0.14 per 1M tokens
"claude-3.5-sonnet": 0.003, # ~$3 per 1M tokens
}
def track_cost(model_name: str, cost_per_1k: float):
"""Decorator to track API costs"""
def decorator(func: Callable) -> Callable:
@wraps(func)
def wrapper(prompt: str, *args, **kwargs):
start = time.time()
result = func(prompt, *args, **kwargs)
# Estimate tokens (rough: 4 chars per token)
input_tokens = len(prompt) // 4
output_tokens = len(result) // 4
total_tokens = input_tokens + output_tokens
cost = (total_tokens / 1000) * cost_per_1k
print(f"[{model_name}] {total_tokens} tokens, ${cost:.6f}")
return result
return wrapper
return decorator
@track_cost("deepseek-chat", 0.00014)
def call_deepseek_tracked(prompt: str) -> str:
return call_deepseek(prompt)

Now I can see exactly what each call costs:

[ollama-local] 450 tokens, $0.000000
[deepseek-chat] 1200 tokens, $0.000168
[claude-3.5-sonnet] 800 tokens, $0.002400

The Results

After implementing this routing system, my monthly costs dropped dramatically:

Before:
- Everything to Claude/GPT-4: $200+/month
After:
- VPS (32GB RAM): $25/month (fixed)
- DeepSeek API: $3/month
- Claude (complex only): $15/month
Total: $43/month (78% savings)

But wait—let me show you the actual breakdown by task type:

Task Distribution:
- Simple tasks (60%): Free via Ollama
- Moderate tasks (30%): $3 via DeepSeek
- Complex tasks (10%): $15 via Claude

The 60% of simple tasks that used to cost me $120/month? Now free.

Common Mistakes I Made

Mistake 1: Using Premium Models for Simple Tasks

I was sending JSON formatting to Claude. Complete waste:

# BEFORE: Wasteful
result = call_claude("Format this JSON: {name:'john',age:30}")
# Cost: ~$0.01
# AFTER: Routed correctly
result = call_local_ollama("Format this JSON: {name:'john',age:30}")
# Cost: $0.00

Mistake 2: No Cost Visibility

I had no idea where my money was going until I added tracking. Now I log every call:

# Simple logging function
def log_inference(model: str, prompt: str, cost: float):
with open("inference_log.csv", "a") as f:
f.write(f"{model},{len(prompt)},{cost}\n")

Mistake 3: Underestimating Local Models

I thought local models couldn’t handle real work. But Qwen 32B handles 80% of my tasks fine:

Tasks Qwen 32B handles well:
- JSON/YAML formatting
- Text summarization
- Basic Q&A
- Code explanation
- Simple refactoring
Tasks requiring cloud models:
- Complex architecture decisions
- Multi-step reasoning
- Critical production code

Mistake 4: Wrong VPS Size

I first tried a 16GB VPS. It couldn’t run larger models. The recommendation from Reddit was right: 32GB minimum for good local inference.

Terminal window
# Check if your VPS can handle the model
ollama run qwen2.5:32b
# If you see "out of memory" errors, you need more RAM

Complete Routing Implementation

Here’s my full routing system:

smart_router.py
from enum import Enum
from typing import Optional
import os
class TaskComplexity(Enum):
SIMPLE = "simple"
MODERATE = "moderate"
COMPLEX = "complex"
class SmartRouter:
def __init__(self):
self.ollama_url = "http://localhost:11434"
self.deepseek_key = os.environ.get("DEEPSEEK_API_KEY")
self.claude_key = os.environ.get("ANTHROPIC_API_KEY")
def infer(
self,
prompt: str,
complexity: TaskComplexity = TaskComplexity.MODERATE
) -> str:
"""Route inference to optimal model"""
if complexity == TaskComplexity.SIMPLE:
return self._call_ollama(prompt)
elif complexity == TaskComplexity.MODERATE:
return self._call_deepseek(prompt)
else:
return self._call_claude(prompt)
def _call_ollama(self, prompt: str) -> str:
"""Free local inference"""
import requests
response = requests.post(
f"{self.ollama_url}/api/generate",
json={"model": "qwen2.5:32b", "prompt": prompt, "stream": False},
timeout=60
)
return response.json()["response"]
def _call_deepseek(self, prompt: str) -> str:
"""Ultra-cheap cloud inference"""
import requests
response = requests.post(
"https://api.deepseek.com/v1/chat/completions",
headers={"Authorization": f"Bearer {self.deepseek_key}"},
json={
"model": "deepseek-chat",
"messages": [{"role": "user", "content": prompt}]
},
timeout=30
)
return response.json()["choices"][0]["message"]["content"]
def _call_claude(self, prompt: str) -> str:
"""Premium inference for complex tasks"""
import anthropic
client = anthropic.Anthropic(api_key=self.claude_key)
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=4096,
messages=[{"role": "user", "content": prompt}]
)
return message.content[0].text
# Usage
router = SmartRouter()
# Simple task - free
formatted = router.infer(
"Format this JSON: {name:'john',age:30}",
TaskComplexity.SIMPLE
)
# Moderate task - cheap
article = router.infer(
"Write a blog post about REST APIs",
TaskComplexity.MODERATE
)
# Complex task - premium (only when needed)
architecture = router.infer(
"Design a microservices architecture for an e-commerce platform...",
TaskComplexity.COMPLEX
)

How to Choose Task Complexity

I use this simple heuristic:

SIMPLE (free local):
- Text formatting (JSON, YAML, Markdown)
- Basic Q&A (factual questions)
- Text extraction and cleanup
- Simple transformations
- Code formatting
MODERATE (cheap API):
- Content generation
- Code explanation
- Document summarization
- Translation
- Simple analysis
COMPLEX (premium):
- Architecture decisions
- Multi-step reasoning
- Complex debugging
- Security analysis
- Critical production code

Summary

I reduced my AI inference costs by 78% using a three-tier routing strategy:

  1. Free tier: Local Ollama for simple tasks (60% of usage)
  2. Cheap tier: DeepSeek API for moderate tasks (30% of usage)
  3. Premium tier: Claude/GPT-4 only for complex tasks (10% of usage)

The key insight: most tasks don’t need premium models. I was burning money sending JSON formatting to Claude.

Start by auditing where your API costs go. You’ll probably find the same pattern: a few expensive calls for tasks that don’t justify the cost.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments