How I Built a Hybrid LLM Setup That Saved Me 90% on API Costs
My LLM API bill hit $847 last month. That’s when I realized I was doing everything wrong.
I was using Claude’s API for everything—simple classification tasks, complex reasoning, batch processing, even quick one-off questions. Every request cost money, and the monthly total shocked me.
Then I discovered a hybrid approach that other developers were using. A Reddit thread showed someone achieving the same results for 12 times less cost. The secret? Mixing subscriptions, APIs, and local models intelligently.
Here’s the architecture I built and the lessons I learned.
The Problem: Single-Provider Blindness
When I started building LLM-powered features, I did what most developers do—I picked a provider and stuck with it. Claude was my choice because of its reasoning capabilities.
The problem? I wasn’t thinking about costs at all.
Task Type | Monthly Calls | Cost/Month-----------------------|---------------|------------Complex reasoning | 500 calls | $180Code generation | 1,200 calls | $240Simple classification | 5,000 calls | $200Batch processing | 10,000 calls | $180Quick Q&A | 2,000 calls | $47-----------------------|---------------|------------TOTAL | 18,700 calls | $847When I analyzed my usage, I realized I was using a Ferrari to deliver pizza. Complex reasoning tasks justified premium API costs, but simple classification? That could run on much cheaper alternatives.
The Hybrid Architecture
I built a routing layer that decides which provider to use based on task characteristics. Here’s the core architecture:
+---------------------------------------------------------------+| HYBRID LLM ARCHITECTURE |+---------------------------------------------------------------+| || +------------------+ +------------------+ || | APPLICATION | --> | ROUTING LAYER | || | LAYER | | (Decision) | || +------------------+ +--------+---------+ || | || +------------------------+------------------------+|| | | | | ||| v v v v ||| +----------+ +----------+ +----------+ +----------+ ||| | CLAUDE | | OPENAI | | DEEPSEEK | | OLLAMA | ||| | Pro/API | | Plus/API | | API | | LOCAL | ||| +----------+ +----------+ +----------+ +----------+ ||| || +------------------+ +------------------+ || | COST MONITOR | <-- | FALLBACK LOGIC | || | & ALERTING | | & RETRY QUEUE | || +------------------+ +------------------+ || |+---------------------------------------------------------------+The key insight: not all tasks need premium models. I could route simple tasks to cheaper options while preserving quality for complex work.
Step 1: Define Your Provider Tiers
I categorized providers into tiers based on cost and capability:
| Tier | Provider | Cost Model | Best For |
|---|---|---|---|
| Premium | Claude Pro subscription | $20/month flat | Daily coding, research, quick questions |
| Premium | OpenAI Plus subscription | $20/month flat | Alternative for subscription usage |
| Pay-per-use Premium | Claude API | $3-15 per million tokens | Complex reasoning tasks |
| Pay-per-use Mid | OpenAI GPT-4 API | $10-30 per million tokens | Code generation, overflow |
| Budget | DeepSeek API | $0.14-0.28 per million tokens | Batch tasks, simple classification |
| Free | Ollama (local) | Hardware cost only | Privacy-sensitive, offline use |
My tier strategy:
- Premium subscriptions for daily interactive use
- Budget APIs for high-volume tasks
- Local models for privacy-sensitive data
Step 2: Build the Routing Logic
I started with a simple rule-based router. No machine learning, just explicit rules.
routing_rules: - name: "privacy_sensitive" condition: contains_pii: true route_to: "ollama" fallback: null
- name: "code_generation" condition: task_type: "code" route_to: "claude_sonnet" fallback: "gpt4o"
- name: "batch_processing" condition: batch_size: ">100" route_to: "deepseek" fallback: "ollama"
- name: "simple_tasks" condition: expected_tokens: "<500" complexity: "low" route_to: "deepseek" fallback: "ollama"
- name: "complex_reasoning" condition: complexity: "high" route_to: "claude_opus" fallback: "gpt4o"This configuration lives in a YAML file, making it easy to adjust routing without code changes.
Step 3: Implement the Router
Here’s a simplified Python implementation:
import yamlfrom typing import Optional, Dict, Anyfrom dataclasses import dataclassfrom enum import Enum
class Provider(Enum): CLAUDE_OPUS = "claude_opus" CLAUDE_SONNET = "claude_sonnet" GPT4O = "gpt4o" DEEPSEEK = "deepseek" OLLAMA = "ollama"
@dataclassclass TaskContext: task_type: str complexity: str # "low", "medium", "high" contains_pii: bool expected_tokens: int batch_size: int = 1
class LLMRouter: def __init__(self, config_path: str = "routing-config.yaml"): with open(config_path) as f: self.config = yaml.safe_load(f) self.providers = { Provider.CLAUDE_OPUS: ClaudeOpusClient(), Provider.CLAUDE_SONNET: ClaudeSonnetClient(), Provider.GPT4O: OpenAIClient(), Provider.DEEPSEEK: DeepSeekClient(), Provider.OLLAMA: OllamaClient(), }
def route(self, context: TaskContext, prompt: str) -> str: """Route request to appropriate provider based on context.""" for rule in self.config["routing_rules"]: if self._matches_rule(context, rule): provider = Provider(rule["route_to"]) try: return self.providers[provider].generate(prompt) except Exception as e: if rule.get("fallback"): fallback = Provider(rule["fallback"]) return self.providers[fallback].generate(prompt) raise
# Default to Claude Sonnet return self.providers[Provider.CLAUDE_SONNET].generate(prompt)
def _matches_rule(self, context: TaskContext, rule: dict) -> bool: """Check if task context matches routing rule.""" cond = rule["condition"]
if cond.get("contains_pii") and context.contains_pii: return True
if cond.get("task_type") == context.task_type: return True
if cond.get("complexity") == context.complexity: return True
if cond.get("batch_size"): threshold = int(cond["batch_size"].lstrip(">")) if context.batch_size > threshold: return True
return False
# Usage examplerouter = LLMRouter()context = TaskContext( task_type="code", complexity="high", contains_pii=False, expected_tokens=1500)response = router.route(context, "Write a Python function to parse CSV files")The router tries the primary provider first, then falls back to the secondary provider if the primary fails.
Step 4: Handle Failures Gracefully
One Reddit commenter warned me: “once you introduce routing logic across multiple providers, the system becomes harder to reason about. Failures, inconsistencies, or rate limits don’t always surface clearly.”
They were right. I needed robust failure handling.
Circuit Breaker Pattern
from enum import Enumfrom datetime import datetime, timedeltafrom typing import Optional
class CircuitState(Enum): CLOSED = "closed" # Normal operation OPEN = "open" # Failing, reject requests HALF_OPEN = "half_open" # Testing if recovered
class CircuitBreaker: def __init__( self, failure_threshold: int = 5, recovery_timeout: int = 60 ): self.failure_threshold = failure_threshold self.recovery_timeout = recovery_timeout self.failure_count = 0 self.last_failure: Optional[datetime] = None self.state = CircuitState.CLOSED
def can_execute(self) -> bool: """Check if requests should be allowed through.""" if self.state == CircuitState.CLOSED: return True
if self.state == CircuitState.OPEN: if datetime.now() - self.last_failure > timedelta(seconds=self.recovery_timeout): self.state = CircuitState.HALF_OPEN return True return False
# HALF_OPEN - allow one test request return True
def record_success(self): """Record successful request.""" self.failure_count = 0 self.state = CircuitState.CLOSED
def record_failure(self): """Record failed request.""" self.failure_count += 1 self.last_failure = datetime.now()
if self.failure_count >= self.failure_threshold: self.state = CircuitState.OPENEach provider gets its own circuit breaker. When one provider starts failing, the router automatically falls back to alternatives.
Retry with Exponential Backoff
import timeimport randomfrom typing import Callable, TypeVar
T = TypeVar('T')
def retry_with_backoff( func: Callable[[], T], max_retries: int = 3, base_delay: float = 1.0, max_delay: float = 30.0) -> T: """Execute function with exponential backoff retry.""" for attempt in range(max_retries): try: return func() except Exception as e: if attempt == max_retries - 1: raise
delay = min( base_delay * (2 ** attempt) + random.uniform(0, 1), max_delay ) time.sleep(delay)
raise RuntimeError("Should not reach here")Step 5: Monitor Costs Religious
The biggest risk with hybrid setups is surprise bills. I built cost tracking into every request.
from dataclasses import dataclassfrom datetime import datetimefrom typing import Dict, Listimport json
@dataclassclass CostEvent: provider: str model: str input_tokens: int output_tokens: int cost_usd: float timestamp: datetime request_id: str cost_center: str # Which project/feature used this
class CostTracker: PRICING = { "claude_opus": {"input": 15.0, "output": 75.0}, "claude_sonnet": {"input": 3.0, "output": 15.0}, "gpt4o": {"input": 2.5, "output": 10.0}, "deepseek": {"input": 0.14, "output": 0.28}, "ollama": {"input": 0.0, "output": 0.0}, }
def __init__(self): self.events: List[CostEvent] = []
def track_request( self, provider: str, model: str, input_tokens: int, output_tokens: int, request_id: str, cost_center: str ) -> float: """Track cost of a request and return cost in USD.""" pricing = self.PRICING.get(model, {"input": 0, "output": 0})
cost = ( (input_tokens / 1_000_000) * pricing["input"] + (output_tokens / 1_000_000) * pricing["output"] )
event = CostEvent( provider=provider, model=model, input_tokens=input_tokens, output_tokens=output_tokens, cost_usd=cost, timestamp=datetime.now(), request_id=request_id, cost_center=cost_center )
self.events.append(event) return cost
def get_daily_spend(self, date: datetime = None) -> Dict[str, float]: """Get total spend by provider for a specific date.""" if date is None: date = datetime.now().date()
daily_spend = {} for event in self.events: if event.timestamp.date() == date: provider = event.provider daily_spend[provider] = daily_spend.get(provider, 0) + event.cost_usd
return daily_spend
def check_budget_alert(self, daily_limit: float = 10.0): """Check if daily spend exceeds limit.""" today_spend = sum(self.get_daily_spend().values()) if today_spend > daily_limit: # Send alert (email, Slack, etc.) print(f"WARNING: Daily spend ${today_spend:.2f} exceeds limit ${daily_limit:.2f}")Every request gets tagged with a cost center (project/feature), making it easy to attribute costs.
Step 6: Handle Provider Differences
Different providers have different APIs and response formats. I built an abstraction layer.
from abc import ABC, abstractmethodfrom typing import Optional
class LLMProvider(ABC): @abstractmethod def generate(self, prompt: str, **kwargs) -> str: """Generate response from provider.""" pass
@abstractmethod def count_tokens(self, text: str) -> int: """Count tokens in text.""" pass
@abstractmethod def get_model_name(self) -> str: """Return model identifier.""" pass
class ClaudeProvider(LLMProvider): def __init__(self, api_key: str, model: str = "claude-sonnet-4-20250514"): self.client = anthropic.Client(api_key=api_key) self.model = model
def generate(self, prompt: str, **kwargs) -> str: response = self.client.messages.create( model=self.model, max_tokens=kwargs.get("max_tokens", 4096), messages=[{"role": "user", "content": prompt}] ) return response.content[0].text
def count_tokens(self, text: str) -> int: return self.client.count_tokens(text)
def get_model_name(self) -> str: return self.model
class DeepSeekProvider(LLMProvider): def __init__(self, api_key: str): self.client = openai.OpenAI( api_key=api_key, base_url="https://api.deepseek.com/v1" )
def generate(self, prompt: str, **kwargs) -> str: response = self.client.chat.completions.create( model="deepseek-chat", messages=[{"role": "user", "content": prompt}] ) return response.choices[0].message.content
def count_tokens(self, text: str) -> int: # Approximate token count return len(text.split()) * 1.3
def get_model_name(self) -> str: return "deepseek-chat"
class OllamaProvider(LLMProvider): def __init__(self, model: str = "llama3.2"): self.model = model self.base_url = "http://localhost:11434"
def generate(self, prompt: str, **kwargs) -> str: import requests response = requests.post( f"{self.base_url}/api/generate", json={"model": self.model, "prompt": prompt, "stream": False} ) return response.json()["response"]
def count_tokens(self, text: str) -> int: return len(text.split())
def get_model_name(self) -> str: return self.modelThe abstraction layer means I can add new providers without changing the routing logic.
Step 7: Real-World Routing Decisions
Here’s my decision matrix for routing:
| Task Type | Primary Provider | Fallback | Cost Tier | Why This Route |
|---|---|---|---|---|
| Complex reasoning | Claude Opus | GPT-4o | Premium | Best reasoning quality |
| Code generation | Claude Sonnet | GPT-4o | Mid | Excellent code understanding |
| Simple classification | DeepSeek | Ollama | Budget | 10x cheaper, sufficient quality |
| Batch processing | DeepSeek | Ollama | Budget | Cost efficiency matters most |
| Privacy-sensitive | Ollama (local) | N/A | Free | Data never leaves machine |
| Quick questions | Claude Pro sub | ChatGPT sub | Fixed | Already paid subscription |
| High-volume tasks | DeepSeek API | Ollama | Budget | Marginal cost optimization |
| Production features | Claude API | GPT-4 API | Premium | SLA guarantees |
Example routing scenarios:
# Scenario 1: Code generation with PII in promptcontext = TaskContext( task_type="code", complexity="high", contains_pii=True, expected_tokens=2000)# Routes to: Ollama (privacy-sensitive rule matches first)
# Scenario 2: Batch classification jobcontext = TaskContext( task_type="classification", complexity="low", contains_pii=False, expected_tokens=100, batch_size=500)# Routes to: DeepSeek (batch_processing rule matches)
# Scenario 3: Complex reasoning for researchcontext = TaskContext( task_type="analysis", complexity="high", contains_pii=False, expected_tokens=3000)# Routes to: Claude Opus (complex_reasoning rule matches)The Results: Cost Savings
After three months of running this hybrid setup, here’s what I found:
Task Type | Before (Claude API) | After (Hybrid) | Savings-----------------------|--------------------|----------------|---------Complex reasoning | $180 | $90 | 50%Code generation | $240 | $120 | 50%Simple classification | $200 | $8 | 96%Batch processing | $180 | $12 | 93%Quick Q&A | $47 | $20 (sub) | 57%-----------------------|--------------------|----------------|---------TOTAL | $847 | $250 | 70%The biggest savings came from routing simple tasks to DeepSeek and batch jobs to the cheapest provider. My total monthly cost dropped from $847 to approximately $250.
Cost breakdown after hybrid implementation:
| Cost Component | Monthly Cost | Notes |
|---|---|---|
| Claude Pro subscription | $20 | Daily interactive use |
| OpenAI Plus subscription | $20 | Overflow, alternative |
| DeepSeek API | $30 | Batch processing, simple tasks |
| Claude API (complex tasks) | $180 | Reasoning, code generation |
| Total | $250 | 70% reduction |
The Tradeoffs: Complexity vs. Savings
The Reddit warning about system complexity is real. Here’s what I learned:
| Challenge | Problem | My Solution |
|---|---|---|
| Multiple failure modes | Failures at any provider | Circuit breaker per provider |
| Inconsistent responses | Same prompt, different outputs | Response validation, quality checks |
| Rate limit confusion | Which provider hit limit? | Per-provider rate limit tracking |
| Cost attribution | Who spent what? | Tag every request with cost center |
| Debugging difficulty | Where did request fail? | Correlation IDs for all requests |
| Testing overhead | Must test all providers | Provider-agnostic test suites |
Debugging tip: Add correlation IDs to every request.
import uuidfrom dataclasses import dataclassfrom datetime import datetime
@dataclassclass TracedRequest: correlation_id: str provider: str model: str prompt_hash: str timestamp: datetime status: str latency_ms: int cost_usd: float error: Optional[str] = None
def trace_request(provider: str, prompt: str) -> TracedRequest: correlation_id = str(uuid.uuid4()) # Log to observability system return TracedRequest( correlation_id=correlation_id, provider=provider, # ... other fields )Practical Recommendations
After running this system for months, here’s what I recommend:
Start Simple
Don’t start with ML-based routing. Use rule-based routing first. My initial router was 50 lines of Python with a YAML config. ML-based routing sounds cool, but you need data to train it, and the complexity isn’t worth it initially.
Invest in Observability
Track everything from day one:
- Request latency per provider
- Cost per request
- Error rates per provider
- Token usage per provider
I use Prometheus + Grafana, but any metrics system works.
Design for Failure
Assume every provider will fail. Build circuit breakers and fallback chains. I’ve seen OpenAI outages, Claude rate limits, and Ollama crashes. The system should degrade gracefully.
Monitor Costs Daily
Set budgets and alerts. I have a daily budget alert at $20. If I exceed it, I get a notification. This has saved me from surprise bills multiple times.
Test Provider-Agnostic
Write tests against interfaces, not providers. This makes it easy to swap providers without rewriting tests.
def test_classification_quality(router: LLMRouter): """Test that classification works across all providers.""" context = TaskContext( task_type="classification", complexity="low", contains_pii=False, expected_tokens=100 )
result = router.route(context, "Classify this text: positive or negative?")
assert result in ["positive", "negative"] # Test works regardless of which provider handles itAlternative Approaches
I considered other strategies before settling on this hybrid approach:
Option 1: All Subscription
Use Claude Pro and ChatGPT Plus subscriptions only. Cost: $40/month.
Pros: Simple, predictable cost. Cons: Limited API access, no batch processing, rate limits on heavy use.
Option 2: All API
Use Claude API or OpenAI API for everything. Cost: $400-800+/month.
Pros: No limits, maximum flexibility. Cons: Expensive, unpredictable costs.
Option 3: All Local
Use Ollama/LMStudio for everything. Cost: Hardware only.
Pros: Free, private, no rate limits. Cons: Quality gap on complex tasks, requires GPU, maintenance overhead.
My Choice: Hybrid
The hybrid approach gives me:
- Quality where I need it (complex reasoning)
- Cost savings where I don’t (simple tasks)
- Privacy when required (local models)
When to Skip the Hybrid Approach
Not everyone needs this complexity. Skip the hybrid approach if:
-
You use LLMs sparingly. If your monthly API bill is under $50, just pick one provider.
-
You don’t have batch tasks. If all your requests are real-time and user-facing, the routing overhead might not be worth it.
-
You need consistent outputs. If your application requires identical outputs for the same prompt, multiple providers will cause headaches.
-
You lack observability skills. Without proper monitoring, a multi-provider system becomes unmanageable.
My Current Stack
For an individual developer, I recommend:
| Tier | Use Case | Provider | Cost |
|---|---|---|---|
| Primary | Daily coding, research | Claude Pro subscription | $20/mo |
| Secondary | Overflow, complex tasks | OpenAI Plus subscription | $20/mo |
| Batch | High-volume processing | DeepSeek API | $10-30/mo |
| Privacy | Sensitive data | Ollama (local) | Hardware only |
Total estimated monthly cost: $50-70
Compare this to my previous $847/month with Claude API only. That’s a 90% cost reduction for the same (or better) quality on most tasks.
Lessons Learned
-
Cost optimization requires usage analysis. I couldn’t optimize until I understood where money was going.
-
Simple routing beats smart routing. My rule-based router works great. ML-based routing is overkill for most use cases.
-
Observability is non-negotiable. Without tracking costs, failures, and latency, a multi-provider system becomes a black box.
-
Local models have improved. Ollama with Llama 3.2 handles many tasks that previously required API calls.
-
Subscriptions are underutilized. Claude Pro and ChatGPT Plus give you unlimited access for $20/month each. Use them for interactive work.
-
Fallback chains save the day. When Claude went down during a critical task, my system automatically switched to GPT-4o. No user noticed.
-
Test each provider independently. Different providers have different strengths and weaknesses. Test them all.
What’s Next
I’m experimenting with:
-
ML-based routing: Using historical quality scores to improve routing decisions.
-
Cost prediction: Estimating costs before running requests.
-
Automatic provider selection: Let the system learn which provider works best for each task type.
-
Quality scoring: Automatically rating response quality to inform future routing.
Related Knowledge
-
Circuit Breaker Pattern: Essential for distributed systems. Read Martin Fowler’s article on circuit breakers.
-
Cost Attribution: Every request should have a cost center. This helps identify which features drive costs.
-
Observability: Prometheus + Grafana is my stack, but any metrics system works. The key is tracking per-provider metrics.
-
Local LLMs: Ollama makes running local models trivial. Worth exploring if you have privacy requirements or high-volume simple tasks.
-
LangChain: Provides abstractions for multi-provider LLM usage. My router is custom-built, but LangChain’s routing abstractions are getting better.
Reference Links
- OpenClaw Cost Optimization Discussion - The Reddit thread that started my journey
- Claude API Pricing - Official pricing page
- DeepSeek API - Budget API provider
- Ollama - Run LLMs locally
- LangChain Multi-Provider Support - Abstraction layer for multiple providers
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments