Skip to content

How I Built a Hybrid LLM Setup That Saved Me 90% on API Costs

My LLM API bill hit $847 last month. That’s when I realized I was doing everything wrong.

I was using Claude’s API for everything—simple classification tasks, complex reasoning, batch processing, even quick one-off questions. Every request cost money, and the monthly total shocked me.

Then I discovered a hybrid approach that other developers were using. A Reddit thread showed someone achieving the same results for 12 times less cost. The secret? Mixing subscriptions, APIs, and local models intelligently.

Here’s the architecture I built and the lessons I learned.

The Problem: Single-Provider Blindness

When I started building LLM-powered features, I did what most developers do—I picked a provider and stuck with it. Claude was my choice because of its reasoning capabilities.

The problem? I wasn’t thinking about costs at all.

My naive monthly API usage breakdown
Task Type | Monthly Calls | Cost/Month
-----------------------|---------------|------------
Complex reasoning | 500 calls | $180
Code generation | 1,200 calls | $240
Simple classification | 5,000 calls | $200
Batch processing | 10,000 calls | $180
Quick Q&A | 2,000 calls | $47
-----------------------|---------------|------------
TOTAL | 18,700 calls | $847

When I analyzed my usage, I realized I was using a Ferrari to deliver pizza. Complex reasoning tasks justified premium API costs, but simple classification? That could run on much cheaper alternatives.

The Hybrid Architecture

I built a routing layer that decides which provider to use based on task characteristics. Here’s the core architecture:

Hybrid LLM architecture overview
+---------------------------------------------------------------+
| HYBRID LLM ARCHITECTURE |
+---------------------------------------------------------------+
| |
| +------------------+ +------------------+ |
| | APPLICATION | --> | ROUTING LAYER | |
| | LAYER | | (Decision) | |
| +------------------+ +--------+---------+ |
| | |
| +------------------------+------------------------+|
| | | | | ||
| v v v v ||
| +----------+ +----------+ +----------+ +----------+ ||
| | CLAUDE | | OPENAI | | DEEPSEEK | | OLLAMA | ||
| | Pro/API | | Plus/API | | API | | LOCAL | ||
| +----------+ +----------+ +----------+ +----------+ ||
| |
| +------------------+ +------------------+ |
| | COST MONITOR | <-- | FALLBACK LOGIC | |
| | & ALERTING | | & RETRY QUEUE | |
| +------------------+ +------------------+ |
| |
+---------------------------------------------------------------+

The key insight: not all tasks need premium models. I could route simple tasks to cheaper options while preserving quality for complex work.

Step 1: Define Your Provider Tiers

I categorized providers into tiers based on cost and capability:

TierProviderCost ModelBest For
PremiumClaude Pro subscription$20/month flatDaily coding, research, quick questions
PremiumOpenAI Plus subscription$20/month flatAlternative for subscription usage
Pay-per-use PremiumClaude API$3-15 per million tokensComplex reasoning tasks
Pay-per-use MidOpenAI GPT-4 API$10-30 per million tokensCode generation, overflow
BudgetDeepSeek API$0.14-0.28 per million tokensBatch tasks, simple classification
FreeOllama (local)Hardware cost onlyPrivacy-sensitive, offline use

My tier strategy:

  • Premium subscriptions for daily interactive use
  • Budget APIs for high-volume tasks
  • Local models for privacy-sensitive data

Step 2: Build the Routing Logic

I started with a simple rule-based router. No machine learning, just explicit rules.

routing-config.yaml
routing_rules:
- name: "privacy_sensitive"
condition:
contains_pii: true
route_to: "ollama"
fallback: null
- name: "code_generation"
condition:
task_type: "code"
route_to: "claude_sonnet"
fallback: "gpt4o"
- name: "batch_processing"
condition:
batch_size: ">100"
route_to: "deepseek"
fallback: "ollama"
- name: "simple_tasks"
condition:
expected_tokens: "<500"
complexity: "low"
route_to: "deepseek"
fallback: "ollama"
- name: "complex_reasoning"
condition:
complexity: "high"
route_to: "claude_opus"
fallback: "gpt4o"

This configuration lives in a YAML file, making it easy to adjust routing without code changes.

Step 3: Implement the Router

Here’s a simplified Python implementation:

llm_router.py
import yaml
from typing import Optional, Dict, Any
from dataclasses import dataclass
from enum import Enum
class Provider(Enum):
CLAUDE_OPUS = "claude_opus"
CLAUDE_SONNET = "claude_sonnet"
GPT4O = "gpt4o"
DEEPSEEK = "deepseek"
OLLAMA = "ollama"
@dataclass
class TaskContext:
task_type: str
complexity: str # "low", "medium", "high"
contains_pii: bool
expected_tokens: int
batch_size: int = 1
class LLMRouter:
def __init__(self, config_path: str = "routing-config.yaml"):
with open(config_path) as f:
self.config = yaml.safe_load(f)
self.providers = {
Provider.CLAUDE_OPUS: ClaudeOpusClient(),
Provider.CLAUDE_SONNET: ClaudeSonnetClient(),
Provider.GPT4O: OpenAIClient(),
Provider.DEEPSEEK: DeepSeekClient(),
Provider.OLLAMA: OllamaClient(),
}
def route(self, context: TaskContext, prompt: str) -> str:
"""Route request to appropriate provider based on context."""
for rule in self.config["routing_rules"]:
if self._matches_rule(context, rule):
provider = Provider(rule["route_to"])
try:
return self.providers[provider].generate(prompt)
except Exception as e:
if rule.get("fallback"):
fallback = Provider(rule["fallback"])
return self.providers[fallback].generate(prompt)
raise
# Default to Claude Sonnet
return self.providers[Provider.CLAUDE_SONNET].generate(prompt)
def _matches_rule(self, context: TaskContext, rule: dict) -> bool:
"""Check if task context matches routing rule."""
cond = rule["condition"]
if cond.get("contains_pii") and context.contains_pii:
return True
if cond.get("task_type") == context.task_type:
return True
if cond.get("complexity") == context.complexity:
return True
if cond.get("batch_size"):
threshold = int(cond["batch_size"].lstrip(">"))
if context.batch_size > threshold:
return True
return False
# Usage example
router = LLMRouter()
context = TaskContext(
task_type="code",
complexity="high",
contains_pii=False,
expected_tokens=1500
)
response = router.route(context, "Write a Python function to parse CSV files")

The router tries the primary provider first, then falls back to the secondary provider if the primary fails.

Step 4: Handle Failures Gracefully

One Reddit commenter warned me: “once you introduce routing logic across multiple providers, the system becomes harder to reason about. Failures, inconsistencies, or rate limits don’t always surface clearly.”

They were right. I needed robust failure handling.

Circuit Breaker Pattern

circuit_breaker.py
from enum import Enum
from datetime import datetime, timedelta
from typing import Optional
class CircuitState(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Failing, reject requests
HALF_OPEN = "half_open" # Testing if recovered
class CircuitBreaker:
def __init__(
self,
failure_threshold: int = 5,
recovery_timeout: int = 60
):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.failure_count = 0
self.last_failure: Optional[datetime] = None
self.state = CircuitState.CLOSED
def can_execute(self) -> bool:
"""Check if requests should be allowed through."""
if self.state == CircuitState.CLOSED:
return True
if self.state == CircuitState.OPEN:
if datetime.now() - self.last_failure > timedelta(seconds=self.recovery_timeout):
self.state = CircuitState.HALF_OPEN
return True
return False
# HALF_OPEN - allow one test request
return True
def record_success(self):
"""Record successful request."""
self.failure_count = 0
self.state = CircuitState.CLOSED
def record_failure(self):
"""Record failed request."""
self.failure_count += 1
self.last_failure = datetime.now()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN

Each provider gets its own circuit breaker. When one provider starts failing, the router automatically falls back to alternatives.

Retry with Exponential Backoff

retry_handler.py
import time
import random
from typing import Callable, TypeVar
T = TypeVar('T')
def retry_with_backoff(
func: Callable[[], T],
max_retries: int = 3,
base_delay: float = 1.0,
max_delay: float = 30.0
) -> T:
"""Execute function with exponential backoff retry."""
for attempt in range(max_retries):
try:
return func()
except Exception as e:
if attempt == max_retries - 1:
raise
delay = min(
base_delay * (2 ** attempt) + random.uniform(0, 1),
max_delay
)
time.sleep(delay)
raise RuntimeError("Should not reach here")

Step 5: Monitor Costs Religious

The biggest risk with hybrid setups is surprise bills. I built cost tracking into every request.

cost_tracker.py
from dataclasses import dataclass
from datetime import datetime
from typing import Dict, List
import json
@dataclass
class CostEvent:
provider: str
model: str
input_tokens: int
output_tokens: int
cost_usd: float
timestamp: datetime
request_id: str
cost_center: str # Which project/feature used this
class CostTracker:
PRICING = {
"claude_opus": {"input": 15.0, "output": 75.0},
"claude_sonnet": {"input": 3.0, "output": 15.0},
"gpt4o": {"input": 2.5, "output": 10.0},
"deepseek": {"input": 0.14, "output": 0.28},
"ollama": {"input": 0.0, "output": 0.0},
}
def __init__(self):
self.events: List[CostEvent] = []
def track_request(
self,
provider: str,
model: str,
input_tokens: int,
output_tokens: int,
request_id: str,
cost_center: str
) -> float:
"""Track cost of a request and return cost in USD."""
pricing = self.PRICING.get(model, {"input": 0, "output": 0})
cost = (
(input_tokens / 1_000_000) * pricing["input"] +
(output_tokens / 1_000_000) * pricing["output"]
)
event = CostEvent(
provider=provider,
model=model,
input_tokens=input_tokens,
output_tokens=output_tokens,
cost_usd=cost,
timestamp=datetime.now(),
request_id=request_id,
cost_center=cost_center
)
self.events.append(event)
return cost
def get_daily_spend(self, date: datetime = None) -> Dict[str, float]:
"""Get total spend by provider for a specific date."""
if date is None:
date = datetime.now().date()
daily_spend = {}
for event in self.events:
if event.timestamp.date() == date:
provider = event.provider
daily_spend[provider] = daily_spend.get(provider, 0) + event.cost_usd
return daily_spend
def check_budget_alert(self, daily_limit: float = 10.0):
"""Check if daily spend exceeds limit."""
today_spend = sum(self.get_daily_spend().values())
if today_spend > daily_limit:
# Send alert (email, Slack, etc.)
print(f"WARNING: Daily spend ${today_spend:.2f} exceeds limit ${daily_limit:.2f}")

Every request gets tagged with a cost center (project/feature), making it easy to attribute costs.

Step 6: Handle Provider Differences

Different providers have different APIs and response formats. I built an abstraction layer.

provider_interface.py
from abc import ABC, abstractmethod
from typing import Optional
class LLMProvider(ABC):
@abstractmethod
def generate(self, prompt: str, **kwargs) -> str:
"""Generate response from provider."""
pass
@abstractmethod
def count_tokens(self, text: str) -> int:
"""Count tokens in text."""
pass
@abstractmethod
def get_model_name(self) -> str:
"""Return model identifier."""
pass
class ClaudeProvider(LLMProvider):
def __init__(self, api_key: str, model: str = "claude-sonnet-4-20250514"):
self.client = anthropic.Client(api_key=api_key)
self.model = model
def generate(self, prompt: str, **kwargs) -> str:
response = self.client.messages.create(
model=self.model,
max_tokens=kwargs.get("max_tokens", 4096),
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
def count_tokens(self, text: str) -> int:
return self.client.count_tokens(text)
def get_model_name(self) -> str:
return self.model
class DeepSeekProvider(LLMProvider):
def __init__(self, api_key: str):
self.client = openai.OpenAI(
api_key=api_key,
base_url="https://api.deepseek.com/v1"
)
def generate(self, prompt: str, **kwargs) -> str:
response = self.client.chat.completions.create(
model="deepseek-chat",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
def count_tokens(self, text: str) -> int:
# Approximate token count
return len(text.split()) * 1.3
def get_model_name(self) -> str:
return "deepseek-chat"
class OllamaProvider(LLMProvider):
def __init__(self, model: str = "llama3.2"):
self.model = model
self.base_url = "http://localhost:11434"
def generate(self, prompt: str, **kwargs) -> str:
import requests
response = requests.post(
f"{self.base_url}/api/generate",
json={"model": self.model, "prompt": prompt, "stream": False}
)
return response.json()["response"]
def count_tokens(self, text: str) -> int:
return len(text.split())
def get_model_name(self) -> str:
return self.model

The abstraction layer means I can add new providers without changing the routing logic.

Step 7: Real-World Routing Decisions

Here’s my decision matrix for routing:

Task TypePrimary ProviderFallbackCost TierWhy This Route
Complex reasoningClaude OpusGPT-4oPremiumBest reasoning quality
Code generationClaude SonnetGPT-4oMidExcellent code understanding
Simple classificationDeepSeekOllamaBudget10x cheaper, sufficient quality
Batch processingDeepSeekOllamaBudgetCost efficiency matters most
Privacy-sensitiveOllama (local)N/AFreeData never leaves machine
Quick questionsClaude Pro subChatGPT subFixedAlready paid subscription
High-volume tasksDeepSeek APIOllamaBudgetMarginal cost optimization
Production featuresClaude APIGPT-4 APIPremiumSLA guarantees

Example routing scenarios:

routing_examples.py
# Scenario 1: Code generation with PII in prompt
context = TaskContext(
task_type="code",
complexity="high",
contains_pii=True,
expected_tokens=2000
)
# Routes to: Ollama (privacy-sensitive rule matches first)
# Scenario 2: Batch classification job
context = TaskContext(
task_type="classification",
complexity="low",
contains_pii=False,
expected_tokens=100,
batch_size=500
)
# Routes to: DeepSeek (batch_processing rule matches)
# Scenario 3: Complex reasoning for research
context = TaskContext(
task_type="analysis",
complexity="high",
contains_pii=False,
expected_tokens=3000
)
# Routes to: Claude Opus (complex_reasoning rule matches)

The Results: Cost Savings

After three months of running this hybrid setup, here’s what I found:

Monthly cost comparison (before vs after hybrid)
Task Type | Before (Claude API) | After (Hybrid) | Savings
-----------------------|--------------------|----------------|---------
Complex reasoning | $180 | $90 | 50%
Code generation | $240 | $120 | 50%
Simple classification | $200 | $8 | 96%
Batch processing | $180 | $12 | 93%
Quick Q&A | $47 | $20 (sub) | 57%
-----------------------|--------------------|----------------|---------
TOTAL | $847 | $250 | 70%

The biggest savings came from routing simple tasks to DeepSeek and batch jobs to the cheapest provider. My total monthly cost dropped from $847 to approximately $250.

Cost breakdown after hybrid implementation:

Cost ComponentMonthly CostNotes
Claude Pro subscription$20Daily interactive use
OpenAI Plus subscription$20Overflow, alternative
DeepSeek API$30Batch processing, simple tasks
Claude API (complex tasks)$180Reasoning, code generation
Total$25070% reduction

The Tradeoffs: Complexity vs. Savings

The Reddit warning about system complexity is real. Here’s what I learned:

ChallengeProblemMy Solution
Multiple failure modesFailures at any providerCircuit breaker per provider
Inconsistent responsesSame prompt, different outputsResponse validation, quality checks
Rate limit confusionWhich provider hit limit?Per-provider rate limit tracking
Cost attributionWho spent what?Tag every request with cost center
Debugging difficultyWhere did request fail?Correlation IDs for all requests
Testing overheadMust test all providersProvider-agnostic test suites

Debugging tip: Add correlation IDs to every request.

request_tracing.py
import uuid
from dataclasses import dataclass
from datetime import datetime
@dataclass
class TracedRequest:
correlation_id: str
provider: str
model: str
prompt_hash: str
timestamp: datetime
status: str
latency_ms: int
cost_usd: float
error: Optional[str] = None
def trace_request(provider: str, prompt: str) -> TracedRequest:
correlation_id = str(uuid.uuid4())
# Log to observability system
return TracedRequest(
correlation_id=correlation_id,
provider=provider,
# ... other fields
)

Practical Recommendations

After running this system for months, here’s what I recommend:

Start Simple

Don’t start with ML-based routing. Use rule-based routing first. My initial router was 50 lines of Python with a YAML config. ML-based routing sounds cool, but you need data to train it, and the complexity isn’t worth it initially.

Invest in Observability

Track everything from day one:

  • Request latency per provider
  • Cost per request
  • Error rates per provider
  • Token usage per provider

I use Prometheus + Grafana, but any metrics system works.

Design for Failure

Assume every provider will fail. Build circuit breakers and fallback chains. I’ve seen OpenAI outages, Claude rate limits, and Ollama crashes. The system should degrade gracefully.

Monitor Costs Daily

Set budgets and alerts. I have a daily budget alert at $20. If I exceed it, I get a notification. This has saved me from surprise bills multiple times.

Test Provider-Agnostic

Write tests against interfaces, not providers. This makes it easy to swap providers without rewriting tests.

provider_agnostic_tests.py
def test_classification_quality(router: LLMRouter):
"""Test that classification works across all providers."""
context = TaskContext(
task_type="classification",
complexity="low",
contains_pii=False,
expected_tokens=100
)
result = router.route(context, "Classify this text: positive or negative?")
assert result in ["positive", "negative"]
# Test works regardless of which provider handles it

Alternative Approaches

I considered other strategies before settling on this hybrid approach:

Option 1: All Subscription

Use Claude Pro and ChatGPT Plus subscriptions only. Cost: $40/month.

Pros: Simple, predictable cost. Cons: Limited API access, no batch processing, rate limits on heavy use.

Option 2: All API

Use Claude API or OpenAI API for everything. Cost: $400-800+/month.

Pros: No limits, maximum flexibility. Cons: Expensive, unpredictable costs.

Option 3: All Local

Use Ollama/LMStudio for everything. Cost: Hardware only.

Pros: Free, private, no rate limits. Cons: Quality gap on complex tasks, requires GPU, maintenance overhead.

My Choice: Hybrid

The hybrid approach gives me:

  • Quality where I need it (complex reasoning)
  • Cost savings where I don’t (simple tasks)
  • Privacy when required (local models)

When to Skip the Hybrid Approach

Not everyone needs this complexity. Skip the hybrid approach if:

  1. You use LLMs sparingly. If your monthly API bill is under $50, just pick one provider.

  2. You don’t have batch tasks. If all your requests are real-time and user-facing, the routing overhead might not be worth it.

  3. You need consistent outputs. If your application requires identical outputs for the same prompt, multiple providers will cause headaches.

  4. You lack observability skills. Without proper monitoring, a multi-provider system becomes unmanageable.

My Current Stack

For an individual developer, I recommend:

TierUse CaseProviderCost
PrimaryDaily coding, researchClaude Pro subscription$20/mo
SecondaryOverflow, complex tasksOpenAI Plus subscription$20/mo
BatchHigh-volume processingDeepSeek API$10-30/mo
PrivacySensitive dataOllama (local)Hardware only

Total estimated monthly cost: $50-70

Compare this to my previous $847/month with Claude API only. That’s a 90% cost reduction for the same (or better) quality on most tasks.

Lessons Learned

  1. Cost optimization requires usage analysis. I couldn’t optimize until I understood where money was going.

  2. Simple routing beats smart routing. My rule-based router works great. ML-based routing is overkill for most use cases.

  3. Observability is non-negotiable. Without tracking costs, failures, and latency, a multi-provider system becomes a black box.

  4. Local models have improved. Ollama with Llama 3.2 handles many tasks that previously required API calls.

  5. Subscriptions are underutilized. Claude Pro and ChatGPT Plus give you unlimited access for $20/month each. Use them for interactive work.

  6. Fallback chains save the day. When Claude went down during a critical task, my system automatically switched to GPT-4o. No user noticed.

  7. Test each provider independently. Different providers have different strengths and weaknesses. Test them all.

What’s Next

I’m experimenting with:

  1. ML-based routing: Using historical quality scores to improve routing decisions.

  2. Cost prediction: Estimating costs before running requests.

  3. Automatic provider selection: Let the system learn which provider works best for each task type.

  4. Quality scoring: Automatically rating response quality to inform future routing.

  • Circuit Breaker Pattern: Essential for distributed systems. Read Martin Fowler’s article on circuit breakers.

  • Cost Attribution: Every request should have a cost center. This helps identify which features drive costs.

  • Observability: Prometheus + Grafana is my stack, but any metrics system works. The key is tracking per-provider metrics.

  • Local LLMs: Ollama makes running local models trivial. Worth exploring if you have privacy requirements or high-volume simple tasks.

  • LangChain: Provides abstractions for multi-provider LLM usage. My router is custom-built, but LangChain’s routing abstractions are getting better.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments