How I Built a Hybrid LLM Setup That Saved Me 90% on API Costs

Mar 24, 2026

My LLM API bill hit $847 last month. That’s when I realized I was doing everything wrong.

I was using Claude’s API for everything—simple classification tasks, complex reasoning, batch processing, even quick one-off questions. Every request cost money, and the monthly total shocked me.

Then I discovered a hybrid approach that other developers were using. A Reddit thread showed someone achieving the same results for 12 times less cost. The secret? Mixing subscriptions, APIs, and local models intelligently.

Here’s the architecture I built and the lessons I learned.

The Problem: Single-Provider Blindness

When I started building LLM-powered features, I did what most developers do—I picked a provider and stuck with it. Claude was my choice because of its reasoning capabilities.

The problem? I wasn’t thinking about costs at all.

Task Type              | Monthly Calls | Cost/Month
-----------------------|---------------|------------
Complex reasoning      | 500 calls     | $180
Code generation        | 1,200 calls   | $240
Simple classification  | 5,000 calls   | $200
Batch processing       | 10,000 calls  | $180
Quick Q&A              | 2,000 calls   | $47
-----------------------|---------------|------------
TOTAL                  | 18,700 calls  | $847

When I analyzed my usage, I realized I was using a Ferrari to deliver pizza. Complex reasoning tasks justified premium API costs, but simple classification? That could run on much cheaper alternatives.

The Hybrid Architecture

I built a routing layer that decides which provider to use based on task characteristics. Here’s the core architecture:

+---------------------------------------------------------------+
|                    HYBRID LLM ARCHITECTURE                     |
+---------------------------------------------------------------+
|                                                               |
|  +------------------+     +------------------+               |
|  |   APPLICATION    | --> |   ROUTING LAYER  |               |
|  |   LAYER          |     |   (Decision)     |               |
|  +------------------+     +--------+---------+               |
|                                    |                          |
|           +------------------------+------------------------+|
|           |             |            |            |          ||
|           v             v            v            v          ||
|     +----------+  +----------+  +----------+  +----------+  ||
|     | CLAUDE   |  | OPENAI   |  | DEEPSEEK |  | OLLAMA   |  ||
|     | Pro/API  |  | Plus/API |  |   API    |  |  LOCAL   |  ||
|     +----------+  +----------+  +----------+  +----------+  ||
|                                                               |
|  +------------------+     +------------------+               |
|  |  COST MONITOR    | &lt;-- |  FALLBACK LOGIC  |               |
|  |  & ALERTING      |     |  & RETRY QUEUE    |               |
|  +------------------+     +------------------+               |
|                                                               |
+---------------------------------------------------------------+

The key insight: not all tasks need premium models. I could route simple tasks to cheaper options while preserving quality for complex work.

Step 1: Define Your Provider Tiers

I categorized providers into tiers based on cost and capability:

Tier	Provider	Cost Model	Best For
Premium	Claude Pro subscription	$20/month flat	Daily coding, research, quick questions
Premium	OpenAI Plus subscription	$20/month flat	Alternative for subscription usage
Pay-per-use Premium	Claude API	$3-15 per million tokens	Complex reasoning tasks
Pay-per-use Mid	OpenAI GPT-4 API	$10-30 per million tokens	Code generation, overflow
Budget	DeepSeek API	$0.14-0.28 per million tokens	Batch tasks, simple classification
Free	Ollama (local)	Hardware cost only	Privacy-sensitive, offline use

My tier strategy:

Premium subscriptions for daily interactive use
Budget APIs for high-volume tasks
Local models for privacy-sensitive data

Step 2: Build the Routing Logic

I started with a simple rule-based router. No machine learning, just explicit rules.

routing_rules:
  - name: "privacy_sensitive"
    condition:
      contains_pii: true
    route_to: "ollama"
    fallback: null

  - name: "code_generation"
    condition:
      task_type: "code"
    route_to: "claude_sonnet"
    fallback: "gpt4o"

  - name: "batch_processing"
    condition:
      batch_size: ">100"
    route_to: "deepseek"
    fallback: "ollama"

  - name: "simple_tasks"
    condition:
      expected_tokens: "<500"
      complexity: "low"
    route_to: "deepseek"
    fallback: "ollama"

  - name: "complex_reasoning"
    condition:
      complexity: "high"
    route_to: "claude_opus"
    fallback: "gpt4o"

This configuration lives in a YAML file, making it easy to adjust routing without code changes.

Step 3: Implement the Router

Here’s a simplified Python implementation:

import yaml
from typing import Optional, Dict, Any
from dataclasses import dataclass
from enum import Enum

class Provider(Enum):
    CLAUDE_OPUS = "claude_opus"
    CLAUDE_SONNET = "claude_sonnet"
    GPT4O = "gpt4o"
    DEEPSEEK = "deepseek"
    OLLAMA = "ollama"

@dataclass
class TaskContext:
    task_type: str
    complexity: str  # "low", "medium", "high"
    contains_pii: bool
    expected_tokens: int
    batch_size: int = 1

class LLMRouter:
    def __init__(self, config_path: str = "routing-config.yaml"):
        with open(config_path) as f:
            self.config = yaml.safe_load(f)
        self.providers = {
            Provider.CLAUDE_OPUS: ClaudeOpusClient(),
            Provider.CLAUDE_SONNET: ClaudeSonnetClient(),
            Provider.GPT4O: OpenAIClient(),
            Provider.DEEPSEEK: DeepSeekClient(),
            Provider.OLLAMA: OllamaClient(),
        }

    def route(self, context: TaskContext, prompt: str) -> str:
        """Route request to appropriate provider based on context."""
        for rule in self.config["routing_rules"]:
            if self._matches_rule(context, rule):
                provider = Provider(rule["route_to"])
                try:
                    return self.providers[provider].generate(prompt)
                except Exception as e:
                    if rule.get("fallback"):
                        fallback = Provider(rule["fallback"])
                        return self.providers[fallback].generate(prompt)
                    raise

        # Default to Claude Sonnet
        return self.providers[Provider.CLAUDE_SONNET].generate(prompt)

    def _matches_rule(self, context: TaskContext, rule: dict) -> bool:
        """Check if task context matches routing rule."""
        cond = rule["condition"]

        if cond.get("contains_pii") and context.contains_pii:
            return True

        if cond.get("task_type") == context.task_type:
            return True

        if cond.get("complexity") == context.complexity:
            return True

        if cond.get("batch_size"):
            threshold = int(cond["batch_size"].lstrip(">"))
            if context.batch_size > threshold:
                return True

        return False

# Usage example
router = LLMRouter()
context = TaskContext(
    task_type="code",
    complexity="high",
    contains_pii=False,
    expected_tokens=1500
)
response = router.route(context, "Write a Python function to parse CSV files")

The router tries the primary provider first, then falls back to the secondary provider if the primary fails.

Step 4: Handle Failures Gracefully

One Reddit commenter warned me: “once you introduce routing logic across multiple providers, the system becomes harder to reason about. Failures, inconsistencies, or rate limits don’t always surface clearly.”

They were right. I needed robust failure handling.

Circuit Breaker Pattern

from enum import Enum
from datetime import datetime, timedelta
from typing import Optional

class CircuitState(Enum):
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Failing, reject requests
    HALF_OPEN = "half_open"  # Testing if recovered

class CircuitBreaker:
    def __init__(
        self,
        failure_threshold: int = 5,
        recovery_timeout: int = 60
    ):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.failure_count = 0
        self.last_failure: Optional[datetime] = None
        self.state = CircuitState.CLOSED

    def can_execute(self) -> bool:
        """Check if requests should be allowed through."""
        if self.state == CircuitState.CLOSED:
            return True

        if self.state == CircuitState.OPEN:
            if datetime.now() - self.last_failure > timedelta(seconds=self.recovery_timeout):
                self.state = CircuitState.HALF_OPEN
                return True
            return False

        # HALF_OPEN - allow one test request
        return True

    def record_success(self):
        """Record successful request."""
        self.failure_count = 0
        self.state = CircuitState.CLOSED

    def record_failure(self):
        """Record failed request."""
        self.failure_count += 1
        self.last_failure = datetime.now()

        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN

Each provider gets its own circuit breaker. When one provider starts failing, the router automatically falls back to alternatives.

Retry with Exponential Backoff

import time
import random
from typing import Callable, TypeVar

T = TypeVar('T')

def retry_with_backoff(
    func: Callable[[], T],
    max_retries: int = 3,
    base_delay: float = 1.0,
    max_delay: float = 30.0
) -> T:
    """Execute function with exponential backoff retry."""
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            if attempt == max_retries - 1:
                raise

            delay = min(
                base_delay * (2 ** attempt) + random.uniform(0, 1),
                max_delay
            )
            time.sleep(delay)

    raise RuntimeError("Should not reach here")

Step 5: Monitor Costs Religious

The biggest risk with hybrid setups is surprise bills. I built cost tracking into every request.

from dataclasses import dataclass
from datetime import datetime
from typing import Dict, List
import json

@dataclass
class CostEvent:
    provider: str
    model: str
    input_tokens: int
    output_tokens: int
    cost_usd: float
    timestamp: datetime
    request_id: str
    cost_center: str  # Which project/feature used this

class CostTracker:
    PRICING = {
        "claude_opus": {"input": 15.0, "output": 75.0},
        "claude_sonnet": {"input": 3.0, "output": 15.0},
        "gpt4o": {"input": 2.5, "output": 10.0},
        "deepseek": {"input": 0.14, "output": 0.28},
        "ollama": {"input": 0.0, "output": 0.0},
    }

    def __init__(self):
        self.events: List[CostEvent] = []

    def track_request(
        self,
        provider: str,
        model: str,
        input_tokens: int,
        output_tokens: int,
        request_id: str,
        cost_center: str
    ) -> float:
        """Track cost of a request and return cost in USD."""
        pricing = self.PRICING.get(model, {"input": 0, "output": 0})

        cost = (
            (input_tokens / 1_000_000) * pricing["input"] +
            (output_tokens / 1_000_000) * pricing["output"]
        )

        event = CostEvent(
            provider=provider,
            model=model,
            input_tokens=input_tokens,
            output_tokens=output_tokens,
            cost_usd=cost,
            timestamp=datetime.now(),
            request_id=request_id,
            cost_center=cost_center
        )

        self.events.append(event)
        return cost

    def get_daily_spend(self, date: datetime = None) -> Dict[str, float]:
        """Get total spend by provider for a specific date."""
        if date is None:
            date = datetime.now().date()

        daily_spend = {}
        for event in self.events:
            if event.timestamp.date() == date:
                provider = event.provider
                daily_spend[provider] = daily_spend.get(provider, 0) + event.cost_usd

        return daily_spend

    def check_budget_alert(self, daily_limit: float = 10.0):
        """Check if daily spend exceeds limit."""
        today_spend = sum(self.get_daily_spend().values())
        if today_spend > daily_limit:
            # Send alert (email, Slack, etc.)
            print(f"WARNING: Daily spend ${today_spend:.2f} exceeds limit ${daily_limit:.2f}")

Every request gets tagged with a cost center (project/feature), making it easy to attribute costs.

Step 6: Handle Provider Differences

Different providers have different APIs and response formats. I built an abstraction layer.

from abc import ABC, abstractmethod
from typing import Optional

class LLMProvider(ABC):
    @abstractmethod
    def generate(self, prompt: str, **kwargs) -> str:
        """Generate response from provider."""
        pass

    @abstractmethod
    def count_tokens(self, text: str) -> int:
        """Count tokens in text."""
        pass

    @abstractmethod
    def get_model_name(self) -> str:
        """Return model identifier."""
        pass

class ClaudeProvider(LLMProvider):
    def __init__(self, api_key: str, model: str = "claude-sonnet-4-20250514"):
        self.client = anthropic.Client(api_key=api_key)
        self.model = model

    def generate(self, prompt: str, **kwargs) -> str:
        response = self.client.messages.create(
            model=self.model,
            max_tokens=kwargs.get("max_tokens", 4096),
            messages=[{"role": "user", "content": prompt}]
        )
        return response.content[0].text

    def count_tokens(self, text: str) -> int:
        return self.client.count_tokens(text)

    def get_model_name(self) -> str:
        return self.model

class DeepSeekProvider(LLMProvider):
    def __init__(self, api_key: str):
        self.client = openai.OpenAI(
            api_key=api_key,
            base_url="https://api.deepseek.com/v1"
        )

    def generate(self, prompt: str, **kwargs) -> str:
        response = self.client.chat.completions.create(
            model="deepseek-chat",
            messages=[{"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content

    def count_tokens(self, text: str) -> int:
        # Approximate token count
        return len(text.split()) * 1.3

    def get_model_name(self) -> str:
        return "deepseek-chat"

class OllamaProvider(LLMProvider):
    def __init__(self, model: str = "llama3.2"):
        self.model = model
        self.base_url = "http://localhost:11434"

    def generate(self, prompt: str, **kwargs) -> str:
        import requests
        response = requests.post(
            f"{self.base_url}/api/generate",
            json={"model": self.model, "prompt": prompt, "stream": False}
        )
        return response.json()["response"]

    def count_tokens(self, text: str) -> int:
        return len(text.split())

    def get_model_name(self) -> str:
        return self.model

The abstraction layer means I can add new providers without changing the routing logic.

Step 7: Real-World Routing Decisions

Here’s my decision matrix for routing:

Task Type	Primary Provider	Fallback	Cost Tier	Why This Route
Complex reasoning	Claude Opus	GPT-4o	Premium	Best reasoning quality
Code generation	Claude Sonnet	GPT-4o	Mid	Excellent code understanding
Simple classification	DeepSeek	Ollama	Budget	10x cheaper, sufficient quality
Batch processing	DeepSeek	Ollama	Budget	Cost efficiency matters most
Privacy-sensitive	Ollama (local)	N/A	Free	Data never leaves machine
Quick questions	Claude Pro sub	ChatGPT sub	Fixed	Already paid subscription
High-volume tasks	DeepSeek API	Ollama	Budget	Marginal cost optimization
Production features	Claude API	GPT-4 API	Premium	SLA guarantees

Example routing scenarios:

# Scenario 1: Code generation with PII in prompt
context = TaskContext(
    task_type="code",
    complexity="high",
    contains_pii=True,
    expected_tokens=2000
)
# Routes to: Ollama (privacy-sensitive rule matches first)

# Scenario 2: Batch classification job
context = TaskContext(
    task_type="classification",
    complexity="low",
    contains_pii=False,
    expected_tokens=100,
    batch_size=500
)
# Routes to: DeepSeek (batch_processing rule matches)

# Scenario 3: Complex reasoning for research
context = TaskContext(
    task_type="analysis",
    complexity="high",
    contains_pii=False,
    expected_tokens=3000
)
# Routes to: Claude Opus (complex_reasoning rule matches)

The Results: Cost Savings

After three months of running this hybrid setup, here’s what I found:

Task Type              | Before (Claude API) | After (Hybrid) | Savings
-----------------------|--------------------|----------------|---------
Complex reasoning      | $180               | $90            | 50%
Code generation        | $240               | $120           | 50%
Simple classification  | $200               | $8             | 96%
Batch processing       | $180               | $12            | 93%
Quick Q&A              | $47                | $20 (sub)      | 57%
-----------------------|--------------------|----------------|---------
TOTAL                  | $847               | $250           | 70%

The biggest savings came from routing simple tasks to DeepSeek and batch jobs to the cheapest provider. My total monthly cost dropped from $847 to approximately $250.

Cost breakdown after hybrid implementation:

Cost Component	Monthly Cost	Notes
Claude Pro subscription	$20	Daily interactive use
OpenAI Plus subscription	$20	Overflow, alternative
DeepSeek API	$30	Batch processing, simple tasks
Claude API (complex tasks)	$180	Reasoning, code generation
Total	$250	70% reduction

The Tradeoffs: Complexity vs. Savings

The Reddit warning about system complexity is real. Here’s what I learned:

Challenge	Problem	My Solution
Multiple failure modes	Failures at any provider	Circuit breaker per provider
Inconsistent responses	Same prompt, different outputs	Response validation, quality checks
Rate limit confusion	Which provider hit limit?	Per-provider rate limit tracking
Cost attribution	Who spent what?	Tag every request with cost center
Debugging difficulty	Where did request fail?	Correlation IDs for all requests
Testing overhead	Must test all providers	Provider-agnostic test suites

Debugging tip: Add correlation IDs to every request.

import uuid
from dataclasses import dataclass
from datetime import datetime

@dataclass
class TracedRequest:
    correlation_id: str
    provider: str
    model: str
    prompt_hash: str
    timestamp: datetime
    status: str
    latency_ms: int
    cost_usd: float
    error: Optional[str] = None

def trace_request(provider: str, prompt: str) -> TracedRequest:
    correlation_id = str(uuid.uuid4())
    # Log to observability system
    return TracedRequest(
        correlation_id=correlation_id,
        provider=provider,
        # ... other fields
    )

Practical Recommendations

After running this system for months, here’s what I recommend:

Start Simple

Don’t start with ML-based routing. Use rule-based routing first. My initial router was 50 lines of Python with a YAML config. ML-based routing sounds cool, but you need data to train it, and the complexity isn’t worth it initially.

Invest in Observability

Track everything from day one:

Request latency per provider
Cost per request
Error rates per provider
Token usage per provider

I use Prometheus + Grafana, but any metrics system works.

Design for Failure

Assume every provider will fail. Build circuit breakers and fallback chains. I’ve seen OpenAI outages, Claude rate limits, and Ollama crashes. The system should degrade gracefully.

Monitor Costs Daily

Set budgets and alerts. I have a daily budget alert at $20. If I exceed it, I get a notification. This has saved me from surprise bills multiple times.

Test Provider-Agnostic

Write tests against interfaces, not providers. This makes it easy to swap providers without rewriting tests.

def test_classification_quality(router: LLMRouter):
    """Test that classification works across all providers."""
    context = TaskContext(
        task_type="classification",
        complexity="low",
        contains_pii=False,
        expected_tokens=100
    )

    result = router.route(context, "Classify this text: positive or negative?")

    assert result in ["positive", "negative"]
    # Test works regardless of which provider handles it

Alternative Approaches

I considered other strategies before settling on this hybrid approach:

Option 1: All Subscription

Use Claude Pro and ChatGPT Plus subscriptions only. Cost: $40/month.

Pros: Simple, predictable cost. Cons: Limited API access, no batch processing, rate limits on heavy use.

Option 2: All API

Use Claude API or OpenAI API for everything. Cost: $400-800+/month.

Pros: No limits, maximum flexibility. Cons: Expensive, unpredictable costs.

Option 3: All Local

Use Ollama/LMStudio for everything. Cost: Hardware only.

Pros: Free, private, no rate limits. Cons: Quality gap on complex tasks, requires GPU, maintenance overhead.

My Choice: Hybrid

The hybrid approach gives me:

Quality where I need it (complex reasoning)
Cost savings where I don’t (simple tasks)
Privacy when required (local models)

When to Skip the Hybrid Approach

Not everyone needs this complexity. Skip the hybrid approach if:

You use LLMs sparingly. If your monthly API bill is under $50, just pick one provider.
You don’t have batch tasks. If all your requests are real-time and user-facing, the routing overhead might not be worth it.
You need consistent outputs. If your application requires identical outputs for the same prompt, multiple providers will cause headaches.
You lack observability skills. Without proper monitoring, a multi-provider system becomes unmanageable.

My Current Stack

For an individual developer, I recommend:

Tier	Use Case	Provider	Cost
Primary	Daily coding, research	Claude Pro subscription	$20/mo
Secondary	Overflow, complex tasks	OpenAI Plus subscription	$20/mo
Batch	High-volume processing	DeepSeek API	$10-30/mo
Privacy	Sensitive data	Ollama (local)	Hardware only

Total estimated monthly cost: $50-70

Compare this to my previous $847/month with Claude API only. That’s a 90% cost reduction for the same (or better) quality on most tasks.

Lessons Learned

Cost optimization requires usage analysis. I couldn’t optimize until I understood where money was going.
Simple routing beats smart routing. My rule-based router works great. ML-based routing is overkill for most use cases.
Observability is non-negotiable. Without tracking costs, failures, and latency, a multi-provider system becomes a black box.
Local models have improved. Ollama with Llama 3.2 handles many tasks that previously required API calls.
Subscriptions are underutilized. Claude Pro and ChatGPT Plus give you unlimited access for $20/month each. Use them for interactive work.
Fallback chains save the day. When Claude went down during a critical task, my system automatically switched to GPT-4o. No user noticed.
Test each provider independently. Different providers have different strengths and weaknesses. Test them all.

What’s Next

I’m experimenting with:

ML-based routing: Using historical quality scores to improve routing decisions.
Cost prediction: Estimating costs before running requests.
Automatic provider selection: Let the system learn which provider works best for each task type.
Quality scoring: Automatically rating response quality to inform future routing.

Circuit Breaker Pattern: Essential for distributed systems. Read Martin Fowler’s article on circuit breakers.
Cost Attribution: Every request should have a cost center. This helps identify which features drive costs.
Observability: Prometheus + Grafana is my stack, but any metrics system works. The key is tracking per-provider metrics.
Local LLMs: Ollama makes running local models trivial. Worth exploring if you have privacy requirements or high-volume simple tasks.
LangChain: Provides abstractions for multi-provider LLM usage. My router is custom-built, but LangChain’s routing abstractions are getting better.

Reference Links

OpenClaw Cost Optimization Discussion - The Reddit thread that started my journey
Claude API Pricing - Official pricing page
DeepSeek API - Budget API provider
Ollama - Run LLMs locally
LangChain Multi-Provider Support - Abstraction layer for multiple providers

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!