Skip to content

Why Do Open Source AI Models Perform Worse Through OpenRouter?

The Problem

I ran the same open source AI models through OpenRouter and got terrible results. Then I tried the exact same models through Ollama Cloud and they worked perfectly.

Same models. Same prompts. Wildly different outcomes.

This confused me for weeks. I thought I was debugging model issues, but I was actually debugging platform problems.

What I Was Testing

I was running agentic coding workflows with open source models. Two models in particular:

  • GLM-5 - For code analysis and modification
  • Devstral 2 - For agentic coding tasks

Both available on OpenRouter and Ollama Cloud. Same models, different platforms.

The Shocking Difference

GLM-5 Test Results
Platform: OpenRouter
Result: FAILED
- 2 "false starts" where the model froze completely
- On the third try, it generated philosophical content instead of analyzing code
- Wasted time debugging why the model "went crazy"
Platform: Ollama Cloud
Result: SUCCESS
- Ran without problems on first try
- 11 files changed
- 47 insertions(+), 42 deletions(-)
Devstral 2 Test Results
Platform: OpenRouter
Result: CATASTROPHIC FAILURE
- Model went berserk
- Hammered the API with 200k token prompts
- Burned $5 in a single failed run
- No useful output
Platform: Ollama Cloud
Result: SUCCESS
- Ran without problems
- 8 files changed
- 20 insertions(+), 15 deletions(-)

The pattern was clear. OpenRouter was causing problems that looked like model failures but were actually platform failures.

Why This Happens

After digging deeper, I found several reasons why open source models perform worse through aggregators like OpenRouter.

1. Inference Pipeline Fragmentation

OpenRouter is an aggregator. It routes your request to different backend providers, each with their own inference stack:

Request Flow Comparison
OpenRouter:
You -> OpenRouter API -> Backend Provider A (unknown config) -> Model
Ollama Cloud:
You -> Ollama API -> Unified Optimized Stack -> Model

The backend provider for GLM-5 on OpenRouter might be completely different from one request to the next. Each provider has different:

  • Hardware configurations
  • Software versions
  • Optimization settings
  • Token handling

Ollama Cloud runs a single, unified stack optimized for their models.

2. Prompt Processing Layers

OpenRouter adds intermediate processing that can corrupt prompts:

Prompt Transformation
Your prompt -> OpenRouter routing layer -> Backend provider layer -> Model
Each layer can:
- Truncate context that exceeds limits
- Reformat prompts for compatibility
- Add system instructions you didn't request
- Modify token handling

This explains why GLM-5 generated philosophical content instead of code analysis. Something in the pipeline corrupted the task.

3. Context Window Mismanagement

OpenRouter’s routing layer sometimes fails to handle context windows properly:

Context Window Issues
Problem: Request goes to backend with smaller context window than expected
Result: Context truncated mid-conversation
Symptom: Model "forgets" earlier instructions, produces irrelevant output
Problem: Context not properly formatted for specific backend
Result: Model receives malformed context
Symptom: Model outputs nonsense or freezes

4. Rate Limiting Chaos

Aggregators have complex rate limiting across multiple backends:

Rate Limiting Complexity
OpenRouter manages:
- Your account rate limits
- Backend provider A rate limits
- Backend provider B rate limits
- Global routing quotas
Failure mode:
Your request -> Rate limited by provider A
-> Rerouted to provider B
-> Different model version or config
-> Unpredictable output

This explains the “false starts” I experienced. Requests were hitting rate limits, timing out, or being rerouted to different backends.

5. Model Version Drift

The most insidious issue: OpenRouter may route to different implementations of the “same” model:

Model Version Confusion
You request: GLM-5
You might get: GLM-5 v1.2 on provider A
or GLM-5 v1.1 on provider B
or GLM-5-quantized on provider C
Each version has different:
- Quantization levels
- Context lengths
- Fine-tuning
- Performance characteristics

You think you’re testing the same model, but you’re not.

The Cost of Platform Problems

The Devstral incident cost me $5 in a single failed run. But the real cost was higher:

Hidden Costs of Platform Instability
Time debugging "model issues": ~4 hours
Failed API calls: 12
Wasted tokens: ~300k
Frustration: Immeasurable
Actual model quality issues: 0 (it was the platform all along)

I spent hours thinking the models were bad, when the platform was the problem.

When to Use Each Platform

Based on my experience:

Platform Selection Guide
Use Ollama Cloud when:
- You need stable, predictable inference
- Running production workloads
- Using agentic coding workflows
- Cost predictability matters
- Debugging model behavior (not platform behavior)
Use OpenRouter when:
- Exploring different models
- Comparing model capabilities
- Experimenting before committing
- You need access to many models quickly
- Cost per query matters more than stability

Common Mistakes I Made

Blaming the Model

When GLM-5 output philosophical ramblings instead of code analysis, I thought the model was bad. It wasn’t. The platform corrupted the task.

Assuming All APIs Are Equal

I assumed GLM-5 via OpenRouter would be identical to GLM-5 via Ollama Cloud. They’re not. The inference infrastructure matters as much as the model.

Not Monitoring Per-Platform Costs

I didn’t track costs by platform. When I did, I found OpenRouter’s instability was more expensive than Ollama Cloud’s subscription:

Cost Comparison
Ollama Cloud: $20/month for stable inference
OpenRouter: $5 single failed Devstral run + countless retries
Monthly OpenRouter costs for unreliable inference: Often exceeded $20

Using Aggregators for Production

OpenRouter is great for exploration. It’s terrible for production agentic workflows where consistency matters.

How to Test This Yourself

If you’re experiencing inconsistent AI model behavior, test across platforms:

  1. Run the same prompt 5 times on each platform

    • OpenRouter: Note variations in output quality
    • Ollama Cloud: Should be consistent
  2. Monitor token usage

    • OpenRouter: Watch for unexpected token spikes
    • Ollama Cloud: Should match your expectations
  3. Test agentic workflows specifically

    • These stress test the inference pipeline
    • Single prompts may work fine, multi-step agents expose problems
  4. Track costs per successful output

    • Include failed attempts in your calculation
    • Unstable platforms cost more than they appear

The Bottom Line

Open source AI models through OpenRouter perform worse because of platform instability, not model quality. The aggregator layer introduces variability that manifests as:

  • Frozen responses
  • Irrelevant outputs
  • Uncontrolled token usage
  • Failed agentic workflows

Ollama Cloud’s $20/month plan provides stable inference that saves money compared to OpenRouter’s failed runs and wasted tokens.

My recommendation: Use Ollama Cloud for production workloads with open source LLMs. Reserve OpenRouter for model exploration and comparison shopping before committing to a provider.

The quality difference you’re experiencing might not be the model. It might be the platform.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments