Why Do Open Source AI Models Perform Worse Through OpenRouter?
The Problem
I ran the same open source AI models through OpenRouter and got terrible results. Then I tried the exact same models through Ollama Cloud and they worked perfectly.
Same models. Same prompts. Wildly different outcomes.
This confused me for weeks. I thought I was debugging model issues, but I was actually debugging platform problems.
What I Was Testing
I was running agentic coding workflows with open source models. Two models in particular:
- GLM-5 - For code analysis and modification
- Devstral 2 - For agentic coding tasks
Both available on OpenRouter and Ollama Cloud. Same models, different platforms.
The Shocking Difference
Platform: OpenRouterResult: FAILED- 2 "false starts" where the model froze completely- On the third try, it generated philosophical content instead of analyzing code- Wasted time debugging why the model "went crazy"
Platform: Ollama CloudResult: SUCCESS- Ran without problems on first try- 11 files changed- 47 insertions(+), 42 deletions(-)Platform: OpenRouterResult: CATASTROPHIC FAILURE- Model went berserk- Hammered the API with 200k token prompts- Burned $5 in a single failed run- No useful output
Platform: Ollama CloudResult: SUCCESS- Ran without problems- 8 files changed- 20 insertions(+), 15 deletions(-)The pattern was clear. OpenRouter was causing problems that looked like model failures but were actually platform failures.
Why This Happens
After digging deeper, I found several reasons why open source models perform worse through aggregators like OpenRouter.
1. Inference Pipeline Fragmentation
OpenRouter is an aggregator. It routes your request to different backend providers, each with their own inference stack:
OpenRouter:You -> OpenRouter API -> Backend Provider A (unknown config) -> Model
Ollama Cloud:You -> Ollama API -> Unified Optimized Stack -> ModelThe backend provider for GLM-5 on OpenRouter might be completely different from one request to the next. Each provider has different:
- Hardware configurations
- Software versions
- Optimization settings
- Token handling
Ollama Cloud runs a single, unified stack optimized for their models.
2. Prompt Processing Layers
OpenRouter adds intermediate processing that can corrupt prompts:
Your prompt -> OpenRouter routing layer -> Backend provider layer -> Model
Each layer can:- Truncate context that exceeds limits- Reformat prompts for compatibility- Add system instructions you didn't request- Modify token handlingThis explains why GLM-5 generated philosophical content instead of code analysis. Something in the pipeline corrupted the task.
3. Context Window Mismanagement
OpenRouter’s routing layer sometimes fails to handle context windows properly:
Problem: Request goes to backend with smaller context window than expectedResult: Context truncated mid-conversationSymptom: Model "forgets" earlier instructions, produces irrelevant output
Problem: Context not properly formatted for specific backendResult: Model receives malformed contextSymptom: Model outputs nonsense or freezes4. Rate Limiting Chaos
Aggregators have complex rate limiting across multiple backends:
OpenRouter manages:- Your account rate limits- Backend provider A rate limits- Backend provider B rate limits- Global routing quotas
Failure mode:Your request -> Rate limited by provider A -> Rerouted to provider B -> Different model version or config -> Unpredictable outputThis explains the “false starts” I experienced. Requests were hitting rate limits, timing out, or being rerouted to different backends.
5. Model Version Drift
The most insidious issue: OpenRouter may route to different implementations of the “same” model:
You request: GLM-5You might get: GLM-5 v1.2 on provider A or GLM-5 v1.1 on provider B or GLM-5-quantized on provider C
Each version has different:- Quantization levels- Context lengths- Fine-tuning- Performance characteristicsYou think you’re testing the same model, but you’re not.
The Cost of Platform Problems
The Devstral incident cost me $5 in a single failed run. But the real cost was higher:
Time debugging "model issues": ~4 hoursFailed API calls: 12Wasted tokens: ~300kFrustration: ImmeasurableActual model quality issues: 0 (it was the platform all along)I spent hours thinking the models were bad, when the platform was the problem.
When to Use Each Platform
Based on my experience:
Use Ollama Cloud when:- You need stable, predictable inference- Running production workloads- Using agentic coding workflows- Cost predictability matters- Debugging model behavior (not platform behavior)
Use OpenRouter when:- Exploring different models- Comparing model capabilities- Experimenting before committing- You need access to many models quickly- Cost per query matters more than stabilityCommon Mistakes I Made
Blaming the Model
When GLM-5 output philosophical ramblings instead of code analysis, I thought the model was bad. It wasn’t. The platform corrupted the task.
Assuming All APIs Are Equal
I assumed GLM-5 via OpenRouter would be identical to GLM-5 via Ollama Cloud. They’re not. The inference infrastructure matters as much as the model.
Not Monitoring Per-Platform Costs
I didn’t track costs by platform. When I did, I found OpenRouter’s instability was more expensive than Ollama Cloud’s subscription:
Ollama Cloud: $20/month for stable inferenceOpenRouter: $5 single failed Devstral run + countless retries
Monthly OpenRouter costs for unreliable inference: Often exceeded $20Using Aggregators for Production
OpenRouter is great for exploration. It’s terrible for production agentic workflows where consistency matters.
How to Test This Yourself
If you’re experiencing inconsistent AI model behavior, test across platforms:
-
Run the same prompt 5 times on each platform
- OpenRouter: Note variations in output quality
- Ollama Cloud: Should be consistent
-
Monitor token usage
- OpenRouter: Watch for unexpected token spikes
- Ollama Cloud: Should match your expectations
-
Test agentic workflows specifically
- These stress test the inference pipeline
- Single prompts may work fine, multi-step agents expose problems
-
Track costs per successful output
- Include failed attempts in your calculation
- Unstable platforms cost more than they appear
The Bottom Line
Open source AI models through OpenRouter perform worse because of platform instability, not model quality. The aggregator layer introduces variability that manifests as:
- Frozen responses
- Irrelevant outputs
- Uncontrolled token usage
- Failed agentic workflows
Ollama Cloud’s $20/month plan provides stable inference that saves money compared to OpenRouter’s failed runs and wasted tokens.
My recommendation: Use Ollama Cloud for production workloads with open source LLMs. Reserve OpenRouter for model exploration and comparison shopping before committing to a provider.
The quality difference you’re experiencing might not be the model. It might be the platform.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit: Open Source Model Quality Through OpenRouter vs Ollama Cloud
- 👨💻 OpenRouter Documentation
- 👨💻 Ollama Cloud
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments