How to Set Up GLM-5.1 + Qwen 3.5 for Planning and Execution Roles
When I started building AI-powered applications, I hit a wall. Frontier models like Claude or GPT-4 delivered great results, but the costs added up fast. Smaller open-source models were cheaper, but they struggled with complex reasoning tasks.
I needed a way to get the best of both worlds: deep reasoning when it matters, and fast cheap execution for everything else.
The Core Problem
Using a single model for all tasks is wasteful. Planning needs deep reasoning—breaking down problems, considering edge cases, designing architecture. Execution needs speed and consistency—generating code, processing data, making API calls.
When I ran everything through a frontier model, I burned budget on tasks that didn’t need that level of intelligence. When I used only smaller models, complex planning suffered.
The Solution: Role-Based Model Routing
I found a pattern that works: route planning tasks to GLM-5.1 and execution tasks to Qwen 3.5. Both run through Ollama Cloud, so I don’t need local GPU hardware.
Here’s why this combination clicks:
GLM-5.1 (Planning Role):
- Handles complex reasoning and task decomposition
- Achieves ~94.6% of Claude Opus 4.6’s coding benchmark performance
- Slower but smarter—exactly what planning needs
Qwen 3.5 (Execution Role):
- Apache 2.0 licensed (fully open source, commercial-friendly)
- Near-frontier performance on agentic tasks
- Fast and cost-effective for high-volume operations
Setting Up Ollama Cloud
First, I installed Ollama and pulled the models with the :cloud suffix. This routes requests through ollama.com instead of running locally.
# Install Ollamacurl -fsSL https://ollama.com/install.sh | sh
# Pull models with cloud suffixollama pull glm-5.1:cloudollama pull qwen-3.5:cloud
# Verify cloud routing is activeollama list | grep cloudThe :cloud suffix is critical. Without it, Ollama tries to run models locally, which requires GPU resources I don’t have on my laptop.
Configuring Role-Based Routing
I use oh-my-pi to define which model handles which role. The configuration is straightforward:
models: slow/plan: model: glm-5.1:cloud temperature: 0.3 max_tokens: 4096
default: model: qwen-3.5:cloud temperature: 0.7 max_tokens: 2048
routing: planning_tasks: - task_decomposition - strategy_design - code_review_planning - architecture_decisions
execution_tasks: - code_generation - data_processing - api_calls - routine_operationsThe slow/plan role uses GLM-5.1 with lower temperature for more deterministic output. The default role uses Qwen 3.5 with higher temperature for creative execution.
Using the Pipeline in Code
Here’s how I use this setup in a Python project:
from oh_my_pi import Agent
# Planning phase uses GLM-5.1planner = Agent(role="slow/plan")plan = planner.run("Design a REST API for user management")
# Execution phase uses Qwen 3.5executor = Agent(role="default")result = executor.run(f"Implement: {plan}")The planner breaks down the problem, considers edge cases, and produces a detailed plan. The executor takes that plan and generates working code.
What I Learned
After running this setup for a few weeks, a few patterns emerged:
Cost dropped significantly. I only pay for deep reasoning on tasks that actually need it. Routine code generation, data processing, and API calls run through the cheaper Qwen 3.5.
Quality stayed high. GLM-5.1’s planning quality rivals frontier models. The plans it produces are thorough and well-structured.
Flexibility improved. When better models come out, I swap them in the config. No code changes needed.
Mistakes I Made
A few things I got wrong at first:
Using GLM-5.1 for everything. The first week, I routed all tasks through the planner. Costs went up, not down. Be selective about what needs deep reasoning.
Skipping the :cloud suffix. Requests failed silently because I forgot the suffix. Always double-check the model names.
Not defining clear role boundaries. Initially, I had vague rules about which role to use. The routing became inconsistent. Explicit task lists for each role fixed this.
When to Use This Setup
This pattern works well when:
- You have a mix of planning-heavy and execution-heavy tasks
- Cost optimization matters
- You want open-source flexibility for some components
- Your tasks can be clearly categorized
It’s overkill for simple projects with uniform task types. If all your tasks are similar, a single model is easier to manage.
Getting Started
- Install Ollama and pull both models with
:cloudsuffix - Set up oh-my-pi with the routing configuration above
- Identify which of your tasks are planning vs. execution
- Test with a few representative tasks
- Adjust the routing rules based on results
The architecture scales well. As new models emerge, you can swap them in without rebuilding your pipeline.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Ollama Cloud
- 👨💻 GLM-5.1 Model
- 👨💻 Qwen 3.5 Model
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments