How to Run Local AI Models with Ollama for OpenClaw
I was paying $20/month for cloud-based AI models with OpenClaw, and I kept wondering: “Can I run local AI models instead?” After all, I have a GPU sitting in my machine that’s mostly idle. Here’s what I discovered and how I set it up.
The Problem: Cloud Model Costs Add Up
Every month, I found myself spending more on API calls. The convenience of cloud-based models comes with a price tag. But when I tried to find free alternatives, I hit roadblocks:
- Some “free” services have severe rate limits
- Others require constant re-authentication
- Many just don’t work reliably with OpenClaw
Then I discovered Ollama, and it changed everything.
What is Ollama?
Ollama is a tool that lets you run large language models locally on your machine. It’s like having your own private ChatGPT running on your GPU. The key benefits:
- Free: No API costs after initial setup
- Private: Your data stays on your machine
- Fast: No network latency for inference
But the big question remained: Would it work with OpenClaw?
Hardware Check: Can My Machine Handle It?
Before diving in, I needed to verify my hardware. Local LLMs are GPU-hungry. Here’s what I learned from the community:
Minimum viable setup:
- GPU: Nvidia 3060 12GB (this is the baseline)
- RAM: 16GB system memory
- Storage: 20GB+ for model files
I have an older machine with a Nvidia 3060 12GB. According to Reddit users, this setup works reliably with smaller models like Qwen 3.5:4B for chat and Qwen 3.5:9B for coding.
nvidia-smi --query-gpu=memory.used,memory.total --format=csvMy output showed 12GB total VRAM, which is the minimum for decent performance. If you have less, you might need to use CPU inference (much slower) or consider a GPU upgrade.
Step 1: Installing Ollama
The installation was straightforward. On Linux/macOS:
curl -fsSL https://ollama.ai/install.sh | shAfter installation, verify it’s working:
ollama --versionStep 2: Choosing the Right Model
This is where I made my first mistake. I initially pulled a massive model that barely fit in my GPU memory, causing slow responses and occasional crashes.
After researching and testing, here’s what works best for different use cases:
For Chat/Conversation:
- Qwen 3.5:4B - Very fast, minimal VRAM usage, good for casual conversation
- Qwen 2.5:7B - Balanced quality and speed
For Coding Tasks:
- Qwen 3.5:9B - Better reasoning for coding problems
- Qwen 2.5-Coder:7B - Specifically trained for code generation
# Start with the most balanced optionollama pull qwen2.5:7b
# For coding tasksollama pull qwen3.5:9b
# For resource-constrained systemsollama pull qwen3.5:4bTo see which models you have installed:
ollama listStep 3: Testing Ollama Locally
Before connecting to OpenClaw, I wanted to verify the model worked:
ollama run qwen2.5:7b "Explain what OpenClaw is in one sentence"The response was instant. My GPU utilization spiked to 80-90% during inference, then dropped back down. This confirmed everything was working correctly.
Step 4: Connecting OpenClaw to Ollama
This step gave me some confusion initially. OpenClaw’s documentation mentioned local model support, but the exact configuration wasn’t obvious.
Here’s what worked:
- Open OpenClaw settings
- Navigate to provider configuration
- Select “Ollama” as the provider
- Set the endpoint to
http://localhost:11434(Ollama’s default port) - Choose your pulled model from the dropdown
If you need to configure it manually via config file:
{ "provider": "ollama", "baseUrl": "http://localhost:11434", "model": "qwen2.5:7b", "temperature": 0.7, "maxTokens": 4096}Step 5: Monitoring Performance
I wanted to see how my GPU handled the workload during actual use:
watch -n 1 nvidia-smiWith Qwen 2.5:7B, my GPU memory usage stayed around 6-8GB during inference. The 3060 12GB handled it comfortably. When I tested Qwen 3.5:9B, memory usage climbed to 10GB, leaving less headroom but still functional.
The Trade-offs I Discovered
After using this setup for a week, I found there’s truth to the saying: “You can’t have cheap (or free), good, and fast. You can have only two.”
What works well:
- General chat and conversation
- Simple code explanations
- Quick documentation lookups
- Refactoring suggestions
Where it struggles:
- Complex multi-file refactoring
- Detailed architectural reasoning
- Tasks requiring extensive context
For those complex tasks, I still occasionally fall back to paid cloud models. But for 80% of my daily use, the local setup is sufficient.
Model Selection Guide
After extensive testing, here’s my recommendation matrix:
| Your GPU | Chat Model | Coding Model | Notes |
|---|---|---|---|
| 3060 12GB | qwen3.5:4b | qwen2.5:7b | Good for light use |
| 3070/3080 | qwen2.5:7b | qwen3.5:9b | Balanced performance |
| 4090/5090 | qwen2.5:14b | qwen2.5-coder:14b | Heavy-duty tasks |
Troubleshooting Common Issues
Model won’t load / Out of memory:
- Try a smaller quantization
- Close other GPU-intensive applications
- Check if your model is too large for your VRAM
Slow responses:
- Verify GPU is being used (check nvidia-smi)
- Consider a smaller model
- Check system RAM usage (swap kills performance)
OpenClaw connection errors:
- Verify Ollama is running:
ollama serve - Check if port 11434 is accessible
- Test with curl:
curl http://localhost:11434/api/tags
Is It Worth It?
For me, yes. I went from paying $20-30/month to effectively $0 for most of my AI-assisted tasks. The initial setup took an afternoon, and the hardware investment (the GPU) was something I already had.
However, this isn’t for everyone. If you don’t have a capable GPU, the experience will be frustratingly slow. And if you need top-tier reasoning for complex tasks, paid models still have an edge.
Final Thoughts
Running local AI models with Ollama for OpenClaw is not only possible but practical for many use cases. The key is matching your hardware to the right model size and understanding the limitations.
My setup—an old tower with a 3060 12GB—runs Qwen 2.5:7B smoothly for daily tasks. When I need more power, I switch to cloud models. It’s the best of both worlds: free for everyday use, paid for heavy lifting.
If you have the hardware, give it a try. The worst case is you learn something new about local LLMs. The best case is you save hundreds of dollars a year.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments