Skip to content

How to Run Local AI Models with Ollama for OpenClaw

I was paying $20/month for cloud-based AI models with OpenClaw, and I kept wondering: “Can I run local AI models instead?” After all, I have a GPU sitting in my machine that’s mostly idle. Here’s what I discovered and how I set it up.

The Problem: Cloud Model Costs Add Up

Every month, I found myself spending more on API calls. The convenience of cloud-based models comes with a price tag. But when I tried to find free alternatives, I hit roadblocks:

  • Some “free” services have severe rate limits
  • Others require constant re-authentication
  • Many just don’t work reliably with OpenClaw

Then I discovered Ollama, and it changed everything.

What is Ollama?

Ollama is a tool that lets you run large language models locally on your machine. It’s like having your own private ChatGPT running on your GPU. The key benefits:

  • Free: No API costs after initial setup
  • Private: Your data stays on your machine
  • Fast: No network latency for inference

But the big question remained: Would it work with OpenClaw?

Hardware Check: Can My Machine Handle It?

Before diving in, I needed to verify my hardware. Local LLMs are GPU-hungry. Here’s what I learned from the community:

Minimum viable setup:

  • GPU: Nvidia 3060 12GB (this is the baseline)
  • RAM: 16GB system memory
  • Storage: 20GB+ for model files

I have an older machine with a Nvidia 3060 12GB. According to Reddit users, this setup works reliably with smaller models like Qwen 3.5:4B for chat and Qwen 3.5:9B for coding.

Check your GPU memory
nvidia-smi --query-gpu=memory.used,memory.total --format=csv

My output showed 12GB total VRAM, which is the minimum for decent performance. If you have less, you might need to use CPU inference (much slower) or consider a GPU upgrade.

Step 1: Installing Ollama

The installation was straightforward. On Linux/macOS:

Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

After installation, verify it’s working:

Verify Ollama installation
ollama --version

Step 2: Choosing the Right Model

This is where I made my first mistake. I initially pulled a massive model that barely fit in my GPU memory, causing slow responses and occasional crashes.

After researching and testing, here’s what works best for different use cases:

For Chat/Conversation:

  • Qwen 3.5:4B - Very fast, minimal VRAM usage, good for casual conversation
  • Qwen 2.5:7B - Balanced quality and speed

For Coding Tasks:

  • Qwen 3.5:9B - Better reasoning for coding problems
  • Qwen 2.5-Coder:7B - Specifically trained for code generation
Pull recommended models
# Start with the most balanced option
ollama pull qwen2.5:7b
# For coding tasks
ollama pull qwen3.5:9b
# For resource-constrained systems
ollama pull qwen3.5:4b

To see which models you have installed:

List installed models
ollama list

Step 3: Testing Ollama Locally

Before connecting to OpenClaw, I wanted to verify the model worked:

Test model directly
ollama run qwen2.5:7b "Explain what OpenClaw is in one sentence"

The response was instant. My GPU utilization spiked to 80-90% during inference, then dropped back down. This confirmed everything was working correctly.

Step 4: Connecting OpenClaw to Ollama

This step gave me some confusion initially. OpenClaw’s documentation mentioned local model support, but the exact configuration wasn’t obvious.

Here’s what worked:

  1. Open OpenClaw settings
  2. Navigate to provider configuration
  3. Select “Ollama” as the provider
  4. Set the endpoint to http://localhost:11434 (Ollama’s default port)
  5. Choose your pulled model from the dropdown

If you need to configure it manually via config file:

OpenClaw Ollama configuration
{
"provider": "ollama",
"baseUrl": "http://localhost:11434",
"model": "qwen2.5:7b",
"temperature": 0.7,
"maxTokens": 4096
}

Step 5: Monitoring Performance

I wanted to see how my GPU handled the workload during actual use:

Monitor GPU usage in real-time
watch -n 1 nvidia-smi

With Qwen 2.5:7B, my GPU memory usage stayed around 6-8GB during inference. The 3060 12GB handled it comfortably. When I tested Qwen 3.5:9B, memory usage climbed to 10GB, leaving less headroom but still functional.

The Trade-offs I Discovered

After using this setup for a week, I found there’s truth to the saying: “You can’t have cheap (or free), good, and fast. You can have only two.”

What works well:

  • General chat and conversation
  • Simple code explanations
  • Quick documentation lookups
  • Refactoring suggestions

Where it struggles:

  • Complex multi-file refactoring
  • Detailed architectural reasoning
  • Tasks requiring extensive context

For those complex tasks, I still occasionally fall back to paid cloud models. But for 80% of my daily use, the local setup is sufficient.

Model Selection Guide

After extensive testing, here’s my recommendation matrix:

Your GPUChat ModelCoding ModelNotes
3060 12GBqwen3.5:4bqwen2.5:7bGood for light use
3070/3080qwen2.5:7bqwen3.5:9bBalanced performance
4090/5090qwen2.5:14bqwen2.5-coder:14bHeavy-duty tasks

Troubleshooting Common Issues

Model won’t load / Out of memory:

  • Try a smaller quantization
  • Close other GPU-intensive applications
  • Check if your model is too large for your VRAM

Slow responses:

  • Verify GPU is being used (check nvidia-smi)
  • Consider a smaller model
  • Check system RAM usage (swap kills performance)

OpenClaw connection errors:

  • Verify Ollama is running: ollama serve
  • Check if port 11434 is accessible
  • Test with curl: curl http://localhost:11434/api/tags

Is It Worth It?

For me, yes. I went from paying $20-30/month to effectively $0 for most of my AI-assisted tasks. The initial setup took an afternoon, and the hardware investment (the GPU) was something I already had.

However, this isn’t for everyone. If you don’t have a capable GPU, the experience will be frustratingly slow. And if you need top-tier reasoning for complex tasks, paid models still have an edge.

Final Thoughts

Running local AI models with Ollama for OpenClaw is not only possible but practical for many use cases. The key is matching your hardware to the right model size and understanding the limitations.

My setup—an old tower with a 3060 12GB—runs Qwen 2.5:7B smoothly for daily tasks. When I need more power, I switch to cloud models. It’s the best of both worlds: free for everyday use, paid for heavy lifting.

If you have the hardware, give it a try. The worst case is you learn something new about local LLMs. The best case is you save hundreds of dollars a year.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments