How to Run AI Models Locally with Ollama: Zero-Cost Alternative to ChatGPT API
Problem
I’ve been using ChatGPT and DeepSeek APIs for my projects, and the costs keep adding up. Every API call costs money, and I worry about sending my code and data to third-party services.
I want to run AI models locally on my machine, but I thought it required expensive GPU hardware and complex setup. Then I discovered Ollama.
What Is Ollama?
Ollama is a tool that lets you run large language models entirely on your local computer. Think of it like Docker for AI models:
- One command to install models
- No API key required
- No cloud costs
- Your data never leaves your machine
- Works with or without GPU
I can run models like Qwen 2.5, DeepSeek-R1, and Llama 3 locally without any cloud dependency.
Hardware Requirements
Before diving in, let me clarify what hardware you actually need:
| Configuration | What It Runs |
|---|---|
| 8GB RAM (no GPU) | 7B models (Qwen 2.5 7B, Llama 3 8B) |
| 16GB RAM + 6GB VRAM | 14B+ models smoothly |
| 32GB+ RAM | Larger models like Qwen 2.5 32B |
Key insight: You don’t need a GPU. CPU inference works, just 5-10x slower. My laptop with 16GB RAM runs Qwen 2.5 7B perfectly fine.
How to Install Ollama
Step 1: Install Ollama
macOS:
brew install ollamaLinux:
curl -fsSL https://ollama.com/install.sh | shWindows:
Download the installer from ollama.com.
Verify installation:
ollama --versionStep 2: Download Your First Model
I’ll start with Qwen 2.5 3B, a small model good for testing:
ollama pull qwen2.5:3bThis downloads about 2GB. The download speed depends on your internet connection.
Now chat with it directly:
ollama run qwen2.5:3bYou’ll see a prompt where you can type questions:
>>> What is Python?Python is a high-level, interpreted programming language known for itssimple syntax and readability. It supports multiple programming paradigms...
To exit, press Ctrl+D or type /bye.
Step 3: Check Installed Models
ollama listOutput:
NAME ID SIZE MODIFIEDqwen2.5:3b 123abc456def 2.0 GB 2 hours agoRecommended Models for Different Use Cases
| Model | Size | Best For |
|---|---|---|
| qwen2.5:3b | ~2GB | Testing, low-memory systems |
| qwen2.5:7b | ~5GB | Best Chinese capability, balanced performance |
| deepseek-r1:7b | ~4.5GB | Strong reasoning, chain-of-thought |
| llama3:8b | ~4.7GB | English-focused, Meta’s flagship |
| qwen2.5:14b | ~10GB | High-quality output, needs 16GB RAM |
I recommend starting with qwen2.5:3b for testing, then upgrading to qwen2.5:7b for real work.
Using the Local API
Ollama provides an OpenAI-compatible API at http://127.0.0.1:11434. This means I can switch from paid APIs to local models with minimal code changes.
Start the API Server
ollama serveThe server runs at http://127.0.0.1:11434 by default.
Python Integration
I’ll show you how to use the OpenAI Python SDK with Ollama:
from openai import OpenAI
client = OpenAI( base_url="http://127.0.0.1:11434/v1", api_key="ollama" # Any value works for local)
response = client.chat.completions.create( model="qwen2.5:7b", messages=[{"role": "user", "content": "Explain quantum computing in simple terms"}], temperature=0.3)
print(response.choices[0].message.content)Run it:
python ollama_chat.pyStreaming Output for Responsive UIs
For chat interfaces, streaming gives better user experience:
from openai import OpenAI
client = OpenAI( base_url="http://127.0.0.1:11434/v1", api_key="ollama")
print("Chat with your local AI (type 'quit' to exit)")print("-" * 50)
messages = []while True: user_input = input("You: ").strip() if user_input.lower() in ['quit', 'exit', 'q']: break
messages.append({"role": "user", "content": user_input})
print("AI: ", end="", flush=True) stream = client.chat.completions.create( model="qwen2.5:7b", messages=messages, stream=True, temperature=0.7 )
full_response = "" for chunk in stream: if chunk.choices[0].delta.content: content = chunk.choices[0].delta.content print(content, end="", flush=True) full_response += content
print() messages.append({"role": "assistant", "content": full_response})Switching from Paid APIs
Here’s the before and after comparison:
Before: Using DeepSeek API (costs money)
from openai import OpenAI
client = OpenAI( api_key="sk-xxxx", # Real API key = cost base_url="https://api.deepseek.com")# Every call costs money...After: Using Local Ollama (free)
from openai import OpenAI
client = OpenAI( api_key="ollama", # Dummy key, no cost base_url="http://127.0.0.1:11434/v1")# Unlimited free calls!The rest of your code stays the same. This is the key advantage: Ollama’s OpenAI-compatible API means zero code refactoring.
Creating Custom Models
I can create custom models with specific behaviors using Modelfile:
FROM qwen2.5:7b
SYSTEM """You are a technical documentation assistant.- Explain concepts in simple language- Include code examples when relevant- Add practical tips"""
PARAMETER temperature 0.3Create the custom model:
ollama create doc-assistant -f ModelfileNow run it:
ollama run doc-assistantCommon Issues and Solutions
| Problem | Cause | Solution |
|---|---|---|
| Slow model download | CDN speed | Set HTTP_PROXY or try again later |
| OOM (Out of Memory) | Model too large | Use smaller model or close other apps |
| API connection failed | Server not running | Run ollama serve first |
| GPU not detected | Missing drivers | Install NVIDIA CUDA toolkit |
| Slow generation | CPU inference | GPU is 10-20x faster |
Why I Use Local Models
The main benefits I get from running Ollama:
- Zero API costs - No per-token charges accumulating
- Privacy - Company code and personal documents stay local
- Offline access - Works without internet
- No rate limits - Unlimited queries
- Customization - Fine-tune models for my specific needs
Summary
In this post, I showed how to run AI models locally with Ollama. The key point is Ollama provides an OpenAI-compatible API at http://127.0.0.1:11434/v1, so switching from paid cloud APIs to free local models requires just changing the base_url parameter.
Start with ollama pull qwen2.5:3b to test it out, then upgrade to qwen2.5:7b for better performance. Your data stays on your machine, and you never pay API costs again.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments