Skip to content

How to Run AI Models Locally with Ollama: Zero-Cost Alternative to ChatGPT API

AI Technology

Problem

I’ve been using ChatGPT and DeepSeek APIs for my projects, and the costs keep adding up. Every API call costs money, and I worry about sending my code and data to third-party services.

I want to run AI models locally on my machine, but I thought it required expensive GPU hardware and complex setup. Then I discovered Ollama.

What Is Ollama?

Ollama is a tool that lets you run large language models entirely on your local computer. Think of it like Docker for AI models:

  • One command to install models
  • No API key required
  • No cloud costs
  • Your data never leaves your machine
  • Works with or without GPU

I can run models like Qwen 2.5, DeepSeek-R1, and Llama 3 locally without any cloud dependency.

Hardware Requirements

Before diving in, let me clarify what hardware you actually need:

ConfigurationWhat It Runs
8GB RAM (no GPU)7B models (Qwen 2.5 7B, Llama 3 8B)
16GB RAM + 6GB VRAM14B+ models smoothly
32GB+ RAMLarger models like Qwen 2.5 32B

Key insight: You don’t need a GPU. CPU inference works, just 5-10x slower. My laptop with 16GB RAM runs Qwen 2.5 7B perfectly fine.

How to Install Ollama

Step 1: Install Ollama

macOS:

Install via Homebrew
brew install ollama

Linux:

Install via script
curl -fsSL https://ollama.com/install.sh | sh

Windows:

Download the installer from ollama.com.

Verify installation:

Check version
ollama --version

Step 2: Download Your First Model

I’ll start with Qwen 2.5 3B, a small model good for testing:

Pull a model
ollama pull qwen2.5:3b

This downloads about 2GB. The download speed depends on your internet connection.

Now chat with it directly:

Run interactive chat
ollama run qwen2.5:3b

You’ll see a prompt where you can type questions:

Chat output
>>> What is Python?
Python is a high-level, interpreted programming language known for its
simple syntax and readability. It supports multiple programming paradigms...

Screenshot of Ollama terminal chat interface showing a conversation with qwen2.5 model

To exit, press Ctrl+D or type /bye.

Step 3: Check Installed Models

List downloaded models
ollama list

Output:

Model list output
NAME ID SIZE MODIFIED
qwen2.5:3b 123abc456def 2.0 GB 2 hours ago
ModelSizeBest For
qwen2.5:3b~2GBTesting, low-memory systems
qwen2.5:7b~5GBBest Chinese capability, balanced performance
deepseek-r1:7b~4.5GBStrong reasoning, chain-of-thought
llama3:8b~4.7GBEnglish-focused, Meta’s flagship
qwen2.5:14b~10GBHigh-quality output, needs 16GB RAM

I recommend starting with qwen2.5:3b for testing, then upgrading to qwen2.5:7b for real work.

Using the Local API

Ollama provides an OpenAI-compatible API at http://127.0.0.1:11434. This means I can switch from paid APIs to local models with minimal code changes.

Start the API Server

Start Ollama server
ollama serve

The server runs at http://127.0.0.1:11434 by default.

Python Integration

I’ll show you how to use the OpenAI Python SDK with Ollama:

ollama_chat.py
from openai import OpenAI
client = OpenAI(
base_url="http://127.0.0.1:11434/v1",
api_key="ollama" # Any value works for local
)
response = client.chat.completions.create(
model="qwen2.5:7b",
messages=[{"role": "user", "content": "Explain quantum computing in simple terms"}],
temperature=0.3
)
print(response.choices[0].message.content)

Run it:

Run the script
python ollama_chat.py

Streaming Output for Responsive UIs

For chat interfaces, streaming gives better user experience:

ollama_streaming.py
from openai import OpenAI
client = OpenAI(
base_url="http://127.0.0.1:11434/v1",
api_key="ollama"
)
print("Chat with your local AI (type 'quit' to exit)")
print("-" * 50)
messages = []
while True:
user_input = input("You: ").strip()
if user_input.lower() in ['quit', 'exit', 'q']:
break
messages.append({"role": "user", "content": user_input})
print("AI: ", end="", flush=True)
stream = client.chat.completions.create(
model="qwen2.5:7b",
messages=messages,
stream=True,
temperature=0.7
)
full_response = ""
for chunk in stream:
if chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
print(content, end="", flush=True)
full_response += content
print()
messages.append({"role": "assistant", "content": full_response})

Switching from Paid APIs

Here’s the before and after comparison:

Before: Using DeepSeek API (costs money)

paid_api.py
from openai import OpenAI
client = OpenAI(
api_key="sk-xxxx", # Real API key = cost
base_url="https://api.deepseek.com"
)
# Every call costs money...

After: Using Local Ollama (free)

local_ollama.py
from openai import OpenAI
client = OpenAI(
api_key="ollama", # Dummy key, no cost
base_url="http://127.0.0.1:11434/v1"
)
# Unlimited free calls!

The rest of your code stays the same. This is the key advantage: Ollama’s OpenAI-compatible API means zero code refactoring.

Creating Custom Models

I can create custom models with specific behaviors using Modelfile:

Modelfile
FROM qwen2.5:7b
SYSTEM """
You are a technical documentation assistant.
- Explain concepts in simple language
- Include code examples when relevant
- Add practical tips
"""
PARAMETER temperature 0.3

Create the custom model:

Create custom model
ollama create doc-assistant -f Modelfile

Now run it:

Run custom model
ollama run doc-assistant

Common Issues and Solutions

ProblemCauseSolution
Slow model downloadCDN speedSet HTTP_PROXY or try again later
OOM (Out of Memory)Model too largeUse smaller model or close other apps
API connection failedServer not runningRun ollama serve first
GPU not detectedMissing driversInstall NVIDIA CUDA toolkit
Slow generationCPU inferenceGPU is 10-20x faster

Why I Use Local Models

The main benefits I get from running Ollama:

  1. Zero API costs - No per-token charges accumulating
  2. Privacy - Company code and personal documents stay local
  3. Offline access - Works without internet
  4. No rate limits - Unlimited queries
  5. Customization - Fine-tune models for my specific needs

Summary

In this post, I showed how to run AI models locally with Ollama. The key point is Ollama provides an OpenAI-compatible API at http://127.0.0.1:11434/v1, so switching from paid cloud APIs to free local models requires just changing the base_url parameter.

Start with ollama pull qwen2.5:3b to test it out, then upgrade to qwen2.5:7b for better performance. Your data stays on your machine, and you never pay API costs again.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments