How to Run AI Models Locally with Ollama: Zero-Cost Alternative to ChatGPT API

May 9, 2026

AI Technology

Problem

I’ve been using ChatGPT and DeepSeek APIs for my projects, and the costs keep adding up. Every API call costs money, and I worry about sending my code and data to third-party services.

I want to run AI models locally on my machine, but I thought it required expensive GPU hardware and complex setup. Then I discovered Ollama.

What Is Ollama?

Ollama is a tool that lets you run large language models entirely on your local computer. Think of it like Docker for AI models:

One command to install models
No API key required
No cloud costs
Your data never leaves your machine
Works with or without GPU

I can run models like Qwen 2.5, DeepSeek-R1, and Llama 3 locally without any cloud dependency.

Hardware Requirements

Before diving in, let me clarify what hardware you actually need:

Configuration	What It Runs
8GB RAM (no GPU)	7B models (Qwen 2.5 7B, Llama 3 8B)
16GB RAM + 6GB VRAM	14B+ models smoothly
32GB+ RAM	Larger models like Qwen 2.5 32B

Key insight: You don’t need a GPU. CPU inference works, just 5-10x slower. My laptop with 16GB RAM runs Qwen 2.5 7B perfectly fine.

How to Install Ollama

Step 1: Install Ollama

macOS:

brew install ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows:

Download the installer from ollama.com.

Verify installation:

ollama --version

Step 2: Download Your First Model

I’ll start with Qwen 2.5 3B, a small model good for testing:

ollama pull qwen2.5:3b

This downloads about 2GB. The download speed depends on your internet connection.

Now chat with it directly:

ollama run qwen2.5:3b

You’ll see a prompt where you can type questions:

>>> What is Python?
Python is a high-level, interpreted programming language known for its
simple syntax and readability. It supports multiple programming paradigms...

Screenshot of Ollama terminal chat interface showing a conversation with qwen2.5 model

To exit, press Ctrl+D or type /bye.

Step 3: Check Installed Models

ollama list

Output:

NAME              ID              SIZE    MODIFIED
qwen2.5:3b        123abc456def    2.0 GB  2 hours ago

Recommended Models for Different Use Cases

Model	Size	Best For
qwen2.5:3b	~2GB	Testing, low-memory systems
qwen2.5:7b	~5GB	Best Chinese capability, balanced performance
deepseek-r1:7b	~4.5GB	Strong reasoning, chain-of-thought
llama3:8b	~4.7GB	English-focused, Meta’s flagship
qwen2.5:14b	~10GB	High-quality output, needs 16GB RAM

I recommend starting with qwen2.5:3b for testing, then upgrading to qwen2.5:7b for real work.

Using the Local API

Ollama provides an OpenAI-compatible API at http://127.0.0.1:11434. This means I can switch from paid APIs to local models with minimal code changes.

Start the API Server

ollama serve

The server runs at http://127.0.0.1:11434 by default.

Python Integration

I’ll show you how to use the OpenAI Python SDK with Ollama:

from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:11434/v1",
    api_key="ollama"  # Any value works for local
)

response = client.chat.completions.create(
    model="qwen2.5:7b",
    messages=[{"role": "user", "content": "Explain quantum computing in simple terms"}],
    temperature=0.3
)

print(response.choices[0].message.content)

Run it:

python ollama_chat.py

Streaming Output for Responsive UIs

For chat interfaces, streaming gives better user experience:

from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:11434/v1",
    api_key="ollama"
)

print("Chat with your local AI (type 'quit' to exit)")
print("-" * 50)

messages = []
while True:
    user_input = input("You: ").strip()
    if user_input.lower() in ['quit', 'exit', 'q']:
        break

    messages.append({"role": "user", "content": user_input})

    print("AI: ", end="", flush=True)
    stream = client.chat.completions.create(
        model="qwen2.5:7b",
        messages=messages,
        stream=True,
        temperature=0.7
    )

    full_response = ""
    for chunk in stream:
        if chunk.choices[0].delta.content:
            content = chunk.choices[0].delta.content
            print(content, end="", flush=True)
            full_response += content

    print()
    messages.append({"role": "assistant", "content": full_response})

Switching from Paid APIs

Here’s the before and after comparison:

Before: Using DeepSeek API (costs money)

from openai import OpenAI

client = OpenAI(
    api_key="sk-xxxx",  # Real API key = cost
    base_url="https://api.deepseek.com"
)
# Every call costs money...

After: Using Local Ollama (free)

from openai import OpenAI

client = OpenAI(
    api_key="ollama",  # Dummy key, no cost
    base_url="http://127.0.0.1:11434/v1"
)
# Unlimited free calls!

The rest of your code stays the same. This is the key advantage: Ollama’s OpenAI-compatible API means zero code refactoring.

Creating Custom Models

I can create custom models with specific behaviors using Modelfile:

FROM qwen2.5:7b

SYSTEM """
You are a technical documentation assistant.
- Explain concepts in simple language
- Include code examples when relevant
- Add practical tips
"""

PARAMETER temperature 0.3

Create the custom model:

ollama create doc-assistant -f Modelfile

Now run it:

ollama run doc-assistant

Common Issues and Solutions

Problem	Cause	Solution
Slow model download	CDN speed	Set HTTP_PROXY or try again later
OOM (Out of Memory)	Model too large	Use smaller model or close other apps
API connection failed	Server not running	Run `ollama serve` first
GPU not detected	Missing drivers	Install NVIDIA CUDA toolkit
Slow generation	CPU inference	GPU is 10-20x faster

Why I Use Local Models

The main benefits I get from running Ollama:

Zero API costs - No per-token charges accumulating
Privacy - Company code and personal documents stay local
Offline access - Works without internet
No rate limits - Unlimited queries
Customization - Fine-tune models for my specific needs

Summary

In this post, I showed how to run AI models locally with Ollama. The key point is Ollama provides an OpenAI-compatible API at http://127.0.0.1:11434/v1, so switching from paid cloud APIs to free local models requires just changing the base_url parameter.

Start with ollama pull qwen2.5:3b to test it out, then upgrade to qwen2.5:7b for better performance. Your data stays on your machine, and you never pay API costs again.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!