How to Run Gemma Locally with Ollama

Apr 4, 2026

AI neural network visualization Photo by Unsplash - AI and neural network visualization

I wanted to run a capable AI model locally without relying on cloud services or paying for API subscriptions. After trying several options, I found that Google’s Gemma model combined with Ollama provides a straightforward solution for fully local, offline-capable AI inference.

The Problem with Cloud AI Services

Most AI services require constant internet connectivity, charge per API call, and send your data to remote servers. For privacy-sensitive work or projects requiring offline capability, this creates significant limitations.

I needed a solution that:

Runs entirely on my local machine
Works offline after initial setup
Requires no API keys or subscriptions
Provides good quality responses

Gemma with Ollama delivers on all these requirements.

What is Gemma?

Gemma is Google’s family of open-weights AI models. The latest version, Gemma 4, ranks #3 among open models according to various benchmarks. Being open-weights means you can download and run the model locally with full control.

The model comes in several sizes:

2B - Lightweight, runs on most machines
9B - Good balance of performance and resource usage
27B - Higher quality, requires more hardware
31B - Top-tier performance among open models

Installing Ollama

Ollama acts as a local model runner that handles the infrastructure for running large language models. First, I installed it on my system.

For macOS:

# Download and install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Verify installation
ollama --version

For Linux, the same command works. Windows users can download the installer from ollama.com.

After installation, Ollama runs as a background service on port 11434.

Pulling the Gemma Model

With Ollama installed, pulling Gemma is straightforward:

# Pull the default Gemma model
ollama pull gemma4

# Or specify a particular size
ollama pull gemma4:9b
ollama pull gemma4:27b

The download size varies by model:

gemma4:2b   - ~1.5 GB
gemma4:9b   - ~5.5 GB
gemma4:27b  - ~16 GB
gemma4:31b  - ~19 GB

I started with the 9B model as it offers good performance while fitting within my 16GB RAM system.

Running Gemma

Once downloaded, running the model is simple:

# Start interactive session
ollama run gemma4

# Or run with a specific prompt
ollama run gemma4 "Explain quantum computing in simple terms"

The interactive session provides a chat interface where you can have extended conversations:

>>> What are the advantages of running AI models locally?

Running AI models locally has several advantages:

1. **Privacy**: Your data stays on your machine
2. **No internet required**: Works offline after download
3. **No API costs**: Free unlimited usage
4. **Low latency**: No network round-trips
5. **Full control**: You own the model and data

>>> How does Gemma compare to other open models?

Gemma 4 ranks highly among open models, particularly the 27B and 31B
variants. It offers strong performance on reasoning tasks, code
generation, and general conversation while maintaining a relatively
efficient architecture.

Hardware Requirements

The model you choose depends on your hardware. Here’s what I found works:

| Model Size | Minimum RAM | Recommended RAM | GPU |
|------------|-------------|-----------------|-----|
| 2B         | 4GB         | 8GB             | Optional |
| 9B         | 8GB         | 16GB            | Recommended |
| 27B        | 16GB        | 32GB            | Required for speed |
| 31B        | 20GB        | 40GB            | Required for speed |

I tested on a MacBook Pro with 16GB RAM. The 9B model runs smoothly, though I noticed slower response times compared to GPU-accelerated systems. For the 27B model, I had to close other applications to free up memory.

Using the API

Ollama exposes a REST API, making it easy to integrate with applications:

import requests
import json

def generate_response(prompt, model="gemma4"):
    """Generate response from local Gemma model."""
    url = "http://localhost:11434/api/generate"

    payload = {
        "model": model,
        "prompt": prompt,
        "stream": False
    }

    response = requests.post(url, json=payload)
    return response.json()

# Example usage
result = generate_response("What is machine learning?")
print(result["response"])

This API runs entirely locally with no external calls.

For streaming responses:

import requests
import json

def stream_response(prompt, model="gemma4"):
    """Stream response from local Gemma model."""
    url = "http://localhost:11434/api/generate"

    payload = {
        "model": model,
        "prompt": prompt,
        "stream": True
    }

    response = requests.post(url, json=payload, stream=True)

    for line in response.iter_lines():
        if line:
            data = json.loads(line)
            if "response" in data:
                print(data["response"], end="", flush=True)

# Example usage
stream_response("Write a short poem about technology")

Integration with LangChain

For more complex applications, I integrated with LangChain:

from langchain_community.llms import Ollama
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory

# Initialize local Gemma model
llm = Ollama(model="gemma4")

# Create conversation with memory
memory = ConversationBufferMemory()
conversation = ConversationChain(
    llm=llm,
    memory=memory,
    verbose=True
)

# Have a conversation
response = conversation.predict(
    input="I want to learn about neural networks"
)
print(response)

# Continue the conversation
follow_up = conversation.predict(
    input="Can you recommend some learning resources?"
)
print(follow_up)

This setup provides persistent conversation context while keeping everything local.

Performance Considerations

Running models locally has trade-offs. Here’s what I observed:

CPU-Only Performance (MacBook Pro M1, 16GB):

Model: gemma4:9b
Inference speed: ~15-20 tokens/second
Memory usage: ~10GB
First token latency: ~2 seconds

With a dedicated GPU:

Model: gemma4:27b (NVIDIA RTX 4090)
Inference speed: ~60-80 tokens/second
VRAM usage: ~20GB
First token latency: ~0.5 seconds

The difference is substantial for interactive use. For batch processing or non-time-sensitive tasks, CPU inference works fine.

Customizing the Model

Ollama allows model customization through Modelfiles:

FROM gemma4:9b

# Set custom parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096

# Add a custom system prompt
SYSTEM You are a helpful coding assistant focused on Python development.

# Save as: ollama create my-gemma -f Modelfile

Create your customized model:

# Create model from Modelfile
ollama create my-gemma -f Modelfile

# Run the customized model
ollama run my-gemma

This approach lets you create specialized variants for specific use cases like code review, technical writing, or data analysis.

Offline Capability

Once downloaded, Gemma runs completely offline. I tested by disconnecting from the internet:

# Disable network and test
sudo ifconfig en0 down
ollama run gemma4 "Test message while offline"
# Works perfectly

# Re-enable network when done
sudo ifconfig en0 up

This makes it ideal for air-gapped environments or locations with unreliable connectivity.

Troubleshooting Common Issues

I encountered a few issues during setup:

Issue: Model downloads fail

# Check Ollama service status
ollama serve

# Or restart the service
# On macOS:
launchctl stop com.ollama.ollama
launchctl start com.ollama.ollama

Issue: Out of memory errors

1. Close other applications
2. Use a smaller model (try gemma4:2b)
3. Reduce context window in Modelfile
4. Add swap space on Linux:
   sudo fallocate -l 16G /swapfile
   sudo chmod 600 /swapfile
   sudo mkswap /swapfile
   sudo swapon /swapfile

Issue: Slow inference on CPU

# Check if Metal acceleration is enabled (macOS)
# It should be automatic on Apple Silicon

# Verify GPU usage on Linux
nvidia-smi -l 1  # Real-time GPU monitoring

When to Use Local vs Cloud AI

Local Gemma works well for:

Privacy-sensitive data processing
Offline environments
Projects with strict data residency requirements
Development and experimentation
Avoiding API rate limits

Cloud services still make sense for:

State-of-the-art models like GPT-4 or Claude
Collaborative features requiring cloud sync
Projects without capable local hardware
Use cases requiring massive scale

I use local Gemma for day-to-day coding assistance and drafting, while reserving cloud services for tasks requiring the absolute best model performance.

Summary

In this post, I showed how to run Google’s Gemma AI model locally using Ollama. The key point is that with just one command ollama pull gemma4, you get fully local, offline-capable AI inference without API keys or cloud dependency.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!