Skip to content

How to Run Gemma Locally with Ollama

AI neural network visualization Photo by Unsplash - AI and neural network visualization

I wanted to run a capable AI model locally without relying on cloud services or paying for API subscriptions. After trying several options, I found that Google’s Gemma model combined with Ollama provides a straightforward solution for fully local, offline-capable AI inference.

The Problem with Cloud AI Services

Most AI services require constant internet connectivity, charge per API call, and send your data to remote servers. For privacy-sensitive work or projects requiring offline capability, this creates significant limitations.

I needed a solution that:

  • Runs entirely on my local machine
  • Works offline after initial setup
  • Requires no API keys or subscriptions
  • Provides good quality responses

Gemma with Ollama delivers on all these requirements.

What is Gemma?

Gemma is Google’s family of open-weights AI models. The latest version, Gemma 4, ranks #3 among open models according to various benchmarks. Being open-weights means you can download and run the model locally with full control.

The model comes in several sizes:

  • 2B - Lightweight, runs on most machines
  • 9B - Good balance of performance and resource usage
  • 27B - Higher quality, requires more hardware
  • 31B - Top-tier performance among open models

Installing Ollama

Ollama acts as a local model runner that handles the infrastructure for running large language models. First, I installed it on my system.

For macOS:

install-ollama.sh
# Download and install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Verify installation
ollama --version

For Linux, the same command works. Windows users can download the installer from ollama.com.

After installation, Ollama runs as a background service on port 11434.

Pulling the Gemma Model

With Ollama installed, pulling Gemma is straightforward:

pull-gemma.sh
# Pull the default Gemma model
ollama pull gemma4
# Or specify a particular size
ollama pull gemma4:9b
ollama pull gemma4:27b

The download size varies by model:

model-sizes.txt
gemma4:2b - ~1.5 GB
gemma4:9b - ~5.5 GB
gemma4:27b - ~16 GB
gemma4:31b - ~19 GB

I started with the 9B model as it offers good performance while fitting within my 16GB RAM system.

Running Gemma

Once downloaded, running the model is simple:

run-gemma.sh
# Start interactive session
ollama run gemma4
# Or run with a specific prompt
ollama run gemma4 "Explain quantum computing in simple terms"

The interactive session provides a chat interface where you can have extended conversations:

gemma-session.txt
>>> What are the advantages of running AI models locally?
Running AI models locally has several advantages:
1. **Privacy**: Your data stays on your machine
2. **No internet required**: Works offline after download
3. **No API costs**: Free unlimited usage
4. **Low latency**: No network round-trips
5. **Full control**: You own the model and data
>>> How does Gemma compare to other open models?
Gemma 4 ranks highly among open models, particularly the 27B and 31B
variants. It offers strong performance on reasoning tasks, code
generation, and general conversation while maintaining a relatively
efficient architecture.

Hardware Requirements

The model you choose depends on your hardware. Here’s what I found works:

hardware-requirements.txt
| Model Size | Minimum RAM | Recommended RAM | GPU |
|------------|-------------|-----------------|-----|
| 2B | 4GB | 8GB | Optional |
| 9B | 8GB | 16GB | Recommended |
| 27B | 16GB | 32GB | Required for speed |
| 31B | 20GB | 40GB | Required for speed |

I tested on a MacBook Pro with 16GB RAM. The 9B model runs smoothly, though I noticed slower response times compared to GPU-accelerated systems. For the 27B model, I had to close other applications to free up memory.

Using the API

Ollama exposes a REST API, making it easy to integrate with applications:

gemma-api.py
import requests
import json
def generate_response(prompt, model="gemma4"):
"""Generate response from local Gemma model."""
url = "http://localhost:11434/api/generate"
payload = {
"model": model,
"prompt": prompt,
"stream": False
}
response = requests.post(url, json=payload)
return response.json()
# Example usage
result = generate_response("What is machine learning?")
print(result["response"])

This API runs entirely locally with no external calls.

For streaming responses:

gemma-stream.py
import requests
import json
def stream_response(prompt, model="gemma4"):
"""Stream response from local Gemma model."""
url = "http://localhost:11434/api/generate"
payload = {
"model": model,
"prompt": prompt,
"stream": True
}
response = requests.post(url, json=payload, stream=True)
for line in response.iter_lines():
if line:
data = json.loads(line)
if "response" in data:
print(data["response"], end="", flush=True)
# Example usage
stream_response("Write a short poem about technology")

Integration with LangChain

For more complex applications, I integrated with LangChain:

langchain-gemma.py
from langchain_community.llms import Ollama
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory
# Initialize local Gemma model
llm = Ollama(model="gemma4")
# Create conversation with memory
memory = ConversationBufferMemory()
conversation = ConversationChain(
llm=llm,
memory=memory,
verbose=True
)
# Have a conversation
response = conversation.predict(
input="I want to learn about neural networks"
)
print(response)
# Continue the conversation
follow_up = conversation.predict(
input="Can you recommend some learning resources?"
)
print(follow_up)

This setup provides persistent conversation context while keeping everything local.

Performance Considerations

Running models locally has trade-offs. Here’s what I observed:

CPU-Only Performance (MacBook Pro M1, 16GB):

cpu-performance.txt
Model: gemma4:9b
Inference speed: ~15-20 tokens/second
Memory usage: ~10GB
First token latency: ~2 seconds

With a dedicated GPU:

gpu-performance.txt
Model: gemma4:27b (NVIDIA RTX 4090)
Inference speed: ~60-80 tokens/second
VRAM usage: ~20GB
First token latency: ~0.5 seconds

The difference is substantial for interactive use. For batch processing or non-time-sensitive tasks, CPU inference works fine.

Customizing the Model

Ollama allows model customization through Modelfiles:

Modelfile
FROM gemma4:9b
# Set custom parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
# Add a custom system prompt
SYSTEM You are a helpful coding assistant focused on Python development.
# Save as: ollama create my-gemma -f Modelfile

Create your customized model:

create-custom-model.sh
# Create model from Modelfile
ollama create my-gemma -f Modelfile
# Run the customized model
ollama run my-gemma

This approach lets you create specialized variants for specific use cases like code review, technical writing, or data analysis.

Offline Capability

Once downloaded, Gemma runs completely offline. I tested by disconnecting from the internet:

test-offline.sh
# Disable network and test
sudo ifconfig en0 down
ollama run gemma4 "Test message while offline"
# Works perfectly
# Re-enable network when done
sudo ifconfig en0 up

This makes it ideal for air-gapped environments or locations with unreliable connectivity.

Troubleshooting Common Issues

I encountered a few issues during setup:

Issue: Model downloads fail

troubleshoot-download.sh
# Check Ollama service status
ollama serve
# Or restart the service
# On macOS:
launchctl stop com.ollama.ollama
launchctl start com.ollama.ollama

Issue: Out of memory errors

memory-fix.txt
1. Close other applications
2. Use a smaller model (try gemma4:2b)
3. Reduce context window in Modelfile
4. Add swap space on Linux:
sudo fallocate -l 16G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Issue: Slow inference on CPU

improve-speed.sh
# Check if Metal acceleration is enabled (macOS)
# It should be automatic on Apple Silicon
# Verify GPU usage on Linux
nvidia-smi -l 1 # Real-time GPU monitoring

When to Use Local vs Cloud AI

Local Gemma works well for:

  • Privacy-sensitive data processing
  • Offline environments
  • Projects with strict data residency requirements
  • Development and experimentation
  • Avoiding API rate limits

Cloud services still make sense for:

  • State-of-the-art models like GPT-4 or Claude
  • Collaborative features requiring cloud sync
  • Projects without capable local hardware
  • Use cases requiring massive scale

I use local Gemma for day-to-day coding assistance and drafting, while reserving cloud services for tasks requiring the absolute best model performance.

Summary

In this post, I showed how to run Google’s Gemma AI model locally using Ollama. The key point is that with just one command ollama pull gemma4, you get fully local, offline-capable AI inference without API keys or cloud dependency.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments