How to Run Gemma Locally with Ollama
Photo by Unsplash - AI and neural network visualization
I wanted to run a capable AI model locally without relying on cloud services or paying for API subscriptions. After trying several options, I found that Google’s Gemma model combined with Ollama provides a straightforward solution for fully local, offline-capable AI inference.
The Problem with Cloud AI Services
Most AI services require constant internet connectivity, charge per API call, and send your data to remote servers. For privacy-sensitive work or projects requiring offline capability, this creates significant limitations.
I needed a solution that:
- Runs entirely on my local machine
- Works offline after initial setup
- Requires no API keys or subscriptions
- Provides good quality responses
Gemma with Ollama delivers on all these requirements.
What is Gemma?
Gemma is Google’s family of open-weights AI models. The latest version, Gemma 4, ranks #3 among open models according to various benchmarks. Being open-weights means you can download and run the model locally with full control.
The model comes in several sizes:
- 2B - Lightweight, runs on most machines
- 9B - Good balance of performance and resource usage
- 27B - Higher quality, requires more hardware
- 31B - Top-tier performance among open models
Installing Ollama
Ollama acts as a local model runner that handles the infrastructure for running large language models. First, I installed it on my system.
For macOS:
# Download and install Ollamacurl -fsSL https://ollama.com/install.sh | sh
# Verify installationollama --versionFor Linux, the same command works. Windows users can download the installer from ollama.com.
After installation, Ollama runs as a background service on port 11434.
Pulling the Gemma Model
With Ollama installed, pulling Gemma is straightforward:
# Pull the default Gemma modelollama pull gemma4
# Or specify a particular sizeollama pull gemma4:9bollama pull gemma4:27bThe download size varies by model:
gemma4:2b - ~1.5 GBgemma4:9b - ~5.5 GBgemma4:27b - ~16 GBgemma4:31b - ~19 GBI started with the 9B model as it offers good performance while fitting within my 16GB RAM system.
Running Gemma
Once downloaded, running the model is simple:
# Start interactive sessionollama run gemma4
# Or run with a specific promptollama run gemma4 "Explain quantum computing in simple terms"The interactive session provides a chat interface where you can have extended conversations:
>>> What are the advantages of running AI models locally?
Running AI models locally has several advantages:
1. **Privacy**: Your data stays on your machine2. **No internet required**: Works offline after download3. **No API costs**: Free unlimited usage4. **Low latency**: No network round-trips5. **Full control**: You own the model and data
>>> How does Gemma compare to other open models?
Gemma 4 ranks highly among open models, particularly the 27B and 31Bvariants. It offers strong performance on reasoning tasks, codegeneration, and general conversation while maintaining a relativelyefficient architecture.Hardware Requirements
The model you choose depends on your hardware. Here’s what I found works:
| Model Size | Minimum RAM | Recommended RAM | GPU ||------------|-------------|-----------------|-----|| 2B | 4GB | 8GB | Optional || 9B | 8GB | 16GB | Recommended || 27B | 16GB | 32GB | Required for speed || 31B | 20GB | 40GB | Required for speed |I tested on a MacBook Pro with 16GB RAM. The 9B model runs smoothly, though I noticed slower response times compared to GPU-accelerated systems. For the 27B model, I had to close other applications to free up memory.
Using the API
Ollama exposes a REST API, making it easy to integrate with applications:
import requestsimport json
def generate_response(prompt, model="gemma4"): """Generate response from local Gemma model.""" url = "http://localhost:11434/api/generate"
payload = { "model": model, "prompt": prompt, "stream": False }
response = requests.post(url, json=payload) return response.json()
# Example usageresult = generate_response("What is machine learning?")print(result["response"])This API runs entirely locally with no external calls.
For streaming responses:
import requestsimport json
def stream_response(prompt, model="gemma4"): """Stream response from local Gemma model.""" url = "http://localhost:11434/api/generate"
payload = { "model": model, "prompt": prompt, "stream": True }
response = requests.post(url, json=payload, stream=True)
for line in response.iter_lines(): if line: data = json.loads(line) if "response" in data: print(data["response"], end="", flush=True)
# Example usagestream_response("Write a short poem about technology")Integration with LangChain
For more complex applications, I integrated with LangChain:
from langchain_community.llms import Ollamafrom langchain.chains import ConversationChainfrom langchain.memory import ConversationBufferMemory
# Initialize local Gemma modelllm = Ollama(model="gemma4")
# Create conversation with memorymemory = ConversationBufferMemory()conversation = ConversationChain( llm=llm, memory=memory, verbose=True)
# Have a conversationresponse = conversation.predict( input="I want to learn about neural networks")print(response)
# Continue the conversationfollow_up = conversation.predict( input="Can you recommend some learning resources?")print(follow_up)This setup provides persistent conversation context while keeping everything local.
Performance Considerations
Running models locally has trade-offs. Here’s what I observed:
CPU-Only Performance (MacBook Pro M1, 16GB):
Model: gemma4:9bInference speed: ~15-20 tokens/secondMemory usage: ~10GBFirst token latency: ~2 secondsWith a dedicated GPU:
Model: gemma4:27b (NVIDIA RTX 4090)Inference speed: ~60-80 tokens/secondVRAM usage: ~20GBFirst token latency: ~0.5 secondsThe difference is substantial for interactive use. For batch processing or non-time-sensitive tasks, CPU inference works fine.
Customizing the Model
Ollama allows model customization through Modelfiles:
FROM gemma4:9b
# Set custom parametersPARAMETER temperature 0.7PARAMETER top_p 0.9PARAMETER num_ctx 4096
# Add a custom system promptSYSTEM You are a helpful coding assistant focused on Python development.
# Save as: ollama create my-gemma -f ModelfileCreate your customized model:
# Create model from Modelfileollama create my-gemma -f Modelfile
# Run the customized modelollama run my-gemmaThis approach lets you create specialized variants for specific use cases like code review, technical writing, or data analysis.
Offline Capability
Once downloaded, Gemma runs completely offline. I tested by disconnecting from the internet:
# Disable network and testsudo ifconfig en0 downollama run gemma4 "Test message while offline"# Works perfectly
# Re-enable network when donesudo ifconfig en0 upThis makes it ideal for air-gapped environments or locations with unreliable connectivity.
Troubleshooting Common Issues
I encountered a few issues during setup:
Issue: Model downloads fail
# Check Ollama service statusollama serve
# Or restart the service# On macOS:launchctl stop com.ollama.ollamalaunchctl start com.ollama.ollamaIssue: Out of memory errors
1. Close other applications2. Use a smaller model (try gemma4:2b)3. Reduce context window in Modelfile4. Add swap space on Linux: sudo fallocate -l 16G /swapfile sudo chmod 600 /swapfile sudo mkswap /swapfile sudo swapon /swapfileIssue: Slow inference on CPU
# Check if Metal acceleration is enabled (macOS)# It should be automatic on Apple Silicon
# Verify GPU usage on Linuxnvidia-smi -l 1 # Real-time GPU monitoringWhen to Use Local vs Cloud AI
Local Gemma works well for:
- Privacy-sensitive data processing
- Offline environments
- Projects with strict data residency requirements
- Development and experimentation
- Avoiding API rate limits
Cloud services still make sense for:
- State-of-the-art models like GPT-4 or Claude
- Collaborative features requiring cloud sync
- Projects without capable local hardware
- Use cases requiring massive scale
I use local Gemma for day-to-day coding assistance and drafting, while reserving cloud services for tasks requiring the absolute best model performance.
Summary
In this post, I showed how to run Google’s Gemma AI model locally using Ollama. The key point is that with just one command ollama pull gemma4, you get fully local, offline-capable AI inference without API keys or cloud dependency.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments