Skip to content

How to Run Google Gemma 4 Locally on Your Computer

The Problem

I wanted to run a modern LLM on my laptop. Every guide I found said I needed an NVIDIA GPU with 16GB+ VRAM. My laptop has 8GB RAM and no dedicated GPU. I thought local AI was impossible for me.

Then I tried downloading a full-precision model anyway:

First Attempt - Failed
$ python run_model.py
Loading model...
CUDA out of memory. Tried to allocate 12.4GB
RuntimeError: No GPU available for inference

I spent hours trying different approaches. PyTorch, HuggingFace transformers, various quantization methods. Each attempt failed with memory errors or dependency conflicts.

The breakthrough came when I discovered Google’s Gemma 4 family. The smaller variants (E2B and E4B) are specifically designed for consumer hardware. A 4-bit quantized E2B model needs only 1.2GB RAM. My laptop could actually run this.

Understanding Gemma 4 Model Variants

Before diving into installation, I needed to understand what I was choosing. Gemma 4 comes in four sizes:

Gemma 4 Model Family
┌─────────────────────────────────────────────────────────────────┐
│ │
│ E2B (2 Billion params) │
│ ├── Smallest, fastest │
│ ├── Runs on phones/tablets │
│ └── Q4_K_M: 1.2GB RAM │
│ │
│ E4B (4 Billion params) │
│ ├── Good balance of speed/quality │
│ ├── Perfect for laptops (8GB RAM) │
│ └── Q4_K_M: 2.5GB RAM │
│ │
│ 26B-A4B (26 Billion params) │
│ ├── High quality, needs dedicated hardware │
│ └── Q4_K_M: 15GB RAM │
│ │
│ 31B (31 Billion params) │
│ ├── Largest, best quality │
│ └── Q4_K_M: 18GB RAM │
│ │
└─────────────────────────────────────────────────────────────────┘

For my 8GB RAM laptop, E2B and E4B were the only realistic options. I started with E2B since it’s the smallest and fastest.

Solution 1: Ollama (The Simplest Path)

Ollama is the easiest way to get started. One command installs everything.

Step 1: Install Ollama

Install Ollama
# macOS/Linux
curl -fsSL https://ollama.ai/install.sh | sh

The installer handles dependencies automatically. No manual CUDA setup, no Python version conflicts.

Step 2: Run Gemma 4

Run Gemma 4 with Ollama
# Pull and run the model
ollama run gemma4:2b
# Interactive chat starts automatically
>>> Write a short poem about coding

That’s it. Two commands and I had a working local LLM.

But I wanted more control. Ollama abstracts too much. I couldn’t customize inference parameters or integrate it into my Python projects.

Solution 2: llama.cpp (For CPU-Only Systems)

llama.cpp is a pure C++ inference engine. It runs efficiently on CPU and optionally uses GPU if available.

Step 1: Install llama.cpp

Install llama.cpp
# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
# Or use Homebrew on macOS
brew install llama.cpp

Building from source took about 5 minutes. The Makefile handles platform-specific optimizations automatically.

Step 2: Download the GGUF Model

GGUF is a quantized format optimized for llama.cpp. I needed to download the 4-bit version:

Download Gemma 4 GGUF
# Create models directory
mkdir -p models
# Download 4-bit quantized model
wget https://huggingface.co/google/gemma-4-2b-it-gguf/resolve/main/gemma-4-2b-it-q4_k_m.gguf

The download was about 1.5GB. Much smaller than the 4GB full-precision version.

Step 3: Run Inference

Run llama.cpp Inference
# CPU-only inference
./llama-cli -m gemma-4-2b-it-q4_k_m.gguf -p "Explain recursion in simple terms" -n 512
# Interactive chat mode
./llama-cli -m gemma-4-2b-it-q4_k_m.gguf -cnv

The -cnv flag enables conversation mode. The model remembers context between prompts.

I measured performance on my laptop:

CPU Performance Results
Model: Gemma 4 E2B Q4_K_M
Hardware: MacBook Pro M1, 8GB RAM
Tokens per second: 10-12
First token latency: ~200ms
Quality: Comparable to GPT-3.5 for simple tasks

10 tokens per second is usable for chat. Not blazing fast, but responsive enough.

Optional: GPU Acceleration

If you have a GPU, add the -ngl flag:

GPU-Accelerated Inference
# Offload 35 layers to GPU
./llama-cli -m gemma-4-2b-it-q4_k_m.gguf -ngl 35 -p "Your prompt"

GPU offloading dramatically improves speed. An RTX 4090 can hit 150+ tokens/second on larger models.

Solution 3: Unsloth (For Python Integration)

I wanted to use Gemma 4 in my Python projects. Unsloth provides a clean Python API with optimized inference.

Step 1: Install Unsloth

Install Unsloth
curl -fsSL https://unsloth.ai/install.sh | sh

This one-line installer sets up PyTorch, transformers, and all dependencies. It took about 2 minutes.

Step 2: Load and Run the Model

run_gemma4.py
from unsloth import FastLanguageModel
# Load model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/gemma-4-2b-it-bnb-4bit",
max_seq_length = 2048,
dtype = None,
load_in_4bit = True,
)
# Enable faster inference
FastLanguageModel.for_inference(model)
# Generate text
messages = [
{"role": "user", "content": "Write a haiku about debugging"}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
return_tensors="pt",
add_generation_prompt=True
)
outputs = model.generate(
input_ids=inputs,
max_new_tokens=100,
temperature=0.7,
top_p=0.9,
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Step 3: Integrate into Your Application

The Python API allows full customization:

custom_inference.py
# Adjust generation parameters
outputs = model.generate(
input_ids=inputs,
max_new_tokens=256, # Longer responses
temperature=0.7, # Creativity level (0-1)
top_p=0.9, # Nucleus sampling
top_k=50, # Top-k sampling
repetition_penalty=1.1, # Reduce repetition
)

Memory Requirements Reference

Choosing the right model depends on your hardware. Here’s a complete reference:

Memory Requirements by Model Size
┌─────────────────────────────────────────────────────────────────┐
│ Model │ FP16 │ Q8_0 │ Q4_K_M │ CPU Speed │ GPU Speed │
├─────────────────────────────────────────────────────────────────┤
│ Gemma 4 E2B │ 4GB │ 2GB │ 1.2GB │ 10-15 t/s │ 80+ t/s │
│ Gemma 4 E4B │ 8GB │ 4GB │ 2.5GB │ 8-12 t/s │ 70+ t/s │
│ Gemma 4 26B │ 52GB │ 26GB │ 15GB │ 2-5 t/s │ 40+ t/s │
│ Gemma 4 31B │ 62GB │ 31GB │ 18GB │ 1-3 t/s │ 35+ t/s │
└─────────────────────────────────────────────────────────────────┘
Recommendations:
- 4-8GB RAM: Use E2B Q4_K_M (1.2GB)
- 8-16GB RAM: Use E4B Q4_K_M (2.5GB)
- 16-32GB RAM: Use E4B Q8_0 or 26B Q4_K_M
- 32GB+ RAM: Use 26B or 31B Q4_K_M

Common Mistakes I Made

Mistake 1: Choosing the Wrong Model Size

I initially tried the 27B model. My 8GB RAM laptop crashed immediately.

Wrong Model Size Error
$ ./llama-cli -m gemma-4-27b-q4_k_m.gguf
Error: insufficient memory
Required: 15GB, Available: 6GB

The fix: Start small. E2B or E4B for hardware with less than 16GB RAM.

Mistake 2: Using Full-Precision Models

I downloaded the FP16 version first. It required 4GB for E2B, which left no room for other processes.

Precision Comparison
FP16 (Full precision): 4GB RAM, best quality
Q8_0 (8-bit quantized): 2GB RAM, 99% quality retention
Q4_K_M (4-bit quantized): 1.2GB RAM, 95% quality retention
The quality loss from Q4_K_M is negligible for most use cases.

The fix: Always use quantized models (Q4_K_M or Q5_K_M) for local inference.

Mistake 3: Not Using GPU When Available

I ran llama.cpp on my friend’s RTX 4090 system without the -ngl flag. Performance was identical to CPU.

GPU Offloading Comparison
Without -ngl: 12 tokens/second (CPU only)
With -ngl 35: 140 tokens/second (GPU accelerated)
10x improvement with one flag.

The fix: Always add -ngl 35 (or higher) if you have a GPU.

Mistake 4: Skipping Dependencies

I tried to build llama.cpp without CUDA on an NVIDIA system. The build succeeded but GPU offloading failed.

Dependency Check
# Check CUDA installation
nvidia-smi
# If missing, install CUDA toolkit
# Ubuntu: sudo apt install nvidia-cuda-toolkit
# macOS: Not needed (Metal is built-in)

The fix: Install platform-specific dependencies before building.

Why Run Locally Instead of Cloud APIs?

Running local LLMs has clear advantages:

Local vs Cloud Comparison
┌─────────────────────────────────────────────────────────────────┐
│ Aspect │ Local (Gemma 4) │ Cloud (GPT-4) │
├─────────────────────────────────────────────────────────────────┤
│ Cost │ Free after setup │ $0.03/1K tokens │
│ Privacy │ Complete │ Data sent to server │
│ Offline │ Yes │ No │
│ Rate limits │ None │ Yes │
│ Customization │ Full control │ Limited │
│ Speed │ 10-150 t/s │ Variable (API latency) │
│ Quality │ Good for most tasks │ Excellent │
└─────────────────────────────────────────────────────────────────┘

For sensitive data (medical, legal, proprietary), local inference is essential. No data leaves your machine.

Complete Setup Script

Here’s a bash script that sets up everything:

setup-gemma4-local.sh
#!/bin/bash
set -e
echo "=== Setting up Gemma 4 locally ==="
# Method 1: Ollama (simplest)
echo ""
echo "Method 1: Ollama"
echo "curl -fsSL https://ollama.ai/install.sh | sh"
echo "ollama run gemma4:2b"
# Method 2: llama.cpp (more control)
echo ""
echo "Method 2: llama.cpp"
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
mkdir -p models
cd models
wget https://huggingface.co/google/gemma-4-2b-it-gguf/resolve/main/gemma-4-2b-it-q4_k_m.gguf
echo ""
echo "=== Setup complete ==="
echo "Run with: ./llama-cli -m models/gemma-4-2b-it-q4_k_m.gguf -cnv"

Summary

Running Gemma 4 locally is now accessible to anyone with a modern computer:

  1. Ollama - Two commands, simplest setup, good for beginners
  2. llama.cpp - More control, CPU-optimized, works on any hardware
  3. Unsloth - Python integration, full customization, best for developers

The key is choosing the right model size. For 4-8GB RAM systems, use E2B Q4_K_M. It runs at 10+ tokens/second with only 1.2GB memory.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments