How to Run Google Gemma 4 Locally on Your Computer

Apr 3, 2026

The Problem

I wanted to run a modern LLM on my laptop. Every guide I found said I needed an NVIDIA GPU with 16GB+ VRAM. My laptop has 8GB RAM and no dedicated GPU. I thought local AI was impossible for me.

Then I tried downloading a full-precision model anyway:

$ python run_model.py
Loading model...
CUDA out of memory. Tried to allocate 12.4GB
RuntimeError: No GPU available for inference

I spent hours trying different approaches. PyTorch, HuggingFace transformers, various quantization methods. Each attempt failed with memory errors or dependency conflicts.

The breakthrough came when I discovered Google’s Gemma 4 family. The smaller variants (E2B and E4B) are specifically designed for consumer hardware. A 4-bit quantized E2B model needs only 1.2GB RAM. My laptop could actually run this.

Understanding Gemma 4 Model Variants

Before diving into installation, I needed to understand what I was choosing. Gemma 4 comes in four sizes:

┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│   E2B (2 Billion params)                                        │
│   ├── Smallest, fastest                                         │
│   ├── Runs on phones/tablets                                    │
│   └── Q4_K_M: 1.2GB RAM                                         │
│                                                                 │
│   E4B (4 Billion params)                                        │
│   ├── Good balance of speed/quality                             │
│   ├── Perfect for laptops (8GB RAM)                             │
│   └── Q4_K_M: 2.5GB RAM                                         │
│                                                                 │
│   26B-A4B (26 Billion params)                                   │
│   ├── High quality, needs dedicated hardware                    │
│   └── Q4_K_M: 15GB RAM                                          │
│                                                                 │
│   31B (31 Billion params)                                       │
│   ├── Largest, best quality                                     │
│   └── Q4_K_M: 18GB RAM                                          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

For my 8GB RAM laptop, E2B and E4B were the only realistic options. I started with E2B since it’s the smallest and fastest.

Solution 1: Ollama (The Simplest Path)

Ollama is the easiest way to get started. One command installs everything.

Step 1: Install Ollama

# macOS/Linux
curl -fsSL https://ollama.ai/install.sh | sh

The installer handles dependencies automatically. No manual CUDA setup, no Python version conflicts.

Step 2: Run Gemma 4

# Pull and run the model
ollama run gemma4:2b

# Interactive chat starts automatically
>>> Write a short poem about coding

That’s it. Two commands and I had a working local LLM.

But I wanted more control. Ollama abstracts too much. I couldn’t customize inference parameters or integrate it into my Python projects.

Solution 2: llama.cpp (For CPU-Only Systems)

llama.cpp is a pure C++ inference engine. It runs efficiently on CPU and optionally uses GPU if available.

Step 1: Install llama.cpp

# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Or use Homebrew on macOS
brew install llama.cpp

Building from source took about 5 minutes. The Makefile handles platform-specific optimizations automatically.

Step 2: Download the GGUF Model

GGUF is a quantized format optimized for llama.cpp. I needed to download the 4-bit version:

# Create models directory
mkdir -p models

# Download 4-bit quantized model
wget https://huggingface.co/google/gemma-4-2b-it-gguf/resolve/main/gemma-4-2b-it-q4_k_m.gguf

The download was about 1.5GB. Much smaller than the 4GB full-precision version.

Step 3: Run Inference

# CPU-only inference
./llama-cli -m gemma-4-2b-it-q4_k_m.gguf -p "Explain recursion in simple terms" -n 512

# Interactive chat mode
./llama-cli -m gemma-4-2b-it-q4_k_m.gguf -cnv

The -cnv flag enables conversation mode. The model remembers context between prompts.

I measured performance on my laptop:

Model: Gemma 4 E2B Q4_K_M
Hardware: MacBook Pro M1, 8GB RAM
Tokens per second: 10-12
First token latency: ~200ms
Quality: Comparable to GPT-3.5 for simple tasks

10 tokens per second is usable for chat. Not blazing fast, but responsive enough.

Optional: GPU Acceleration

If you have a GPU, add the -ngl flag:

# Offload 35 layers to GPU
./llama-cli -m gemma-4-2b-it-q4_k_m.gguf -ngl 35 -p "Your prompt"

GPU offloading dramatically improves speed. An RTX 4090 can hit 150+ tokens/second on larger models.

Solution 3: Unsloth (For Python Integration)

I wanted to use Gemma 4 in my Python projects. Unsloth provides a clean Python API with optimized inference.

Step 1: Install Unsloth

curl -fsSL https://unsloth.ai/install.sh | sh

This one-line installer sets up PyTorch, transformers, and all dependencies. It took about 2 minutes.

Step 2: Load and Run the Model

from unsloth import FastLanguageModel

# Load model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gemma-4-2b-it-bnb-4bit",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)

# Enable faster inference
FastLanguageModel.for_inference(model)

# Generate text
messages = [
    {"role": "user", "content": "Write a haiku about debugging"}
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    return_tensors="pt",
    add_generation_prompt=True
)

outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.9,
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Step 3: Integrate into Your Application

The Python API allows full customization:

# Adjust generation parameters
outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=256,        # Longer responses
    temperature=0.7,           # Creativity level (0-1)
    top_p=0.9,                 # Nucleus sampling
    top_k=50,                  # Top-k sampling
    repetition_penalty=1.1,    # Reduce repetition
)

Memory Requirements Reference

Choosing the right model depends on your hardware. Here’s a complete reference:

┌─────────────────────────────────────────────────────────────────┐
│ Model          │ FP16   │ Q8_0  │ Q4_K_M │ CPU Speed │ GPU Speed │
├─────────────────────────────────────────────────────────────────┤
│ Gemma 4 E2B    │ 4GB    │ 2GB   │ 1.2GB  │ 10-15 t/s │ 80+ t/s   │
│ Gemma 4 E4B    │ 8GB    │ 4GB   │ 2.5GB  │ 8-12 t/s  │ 70+ t/s   │
│ Gemma 4 26B    │ 52GB   │ 26GB  │ 15GB   │ 2-5 t/s   │ 40+ t/s   │
│ Gemma 4 31B    │ 62GB   │ 31GB  │ 18GB   │ 1-3 t/s   │ 35+ t/s   │
└─────────────────────────────────────────────────────────────────┘

Recommendations:
- 4-8GB RAM: Use E2B Q4_K_M (1.2GB)
- 8-16GB RAM: Use E4B Q4_K_M (2.5GB)
- 16-32GB RAM: Use E4B Q8_0 or 26B Q4_K_M
- 32GB+ RAM: Use 26B or 31B Q4_K_M

Common Mistakes I Made

Mistake 1: Choosing the Wrong Model Size

I initially tried the 27B model. My 8GB RAM laptop crashed immediately.

$ ./llama-cli -m gemma-4-27b-q4_k_m.gguf
Error: insufficient memory
Required: 15GB, Available: 6GB

The fix: Start small. E2B or E4B for hardware with less than 16GB RAM.

Mistake 2: Using Full-Precision Models

I downloaded the FP16 version first. It required 4GB for E2B, which left no room for other processes.

FP16 (Full precision):  4GB RAM, best quality
Q8_0 (8-bit quantized): 2GB RAM, 99% quality retention
Q4_K_M (4-bit quantized): 1.2GB RAM, 95% quality retention

The quality loss from Q4_K_M is negligible for most use cases.

The fix: Always use quantized models (Q4_K_M or Q5_K_M) for local inference.

Mistake 3: Not Using GPU When Available

I ran llama.cpp on my friend’s RTX 4090 system without the -ngl flag. Performance was identical to CPU.

Without -ngl:    12 tokens/second (CPU only)
With -ngl 35:    140 tokens/second (GPU accelerated)

10x improvement with one flag.

The fix: Always add -ngl 35 (or higher) if you have a GPU.

Mistake 4: Skipping Dependencies

I tried to build llama.cpp without CUDA on an NVIDIA system. The build succeeded but GPU offloading failed.

# Check CUDA installation
nvidia-smi

# If missing, install CUDA toolkit
# Ubuntu: sudo apt install nvidia-cuda-toolkit
# macOS: Not needed (Metal is built-in)

The fix: Install platform-specific dependencies before building.

Why Run Locally Instead of Cloud APIs?

Running local LLMs has clear advantages:

┌─────────────────────────────────────────────────────────────────┐
│ Aspect          │ Local (Gemma 4)      │ Cloud (GPT-4)          │
├─────────────────────────────────────────────────────────────────┤
│ Cost            │ Free after setup     │ $0.03/1K tokens        │
│ Privacy         │ Complete             │ Data sent to server    │
│ Offline         │ Yes                  │ No                     │
│ Rate limits     │ None                 │ Yes                    │
│ Customization   │ Full control         │ Limited                │
│ Speed           │ 10-150 t/s           │ Variable (API latency) │
│ Quality         │ Good for most tasks  │ Excellent              │
└─────────────────────────────────────────────────────────────────┘

For sensitive data (medical, legal, proprietary), local inference is essential. No data leaves your machine.

Complete Setup Script

Here’s a bash script that sets up everything:

#!/bin/bash
set -e

echo "=== Setting up Gemma 4 locally ==="

# Method 1: Ollama (simplest)
echo ""
echo "Method 1: Ollama"
echo "curl -fsSL https://ollama.ai/install.sh | sh"
echo "ollama run gemma4:2b"

# Method 2: llama.cpp (more control)
echo ""
echo "Method 2: llama.cpp"
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

mkdir -p models
cd models
wget https://huggingface.co/google/gemma-4-2b-it-gguf/resolve/main/gemma-4-2b-it-q4_k_m.gguf

echo ""
echo "=== Setup complete ==="
echo "Run with: ./llama-cli -m models/gemma-4-2b-it-q4_k_m.gguf -cnv"

Summary

Running Gemma 4 locally is now accessible to anyone with a modern computer:

Ollama - Two commands, simplest setup, good for beginners
llama.cpp - More control, CPU-optimized, works on any hardware
Unsloth - Python integration, full customization, best for developers

The key is choosing the right model size. For 4-8GB RAM systems, use E2B Q4_K_M. It runs at 10+ tokens/second with only 1.2GB memory.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Unsloth Installation Guide
👨‍💻 llama.cpp GitHub Repository
👨‍💻 Ollama Official Website
👨‍💻 Reddit: Running Gemma 4 Locally Discussion
👨‍💻 Google Gemma Model Card on HuggingFace

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!