How to Run Google Gemma 4 Locally on Your Computer
The Problem
I wanted to run a modern LLM on my laptop. Every guide I found said I needed an NVIDIA GPU with 16GB+ VRAM. My laptop has 8GB RAM and no dedicated GPU. I thought local AI was impossible for me.
Then I tried downloading a full-precision model anyway:
$ python run_model.pyLoading model...CUDA out of memory. Tried to allocate 12.4GBRuntimeError: No GPU available for inferenceI spent hours trying different approaches. PyTorch, HuggingFace transformers, various quantization methods. Each attempt failed with memory errors or dependency conflicts.
The breakthrough came when I discovered Google’s Gemma 4 family. The smaller variants (E2B and E4B) are specifically designed for consumer hardware. A 4-bit quantized E2B model needs only 1.2GB RAM. My laptop could actually run this.
Understanding Gemma 4 Model Variants
Before diving into installation, I needed to understand what I was choosing. Gemma 4 comes in four sizes:
┌─────────────────────────────────────────────────────────────────┐│ ││ E2B (2 Billion params) ││ ├── Smallest, fastest ││ ├── Runs on phones/tablets ││ └── Q4_K_M: 1.2GB RAM ││ ││ E4B (4 Billion params) ││ ├── Good balance of speed/quality ││ ├── Perfect for laptops (8GB RAM) ││ └── Q4_K_M: 2.5GB RAM ││ ││ 26B-A4B (26 Billion params) ││ ├── High quality, needs dedicated hardware ││ └── Q4_K_M: 15GB RAM ││ ││ 31B (31 Billion params) ││ ├── Largest, best quality ││ └── Q4_K_M: 18GB RAM ││ │└─────────────────────────────────────────────────────────────────┘For my 8GB RAM laptop, E2B and E4B were the only realistic options. I started with E2B since it’s the smallest and fastest.
Solution 1: Ollama (The Simplest Path)
Ollama is the easiest way to get started. One command installs everything.
Step 1: Install Ollama
# macOS/Linuxcurl -fsSL https://ollama.ai/install.sh | shThe installer handles dependencies automatically. No manual CUDA setup, no Python version conflicts.
Step 2: Run Gemma 4
# Pull and run the modelollama run gemma4:2b
# Interactive chat starts automatically>>> Write a short poem about codingThat’s it. Two commands and I had a working local LLM.
But I wanted more control. Ollama abstracts too much. I couldn’t customize inference parameters or integrate it into my Python projects.
Solution 2: llama.cpp (For CPU-Only Systems)
llama.cpp is a pure C++ inference engine. It runs efficiently on CPU and optionally uses GPU if available.
Step 1: Install llama.cpp
# Clone and buildgit clone https://github.com/ggerganov/llama.cppcd llama.cppmake
# Or use Homebrew on macOSbrew install llama.cppBuilding from source took about 5 minutes. The Makefile handles platform-specific optimizations automatically.
Step 2: Download the GGUF Model
GGUF is a quantized format optimized for llama.cpp. I needed to download the 4-bit version:
# Create models directorymkdir -p models
# Download 4-bit quantized modelwget https://huggingface.co/google/gemma-4-2b-it-gguf/resolve/main/gemma-4-2b-it-q4_k_m.ggufThe download was about 1.5GB. Much smaller than the 4GB full-precision version.
Step 3: Run Inference
# CPU-only inference./llama-cli -m gemma-4-2b-it-q4_k_m.gguf -p "Explain recursion in simple terms" -n 512
# Interactive chat mode./llama-cli -m gemma-4-2b-it-q4_k_m.gguf -cnvThe -cnv flag enables conversation mode. The model remembers context between prompts.
I measured performance on my laptop:
Model: Gemma 4 E2B Q4_K_MHardware: MacBook Pro M1, 8GB RAMTokens per second: 10-12First token latency: ~200msQuality: Comparable to GPT-3.5 for simple tasks10 tokens per second is usable for chat. Not blazing fast, but responsive enough.
Optional: GPU Acceleration
If you have a GPU, add the -ngl flag:
# Offload 35 layers to GPU./llama-cli -m gemma-4-2b-it-q4_k_m.gguf -ngl 35 -p "Your prompt"GPU offloading dramatically improves speed. An RTX 4090 can hit 150+ tokens/second on larger models.
Solution 3: Unsloth (For Python Integration)
I wanted to use Gemma 4 in my Python projects. Unsloth provides a clean Python API with optimized inference.
Step 1: Install Unsloth
curl -fsSL https://unsloth.ai/install.sh | shThis one-line installer sets up PyTorch, transformers, and all dependencies. It took about 2 minutes.
Step 2: Load and Run the Model
from unsloth import FastLanguageModel
# Load model with 4-bit quantizationmodel, tokenizer = FastLanguageModel.from_pretrained( model_name = "unsloth/gemma-4-2b-it-bnb-4bit", max_seq_length = 2048, dtype = None, load_in_4bit = True,)
# Enable faster inferenceFastLanguageModel.for_inference(model)
# Generate textmessages = [ {"role": "user", "content": "Write a haiku about debugging"}]
inputs = tokenizer.apply_chat_template( messages, tokenize=True, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate( input_ids=inputs, max_new_tokens=100, temperature=0.7, top_p=0.9,)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)print(response)Step 3: Integrate into Your Application
The Python API allows full customization:
# Adjust generation parametersoutputs = model.generate( input_ids=inputs, max_new_tokens=256, # Longer responses temperature=0.7, # Creativity level (0-1) top_p=0.9, # Nucleus sampling top_k=50, # Top-k sampling repetition_penalty=1.1, # Reduce repetition)Memory Requirements Reference
Choosing the right model depends on your hardware. Here’s a complete reference:
┌─────────────────────────────────────────────────────────────────┐│ Model │ FP16 │ Q8_0 │ Q4_K_M │ CPU Speed │ GPU Speed │├─────────────────────────────────────────────────────────────────┤│ Gemma 4 E2B │ 4GB │ 2GB │ 1.2GB │ 10-15 t/s │ 80+ t/s ││ Gemma 4 E4B │ 8GB │ 4GB │ 2.5GB │ 8-12 t/s │ 70+ t/s ││ Gemma 4 26B │ 52GB │ 26GB │ 15GB │ 2-5 t/s │ 40+ t/s ││ Gemma 4 31B │ 62GB │ 31GB │ 18GB │ 1-3 t/s │ 35+ t/s │└─────────────────────────────────────────────────────────────────┘
Recommendations:- 4-8GB RAM: Use E2B Q4_K_M (1.2GB)- 8-16GB RAM: Use E4B Q4_K_M (2.5GB)- 16-32GB RAM: Use E4B Q8_0 or 26B Q4_K_M- 32GB+ RAM: Use 26B or 31B Q4_K_MCommon Mistakes I Made
Mistake 1: Choosing the Wrong Model Size
I initially tried the 27B model. My 8GB RAM laptop crashed immediately.
$ ./llama-cli -m gemma-4-27b-q4_k_m.ggufError: insufficient memoryRequired: 15GB, Available: 6GBThe fix: Start small. E2B or E4B for hardware with less than 16GB RAM.
Mistake 2: Using Full-Precision Models
I downloaded the FP16 version first. It required 4GB for E2B, which left no room for other processes.
FP16 (Full precision): 4GB RAM, best qualityQ8_0 (8-bit quantized): 2GB RAM, 99% quality retentionQ4_K_M (4-bit quantized): 1.2GB RAM, 95% quality retention
The quality loss from Q4_K_M is negligible for most use cases.The fix: Always use quantized models (Q4_K_M or Q5_K_M) for local inference.
Mistake 3: Not Using GPU When Available
I ran llama.cpp on my friend’s RTX 4090 system without the -ngl flag. Performance was identical to CPU.
Without -ngl: 12 tokens/second (CPU only)With -ngl 35: 140 tokens/second (GPU accelerated)
10x improvement with one flag.The fix: Always add -ngl 35 (or higher) if you have a GPU.
Mistake 4: Skipping Dependencies
I tried to build llama.cpp without CUDA on an NVIDIA system. The build succeeded but GPU offloading failed.
# Check CUDA installationnvidia-smi
# If missing, install CUDA toolkit# Ubuntu: sudo apt install nvidia-cuda-toolkit# macOS: Not needed (Metal is built-in)The fix: Install platform-specific dependencies before building.
Why Run Locally Instead of Cloud APIs?
Running local LLMs has clear advantages:
┌─────────────────────────────────────────────────────────────────┐│ Aspect │ Local (Gemma 4) │ Cloud (GPT-4) │├─────────────────────────────────────────────────────────────────┤│ Cost │ Free after setup │ $0.03/1K tokens ││ Privacy │ Complete │ Data sent to server ││ Offline │ Yes │ No ││ Rate limits │ None │ Yes ││ Customization │ Full control │ Limited ││ Speed │ 10-150 t/s │ Variable (API latency) ││ Quality │ Good for most tasks │ Excellent │└─────────────────────────────────────────────────────────────────┘For sensitive data (medical, legal, proprietary), local inference is essential. No data leaves your machine.
Complete Setup Script
Here’s a bash script that sets up everything:
#!/bin/bashset -e
echo "=== Setting up Gemma 4 locally ==="
# Method 1: Ollama (simplest)echo ""echo "Method 1: Ollama"echo "curl -fsSL https://ollama.ai/install.sh | sh"echo "ollama run gemma4:2b"
# Method 2: llama.cpp (more control)echo ""echo "Method 2: llama.cpp"git clone https://github.com/ggerganov/llama.cppcd llama.cppmake
mkdir -p modelscd modelswget https://huggingface.co/google/gemma-4-2b-it-gguf/resolve/main/gemma-4-2b-it-q4_k_m.gguf
echo ""echo "=== Setup complete ==="echo "Run with: ./llama-cli -m models/gemma-4-2b-it-q4_k_m.gguf -cnv"Summary
Running Gemma 4 locally is now accessible to anyone with a modern computer:
- Ollama - Two commands, simplest setup, good for beginners
- llama.cpp - More control, CPU-optimized, works on any hardware
- Unsloth - Python integration, full customization, best for developers
The key is choosing the right model size. For 4-8GB RAM systems, use E2B Q4_K_M. It runs at 10+ tokens/second with only 1.2GB memory.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Unsloth Installation Guide
- 👨💻 llama.cpp GitHub Repository
- 👨💻 Ollama Official Website
- 👨💻 Reddit: Running Gemma 4 Locally Discussion
- 👨💻 Google Gemma Model Card on HuggingFace
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments