How to Fine-Tune Google Gemma 4 Locally: A Complete LoRA/QLoRA Guide
The Problem
I tried to fine-tune Gemma 4 on my RTX 3060 with 12GB VRAM. Two minutes into training, I got this:
OutOfMemoryError: CUDA out of memory. Tried to allocate 2.5 GiBGPU 0 has a total capacity of 11.8 GiBAlready allocated: 9.2 GiBFull fine-tuning a 4B parameter model requires updating billions of weights. That means loading the entire model in memory, storing gradients for each weight, and keeping optimizer states. Even a modest 4B model needs 40+ GB VRAM for full training.
I thought local fine-tuning was only for people with A100s. Turns out I was wrong.
The Solution: LoRA and QLoRA
The key insight: you don’t need to update all parameters. LoRA (Low-Rank Adaptation) adds tiny trainable adapter matrices instead of modifying the full model.
How LoRA Works
Traditional fine-tuning: Weight matrix W (7B x 7B) → Update all parameters → Massive memory
LoRA approach: Weight matrix W (unchanged) + Adapter A (7B x r) where r = 16 (tiny!) + Adapter B (r x 7B) → Train only A and B → 1-5% of total parameters → 95%+ of full fine-tuning qualityFor a weight matrix W, the update becomes: W’ = W + (A x B). The rank r controls adapter size. A rank of 16 means training roughly 0.2% of original parameters.
QLoRA goes further by loading the base model in 4-bit precision:
Full fine-tuning: 40+ GB VRAMLoRA (16-bit): 12-15 GB VRAMQLoRA (4-bit): 8-10 GB VRAM <-- Works on consumer GPUs!Step-by-Step with Unsloth
Unsloth is an optimized training framework that makes QLoRA training 2x faster with 70% less memory.
Step 1: Install Unsloth
# One-line installer (takes 1-2 minutes)curl -fsSL https://unsloth.ai/install.sh | sh
# Or via pip for existing environmentspip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"Step 2: Load Model with 4-bit Quantization
from unsloth import FastLanguageModelimport torch
# Load Gemma 4 E4B with 4-bit quantizationmodel, tokenizer = FastLanguageModel.from_pretrained( model_name = "unsloth/gemma-4-4b-it-bnb-4bit", max_seq_length = 2048, dtype = None, # Auto-detect load_in_4bit = True, # QLoRA mode - critical!)print(f"Model loaded! VRAM usage: ~8GB")The load_in_4bit=True flag is the memory-saving magic. Without it, you’d need 15+ GB.
Step 3: Configure LoRA Adapters
# Apply LoRA - only train 0.2% of parametersmodel = FastLanguageModel.get_peft_model( model, r = 16, # LoRA rank (try 8, 16, 32, 64) target_modules = [ "q_proj", "k_proj", "v_proj", "o_proj", # Attention "gate_proj", "up_proj", "down_proj", # MLP ], lora_alpha = 16, # Scaling (same as r works well) lora_dropout = 0, # No dropout (faster training) bias = "none", use_gradient_checkpointing = "unsloth", # Saves 30% more VRAM random_state = 3407,)print(f"LoRA applied! Trainable params: ~0.2% of total")The use_gradient_checkpointing="unsloth" setting is crucial for consumer GPUs. It trades a small speed decrease for significant memory savings.
Step 4: Prepare Your Dataset
from datasets import load_dataset
# Load Alpaca-style datasetdataset = load_dataset("yahma/alpaca-cleaned", split="train")
def format_prompt(example): """Format for instruction tuning""" if example["input"]: return f"""### Instruction:{example['instruction']}
### Input:{example['input']}
### Response:{example['output']}""" else: return f"""### Instruction:{example['instruction']}
### Response:{example['output']}"""
dataset = dataset.map(lambda x: {"text": format_prompt(x)})print(f"Dataset ready: {len(dataset)} samples")For custom data, create a JSON file:
import jsonfrom datasets import Dataset
# Your training examplestraining_data = [ { "instruction": "Write a Python function to reverse a string", "input": "", "output": "def reverse_string(s):\n return s[::-1]" }, # ... more examples]
# Save and loadwith open("my_data.json", "w") as f: json.dump(training_data, f, indent=2)
dataset = Dataset.from_json("my_data.json")Step 5: Configure Training
from trl import SFTTrainerfrom transformers import TrainingArguments
training_args = TrainingArguments( output_dir = "./outputs", per_device_train_batch_size = 2, # Small batch for limited VRAM gradient_accumulation_steps = 4, # Simulates batch size of 8 warmup_ratio = 0.1, num_train_epochs = 3, learning_rate = 2e-4, # LoRA works well with 2e-4 fp16 = not torch.cuda.is_bf16_supported(), bf16 = torch.cuda.is_bf16_supported(), # Use BF16 on Ampere+ GPUs logging_steps = 10, optim = "adamw_8bit", # 8-bit optimizer saves memory weight_decay = 0.01, lr_scheduler_type = "linear", seed = 3407, report_to = "none", # Disable wandb)The key settings for consumer GPUs:
per_device_train_batch_size = 2: Small batch keeps memory lowgradient_accumulation_steps = 4: Accumulate gradients to simulate larger batchoptim = "adamw_8bit": 8-bit optimizer uses less memory than full Adam
Step 6: Train
trainer = SFTTrainer( model = model, tokenizer = tokenizer, train_dataset = dataset, dataset_text_field = "text", max_seq_length = 2048, dataset_num_proc = 2, packing = True, # Pack short sequences together args = training_args,)
# Start trainingtrainer_stats = trainer.train()print(f"Training time: {trainer_stats.metrics['train_runtime']:.2f}s")print(f"Speed: {trainer_stats.metrics['train_samples_per_second']:.2f} samples/s")On my RTX 3060 with 10K samples, training takes about 3 hours. An RTX 4090 would finish in 30-45 minutes.
Step 7: Save and Export
# Save LoRA adapters only (~50-100MB, portable)model.save_pretrained("gemma4-lora")tokenizer.save_pretrained("gemma4-lora")
# Merge LoRA with base model for standalone usemodel.save_pretrained_merged( "gemma4-finetuned", tokenizer, save_method = "merged_16bit",)
# Export to GGUF for llama.cpp inferencemodel.save_pretrained_gguf( "gemma4-gguf", tokenizer, quantization_method = "q4_k_m",)Saving LoRA separately is important. The adapter files are tiny (~50MB). You can swap them without reloading the base model.
Why LoRA Rank Matters
The rank r controls how much the model can adapt:
r = 8: Minimal adaptation, fastest training, 70% memory saved Best for: Style adjustments, tone changes
r = 16: Balanced (recommended default), 60% memory saved Best for: General domain adaptation, chatbots
r = 64: Maximum adaptation, slower training, 40% memory saved Best for: Complex domain specialization (medical, legal)
r = 128: Overkill for most use cases, minimal memory benefitI started with r=8 for a style adjustment task. The model learned the new tone but struggled with domain-specific terminology. Switching to r=16 solved this without noticeable speed impact.
Common Mistakes I Made
Mistake 1: Using full fine-tuning
WRONG: Load full model in 16-bit Update all parameters → OutOfMemoryError on 12GB GPU
RIGHT: Load model in 4-bit (load_in_4bit=True) Apply LoRA adapters Train only adapters → Works on 8GB GPUMistake 2: Wrong learning rate
LoRA training: Use 2e-4 to 5e-4Full fine-tuning: Use 1e-5 to 2e-5
Using 1e-5 for LoRA → Training barely progressesUsing 5e-4 for full → Model diverges, garbage outputMistake 3: Skipping gradient checkpointing
# Without gradient checkpointinguse_gradient_checkpointing = False→ 12GB VRAM needed for 4B model→ Runs out on RTX 3060 12GB with any sequence length
# With Unsloth's optimized checkpointinguse_gradient_checkpointing = "unsloth"→ 8GB VRAM needed for 4B model→ Fits comfortably on RTX 3060Mistake 4: Poor dataset quality
500 high-quality, consistent examples→ Good results after 2-3 epochs
5000 noisy, inconsistent examples→ Poor results even after 10 epochs
Key: Clean data beats volume. Consistency in format matters.Mistake 5: Over-training
Epoch 1-2: Training loss drops, model improvesEpoch 3: Training loss stable, validation loss stableEpoch 4+: Training loss drops, validation loss RISES
Stop at epoch 3. Further training degrades generalization.Hardware Requirements
Gemma 4 E2B + QLoRA: 4-5GB → GTX 1660, RTX 3050Gemma 4 E4B + QLoRA: 8-10GB → RTX 3060 12GB, 4060 TiGemma 4 E4B + LoRA: 12GB → RTX 4070, 3080Gemma 4 27B + QLoRA: 20-24GB → RTX 3090, 4090
Training Speed (10K samples, E4B):RTX 3060: 2-4 hoursRTX 4070: 1-2 hoursRTX 4090: 30-60 minutesUsing Your Fine-Tuned Model
from unsloth import FastLanguageModel
# Load merged model directlymodel, tokenizer = FastLanguageModel.from_pretrained( model_name = "./gemma4-finetuned", max_seq_length = 2048,)
# Or load base + LoRA adaptersmodel, tokenizer = FastLanguageModel.from_pretrained( model_name = "unsloth/gemma-4-4b-it-bnb-4bit", max_seq_length = 2048,)model.load_adapter("./gemma4-lora")
# Enable fast inferenceFastLanguageModel.for_inference(model)
# Generateprompt = """### Instruction:Explain LoRA fine-tuning in simple terms.
### Response:"""
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")outputs = model.generate( **inputs, max_new_tokens = 256, temperature = 0.7,)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))Complete Script
Here’s the full training script I use:
#!/usr/bin/env python3"""Fine-tune Gemma 4 E4B with QLoRA on consumer GPURequires: RTX 3060 12GB or similar"""
from unsloth import FastLanguageModelfrom trl import SFTTrainerfrom transformers import TrainingArgumentsfrom datasets import load_datasetimport torch
# ConfigurationMODEL_NAME = "unsloth/gemma-4-4b-it-bnb-4bit"MAX_SEQ_LENGTH = 2048LORA_RANK = 16OUTPUT_DIR = "./gemma4-finetuned"
# Step 1: Load modelprint("Loading model...")model, tokenizer = FastLanguageModel.from_pretrained( model_name = MODEL_NAME, max_seq_length = MAX_SEQ_LENGTH, dtype = None, load_in_4bit = True,)
# Step 2: Apply LoRAprint("Configuring LoRA...")model = FastLanguageModel.get_peft_model( model, r = LORA_RANK, target_modules = [ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", ], lora_alpha = LORA_RANK, lora_dropout = 0, bias = "none", use_gradient_checkpointing = "unsloth", random_state = 3407,)
# Step 3: Load datasetprint("Loading dataset...")dataset = load_dataset("yahma/alpaca-cleaned", split="train")
def format_prompt(example): if example["input"]: return f"""### Instruction:{example['instruction']}
### Input:{example['input']}
### Response:{example['output']}""" return f"""### Instruction:{example['instruction']}
### Response:{example['output']}"""
dataset = dataset.map(lambda x: {"text": format_prompt(x)})
# Step 4: Training configtraining_args = TrainingArguments( output_dir = OUTPUT_DIR, per_device_train_batch_size = 2, gradient_accumulation_steps = 4, warmup_ratio = 0.1, num_train_epochs = 3, learning_rate = 2e-4, fp16 = not torch.cuda.is_bf16_supported(), bf16 = torch.cuda.is_bf16_supported(), logging_steps = 10, optim = "adamw_8bit", weight_decay = 0.01, lr_scheduler_type = "linear", seed = 3407, report_to = "none",)
# Step 5: Traintrainer = SFTTrainer( model = model, tokenizer = tokenizer, train_dataset = dataset, dataset_text_field = "text", max_seq_length = MAX_SEQ_LENGTH, dataset_num_proc = 2, packing = True, args = training_args,)
print("Starting training...")trainer_stats = trainer.train()
print(f"\nDone! Time: {trainer_stats.metrics['train_runtime']:.2f}s")
# Step 6: Savemodel.save_pretrained("gemma4-lora")tokenizer.save_pretrained("gemma4-lora")model.save_pretrained_merged(OUTPUT_DIR, tokenizer, save_method = "merged_16bit")model.save_pretrained_gguf("gemma4-gguf", tokenizer, quantization_method = "q4_k_m")
print("Saved: LoRA adapters, merged model, GGUF")Summary
Fine-tuning Gemma 4 locally is possible with consumer GPUs. The key is using QLoRA (4-bit quantization + LoRA adapters) through Unsloth:
- Load model with
load_in_4bit=True- reduces memory from 15GB to 8GB - Apply LoRA adapters with rank 16 - trains only 0.2% of parameters
- Enable
use_gradient_checkpointing="unsloth"- saves 30% more VRAM - Use learning rate 2e-4 and 8-bit optimizer
- Train for 2-3 epochs, stop when validation loss rises
- Save LoRA adapters separately (~50MB) for easy swapping
With these techniques, I fine-tuned Gemma 4 on my RTX 3060 in 3 hours. The resulting model matches domain-specific needs without requiring cloud GPUs.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Unsloth Official Documentation
- 👨💻 Reddit: Run Gemma 4 Locally Discussion
- 👨💻 HuggingFace LoRA Documentation
- 👨💻 QLoRA Paper
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments