Skip to content

How to Fine-Tune Google Gemma 4 Locally: A Complete LoRA/QLoRA Guide

The Problem

I tried to fine-tune Gemma 4 on my RTX 3060 with 12GB VRAM. Two minutes into training, I got this:

Error Message
OutOfMemoryError: CUDA out of memory. Tried to allocate 2.5 GiB
GPU 0 has a total capacity of 11.8 GiB
Already allocated: 9.2 GiB

Full fine-tuning a 4B parameter model requires updating billions of weights. That means loading the entire model in memory, storing gradients for each weight, and keeping optimizer states. Even a modest 4B model needs 40+ GB VRAM for full training.

I thought local fine-tuning was only for people with A100s. Turns out I was wrong.

The Solution: LoRA and QLoRA

The key insight: you don’t need to update all parameters. LoRA (Low-Rank Adaptation) adds tiny trainable adapter matrices instead of modifying the full model.

How LoRA Works

LoRA Concept
Traditional fine-tuning:
Weight matrix W (7B x 7B) → Update all parameters → Massive memory
LoRA approach:
Weight matrix W (unchanged)
+ Adapter A (7B x r) where r = 16 (tiny!)
+ Adapter B (r x 7B)
→ Train only A and B
→ 1-5% of total parameters
→ 95%+ of full fine-tuning quality

For a weight matrix W, the update becomes: W’ = W + (A x B). The rank r controls adapter size. A rank of 16 means training roughly 0.2% of original parameters.

QLoRA goes further by loading the base model in 4-bit precision:

QLoRA Memory Savings
Full fine-tuning: 40+ GB VRAM
LoRA (16-bit): 12-15 GB VRAM
QLoRA (4-bit): 8-10 GB VRAM <-- Works on consumer GPUs!

Step-by-Step with Unsloth

Unsloth is an optimized training framework that makes QLoRA training 2x faster with 70% less memory.

Step 1: Install Unsloth

Terminal
# One-line installer (takes 1-2 minutes)
curl -fsSL https://unsloth.ai/install.sh | sh
# Or via pip for existing environments
pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"

Step 2: Load Model with 4-bit Quantization

load_model.py
from unsloth import FastLanguageModel
import torch
# Load Gemma 4 E4B with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/gemma-4-4b-it-bnb-4bit",
max_seq_length = 2048,
dtype = None, # Auto-detect
load_in_4bit = True, # QLoRA mode - critical!
)
print(f"Model loaded! VRAM usage: ~8GB")

The load_in_4bit=True flag is the memory-saving magic. Without it, you’d need 15+ GB.

Step 3: Configure LoRA Adapters

configure_lora.py
# Apply LoRA - only train 0.2% of parameters
model = FastLanguageModel.get_peft_model(
model,
r = 16, # LoRA rank (try 8, 16, 32, 64)
target_modules = [
"q_proj", "k_proj", "v_proj", "o_proj", # Attention
"gate_proj", "up_proj", "down_proj", # MLP
],
lora_alpha = 16, # Scaling (same as r works well)
lora_dropout = 0, # No dropout (faster training)
bias = "none",
use_gradient_checkpointing = "unsloth", # Saves 30% more VRAM
random_state = 3407,
)
print(f"LoRA applied! Trainable params: ~0.2% of total")

The use_gradient_checkpointing="unsloth" setting is crucial for consumer GPUs. It trades a small speed decrease for significant memory savings.

Step 4: Prepare Your Dataset

prepare_dataset.py
from datasets import load_dataset
# Load Alpaca-style dataset
dataset = load_dataset("yahma/alpaca-cleaned", split="train")
def format_prompt(example):
"""Format for instruction tuning"""
if example["input"]:
return f"""### Instruction:
{example['instruction']}
### Input:
{example['input']}
### Response:
{example['output']}"""
else:
return f"""### Instruction:
{example['instruction']}
### Response:
{example['output']}"""
dataset = dataset.map(lambda x: {"text": format_prompt(x)})
print(f"Dataset ready: {len(dataset)} samples")

For custom data, create a JSON file:

custom_data.py
import json
from datasets import Dataset
# Your training examples
training_data = [
{
"instruction": "Write a Python function to reverse a string",
"input": "",
"output": "def reverse_string(s):\n return s[::-1]"
},
# ... more examples
]
# Save and load
with open("my_data.json", "w") as f:
json.dump(training_data, f, indent=2)
dataset = Dataset.from_json("my_data.json")

Step 5: Configure Training

training_config.py
from trl import SFTTrainer
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir = "./outputs",
per_device_train_batch_size = 2, # Small batch for limited VRAM
gradient_accumulation_steps = 4, # Simulates batch size of 8
warmup_ratio = 0.1,
num_train_epochs = 3,
learning_rate = 2e-4, # LoRA works well with 2e-4
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(), # Use BF16 on Ampere+ GPUs
logging_steps = 10,
optim = "adamw_8bit", # 8-bit optimizer saves memory
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
report_to = "none", # Disable wandb
)

The key settings for consumer GPUs:

  • per_device_train_batch_size = 2: Small batch keeps memory low
  • gradient_accumulation_steps = 4: Accumulate gradients to simulate larger batch
  • optim = "adamw_8bit": 8-bit optimizer uses less memory than full Adam

Step 6: Train

train.py
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = 2048,
dataset_num_proc = 2,
packing = True, # Pack short sequences together
args = training_args,
)
# Start training
trainer_stats = trainer.train()
print(f"Training time: {trainer_stats.metrics['train_runtime']:.2f}s")
print(f"Speed: {trainer_stats.metrics['train_samples_per_second']:.2f} samples/s")

On my RTX 3060 with 10K samples, training takes about 3 hours. An RTX 4090 would finish in 30-45 minutes.

Step 7: Save and Export

save_model.py
# Save LoRA adapters only (~50-100MB, portable)
model.save_pretrained("gemma4-lora")
tokenizer.save_pretrained("gemma4-lora")
# Merge LoRA with base model for standalone use
model.save_pretrained_merged(
"gemma4-finetuned",
tokenizer,
save_method = "merged_16bit",
)
# Export to GGUF for llama.cpp inference
model.save_pretrained_gguf(
"gemma4-gguf",
tokenizer,
quantization_method = "q4_k_m",
)

Saving LoRA separately is important. The adapter files are tiny (~50MB). You can swap them without reloading the base model.

Why LoRA Rank Matters

The rank r controls how much the model can adapt:

LoRA Rank Comparison
r = 8: Minimal adaptation, fastest training, 70% memory saved
Best for: Style adjustments, tone changes
r = 16: Balanced (recommended default), 60% memory saved
Best for: General domain adaptation, chatbots
r = 64: Maximum adaptation, slower training, 40% memory saved
Best for: Complex domain specialization (medical, legal)
r = 128: Overkill for most use cases, minimal memory benefit

I started with r=8 for a style adjustment task. The model learned the new tone but struggled with domain-specific terminology. Switching to r=16 solved this without noticeable speed impact.

Common Mistakes I Made

Mistake 1: Using full fine-tuning

Wrong vs Right
WRONG:
Load full model in 16-bit
Update all parameters
→ OutOfMemoryError on 12GB GPU
RIGHT:
Load model in 4-bit (load_in_4bit=True)
Apply LoRA adapters
Train only adapters
→ Works on 8GB GPU

Mistake 2: Wrong learning rate

Learning Rate Guide
LoRA training: Use 2e-4 to 5e-4
Full fine-tuning: Use 1e-5 to 2e-5
Using 1e-5 for LoRA → Training barely progresses
Using 5e-4 for full → Model diverges, garbage output

Mistake 3: Skipping gradient checkpointing

Memory Settings
# Without gradient checkpointing
use_gradient_checkpointing = False
12GB VRAM needed for 4B model
→ Runs out on RTX 3060 12GB with any sequence length
# With Unsloth's optimized checkpointing
use_gradient_checkpointing = "unsloth"
8GB VRAM needed for 4B model
→ Fits comfortably on RTX 3060

Mistake 4: Poor dataset quality

Dataset Quality Impact
500 high-quality, consistent examples
→ Good results after 2-3 epochs
5000 noisy, inconsistent examples
→ Poor results even after 10 epochs
Key: Clean data beats volume. Consistency in format matters.

Mistake 5: Over-training

Overfitting Signs
Epoch 1-2: Training loss drops, model improves
Epoch 3: Training loss stable, validation loss stable
Epoch 4+: Training loss drops, validation loss RISES
Stop at epoch 3. Further training degrades generalization.

Hardware Requirements

VRAM Requirements by Model
Gemma 4 E2B + QLoRA: 4-5GB → GTX 1660, RTX 3050
Gemma 4 E4B + QLoRA: 8-10GB → RTX 3060 12GB, 4060 Ti
Gemma 4 E4B + LoRA: 12GB → RTX 4070, 3080
Gemma 4 27B + QLoRA: 20-24GB → RTX 3090, 4090
Training Speed (10K samples, E4B):
RTX 3060: 2-4 hours
RTX 4070: 1-2 hours
RTX 4090: 30-60 minutes

Using Your Fine-Tuned Model

inference.py
from unsloth import FastLanguageModel
# Load merged model directly
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "./gemma4-finetuned",
max_seq_length = 2048,
)
# Or load base + LoRA adapters
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/gemma-4-4b-it-bnb-4bit",
max_seq_length = 2048,
)
model.load_adapter("./gemma4-lora")
# Enable fast inference
FastLanguageModel.for_inference(model)
# Generate
prompt = """### Instruction:
Explain LoRA fine-tuning in simple terms.
### Response:
"""
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens = 256,
temperature = 0.7,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Complete Script

Here’s the full training script I use:

finetune_gemma4.py
#!/usr/bin/env python3
"""
Fine-tune Gemma 4 E4B with QLoRA on consumer GPU
Requires: RTX 3060 12GB or similar
"""
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
import torch
# Configuration
MODEL_NAME = "unsloth/gemma-4-4b-it-bnb-4bit"
MAX_SEQ_LENGTH = 2048
LORA_RANK = 16
OUTPUT_DIR = "./gemma4-finetuned"
# Step 1: Load model
print("Loading model...")
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = MODEL_NAME,
max_seq_length = MAX_SEQ_LENGTH,
dtype = None,
load_in_4bit = True,
)
# Step 2: Apply LoRA
print("Configuring LoRA...")
model = FastLanguageModel.get_peft_model(
model,
r = LORA_RANK,
target_modules = [
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha = LORA_RANK,
lora_dropout = 0,
bias = "none",
use_gradient_checkpointing = "unsloth",
random_state = 3407,
)
# Step 3: Load dataset
print("Loading dataset...")
dataset = load_dataset("yahma/alpaca-cleaned", split="train")
def format_prompt(example):
if example["input"]:
return f"""### Instruction:
{example['instruction']}
### Input:
{example['input']}
### Response:
{example['output']}"""
return f"""### Instruction:
{example['instruction']}
### Response:
{example['output']}"""
dataset = dataset.map(lambda x: {"text": format_prompt(x)})
# Step 4: Training config
training_args = TrainingArguments(
output_dir = OUTPUT_DIR,
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_ratio = 0.1,
num_train_epochs = 3,
learning_rate = 2e-4,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 10,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
report_to = "none",
)
# Step 5: Train
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = MAX_SEQ_LENGTH,
dataset_num_proc = 2,
packing = True,
args = training_args,
)
print("Starting training...")
trainer_stats = trainer.train()
print(f"\nDone! Time: {trainer_stats.metrics['train_runtime']:.2f}s")
# Step 6: Save
model.save_pretrained("gemma4-lora")
tokenizer.save_pretrained("gemma4-lora")
model.save_pretrained_merged(OUTPUT_DIR, tokenizer, save_method = "merged_16bit")
model.save_pretrained_gguf("gemma4-gguf", tokenizer, quantization_method = "q4_k_m")
print("Saved: LoRA adapters, merged model, GGUF")

Summary

Fine-tuning Gemma 4 locally is possible with consumer GPUs. The key is using QLoRA (4-bit quantization + LoRA adapters) through Unsloth:

  1. Load model with load_in_4bit=True - reduces memory from 15GB to 8GB
  2. Apply LoRA adapters with rank 16 - trains only 0.2% of parameters
  3. Enable use_gradient_checkpointing="unsloth" - saves 30% more VRAM
  4. Use learning rate 2e-4 and 8-bit optimizer
  5. Train for 2-3 epochs, stop when validation loss rises
  6. Save LoRA adapters separately (~50MB) for easy swapping

With these techniques, I fine-tuned Gemma 4 on my RTX 3060 in 3 hours. The resulting model matches domain-specific needs without requiring cloud GPUs.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments