How to Fine-Tune Google Gemma 4 Locally: A Complete LoRA/QLoRA Guide

Apr 3, 2026

The Problem

I tried to fine-tune Gemma 4 on my RTX 3060 with 12GB VRAM. Two minutes into training, I got this:

OutOfMemoryError: CUDA out of memory. Tried to allocate 2.5 GiB
GPU 0 has a total capacity of 11.8 GiB
Already allocated: 9.2 GiB

Full fine-tuning a 4B parameter model requires updating billions of weights. That means loading the entire model in memory, storing gradients for each weight, and keeping optimizer states. Even a modest 4B model needs 40+ GB VRAM for full training.

I thought local fine-tuning was only for people with A100s. Turns out I was wrong.

The Solution: LoRA and QLoRA

The key insight: you don’t need to update all parameters. LoRA (Low-Rank Adaptation) adds tiny trainable adapter matrices instead of modifying the full model.

How LoRA Works

Traditional fine-tuning:
  Weight matrix W (7B x 7B) → Update all parameters → Massive memory

LoRA approach:
  Weight matrix W (unchanged)
  + Adapter A (7B x r)    where r = 16 (tiny!)
  + Adapter B (r x 7B)
  → Train only A and B
  → 1-5% of total parameters
  → 95%+ of full fine-tuning quality

For a weight matrix W, the update becomes: W’ = W + (A x B). The rank r controls adapter size. A rank of 16 means training roughly 0.2% of original parameters.

QLoRA goes further by loading the base model in 4-bit precision:

Full fine-tuning:    40+ GB VRAM
LoRA (16-bit):       12-15 GB VRAM
QLoRA (4-bit):       8-10 GB VRAM  <-- Works on consumer GPUs!

Step-by-Step with Unsloth

Unsloth is an optimized training framework that makes QLoRA training 2x faster with 70% less memory.

Step 1: Install Unsloth

# One-line installer (takes 1-2 minutes)
curl -fsSL https://unsloth.ai/install.sh | sh

# Or via pip for existing environments
pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"

Step 2: Load Model with 4-bit Quantization

from unsloth import FastLanguageModel
import torch

# Load Gemma 4 E4B with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gemma-4-4b-it-bnb-4bit",
    max_seq_length = 2048,
    dtype = None,               # Auto-detect
    load_in_4bit = True,        # QLoRA mode - critical!
)
print(f"Model loaded! VRAM usage: ~8GB")

The load_in_4bit=True flag is the memory-saving magic. Without it, you’d need 15+ GB.

Step 3: Configure LoRA Adapters

# Apply LoRA - only train 0.2% of parameters
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,                     # LoRA rank (try 8, 16, 32, 64)
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
        "gate_proj", "up_proj", "down_proj",     # MLP
    ],
    lora_alpha = 16,            # Scaling (same as r works well)
    lora_dropout = 0,           # No dropout (faster training)
    bias = "none",
    use_gradient_checkpointing = "unsloth",  # Saves 30% more VRAM
    random_state = 3407,
)
print(f"LoRA applied! Trainable params: ~0.2% of total")

The use_gradient_checkpointing="unsloth" setting is crucial for consumer GPUs. It trades a small speed decrease for significant memory savings.

Step 4: Prepare Your Dataset

from datasets import load_dataset

# Load Alpaca-style dataset
dataset = load_dataset("yahma/alpaca-cleaned", split="train")

def format_prompt(example):
    """Format for instruction tuning"""
    if example["input"]:
        return f"""### Instruction:
{example['instruction']}

### Input:
{example['input']}

### Response:
{example['output']}"""
    else:
        return f"""### Instruction:
{example['instruction']}

### Response:
{example['output']}"""

dataset = dataset.map(lambda x: {"text": format_prompt(x)})
print(f"Dataset ready: {len(dataset)} samples")

For custom data, create a JSON file:

import json
from datasets import Dataset

# Your training examples
training_data = [
    {
        "instruction": "Write a Python function to reverse a string",
        "input": "",
        "output": "def reverse_string(s):\n    return s[::-1]"
    },
    # ... more examples
]

# Save and load
with open("my_data.json", "w") as f:
    json.dump(training_data, f, indent=2)

dataset = Dataset.from_json("my_data.json")

Step 5: Configure Training

from trl import SFTTrainer
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir = "./outputs",
    per_device_train_batch_size = 2,      # Small batch for limited VRAM
    gradient_accumulation_steps = 4,      # Simulates batch size of 8
    warmup_ratio = 0.1,
    num_train_epochs = 3,
    learning_rate = 2e-4,                 # LoRA works well with 2e-4
    fp16 = not torch.cuda.is_bf16_supported(),
    bf16 = torch.cuda.is_bf16_supported(), # Use BF16 on Ampere+ GPUs
    logging_steps = 10,
    optim = "adamw_8bit",                 # 8-bit optimizer saves memory
    weight_decay = 0.01,
    lr_scheduler_type = "linear",
    seed = 3407,
    report_to = "none",                   # Disable wandb
)

The key settings for consumer GPUs:

per_device_train_batch_size = 2: Small batch keeps memory low
gradient_accumulation_steps = 4: Accumulate gradients to simulate larger batch
optim = "adamw_8bit": 8-bit optimizer uses less memory than full Adam

Step 6: Train

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    dataset_num_proc = 2,
    packing = True,                       # Pack short sequences together
    args = training_args,
)

# Start training
trainer_stats = trainer.train()
print(f"Training time: {trainer_stats.metrics['train_runtime']:.2f}s")
print(f"Speed: {trainer_stats.metrics['train_samples_per_second']:.2f} samples/s")

On my RTX 3060 with 10K samples, training takes about 3 hours. An RTX 4090 would finish in 30-45 minutes.

Step 7: Save and Export

# Save LoRA adapters only (~50-100MB, portable)
model.save_pretrained("gemma4-lora")
tokenizer.save_pretrained("gemma4-lora")

# Merge LoRA with base model for standalone use
model.save_pretrained_merged(
    "gemma4-finetuned",
    tokenizer,
    save_method = "merged_16bit",
)

# Export to GGUF for llama.cpp inference
model.save_pretrained_gguf(
    "gemma4-gguf",
    tokenizer,
    quantization_method = "q4_k_m",
)

Saving LoRA separately is important. The adapter files are tiny (~50MB). You can swap them without reloading the base model.

Why LoRA Rank Matters

The rank r controls how much the model can adapt:

r = 8:   Minimal adaptation, fastest training, 70% memory saved
         Best for: Style adjustments, tone changes

r = 16:  Balanced (recommended default), 60% memory saved
         Best for: General domain adaptation, chatbots

r = 64:  Maximum adaptation, slower training, 40% memory saved
         Best for: Complex domain specialization (medical, legal)

r = 128: Overkill for most use cases, minimal memory benefit

I started with r=8 for a style adjustment task. The model learned the new tone but struggled with domain-specific terminology. Switching to r=16 solved this without noticeable speed impact.

Common Mistakes I Made

Mistake 1: Using full fine-tuning

WRONG:
  Load full model in 16-bit
  Update all parameters
  → OutOfMemoryError on 12GB GPU

RIGHT:
  Load model in 4-bit (load_in_4bit=True)
  Apply LoRA adapters
  Train only adapters
  → Works on 8GB GPU

Mistake 2: Wrong learning rate

LoRA training:     Use 2e-4 to 5e-4
Full fine-tuning:  Use 1e-5 to 2e-5

Using 1e-5 for LoRA → Training barely progresses
Using 5e-4 for full  → Model diverges, garbage output

Mistake 3: Skipping gradient checkpointing

# Without gradient checkpointing
use_gradient_checkpointing = False
→ 12GB VRAM needed for 4B model
→ Runs out on RTX 3060 12GB with any sequence length

# With Unsloth's optimized checkpointing
use_gradient_checkpointing = "unsloth"
→ 8GB VRAM needed for 4B model
→ Fits comfortably on RTX 3060

Mistake 4: Poor dataset quality

500 high-quality, consistent examples
→ Good results after 2-3 epochs

5000 noisy, inconsistent examples
→ Poor results even after 10 epochs

Key: Clean data beats volume. Consistency in format matters.

Mistake 5: Over-training

Epoch 1-2:  Training loss drops, model improves
Epoch 3:    Training loss stable, validation loss stable
Epoch 4+:   Training loss drops, validation loss RISES

Stop at epoch 3. Further training degrades generalization.

Hardware Requirements

Gemma 4 E2B + QLoRA:   4-5GB  → GTX 1660, RTX 3050
Gemma 4 E4B + QLoRA:   8-10GB → RTX 3060 12GB, 4060 Ti
Gemma 4 E4B + LoRA:    12GB   → RTX 4070, 3080
Gemma 4 27B + QLoRA:   20-24GB → RTX 3090, 4090

Training Speed (10K samples, E4B):
RTX 3060:   2-4 hours
RTX 4070:   1-2 hours
RTX 4090:   30-60 minutes

Using Your Fine-Tuned Model

from unsloth import FastLanguageModel

# Load merged model directly
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "./gemma4-finetuned",
    max_seq_length = 2048,
)

# Or load base + LoRA adapters
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gemma-4-4b-it-bnb-4bit",
    max_seq_length = 2048,
)
model.load_adapter("./gemma4-lora")

# Enable fast inference
FastLanguageModel.for_inference(model)

# Generate
prompt = """### Instruction:
Explain LoRA fine-tuning in simple terms.

### Response:
"""

inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
outputs = model.generate(
    **inputs,
    max_new_tokens = 256,
    temperature = 0.7,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Complete Script

Here’s the full training script I use:

#!/usr/bin/env python3
"""
Fine-tune Gemma 4 E4B with QLoRA on consumer GPU
Requires: RTX 3060 12GB or similar
"""

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
import torch

# Configuration
MODEL_NAME = "unsloth/gemma-4-4b-it-bnb-4bit"
MAX_SEQ_LENGTH = 2048
LORA_RANK = 16
OUTPUT_DIR = "./gemma4-finetuned"

# Step 1: Load model
print("Loading model...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = MODEL_NAME,
    max_seq_length = MAX_SEQ_LENGTH,
    dtype = None,
    load_in_4bit = True,
)

# Step 2: Apply LoRA
print("Configuring LoRA...")
model = FastLanguageModel.get_peft_model(
    model,
    r = LORA_RANK,
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha = LORA_RANK,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)

# Step 3: Load dataset
print("Loading dataset...")
dataset = load_dataset("yahma/alpaca-cleaned", split="train")

def format_prompt(example):
    if example["input"]:
        return f"""### Instruction:
{example['instruction']}

### Input:
{example['input']}

### Response:
{example['output']}"""
    return f"""### Instruction:
{example['instruction']}

### Response:
{example['output']}"""

dataset = dataset.map(lambda x: {"text": format_prompt(x)})

# Step 4: Training config
training_args = TrainingArguments(
    output_dir = OUTPUT_DIR,
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 4,
    warmup_ratio = 0.1,
    num_train_epochs = 3,
    learning_rate = 2e-4,
    fp16 = not torch.cuda.is_bf16_supported(),
    bf16 = torch.cuda.is_bf16_supported(),
    logging_steps = 10,
    optim = "adamw_8bit",
    weight_decay = 0.01,
    lr_scheduler_type = "linear",
    seed = 3407,
    report_to = "none",
)

# Step 5: Train
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = MAX_SEQ_LENGTH,
    dataset_num_proc = 2,
    packing = True,
    args = training_args,
)

print("Starting training...")
trainer_stats = trainer.train()

print(f"\nDone! Time: {trainer_stats.metrics['train_runtime']:.2f}s")

# Step 6: Save
model.save_pretrained("gemma4-lora")
tokenizer.save_pretrained("gemma4-lora")
model.save_pretrained_merged(OUTPUT_DIR, tokenizer, save_method = "merged_16bit")
model.save_pretrained_gguf("gemma4-gguf", tokenizer, quantization_method = "q4_k_m")

print("Saved: LoRA adapters, merged model, GGUF")

Summary

Fine-tuning Gemma 4 locally is possible with consumer GPUs. The key is using QLoRA (4-bit quantization + LoRA adapters) through Unsloth:

Load model with load_in_4bit=True - reduces memory from 15GB to 8GB
Apply LoRA adapters with rank 16 - trains only 0.2% of parameters
Enable use_gradient_checkpointing="unsloth" - saves 30% more VRAM
Use learning rate 2e-4 and 8-bit optimizer
Train for 2-3 epochs, stop when validation loss rises
Save LoRA adapters separately (~50MB) for easy swapping

With these techniques, I fine-tuned Gemma 4 on my RTX 3060 in 3 hours. The resulting model matches domain-specific needs without requiring cloud GPUs.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Unsloth Official Documentation
👨‍💻 Reddit: Run Gemma 4 Locally Discussion
👨‍💻 HuggingFace LoRA Documentation
👨‍💻 QLoRA Paper

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!