Skip to content

rsLoRA vs LoRA: Why My Fine-Tuning Kept Crashing at Higher Ranks

I stared at the training loss graph. It was oscillating wildly, then flatlined.

text title="Training loss output"
Step 100: loss=2.34
Step 200: loss=1.87
Step 300: loss=1.52
Step 400: loss=NaN <-- What?!

I had increased the LoRA rank from 16 to 64 hoping for better fine-tuning quality. Instead, my training crashed. After hours of debugging, I discovered the problem wasn’t my data or hyperparameters. It was standard LoRA itself.

That’s when I found rsLoRA (rank-stabilized LoRA). Here’s what I learned about why standard LoRA fails at higher ranks and how rsLoRA fixes it.

The Problem: LoRA Gets Unstable at High Ranks

LoRA (Low-Rank Adaptation) is a popular method for fine-tuning large language models efficiently. Instead of updating all model weights, LoRA adds small trainable matrices with a low rank r.

The standard LoRA formula applies a scaling factor:

text title="Standard LoRA formula"
output = Wx + (BA)x * (alpha / r)
Where:
- W = original frozen weights
- B, A = low-rank trainable matrices
- alpha = scaling hyperparameter
- r = rank

Here’s the problem: when you increase r, the scaling factor alpha / r decreases proportionally. This causes issues:

  1. Vanishing gradients at higher ranks
  2. Training instability as rank increases
  3. Hyperparameter sensitivity - small changes cause big effects

I thought increasing rank would help my model learn more complex patterns. Instead, the shrinking scaling factor made training unstable.

The Solution: rsLoRA Changes the Scaling

rsLoRA modifies the scaling formula in a simple but powerful way:

text title="rsLoRA vs standard LoRA scaling"
Standard LoRA: scale = alpha / r
rsLoRA: scale = alpha / sqrt(r)

By dividing by the square root of rank instead of rank itself, rsLoRA maintains gradient stability across different rank sizes.

Why does this matter?

text title="Scaling comparison by rank"
Rank 16:
Standard LoRA: alpha/16 = 0.0625 * alpha
rsLoRA: alpha/sqrt(16) = 0.25 * alpha
Rank 64:
Standard LoRA: alpha/64 = 0.0156 * alpha (much smaller!)
rsLoRA: alpha/sqrt(64) = 0.125 * alpha (still reasonable)
Rank 128:
Standard LoRA: alpha/128 = 0.0078 * alpha (tiny!)
rsLoRA: alpha/sqrt(128) = 0.088 * alpha (stable)

As rank increases, standard LoRA’s scaling shrinks dramatically. rsLoRA’s scaling decreases much more slowly, keeping training stable.

Code Comparison: Standard LoRA vs rsLoRA

Here’s how to configure both in HuggingFace PEFT:

Standard LoRA Configuration

standard_lora.py
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
# Standard LoRA configuration
lora_config = LoraConfig(
r=16, # rank
lora_alpha=32, # scaling factor (fixed ratio)
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, lora_config)
# Training parameters for standard LoRA at rank 16
# learning_rate: 2e-4 works well
# At higher ranks (64+), you'd need to adjust learning rate

rsLoRA Configuration

rslora.py
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
# rsLoRA configuration (requires PEFT >= 0.7.0)
lora_config = LoraConfig(
r=64, # Can use higher rank now!
lora_alpha=64, # With rsLoRA, alpha typically equals rank
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
use_rslora=True, # This is the key parameter!
)
model = get_peft_model(base_model, lora_config)
# With rsLoRA, training stays stable even at rank 64
# You can use the same learning rate across different ranks

The key difference is use_rslora=True. That one parameter changes the scaling behavior.

Practical Example: Fine-Tuning with Unsloth + rsLoRA

I use Unsloth for memory-efficient training on consumer GPUs. Here’s a complete setup:

unsloth_rslora.py
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
# Load model with 4-bit quantization for memory efficiency
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/llama-2-7b",
max_seq_length=2048,
dtype=None,
load_in_4bit=True,
)
# Add rsLoRA adapters with higher rank
model = FastLanguageModel.get_peft_model(
model,
r=32, # Higher rank enabled by rsLoRA
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_alpha=32, # Alpha = rank for simplicity
lora_dropout=0.05,
use_rslora=True, # Rank-stabilized scaling
use_gradient_checkpointing="unsloth", # Memory efficiency
)
# Training arguments
training_args = TrainingArguments(
output_dir="./output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4, # Stable across ranks with rsLoRA
logging_steps=10,
save_steps=100,
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
args=training_args,
)
trainer.train()

This combination - rsLoRA + Unsloth + 4-bit quantization - lets me fine-tune on a free Google Colab T4 GPU without crashes.

When to Use rsLoRA vs Standard LoRA

After testing both extensively, here’s my decision matrix:

ScenarioRecommendationReason
Rank ≤ 8Standard LoRA OKLow ranks are stable either way
Rank 16-32rsLoRA recommendedNoticeable stability improvement
Rank > 32rsLoRA strongly recommendedStandard LoRA often fails
Consumer GPU (T4, etc.)rsLoRA + UnslothMemory efficiency + stability
Production modelsrsLoRA defaultPredictable, reproducible training

Why I Now Default to rsLoRA

After my rank-64 training crashed, I ran a comparison:

Training comparison results
Standard LoRA at r=64:
- Required 5x learning rate tuning attempts
- Training loss oscillated wildly
- 3 out of 5 runs diverged (NaN loss)
- Final model quality: inconsistent
rsLoRA at r=64:
- Same learning rate as rank 16 worked
- Training loss curve was smooth
- 5 out of 5 runs converged
- Final model quality: consistent and better

The stability improvement is real. I saved hours of hyperparameter tuning.

Mistakes I Made (So You Don’t Have To)

Mistake 1: Setting Alpha Incorrectly with rsLoRA

With standard LoRA, I typically set alpha = 2 * r. With rsLoRA, I initially kept that ratio and got poor results.

Wrong vs correct alpha setting
# WRONG for rsLoRA
lora_config = LoraConfig(
r=64,
lora_alpha=128, # Too high for rsLoRA
use_rslora=True,
)
# CORRECT for rsLoRA
lora_config = LoraConfig(
r=64,
lora_alpha=64, # Alpha = rank is simpler and works well
use_rslora=True,
)

With rsLoRA, setting alpha = r is a good starting point. The sqrt(r) scaling already handles the math.

Mistake 2: Not Adjusting Learning Rate for Higher Ranks

Even with rsLoRA, higher ranks benefit from lower learning rates. I used 2e-4 for rank 16, but 1e-4 worked better for rank 64.

Learning rate by rank
# Rank 16
learning_rate = 2e-4
# Rank 32
learning_rate = 1.5e-4
# Rank 64
learning_rate = 1e-4
# The relationship isn't linear, so experiment

Mistake 3: Ignoring Gradient Checkpointing

Higher ranks mean more parameters. I ran out of VRAM trying to use rank 64 on my T4 GPU.

The solution: combine rsLoRA with gradient checkpointing and 4-bit quantization.

Memory-efficient setup
model = FastLanguageModel.get_peft_model(
model,
r=64,
lora_alpha=64,
use_rslora=True,
use_gradient_checkpointing="unsloth", # Critical for memory
# ... other params
)

The Math Behind the Stability

If you’re curious why sqrt(r) works better than r, here’s the intuition:

Standard LoRA assumes the product BA has values that scale with rank. When you divide by r, you’re normalizing for this assumption. But in practice, the values in BA don’t scale linearly with rank.

rsLoRA’s sqrt(r) scaling better matches the actual variance of the low-rank matrices. This keeps gradient magnitudes consistent regardless of rank size.

For a deeper dive, the original rsLoRA paper (arXiv:2312.03732) explains the mathematical derivation in detail.

Quick Reference: Implementation Checklist

text title="rsLoRA implementation checklist"
1. Install PEFT >= 0.7.0
pip install peft>=0.7.0
2. Add use_rslora=True to LoraConfig
lora_config = LoraConfig(r=32, lora_alpha=32, use_rslora=True, ...)
3. Set alpha equal to rank for simplicity
lora_alpha = r
4. Lower learning rate for higher ranks
r=16: lr=2e-4
r=64: lr=1e-4
5. Enable gradient checkpointing for memory
use_gradient_checkpointing=True
6. Use Unsloth for consumer GPUs
from unsloth import FastLanguageModel

Final Thoughts

If you’re fine-tuning LLMs with LoRA, I recommend using rsLoRA as your default. The stability improvement is significant, especially at ranks above 16.

The key differences:

  • Standard LoRA: scale = alpha / r - gets unstable at high ranks
  • rsLoRA: scale = alpha / sqrt(r) - stable across all ranks

For practical fine-tuning on consumer hardware, the combination of rsLoRA + Unsloth has been game-changing. I can now reliably train with rank 32-64 on free T4 GPUs without the crashes that plagued my earlier attempts.

One parameter change (use_rslora=True) saved me hours of debugging and hyperparameter tuning. That’s a good trade in my book.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments