rsLoRA vs LoRA: Why My Fine-Tuning Kept Crashing at Higher Ranks
I stared at the training loss graph. It was oscillating wildly, then flatlined.
text title="Training loss output"Step 100: loss=2.34Step 200: loss=1.87Step 300: loss=1.52Step 400: loss=NaN <-- What?!I had increased the LoRA rank from 16 to 64 hoping for better fine-tuning quality. Instead, my training crashed. After hours of debugging, I discovered the problem wasn’t my data or hyperparameters. It was standard LoRA itself.
That’s when I found rsLoRA (rank-stabilized LoRA). Here’s what I learned about why standard LoRA fails at higher ranks and how rsLoRA fixes it.
The Problem: LoRA Gets Unstable at High Ranks
LoRA (Low-Rank Adaptation) is a popular method for fine-tuning large language models efficiently. Instead of updating all model weights, LoRA adds small trainable matrices with a low rank r.
The standard LoRA formula applies a scaling factor:
text title="Standard LoRA formula"output = Wx + (BA)x * (alpha / r)
Where:- W = original frozen weights- B, A = low-rank trainable matrices- alpha = scaling hyperparameter- r = rankHere’s the problem: when you increase r, the scaling factor alpha / r decreases proportionally. This causes issues:
- Vanishing gradients at higher ranks
- Training instability as rank increases
- Hyperparameter sensitivity - small changes cause big effects
I thought increasing rank would help my model learn more complex patterns. Instead, the shrinking scaling factor made training unstable.
The Solution: rsLoRA Changes the Scaling
rsLoRA modifies the scaling formula in a simple but powerful way:
text title="rsLoRA vs standard LoRA scaling"Standard LoRA: scale = alpha / rrsLoRA: scale = alpha / sqrt(r)By dividing by the square root of rank instead of rank itself, rsLoRA maintains gradient stability across different rank sizes.
Why does this matter?
text title="Scaling comparison by rank"Rank 16: Standard LoRA: alpha/16 = 0.0625 * alpha rsLoRA: alpha/sqrt(16) = 0.25 * alpha
Rank 64: Standard LoRA: alpha/64 = 0.0156 * alpha (much smaller!) rsLoRA: alpha/sqrt(64) = 0.125 * alpha (still reasonable)
Rank 128: Standard LoRA: alpha/128 = 0.0078 * alpha (tiny!) rsLoRA: alpha/sqrt(128) = 0.088 * alpha (stable)As rank increases, standard LoRA’s scaling shrinks dramatically. rsLoRA’s scaling decreases much more slowly, keeping training stable.
Code Comparison: Standard LoRA vs rsLoRA
Here’s how to configure both in HuggingFace PEFT:
Standard LoRA Configuration
from peft import LoraConfig, get_peft_modelfrom transformers import AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
# Standard LoRA configurationlora_config = LoraConfig( r=16, # rank lora_alpha=32, # scaling factor (fixed ratio) target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")
model = get_peft_model(base_model, lora_config)
# Training parameters for standard LoRA at rank 16# learning_rate: 2e-4 works well# At higher ranks (64+), you'd need to adjust learning ratersLoRA Configuration
from peft import LoraConfig, get_peft_modelfrom transformers import AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
# rsLoRA configuration (requires PEFT >= 0.7.0)lora_config = LoraConfig( r=64, # Can use higher rank now! lora_alpha=64, # With rsLoRA, alpha typically equals rank target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", use_rslora=True, # This is the key parameter!)
model = get_peft_model(base_model, lora_config)
# With rsLoRA, training stays stable even at rank 64# You can use the same learning rate across different ranksThe key difference is use_rslora=True. That one parameter changes the scaling behavior.
Practical Example: Fine-Tuning with Unsloth + rsLoRA
I use Unsloth for memory-efficient training on consumer GPUs. Here’s a complete setup:
from unsloth import FastLanguageModelfrom trl import SFTTrainerfrom transformers import TrainingArguments
# Load model with 4-bit quantization for memory efficiencymodel, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/llama-2-7b", max_seq_length=2048, dtype=None, load_in_4bit=True,)
# Add rsLoRA adapters with higher rankmodel = FastLanguageModel.get_peft_model( model, r=32, # Higher rank enabled by rsLoRA target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_alpha=32, # Alpha = rank for simplicity lora_dropout=0.05, use_rslora=True, # Rank-stabilized scaling use_gradient_checkpointing="unsloth", # Memory efficiency)
# Training argumentstraining_args = TrainingArguments( output_dir="./output", num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, # Stable across ranks with rsLoRA logging_steps=10, save_steps=100,)
trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, args=training_args,)
trainer.train()This combination - rsLoRA + Unsloth + 4-bit quantization - lets me fine-tune on a free Google Colab T4 GPU without crashes.
When to Use rsLoRA vs Standard LoRA
After testing both extensively, here’s my decision matrix:
| Scenario | Recommendation | Reason |
|---|---|---|
| Rank ≤ 8 | Standard LoRA OK | Low ranks are stable either way |
| Rank 16-32 | rsLoRA recommended | Noticeable stability improvement |
| Rank > 32 | rsLoRA strongly recommended | Standard LoRA often fails |
| Consumer GPU (T4, etc.) | rsLoRA + Unsloth | Memory efficiency + stability |
| Production models | rsLoRA default | Predictable, reproducible training |
Why I Now Default to rsLoRA
After my rank-64 training crashed, I ran a comparison:
Standard LoRA at r=64: - Required 5x learning rate tuning attempts - Training loss oscillated wildly - 3 out of 5 runs diverged (NaN loss) - Final model quality: inconsistent
rsLoRA at r=64: - Same learning rate as rank 16 worked - Training loss curve was smooth - 5 out of 5 runs converged - Final model quality: consistent and betterThe stability improvement is real. I saved hours of hyperparameter tuning.
Mistakes I Made (So You Don’t Have To)
Mistake 1: Setting Alpha Incorrectly with rsLoRA
With standard LoRA, I typically set alpha = 2 * r. With rsLoRA, I initially kept that ratio and got poor results.
# WRONG for rsLoRAlora_config = LoraConfig( r=64, lora_alpha=128, # Too high for rsLoRA use_rslora=True,)
# CORRECT for rsLoRAlora_config = LoraConfig( r=64, lora_alpha=64, # Alpha = rank is simpler and works well use_rslora=True,)With rsLoRA, setting alpha = r is a good starting point. The sqrt(r) scaling already handles the math.
Mistake 2: Not Adjusting Learning Rate for Higher Ranks
Even with rsLoRA, higher ranks benefit from lower learning rates. I used 2e-4 for rank 16, but 1e-4 worked better for rank 64.
# Rank 16learning_rate = 2e-4
# Rank 32learning_rate = 1.5e-4
# Rank 64learning_rate = 1e-4
# The relationship isn't linear, so experimentMistake 3: Ignoring Gradient Checkpointing
Higher ranks mean more parameters. I ran out of VRAM trying to use rank 64 on my T4 GPU.
The solution: combine rsLoRA with gradient checkpointing and 4-bit quantization.
model = FastLanguageModel.get_peft_model( model, r=64, lora_alpha=64, use_rslora=True, use_gradient_checkpointing="unsloth", # Critical for memory # ... other params)The Math Behind the Stability
If you’re curious why sqrt(r) works better than r, here’s the intuition:
Standard LoRA assumes the product BA has values that scale with rank. When you divide by r, you’re normalizing for this assumption. But in practice, the values in BA don’t scale linearly with rank.
rsLoRA’s sqrt(r) scaling better matches the actual variance of the low-rank matrices. This keeps gradient magnitudes consistent regardless of rank size.
For a deeper dive, the original rsLoRA paper (arXiv:2312.03732) explains the mathematical derivation in detail.
Quick Reference: Implementation Checklist
text title="rsLoRA implementation checklist"1. Install PEFT >= 0.7.0 pip install peft>=0.7.0
2. Add use_rslora=True to LoraConfig lora_config = LoraConfig(r=32, lora_alpha=32, use_rslora=True, ...)
3. Set alpha equal to rank for simplicity lora_alpha = r
4. Lower learning rate for higher ranks r=16: lr=2e-4 r=64: lr=1e-4
5. Enable gradient checkpointing for memory use_gradient_checkpointing=True
6. Use Unsloth for consumer GPUs from unsloth import FastLanguageModelFinal Thoughts
If you’re fine-tuning LLMs with LoRA, I recommend using rsLoRA as your default. The stability improvement is significant, especially at ranks above 16.
The key differences:
- Standard LoRA:
scale = alpha / r- gets unstable at high ranks - rsLoRA:
scale = alpha / sqrt(r)- stable across all ranks
For practical fine-tuning on consumer hardware, the combination of rsLoRA + Unsloth has been game-changing. I can now reliably train with rank 32-64 on free T4 GPUs without the crashes that plagued my earlier attempts.
One parameter change (use_rslora=True) saved me hours of debugging and hyperparameter tuning. That’s a good trade in my book.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 LoRA Paper
- 👨💻 rsLoRA Paper: Fine-Tuning with Low-Rank Adaptation in a Stable Way
- 👨💻 PEFT Documentation
- 👨💻 Unsloth Documentation
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments