LoRA Fails Facts Due to High-Rank Updates; RS-LoRA Fixes Scaling
LoRA assumes low-rank updates, capturing style (99% at r=8) but missing facts (28% at r=8). High ranks fix info loss but standard α/r scaling drops to 0.25 at r=64, killing signal. RS-LoRA's α/√r keeps scale at 2.0, stabilizing learning.
Style Updates Are Low-Rank, Facts Are High-Rank
Style changes like tone or format concentrate in few dimensions: singular values decay fast (top 10: 5.0, 4.5, 4.0, 3.5, 0.5,...). At rank-4, LoRA captures nearly all signal; rank-8 hits 99% cumulative variance. Facts like medical data or stats spread across dimensions (top 10 singular values: 3.0, 2.9, 2.8,... slow decay). Rank-8 captures only 28% variance, so low-rank LoRA (r=4-8) sounds fluent but outputs wrong/incomplete facts—model forgets high-dimensional tail.
To simulate: Generate low-rank delta with true_rank=4, linspace singular values 5→0.5; high-rank with linspace 3→0.5 over min(d,k)=64. QR orthogonalize U/V, add 1% noise. Frobenius-normalized error quantifies loss.
Standard LoRA Over-Scales at High Ranks, Causing Collapse
Increasing rank captures more facts (error drops from 0.85 at r=4 to 0.42 at r=32), but standard scaling α/r (α=16) shrinks update: r=1→16.0, r=4→4.0, r=8→2.0, r=16→1.0, r=32→0.5, r=64→0.25. Higher capacity but weaker signal forces optimizer overcompensation, leading to instability/poor convergence.
Error table (64x64 matrix):
| Rank | Style Err | Facts Err |
|---|---|---|
| 2 | 0.201 | 0.916 |
| 4 | 0.015 | 0.850 |
| 8 | 0.002 | 0.692 |
| 16 | 0.001 | 0.553 |
| 32 | 0.000 | 0.417 |
| 48 | 0.000 | 0.289 |
Style error →0 quickly; facts need r≥32 but scaling vanishes.
RS-LoRA's √r Scaling Enables High-Rank Fact Learning
Change scaling to α/√r: r=1→16.0, r=4→8.0, r=8→5.7, r=16→4.0, r=32→2.8, r=64→2.0—gradual drop preserves signal magnitude. RS-LoRA facts error: r=2→0.894, r=4→0.775, r=8→0.585, r=16→0.413, r=32→0.199, r=48→0.099 (steady improvement vs standard's plateau).
LoRA approx: SVD delta → U,S,Vt; truncate r; B=U