SageMaker Fine-Tuning: LoRA Beats QLoRA on Cost-Perf Balance

Fine-Tuning Methods: Trade-Offs in Params, Memory, and Speed

Full fine-tuning updates all 7B parameters of models like Llama2-7B, delivering top accuracy (e.g., highest Rouge1/2/L, Bert F1, Intent Accuracy on Banking77 dataset) but at highest cost and time—ideal only for unrestricted budgets or compliance needs where no accuracy compromise is allowed.

LoRA (PEFT) freezes original weights and trains low-rank matrices A/B: for a 2048x2048 update matrix (4M params), it uses (2048x4) + (4x2048) = 16K params, a 96% reduction. Process merges on-the-fly during inference, preserving general knowledge while specializing on domain data like finance intents; slight accuracy drop vs full but massive GPU/time savings, with minor inference delay unless merged.

QLoRA quantizes LoRA weights to 4-bit NF4 (e.g., 0.117 → 0.12), yielding 8x memory savings via higher precision near zero and less for outliers. It enables fine-tuning large models on single GPUs but slows training 25%+ due to gradient checkpointing (trades compute for 45% activation memory), dequantization per forward/backward pass, and paged_adam_8bit optimizer—use for prototypes or severe constraints where slight accuracy loss is ok.

AWS SageMaker Implementation: Universal Script Across Approaches

Prepare Banking77 dataset (HF: PolyAI/banking77) into train/test .jsonl, upload to S3 bucket (e.g., finetuning-llm-blog-harshitdawar/Banking77/{train,test}). Bundle requirements.txt (key libs: torch, transformers, peft, bitsandbytes, trl, datasets, accelerate) and training_script.py into training-scripts.tar.gz—script handles model_name (Llama2-7B, Mistral7B-v0.1, GPT-NeoX-20B), approach (full/lora/qlora), epochs, batch_size=8, lr (auto-tuned), hf_token for gated models.

Add S3 bucket policy for SageMaker access. In SageMaker Training Jobs: use HuggingFace PyTorch container (e.g., 763104351884.dkr.ecr.ap-south-1.amazonaws.com/huggingface-pytorch-training:2.1.0-...), ml.g5.xlarge+ GPU instances (scale per table: e.g., Llama2 QLoRA on g5.xlarge batch=8; GPT-NeoX-20B LoRA on p4d.24xlarge batch=1). Hyperparams reference S3 code/output paths; channels for train/test data; output to S3/models/{model}-{approach}. Spot instances optional; ensure IAM role has S3 perms, request quotas for instances.

Run jobs for 9 combos (excluding GPT-NeoX full FT due to cost); eval on 500 test samples with Rouge/Bert/Intent Acc/Parse Rate/Inference Sec.

Results: LoRA Wins on Cost per Performance Point

On Banking77 intents: Full FT tops metrics (e.g., Llama2 full: high Intent Acc), LoRA close (slight drop), QLoRA lowest but viable baseline. Training time/cost: QLoRA cheapest upfront (memory savings) yet higher total due to overheads; LoRA optimal (e.g., lower than full by orders, beats QLoRA on perf/$). Inference: Full/LoRA faster/sec than QLoRA; cost per perf point favors LoRA.

Resources: Fine-tuned sizes ~original (merging bloats); GPU util high across (e.g., Llama2 QLoRA peaks 100% GPU mem); QLoRA maxes smaller instances. Author spent >$200 across runs—get credits/estimates first.

Recommendations: Match Approach to Constraints

Full FT: Max accuracy, no compromises (e.g., regulated finance). LoRA: Production sweet spot—96% param cut, near-full perf, preserves base knowledge. QLoRA: Quick prototypes/high constraints (democratizes research). Scale instances per model (e.g., 7B on g5.12xlarge full; 20B LoRA p4d.24xlarge). Merge LoRA for inference speed; test baselines before scaling.