Phi-4-Mini Masterclass: Quantized LLM Pipelines
Build end-to-end Phi-4-mini workflows in Colab: 4-bit inference, streaming chat, CoT reasoning, tool calling, RAG, and LoRA fine-tuning—all in one notebook with full code.
Load Phi-4-Mini in 4-Bit Quantization for Efficient Inference
Phi-4-mini (3.8B params) runs on a single T4 GPU in Colab using 4-bit NF4 quantization via BitsAndBytes. Start by installing pinned versions: transformers (4.49-4.56), accelerate, bitsandbytes, peft, datasets, sentence-transformers, faiss-cpu. Clear caches to avoid clashes.
Key code:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
PHI_MODEL_ID = "microsoft/Phi-4-mini-instruct"
bnb_cfg = BitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True
)
phi_tokenizer = AutoTokenizer.from_pretrained(PHI_MODEL_ID)
phi_model = AutoModelForCausalLM.from_pretrained(
PHI_MODEL_ID, quantization_config=bnb_cfg,
device_map="auto", torch_dtype=torch.bfloat16
)
Uses ~2-3GB VRAM. Enable use_cache=True for inference speed. Pad token set to EOS. Assumes CUDA GPU; fails gracefully without.
Unified ask_phi function handles all inference: applies chat template (with optional tools), generates with top_p=0.9, temperature control, streaming via TextStreamer. Supports max_new_tokens up to 512. Strips special tokens post-decode.
Trade-off: Quantization trades minor precision for 4x memory savings vs. BF16. NF4 + double quant optimal for Phi arch.
Common mistake: Skipping cache clear—leads to tokenizer/model version mismatches.
Chain-of-Thought Reasoning and Streaming Chat
Test basic capabilities with system prompts. For chat: concise research assistant responds in bullets on SLM benefits (e.g., on-device AI: low latency/privacy, efficient compute, edge deployment).
For CoT: Math word problem (trains meeting). Prompt: "Reason step by step, label each step, final 'Answer:' line." Temperature=0.2. Model computes relative speeds (140 mph closing), time to meet (300/140=2.14h post-10AM), arrives ~12:09PM.
Principle: Explicit step-labeling + low temp forces structured output. Before: hallucinated jumps; after: verifiable steps.
Streaming shows token-by-token generation, ideal for UI. Quality criteria: Coherent steps, exact arithmetic, clock time format.
Exercise: Adapt for your math/logic puzzles—vary temp (0 for deterministic, 0.3 for creative).
Tool Calling: Parse, Execute, and Iterate
Define JSON schemas for tools (e.g., get_weather: city/unit; calculate: expression). Fake impls for demo.
run_tool_turn loop:
- Initial assistant call with tools=tools, temp=0 (greedy).
- Regex/JSON parse tool calls from output (handles <tool_call> tags or raw JSON).
- Execute: map name to fn, pass args (JSON.parse if str), collect results.
- Append as {"role": "tool", "content": json.dumps(results)}.
- Second call for final answer (temp=0.2).
Example: "Tokyo weather F, 47*93" → Calls both, gets 75F sunny + 4371, synthesizes.
Key insight: Phi natively parses function-calling schema (no extra training). Extract via flexible regex for robustness.
Pitfalls: Invalid args → error dict; unsupported expr → safe eval check (regex digits/ops). No calls? Direct answer.
Quality check: Calls only when needed (prompt: "Only if helpful"); handles multi-calls parallelly.
RAG: Embed, Retrieve, Ground Responses
Simple vector DB: 7 Phi docs → all-MiniLM-L6-v2 embeddings (384D), FAISS IndexFlatIP (cosine sim).
retrieve(q, k=3): Encode query, top-k indices → docs.
rag_answer: Format context as bullets, system: "Answer ONLY from context or say don't know." Temp=0.1.
Queries:
- Audio? → Phi-4-multimodal (MoLoRA).
- Cheap fine-tune? → LoRA/QLoRA on single GPU.
- Context? → 128K.
Method: Normalize embeddings; IP for cosine. Grounds hallucinations—model refuses unknowns.
Trade-offs: CPU FAISS fine for <1K docs; scale to GPU/HNSW for millions. MiniLM fast but domain-general.
Before/after: Vanilla Phi fabricates; RAG cites exact facts.
LoRA Fine-Tuning: Inject Facts on Quantized Base
Probe: "What is Zorblax-7?" → Before: Hallucinates/nothing.
Dataset: 6 Q&A pairs on fictional alloy (x4 repeats). Chat template → tokenized features (max_len=384, labels=copy input_ids).
Prep: prepare_model_for_kbit_training. LoRA: r=16, alpha=32, dropout=0.05, targets=qkv_proj, o_proj, gate_up_proj, down_proj. ~1-2% trainable params.
Train: 3 epochs, bs=1 acc=4, lr=2e-4, warmup=0.05, paged_adamw_8bit, grad checkpoint, bf16. No eval/save.
Post-probe: Recalls inventor, lab, use, color accurately.
Principle: QLoRA freezes 4-bit base, tunes adapters. Disable cache during train.
Criteria: Fact retention post-merge? Here, in-context via adapters.
Pitfalls: Overfit small data → repeats; longer contexts need bigger ds.
Exercise: Your domain facts—scale examples, merge via peft merge_and_unload.
Prerequisites: Python/Transformers basics, GPU. Fits early prototyping: inference → adapt.
"✓ Phi-4-mini loaded in 4-bit. GPU memory: ~2GB"
"You can call tools when helpful. Only call a tool if needed."
"Answer ONLY from the provided context. If the context is insufficient, say you don't know."
"LoRA adapters attached to Phi-4-mini: trainable 1.8% params"
"Next ideas: Swap to Phi-4-multimodal for vision + audio."
Key Takeaways
- Pin deps and clear caches for stable Colab Phi loads.
- Use single
ask_phifor chat/tools/streaming via chat template. - CoT: Label steps + low temp for reliable reasoning.
- Tool loop: Parse JSON calls, execute parallel, feed back as 'tool' role.
- RAG: MiniLM + FAISS for quick semantic search; strict system grounding.
- QLoRA: Target Phi attn/mlp, small ds for fact injection on 4-bit base.
- All runs T4 GPU <4GB: Prod-ready for agents/pipelines.
- Test before/after: Quantify gains (e.g., hallucination drop).
- Extend: Multimodal, ONNX export, multi-agent.