Phi-4-Mini Masterclass: Quantized LLM Pipelines

Load Phi-4-Mini in 4-Bit Quantization for Efficient Inference

Phi-4-mini (3.8B params) runs on a single T4 GPU in Colab using 4-bit NF4 quantization via BitsAndBytes. Start by installing pinned versions: transformers (4.49-4.56), accelerate, bitsandbytes, peft, datasets, sentence-transformers, faiss-cpu. Clear caches to avoid clashes.

Key code:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

PHI_MODEL_ID = "microsoft/Phi-4-mini-instruct"
bnb_cfg = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True
)
phi_tokenizer = AutoTokenizer.from_pretrained(PHI_MODEL_ID)
phi_model = AutoModelForCausalLM.from_pretrained(
    PHI_MODEL_ID, quantization_config=bnb_cfg,
    device_map="auto", torch_dtype=torch.bfloat16
)

Uses ~2-3GB VRAM. Enable use_cache=True for inference speed. Pad token set to EOS. Assumes CUDA GPU; fails gracefully without.

Unified ask_phi function handles all inference: applies chat template (with optional tools), generates with top_p=0.9, temperature control, streaming via TextStreamer. Supports max_new_tokens up to 512. Strips special tokens post-decode.

Trade-off: Quantization trades minor precision for 4x memory savings vs. BF16. NF4 + double quant optimal for Phi arch.

Common mistake: Skipping cache clear—leads to tokenizer/model version mismatches.

Chain-of-Thought Reasoning and Streaming Chat

Test basic capabilities with system prompts. For chat: concise research assistant responds in bullets on SLM benefits (e.g., on-device AI: low latency/privacy, efficient compute, edge deployment).

For CoT: Math word problem (trains meeting). Prompt: "Reason step by step, label each step, final 'Answer:' line." Temperature=0.2. Model computes relative speeds (140 mph closing), time to meet (300/140=2.14h post-10AM), arrives ~12:09PM.

Principle: Explicit step-labeling + low temp forces structured output. Before: hallucinated jumps; after: verifiable steps.

Streaming shows token-by-token generation, ideal for UI. Quality criteria: Coherent steps, exact arithmetic, clock time format.

Exercise: Adapt for your math/logic puzzles—vary temp (0 for deterministic, 0.3 for creative).

Tool Calling: Parse, Execute, and Iterate

Define JSON schemas for tools (e.g., get_weather: city/unit; calculate: expression). Fake impls for demo.

run_tool_turn loop:

Initial assistant call with tools=tools, temp=0 (greedy).
Regex/JSON parse tool calls from output (handles <tool_call> tags or raw JSON).
Execute: map name to fn, pass args (JSON.parse if str), collect results.
Append as {"role": "tool", "content": json.dumps(results)}.
Second call for final answer (temp=0.2).

Example: "Tokyo weather F, 47*93" → Calls both, gets 75F sunny + 4371, synthesizes.

Key insight: Phi natively parses function-calling schema (no extra training). Extract via flexible regex for robustness.

Pitfalls: Invalid args → error dict; unsupported expr → safe eval check (regex digits/ops). No calls? Direct answer.

Quality check: Calls only when needed (prompt: "Only if helpful"); handles multi-calls parallelly.

RAG: Embed, Retrieve, Ground Responses

Simple vector DB: 7 Phi docs → all-MiniLM-L6-v2 embeddings (384D), FAISS IndexFlatIP (cosine sim).

retrieve(q, k=3): Encode query, top-k indices → docs.

rag_answer: Format context as bullets, system: "Answer ONLY from context or say don't know." Temp=0.1.

Queries:

Audio? → Phi-4-multimodal (MoLoRA).
Cheap fine-tune? → LoRA/QLoRA on single GPU.
Context? → 128K.

Method: Normalize embeddings; IP for cosine. Grounds hallucinations—model refuses unknowns.

Trade-offs: CPU FAISS fine for <1K docs; scale to GPU/HNSW for millions. MiniLM fast but domain-general.

Before/after: Vanilla Phi fabricates; RAG cites exact facts.

LoRA Fine-Tuning: Inject Facts on Quantized Base

Probe: "What is Zorblax-7?" → Before: Hallucinates/nothing.

Dataset: 6 Q&A pairs on fictional alloy (x4 repeats). Chat template → tokenized features (max_len=384, labels=copy input_ids).

Prep: prepare_model_for_kbit_training. LoRA: r=16, alpha=32, dropout=0.05, targets=qkv_proj, o_proj, gate_up_proj, down_proj. ~1-2% trainable params.

Train: 3 epochs, bs=1 acc=4, lr=2e-4, warmup=0.05, paged_adamw_8bit, grad checkpoint, bf16. No eval/save.

Post-probe: Recalls inventor, lab, use, color accurately.

Principle: QLoRA freezes 4-bit base, tunes adapters. Disable cache during train.

Criteria: Fact retention post-merge? Here, in-context via adapters.

Pitfalls: Overfit small data → repeats; longer contexts need bigger ds.

Exercise: Your domain facts—scale examples, merge via peft merge_and_unload.

Prerequisites: Python/Transformers basics, GPU. Fits early prototyping: inference → adapt.

"✓ Phi-4-mini loaded in 4-bit. GPU memory: ~2GB"

"You can call tools when helpful. Only call a tool if needed."

"Answer ONLY from the provided context. If the context is insufficient, say you don't know."

"LoRA adapters attached to Phi-4-mini: trainable 1.8% params"

"Next ideas: Swap to Phi-4-multimodal for vision + audio."

Key Takeaways

Pin deps and clear caches for stable Colab Phi loads.
Use single ask_phi for chat/tools/streaming via chat template.
CoT: Label steps + low temp for reliable reasoning.
Tool loop: Parse JSON calls, execute parallel, feed back as 'tool' role.
RAG: MiniLM + FAISS for quick semantic search; strict system grounding.
QLoRA: Target Phi attn/mlp, small ds for fact injection on 4-bit base.
All runs T4 GPU <4GB: Prod-ready for agents/pipelines.
Test before/after: Quantify gains (e.g., hallucination drop).
Extend: Multimodal, ONNX export, multi-agent.