GGUF: Fast-Loading LLM Format with Metadata on HF Hub
GGUF bundles model tensors and metadata for quick inference loading in tools like llama.cpp; filter GGUF-tagged models on HF, inspect tensor details via viewer, parse remotely with JS lib, select from 20+ quantization types balancing size and precision.
GGUF Encodes Tensors and Metadata for Efficient Inference
Convert PyTorch models to GGUF—a binary format storing both tensors and standardized metadata—for faster loading and saving than tensor-only formats like safetensors. Developed by @ggerganov (llama.cpp creator), GGUF targets GGML executors and C/C++ inference frameworks. This dual storage enables seamless use in production inference without separate metadata files, reducing load times.
Discover and Inspect GGUF Models Directly on Hub
Filter GGUF models at hf.co/models?library=gguf or use ggml-org/gguf-my-repo Space to quantize/convert weights. Example: TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF repo shows files like mixtral-8x7b-instruct-v0.1.Q4_0.gguf.
HF provides a built-in viewer on model/file pages (append ?show_tensors=filename.gguf) displaying metadata and tensor details: name, shape, precision. Access via model page (e.g., TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF?show_tensors=mixtral-8x7b-instruct-v0.1.Q4_0.gguf) or files tab.
Parse Metadata and Run with Open Tools
Parse remote GGUF files client-side using @huggingface/gguf JS library:
npm install @huggingface/gguf
import { gguf } from "@huggingface/gguf";
const URL_LLAMA = "https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/191239b/llama-2-7b-chat.Q2_K.gguf";
const { metadata, tensorInfos } = await gguf(URL_LLAMA);
Run GGUF with: llama.cpp, LM Studio, GPT4All, Ollama (dedicated HF docs cover integration).
Quantization Types Trade Size for Speed
Choose from these precisions, each with block-based formulas for weights (e.g., w = q * block_scale). Newer K-types outperform legacy:
| Type | Bits/Weight | Key Formula/Notes |
|---|---|---|
| F64 | 64 | IEEE double |
| F32 | 32 | IEEE single |
| F16 | 16 | IEEE half |
| BF16 | 16 | Shortened F32 |
| Q8_K | ~8 | 256 weights/block, for intermediates |
| Q6_K | 6.56 | 16x16 superblock, w = q * 8b scale |
| Q5_K | 5.5 | 8x32 superblock, w = q * 6b scale + 6b min |
| Q4_K | 4.5 | 8x32 superblock, w = q * 6b scale + 6b min |
| Q3_K | 3.44 | 16x16 superblock, w = q * 6b scale |
| Q2_K | 2.63 | 16x16 superblock, w = q * 4b scale + 4b min |
| IQ4_NL/XS | ~4-4.25 | 256-weight superblocks w/ scale & importance matrix |
| IQ3_S/XXS | ~3-3.44 | Similar, lower bits |
| IQ2_XXS/S/XS | ~2-2.5 | Similar, aggressive compression |
| IQ1_S/M | ~1.5-1.75 | 1-bit w/ scale & matrix |
| TQ1_0/TQ2_0 | Ternary | Three-state values |
| MXFP4 | 4 | Microscaling block FP |
Legacy (avoid): Q8_0/1, Q5_0/1, Q4_0/1 (32-weight blocks, basic scale/min). Update inaccuracies via GitHub PR to huggingface.js quant descriptions.