GGUF: Fast-Loading LLM Format with Metadata on HF Hub

GGUF Encodes Tensors and Metadata for Efficient Inference

Convert PyTorch models to GGUF—a binary format storing both tensors and standardized metadata—for faster loading and saving than tensor-only formats like safetensors. Developed by @ggerganov (llama.cpp creator), GGUF targets GGML executors and C/C++ inference frameworks. This dual storage enables seamless use in production inference without separate metadata files, reducing load times.

Discover and Inspect GGUF Models Directly on Hub

Filter GGUF models at hf.co/models?library=gguf or use ggml-org/gguf-my-repo Space to quantize/convert weights. Example: TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF repo shows files like mixtral-8x7b-instruct-v0.1.Q4_0.gguf.

HF provides a built-in viewer on model/file pages (append ?show_tensors=filename.gguf) displaying metadata and tensor details: name, shape, precision. Access via model page (e.g., TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF?show_tensors=mixtral-8x7b-instruct-v0.1.Q4_0.gguf) or files tab.

Parse Metadata and Run with Open Tools

Parse remote GGUF files client-side using @huggingface/gguf JS library:

npm install @huggingface/gguf
import { gguf } from "@huggingface/gguf";
const URL_LLAMA = "https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/191239b/llama-2-7b-chat.Q2_K.gguf";
const { metadata, tensorInfos } = await gguf(URL_LLAMA);

Run GGUF with: llama.cpp, LM Studio, GPT4All, Ollama (dedicated HF docs cover integration).

Quantization Types Trade Size for Speed

Choose from these precisions, each with block-based formulas for weights (e.g., w = q * block_scale). Newer K-types outperform legacy:

Type	Bits/Weight	Key Formula/Notes
F64	64	IEEE double
F32	32	IEEE single
F16	16	IEEE half
BF16	16	Shortened F32
Q8_K	~8	256 weights/block, for intermediates
Q6_K	6.56	16x16 superblock, w = q * 8b scale
Q5_K	5.5	8x32 superblock, w = q * 6b scale + 6b min
Q4_K	4.5	8x32 superblock, w = q * 6b scale + 6b min
Q3_K	3.44	16x16 superblock, w = q * 6b scale
Q2_K	2.63	16x16 superblock, w = q * 4b scale + 4b min
IQ4_NL/XS	~4-4.25	256-weight superblocks w/ scale & importance matrix
IQ3_S/XXS	~3-3.44	Similar, lower bits
IQ2_XXS/S/XS	~2-2.5	Similar, aggressive compression
IQ1_S/M	~1.5-1.75	1-bit w/ scale & matrix
TQ1_0/TQ2_0	Ternary	Three-state values
MXFP4	4	Microscaling block FP

Legacy (avoid): Q8_0/1, Q5_0/1, Q4_0/1 (32-weight blocks, basic scale/min). Update inaccuracies via GitHub PR to huggingface.js quant descriptions.