GGUF: Fast-Loading LLM Format with Metadata on HF Hub

GGUF bundles model tensors and metadata for quick inference loading in tools like llama.cpp; filter GGUF-tagged models on HF, inspect tensor details via viewer, parse remotely with JS lib, select from 20+ quantization types balancing size and precision.

GGUF Encodes Tensors and Metadata for Efficient Inference

Convert PyTorch models to GGUF—a binary format storing both tensors and standardized metadata—for faster loading and saving than tensor-only formats like safetensors. Developed by @ggerganov (llama.cpp creator), GGUF targets GGML executors and C/C++ inference frameworks. This dual storage enables seamless use in production inference without separate metadata files, reducing load times.

Discover and Inspect GGUF Models Directly on Hub

Filter GGUF models at hf.co/models?library=gguf or use ggml-org/gguf-my-repo Space to quantize/convert weights. Example: TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF repo shows files like mixtral-8x7b-instruct-v0.1.Q4_0.gguf.

HF provides a built-in viewer on model/file pages (append ?show_tensors=filename.gguf) displaying metadata and tensor details: name, shape, precision. Access via model page (e.g., TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF?show_tensors=mixtral-8x7b-instruct-v0.1.Q4_0.gguf) or files tab.

Parse Metadata and Run with Open Tools

Parse remote GGUF files client-side using @huggingface/gguf JS library:

npm install @huggingface/gguf
import { gguf } from "@huggingface/gguf";
const URL_LLAMA = "https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/191239b/llama-2-7b-chat.Q2_K.gguf";
const { metadata, tensorInfos } = await gguf(URL_LLAMA);

Run GGUF with: llama.cpp, LM Studio, GPT4All, Ollama (dedicated HF docs cover integration).

Quantization Types Trade Size for Speed

Choose from these precisions, each with block-based formulas for weights (e.g., w = q * block_scale). Newer K-types outperform legacy:

TypeBits/WeightKey Formula/Notes
F6464IEEE double
F3232IEEE single
F1616IEEE half
BF1616Shortened F32
Q8_K~8256 weights/block, for intermediates
Q6_K6.5616x16 superblock, w = q * 8b scale
Q5_K5.58x32 superblock, w = q * 6b scale + 6b min
Q4_K4.58x32 superblock, w = q * 6b scale + 6b min
Q3_K3.4416x16 superblock, w = q * 6b scale
Q2_K2.6316x16 superblock, w = q * 4b scale + 4b min
IQ4_NL/XS~4-4.25256-weight superblocks w/ scale & importance matrix
IQ3_S/XXS~3-3.44Similar, lower bits
IQ2_XXS/S/XS~2-2.5Similar, aggressive compression
IQ1_S/M~1.5-1.751-bit w/ scale & matrix
TQ1_0/TQ2_0TernaryThree-state values
MXFP44Microscaling block FP

Legacy (avoid): Q8_0/1, Q5_0/1, Q4_0/1 (32-weight blocks, basic scale/min). Update inaccuracies via GitHub PR to huggingface.js quant descriptions.

Summarized by x-ai/grok-4.1-fast via openrouter

9194 input / 2527 output tokens in 14407ms

© 2026 Edge