Load 4-Bit AWQ LLMs in Transformers for Low-Memory Inference

Load AWQ-Quantized Models with One Line

AWQ (Activation-aware Weight Quantization) compresses LLMs to 4-bit weights while preserving a small set of performance-critical weights in higher precision, minimizing accuracy loss versus full quantization. Identify AWQ models by quant_method: "awq" in their config.json. Install autoawq (which pins Transformers to v4.47.1—reinstall Transformers after for compatibility), then load with AutoModelForCausalLM.from_pretrained(model_id, quant_method="awq"). This auto-converts non-quantized weights (e.g., embeddings) to fp16 for speed; override via dtype=torch.bfloat16. Move to GPU with device_map="auto" or CPU otherwise. Add attn_implementation="flash_attention_2" for further acceleration, but it conflicts with fused modules below. Trade-off: AWQ prioritizes salient weights per channel, beating round-to-nearest methods on benchmarks like perplexity and zero-shot tasks.

Fused Modules Double Prefill/Decode Throughput

Fuse AWQ linear layers into single kernels for 2x faster prefill (up to 3184 → 3044 tokens/s at 1024 length) and decode (31 → 89 tokens/s at 2048 length) at batch_size=1, using just 4-5.5GB VRAM on Mistral-7B-OpenOrca-AWQ. Native support for Llama/Mistral; extend to others manually. Create AwqConfig(fuse_max_seq_len=2048, do_fuse=True, version="gemm")—fuse_max_seq_len covers context + generation (oversize safely). Pass to from_pretrained(..., quantization_config=AwqConfig(...)). Benchmarks show fused wins peak at mid-lengths (e.g., 512: prefill 3184→2848, decode 31→97 tokens/s), but VRAM rises slightly at long contexts (4GB → 5.57GB at 2048). optimum-benchmark graphs confirm fused generate throughput doubles vs. unfused up to batch=8. Can't combine with FlashAttention2—pick based on your seq_len/batch needs.

Prefill Length	Unfused Prefill/Decode (tokens/s)	Fused Prefill/Decode (tokens/s)	VRAM Savings
2048	2927 / 35	2715 / 89	~0.16GB

ExLlamaV2 Kernels for AMD/Extreme Speed

For fastest prefill/decode, install autoawq with ExLlamaV2 support and set AwqConfig(version="exllama"). These kernels excel on AMD GPUs, outperforming standard AWQ on long contexts. Supports fused modules too. Trade-off: ExLlamaV2 ties you to autoawq ecosystem, less flexible than pure Transformers.