35B Models on RTX 4090: TurboQuant KV Compression Unlocks 32K Context

Stack Q4_K_M weight quantization with TurboQuant's 3-bit KV cache compression to run dense 35B models at 32K context on 24GB VRAM, fitting weights (20GB) + KV cache (under 4GB) with room to spare—use llama.cpp forks today.

Q4_K_M Quantization Delivers 90-95% Quality at 60% Original Size

Q4_K_M GGUF compresses model weights to ~0.6 GB per billion parameters (7B → 4GB, 32B → 19GB, 70B → 40GB) by storing weights in 4 bits with K-quant block grouping and M-medium mixed precision for sensitive layers. This preserves 90-95% of FP16 accuracy, making it the default for local runs on HuggingFace/Ollama. Dense 35B models need 21-22GB VRAM for weights alone on RTX 4090 (24GB total), leaving ~2GB for KV cache—insufficient beyond short contexts. MoE 35B (e.g., Qwen2.5-35B-A3B) activates only 3B params/token, fitting in ~20GB with 1.2GB KV at 64K context due to fewer active heads, reducing TurboQuant's necessity.

TurboQuant Stacks on Weights for Long-Context Memory Wins

TurboQuant compresses KV cache to 2-4 bits at inference (PolarQuant + QJL, e.g., bits=3) without touching weights, enabling dense models like Mistral Small 3.1 24B or Qwen2.5-32B (64 layers, 8 GQA heads, head_dim=128) to handle 32K context on 24GB VRAM. Formula: 2 × layers × heads × head_dim × seq_len × bytes/element. Without it, 16K context KV hits ~4GB (total ~24GB borderline); with turbo3, drops to ~1.2GB, freeing space for 32K (~2.4GB). Fused Triton kernels compute attention on compressed KV, speeding up >8K contexts (major at 32K+). Asymmetric K@3bits/V@2bits saves more with zero quality loss empirically.

Three Paths to TurboQuant on 24GB GPUs Today

PyPI turboquant-kv: Wrap HF Transformers (load_in_4bit) with TurboQuantModel(bits=3).enable_decoder_fused_attention() for Python scripts; handles 512+ new tokens on long inputs.

vLLM fork (0xSero/turboquant): install_turboquant_vllm(bits=3, head_dim=128) before LLM(model, gpu_memory_utilization=0.92); prebuilt codebooks for d=128/256 at 2/3/4 bits; server-friendly.

llama.cpp fork (turboquant_plus): Build with CUDA, run llama-server -m model-Q4_K_M.gguf --cache-type-k turbo3 --cache-type-v turbo2 -c 32768 -ngl 99. Turbo4 ≈ q8_0 quality, turbo3 best tradeoff, turbo2 extreme. Fits 32K on Qwen2.5-32B (19GB weights + <4GB KV).

Quality holds ≥8B models; speedups context-dependent (<2K: memory only). Experimental—await Google impl (Q2-Q3 2026), llama.cpp #20969, vLLM #38171 merges.

Optimal Stack: Q4_K_M GGUF + turboquant_plus turbo3/2

Download Q4_K_M GGUF, use llama.cpp fork at 16-32K context. Achieves reliable 35B dense inference where defaults crash; 128K impossible (KV still GBs post-compression).

Summarized by x-ai/grok-4.1-fast via openrouter

6914 input / 2052 output tokens in 13171ms

© 2026 Edge