Run Bonsai 1-Bit LLM on CUDA: 14x Smaller, 3x Faster
Bonsai-1.7B uses Q1_0_g128 quantization for 0.24GB size (14.2x FP16 reduction), runs at 674 tok/s on RTX 4090 via llama.cpp CUDA binaries, supports chat, JSON, code gen, RAG, and OpenAI server.
Q1_0_g128 Quantization Cuts Memory 14x to 1.125 Bits per Weight
Bonsai-1.7B packs weights as 1-bit signs (0 = -scale, 1 = +scale) with one shared FP16 scale per 128-weight group, yielding 1 + 16/128 = 1.125 bpw. This shrinks FP16's 3.44GB to 0.24GB (14.2x reduction), outperforming MLX 1-bit g128's 0.27GB. Reconstruction demo: Generate FP16 weights (e.g., first 8: 0.0621, -0.0284, ...), compute max abs as scale (0.1587), quantize to bits 1,0,..., dequantize to +/-scale, achieving MSE 0.001098. Memory per group: FP16 256B vs Q1_0_g128 18.0B (14.2x saving). Deploy via GGUF from prism-ml/Bonsai-1.7B-gguf (~248MB download). Use prebuilt llama.cpp binaries (e.g., prism-b8194-1179bfc for CUDA 12.4/12.8/13.1) for GPU offload (-ngl 99, -c 4096).
Benchmark 3x Speed Gains Over FP16 on Consumer GPUs
Measure tokens/sec with repeated inference (128 tokens, 3 runs): Bonsai-1.7B hits 674 tok/s TG128 on RTX 4090 (3.0x FP16's 224 tok/s), 250 tok/s on M4 Pro (3.8x FP16's 65 tok/s). Default params: temp=0.5, top_p=0.85, top_k=20, repeat_penalty=1.0, n_predict=256. Vary sampling for control—low temp=0.1/top_k=10/top_p=0.70 yields precise output ("A futuristic city powered entirely by 1-bit AI features crystalline spires pulsing with binary neural networks..."); high temp=1.2/top_k=100/top_p=0.98 produces varied hallucinations. Multi-turn chat accumulates history in ChatML format (<|im_start|>role\nmsg<|im_end|>), handling 3 turns on 1-bit trade-offs without context loss up to 4096 tokens.
Production Pipelines: JSON, Code Gen, RAG, OpenAI Server
Force JSON with system prompt "Respond ONLY with valid JSON" + low temp=0.1: Generates {"model_name": "Bonsai-1.7B", "parameter_count": "1.7B", "bits_per_weight": 1.125, "memory_gb": 0.24, "top_use_cases": "edge deployment", "mobile AI", "fast inference"}—parse after stripping fences. Code gen: Prompt for 1-bit quantizer function, execs successfully (input 256 weights → 2 bit arrays + 2 scales for group_size=128). Long context (2048 tokens) summarizes transformers history in 3 bullets. Mini-RAG injects KB snippets (e.g., Bonsai-1.7B: 32k ctx, 0.24GB; 8B: 65k ctx) for grounded answers like "Deployed file size of 1.7B: 0.24 GB". Run OpenAI-compatible server (llama-server --port 8088 -ngl 99), query via openai client: Counts prompt/completion/total tokens accurately. Model family: 1.7B (0.25GB, 32k ctx, 14.2x), 4B (~0.6GB, 13x), 8B (~0.9GB, 65k ctx, 13.9x).