1-Bit LLMs Enable CPU-Scale Inference

BitNet introduces 1-bit transformers where all LLMs fit in 1.58 bits, using 1-bit weights and native 4-bit activations in BitNet v2 and a4.8 variants. Scale to BitNet b1.58 2B4T (2B params on 4T tokens) cuts memory and speeds inference on CPUs via bitnet.cpp. Sparsity boosts efficiency: Sparse-BitNet (semi-structured 1.58-bit), SlideSparse ((2N-2):2N structured), Q-Sparse/Block Q-Sparse (fully sparse-activated LLMs), ReSa (rectified sparse attention). BitDistill finetunes any full-precision LLM to 1.58-bit for tasks; training tips in S-shape guide cover FAQs. Trade-off: precision loss traded for 10x cheaper deployment vs. full-precision.

Multimodal and Voice AI Foundations

Kosmos series grounds multimodal LLMs: Kosmos-1 (MLLM), Kosmos-2 (world grounding), Kosmos-2.5 (literacy), Kosmos-G (in-context image gen). VALL-E X (neural codec zero-shot TTS), VALL-E 2 (human-parity zero-shot TTS without VQ), WavMark (audio watermarking). VibeVoice advances voice AI: realtime streaming long-form TTS, VibeVoice-ASR (LLM-era ASR), MELLE (autoregressive speech sans VQ). LatentLM unifies multimodality; LongViT treats 1024x1024 images as 1M tokens. Use for production TTS/ASR: zero-shot synthesis skips data collection.

Architecture Scaling and Context Extension

YOCO (decoder-decoder LLMs), YOCO-U (universal depth scaling). LongNet scales transformers to 1B tokens context; BYOCL bootstraps longer contexts. Differential Transformer V2 (faster/better/stable), RetNet (revolutionizes transformers), MH-MoE v2 (multi-head MoE), TorchScale (any-scale transformers), DeepNet (1K layers), Magneto (foundation transformer), XPos (length-extrapolatable). PoSE (positional skip-wise for context windows), Structured Prompting (1K examples ICL). Deploy for long docs: native 1B-token handling avoids truncation errors.

Distillation, RL, and Agentic Reasoning

Distill via MiniLLM (on-policy), GAD (black-box on-policy), on-policy context (Experiential Learning Parts I/II), BitDistill. Pre-training: TPT (thinking-augmented), RPT (reinforcement), Learning Law (optimal LM learning), Scaling Laws (synthetic data). RLHF advances: GMPO (geometric-mean policy opt), QueST (generate hard problems), DocReward (document RM), RRM (reward model frontier). Agentic: LLM-in-Sandbox (general intelligence), Multiplex Thinking (token-wise branch/merge), Era of Agentic Organization, Visualization-of-Thought (spatial reasoning, MVOT). Outcomes: Distillation shrinks models 4x with <5% perf loss; RL elicits reasoning without full retrain.