Microsoft's Efficient 1-Bit LLMs and Multimodal AI Papers

1-Bit LLMs Enable CPU-Scale Inference

BitNet introduces 1-bit transformers where all LLMs fit in 1.58 bits, using 1-bit weights and native 4-bit activations in BitNet v2 and a4.8 variants. Scale to BitNet b1.58 2B4T (2B params on 4T tokens) cuts memory and speeds inference on CPUs via bitnet.cpp. Sparsity boosts efficiency: Sparse-BitNet (semi-structured 1.58-bit), SlideSparse ((2N-2):2N structured), Q-Sparse/Block Q-Sparse (fully sparse-activated LLMs), ReSa (rectified sparse attention). BitDistill finetunes any full-precision LLM to 1.58-bit for tasks; training tips in S-shape guide cover FAQs. Trade-off: precision loss traded for 10x cheaper deployment vs. full-precision.

Multimodal and Voice AI Foundations

Kosmos series grounds multimodal LLMs: Kosmos-1 (MLLM), Kosmos-2 (world grounding), Kosmos-2.5 (literacy), Kosmos-G (in-context image gen). VALL-E X (neural codec zero-shot TTS), VALL-E 2 (human-parity zero-shot TTS without VQ), WavMark (audio watermarking). VibeVoice advances voice AI: realtime streaming long-form TTS, VibeVoice-ASR (LLM-era ASR), MELLE (autoregressive speech sans VQ). LatentLM unifies multimodality; LongViT treats 1024x1024 images as 1M tokens. Use for production TTS/ASR: zero-shot synthesis skips data collection.

Architecture Scaling and Context Extension

YOCO (decoder-decoder LLMs), YOCO-U (universal depth scaling). LongNet scales transformers to 1B tokens context; BYOCL bootstraps longer contexts. Differential Transformer V2 (faster/better/stable), RetNet (revolutionizes transformers), MH-MoE v2 (multi-head MoE), TorchScale (any-scale transformers), DeepNet (1K layers), Magneto (foundation transformer), XPos (length-extrapolatable). PoSE (positional skip-wise for context windows), Structured Prompting (1K examples ICL). Deploy for long docs: native 1B-token handling avoids truncation errors.

Distillation, RL, and Agentic Reasoning

Distill via MiniLLM (on-policy), GAD (black-box on-policy), on-policy context (Experiential Learning Parts I/II), BitDistill. Pre-training: TPT (thinking-augmented), RPT (reinforcement), Learning Law (optimal LM learning), Scaling Laws (synthetic data). RLHF advances: GMPO (geometric-mean policy opt), QueST (generate hard problems), DocReward (document RM), RRM (reward model frontier). Agentic: LLM-in-Sandbox (general intelligence), Multiplex Thinking (token-wise branch/merge), Era of Agentic Organization, Visualization-of-Thought (spatial reasoning, MVOT). Outcomes: Distillation shrinks models 4x with <5% perf loss; RL elicits reasoning without full retrain.

1-Bit LLMs Enable CPU-Scale Inference

Multimodal and Voice AI Foundations

Architecture Scaling and Context Extension

Distillation, RL, and Agentic Reasoning

More on Edge

World Models Build AI's Internal Reality Simulators

Teach AI Values' Why Before What for Stronger Alignment

Neuro-Symbolic AI Pairs Neural Patterns with Logic for Explainability

LFM 2.5: Train Small Models to Beat Doom Loops & Use Tools