VibeVoice-Realtime-0.5B: 300ms Streaming TTS Model

Microsoft's 0.5B param TTS model streams text input for real-time speech output in ~300ms, handles ~10min long-form English audio, beats benchmarks on WER (2.00% LibriSpeech) while adding multilingual support.

Build Real-Time TTS with Interleaved Streaming Design

Integrate VibeVoice-Realtime-0.5B to generate speech from streaming text inputs, producing initial audio in ~300ms (hardware-dependent) for live narration or LLM responses. The 0.5B parameter model uses an interleaved, windowed architecture: it encodes incoming text chunks incrementally while parallel diffusion-based acoustic latent generation continues from prior context. This drops the semantic tokenizer of larger variants, relying on an efficient acoustic tokenizer at 7.5 Hz frame rate for low latency. Supports up to 8k context (~10min generation), single English speaker (multilingual like German/French/Italian/Japanese/Korean/Dutch/Polish/Portuguese/Spanish works reasonably). Launch websocket demos via GitHub for real-time apps; plug into any LLM for token-by-token speech before full responses complete. Trade-off: no multi-speaker or overlapping speech—use larger VibeVoice models (1.5B/64k ctx or Large/32k ctx) for conversations.

Outperform Baselines on Zero-Shot TTS Benchmarks

Deploy for production-like quality: on LibriSpeech test-clean, achieves 2.00% WER (↓ better) and 0.695 speaker similarity (↑ better), topping VALL-E 2 (2.40%/0.643), Voicebox (1.90%/0.662), and MELLE (2.10%/0.625). On SEED test-en, hits 2.05% WER/0.633 similarity, edging MaskGCT (2.62%/0.714), Seed-TTS (2.25%/0.762), FireRedTTS (3.82%/0.460), SparkTTS (1.98%/0.584), and CosyVoice2 (2.57%/0.652). Excels in long-form over short sentences; transformer LLM (Qwen2.5 0.5B base) + acoustic tokenizer + diffusion head enables this without full retraining.

Mitigate Risks in Research Deployments

For research-only: install via GitHub README, avoiding commercial use without testing. Pre-process inputs to strip code/formulas/symbols (unsupported). Limitations: English-focused (non-English unpredictable), no non-speech audio/overlaps; inherits Qwen2.5 biases. Safeguards include auto-embedded 'This segment was generated by AI' disclaimer, imperceptible watermark for provenance verification, and removed acoustic tokenizer to block custom embeddings. Disclose AI use; comply with laws/MIT license. Contact VibeVoice@microsoft.com for issues—Microsoft Research welcomes feedback.

Summarized by x-ai/grok-4.1-fast via openrouter

5279 input / 1858 output tokens in 17075ms

© 2026 Edge