Efficient Tokenization Drives Long-Form Processing

VibeVoice models process extended audio via acoustic and semantic tokenizers at 7.5Hz frame rate, preserving fidelity while handling sequences up to 64K tokens. This enables single-pass transcription of 60-minute audio or synthesis of 90-minute speech without chunking losses. A next-token diffusion framework combines an LLM for textual context/dialogue flow with a diffusion head for high-fidelity acoustics, outperforming chunked baselines on diarization error rate (DER), concatenated-padded word error rate (cpWER), and timestamped cpWER (tcpWER) per benchmarks.

Apply this by loading models from Hugging Face (e.g., microsoft/VibeVoice-ASR-7B, VibeVoice-1.5B, VibeVoice-Realtime-0.5B) for inference; vLLM plugin accelerates ASR serving.

ASR Delivers Structured 60-Minute Transcriptions

VibeVoice-ASR-7B transcribes long-form audio with joint ASR, speaker diarization (Who), timestamps (When), and content (What). Provide customized hotwords (names/terms/context) to boost accuracy on domain-specific audio. Natively supports 50+ languages in one pass, avoiding context loss from short-segment processing. Finetuning scripts available; integrated into Hugging Face Transformers v5.3.0. Test via playground (aka.ms/vibevoice-asr) or Colab.

Outcomes: Consistent speaker tracking and semantic coherence over full hour, with community apps like Vibing using it for voice input on macOS/Windows.

TTS Enables Expressive Multi-Speaker Dialogues

VibeVoice-TTS-1.5B generates up to 90 minutes of speech with 4 distinct speakers, natural turn-taking, and emotional nuances in English/Chinese/cross-lingual. Handles spontaneous singing and long conversations (e.g., 45min 4-person climate discussion). VibeVoice-Realtime-0.5B adds streaming text input, ~300ms first audible latency, and 10-minute robust generation for deployment. Note: TTS code removed September 2025 due to misuse beyond research intent; use HF weights. Experimental voices cover 9 languages + 11 English styles. Try Realtime on Colab.

ICLR 2026 Oral acceptance validates long-form/multi-speaker quality.

Quick Starts and Extensions

Stream TTS via Colab notebook; ASR playground for instant testing. Finetune ASR with provided code. Contribute per CONTRIBUTING.md; MIT license. Track 39.1k stars, 4.5k forks.