VIBEVOICE-ASR: Single-Pass 60-Min ASR with Diarization

Single-Pass Processing Eliminates Context Fragmentation

Traditional long-form ASR pipelines chunk audio into <30-second clips, breaking semantic dependencies and requiring separate models for ASR, diarization, and timestamping, which propagates errors. VIBEVOICE-ASR processes up to 60 minutes end-to-end in one pass using dual tokenizers (acoustic at 3200× downsampling for 7.5 tokens/sec spectral fidelity; semantic for linguistic alignment), compressing 1 hour to 27,000 tokens—fitting modern LLM context windows like Qwen 2.5's 65k. This enables global attention for homophone disambiguation, coreference resolution, and consistent speaker tracking without external clustering. Output is structured "Rich Transcription" interleaving Speaker ID ("Who"), timestamps ("When"), and content ("What"). Prompt-based context injection prepends user-supplied info (hotwords, domain terms, backgrounds) to boost accuracy on polyphonic names or jargon, supporting 50+ languages and code-switching without explicit settings.

Robust Data Pipeline and Curriculum Training

Pre-training uses pseudo-labels from a pipeline outperforming WhisperX/Emilia: Silero VAD segments to 30s clips, Whisper-large-v3-turbo transcribes with word timestamps refined at punctuation, WeSpeaker diarization clusters embeddings (1.5s window, 0.75s hop, HDBSCAN, merge >0.67 cosine), filters if >30% segments WER>20% or speech<60% duration—yielding lower DER/WER on AISHELL4 (16.93/18.99), AMI-IHM (15.46/23.22), etc. (Table 1). Supervised fine-tuning mixes: 0.5 standard benchmarks (MLC-SLM, Fisher), 0.1 music (Muse), 0.1 synthetic (GPT-5 scripts + VIBEVOICE synthesis for 6k hours code-switched audio, WER-filtered), 0.3 long-form (GPT-5 refines chunked transcripts for coherence; GPT-Audio tags non-speech like Music/Silence). Curriculum ramps input from 8k to 65k tokens.

State-of-the-Art Benchmarks and Trade-offs

Evaluated via MeetEval on DER (speaker attribution), WER (content), cpWER (speaker-consistent content), tcpWER (time-aligned speaker content). Single-pass VIBEVOICE-ASR crushes chunked Gemini-2.5/3-Pro: avg DER 3.42 vs 16.29/32.96; tcpWER 15.66 vs 28.90/58.81; best cpWER 11/16 settings; lowest WER 8/16 (Table 2, Figure 1). Excels in multi-speaker (e.g., AliMeeting DER 10.92) and multilingual (e.g., Japanese DER 0.82). Limitations: SFT English/Chinese focus causes low-resource forgetting; serial output misses overlapping speech (transcribes dominant speaker). Open-sources weights, vLLM inference, fine-tuning code on GitHub/HuggingFace for community adaptation.