VibeVoice-ASR: 60-Min ASR with Speakers, Timestamps, Hotwords
Process up to 60 minutes of audio in one pass for structured transcripts (speaker IDs, timestamps, content) across 50+ languages, with custom hotwords boosting accuracy on proper nouns.
Unified Long-Form Transcription in Single Pass
VibeVoice-ASR handles 60-minute audio within 64K tokens without chunking losses, maintaining speaker consistency and semantics. It jointly performs ASR, diarization, and timestamping, outputting JSON-like structures with Start/End times, Speaker IDs, and Content. Load via Transformers >=5.3.0: AutoProcessor and VibeVoiceAsrForConditionalGeneration.from_pretrained("microsoft/VibeVoice-ASR-HF"). Use processor.apply_transcription_request(audio) for inputs, then model.generate(**inputs) and processor.decode(generated_ids, return_format="parsed") for list of dicts or "transcription_only" for plain text. Example on podcast audio yields segments like {"Start":0,"End":15.43,"Speaker":0,"Content":"Hello everyone..."}, preserving multi-speaker flow.
Custom hotwords via prompt parameter fix misrecognitions: on German-accented "VibeVoice" audio, without prompt it transcribes "Revevoices", but "About VibeVoice" prompt corrects to exact match, ideal for names or terms.
Flexible Inference and Optimization Techniques
Batch process lists of audio/prompts for efficiency. Adjust tokenizer_chunk_size (default 1440000 samples/60s at 24kHz, multiples of 3200 hop length) to fit memory, e.g., 64000 for shorter segments with cached states. Chat templates enable role-based inputs: [{"role":"user","content":[{"type":"text","text":"prompt"},{"type":"audio","path":"url"}]}], processed via apply_chat_template. Torch.compile speeds up by 2x+ on benchmarks (e.g., batch-4 German audio: ~0.2s uncompiled to ~0.1s compiled). Pipeline mode works but requires custom parsing of raw JSON strings. For training, use model.train() with output_labels=True in chat templates, computing loss on JSON-like targets.
Proven Performance Across Benchmarks
Achieves average 7.77% WER on Open ASR Leaderboard (e.g., 2.20% LibriSpeech clean, 13.17% earnings22, RTF 51.80x real-time). Technical report shows low DER, cpWER, tcpWER on long-form datasets. Supports 50+ languages without ID specification, handling code-switching; distribution chart emphasizes English-heavy training with broad coverage. MIT-licensed, deployable on Foundry or Gradio playground.