Run VibeVoice STT Locally on Mac in One uv Command

This link post demonstrates running Microsoft's VibeVoice, a Whisper-style speech-to-text model with built-in speaker diarization, locally on Apple Silicon. Released January 21, 2026, and MIT-licensed, it uses the 5.71GB 4-bit MLX-quantized version of the 17.3GB original for efficient inference.

One-Liner Command Delivers Full Transcription

Install and run via uv and mlx-audio:

uv run --with mlx-audio mlx_audio.stt.generate \
  --model mlx-community/VibeVoice-ASR-4bit \
  --audio lenny.mp3 --output-path lenny \
  --format json --verbose --max-tokens 32768

This handles .mp3 and .wav inputs. Default --max-tokens 8192 covers ~25min audio; increase to 32768 for up to ~59min (model limit trims longer files). Outputs JSON array of segments like:

{
  "text": "And an open question for me is...",
  "start": 13.85,
  "end": 19.5,
  "duration": 5.65,
  "speaker_id": 0
}

Load JSON into Datasette Lite (https://lite.dssette.io/?json=URL) to facet by speaker_id and browse turns—accurately distinguishes speakers, even voice changes in intros.

M5 Max Performance: Fast for Local Use

On 128GB M5 Max MacBook Pro, 99.8min podcast (trimmed to 59min) took 524.79s total:

Prompt: 26,615 tokens at 50.718 t/s
Generation: 20,248 tokens at 38.585 t/s
Peak reported: 30.44GB RAM (Activity Monitor showed 61.5GB prefill, 18GB generation)

That's 8min 45s for ~1hr audio, enabling quick local prototyping without cloud costs.

Handling Long Audio Requires Splitting

Model caps at ~59min; for longer files, split with 1min overlaps to align speaker IDs and avoid cut-off words. Align segments post-processing to merge full transcripts.