Run VibeVoice STT Locally on Mac in One uv Command
Transcribe up to 59min audio with Microsoft's MIT-licensed VibeVoice model using mlx-audio: uv one-liner on M5 Max Mac processes 1hr podcast in 524s (8:45min) at 30-61GB RAM peak, outputs speaker-diarized JSON segments.
This link post demonstrates running Microsoft's VibeVoice, a Whisper-style speech-to-text model with built-in speaker diarization, locally on Apple Silicon. Released January 21, 2026, and MIT-licensed, it uses the 5.71GB 4-bit MLX-quantized version of the 17.3GB original for efficient inference.
One-Liner Command Delivers Full Transcription
Install and run via uv and mlx-audio:
uv run --with mlx-audio mlx_audio.stt.generate \
--model mlx-community/VibeVoice-ASR-4bit \
--audio lenny.mp3 --output-path lenny \
--format json --verbose --max-tokens 32768
This handles .mp3 and .wav inputs. Default --max-tokens 8192 covers ~25min audio; increase to 32768 for up to ~59min (model limit trims longer files). Outputs JSON array of segments like:
{
"text": "And an open question for me is...",
"start": 13.85,
"end": 19.5,
"duration": 5.65,
"speaker_id": 0
}
Load JSON into Datasette Lite (https://lite.dssette.io/?json=URL) to facet by speaker_id and browse turns—accurately distinguishes speakers, even voice changes in intros.
M5 Max Performance: Fast for Local Use
On 128GB M5 Max MacBook Pro, 99.8min podcast (trimmed to 59min) took 524.79s total:
- Prompt: 26,615 tokens at 50.718 t/s
- Generation: 20,248 tokens at 38.585 t/s
- Peak reported: 30.44GB RAM (Activity Monitor showed 61.5GB prefill, 18GB generation)
That's 8min 45s for ~1hr audio, enabling quick local prototyping without cloud costs.
Handling Long Audio Requires Splitting
Model caps at ~59min; for longer files, split with 1min overlaps to align speaker IDs and avoid cut-off words. Align segments post-processing to merge full transcripts.