Run VibeVoice STT Locally on Mac in One uv Command

Transcribe up to 59min audio with Microsoft's MIT-licensed VibeVoice model using mlx-audio: uv one-liner on M5 Max Mac processes 1hr podcast in 524s (8:45min) at 30-61GB RAM peak, outputs speaker-diarized JSON segments.

This link post demonstrates running Microsoft's VibeVoice, a Whisper-style speech-to-text model with built-in speaker diarization, locally on Apple Silicon. Released January 21, 2026, and MIT-licensed, it uses the 5.71GB 4-bit MLX-quantized version of the 17.3GB original for efficient inference.

One-Liner Command Delivers Full Transcription

Install and run via uv and mlx-audio:

uv run --with mlx-audio mlx_audio.stt.generate \
  --model mlx-community/VibeVoice-ASR-4bit \
  --audio lenny.mp3 --output-path lenny \
  --format json --verbose --max-tokens 32768

This handles .mp3 and .wav inputs. Default --max-tokens 8192 covers ~25min audio; increase to 32768 for up to ~59min (model limit trims longer files). Outputs JSON array of segments like:

{
  "text": "And an open question for me is...",
  "start": 13.85,
  "end": 19.5,
  "duration": 5.65,
  "speaker_id": 0
}

Load JSON into Datasette Lite (https://lite.dssette.io/?json=URL) to facet by speaker_id and browse turns—accurately distinguishes speakers, even voice changes in intros.

M5 Max Performance: Fast for Local Use

On 128GB M5 Max MacBook Pro, 99.8min podcast (trimmed to 59min) took 524.79s total:

  • Prompt: 26,615 tokens at 50.718 t/s
  • Generation: 20,248 tokens at 38.585 t/s
  • Peak reported: 30.44GB RAM (Activity Monitor showed 61.5GB prefill, 18GB generation)

That's 8min 45s for ~1hr audio, enabling quick local prototyping without cloud costs.

Handling Long Audio Requires Splitting

Model caps at ~59min; for longer files, split with 1min overlaps to align speaker IDs and avoid cut-off words. Align segments post-processing to merge full transcripts.

Summarized by x-ai/grok-4.1-fast via openrouter

5072 input / 2597 output tokens in 26872ms

© 2026 Edge