Run VibeVoice STT on Mac with MLX in one command

Deploy VibeVoice Locally for Fast Transcription

Microsoft's MIT-licensed VibeVoice-ASR model, a Whisper-style speech-to-text system with built-in speaker diarization, runs on Mac via mlx-audio and a 5.71GB 4-bit MLX-quantized version from Hugging Face. Install with uv and execute in one line: uv run --with mlx-audio mlx_audio.stt.generate --model mlx-community/VibeVoice-ASR-4bit --audio input.mp3 --output-path output --format json --verbose --max-tokens 32768. This handles MP3 and WAV inputs, producing JSON segments timed to seconds with speaker IDs. Default max-tokens of 8192 covers ~25min audio; increase to 32768 for full ~1hr files.

Achieve 8:45min Processing for 1hr Audio on Apple Silicon

On a 128GB M5 Max MacBook Pro, transcribing a 99.8min podcast (trimmed to 59min max) takes 524.79s total: 26615 prompt tokens at 50.718 t/s, 20248 generation tokens at 38.585 t/s, peaking at 30.44GB RAM (Activity Monitor shows 61.5GB prefill, 18GB generation). For longer audio, split files with 1min overlaps to align speaker IDs and avoid cut-off words.

Parse Output as Segmented JSON for Analysis

Output is an array of objects like {"text": "...", "start": 13.85, "end": 19.5, "duration": 5.65, "speaker_id": 0}, enabling speaker separation (e.g., distinguishes hosts and sponsor reads). Load directly into Datasette Lite via URL for faceted browsing by speaker_id, revealing nuances like multiple voices for one person.