Select Granite 4.1 Variant by Your ASR Bottleneck

IBM's Granite Speech 4.1 releases three ~2B parameter models optimized for edge deployment, each targeting a specific constraint: accuracy, structured output, or throughput. Use the base model (ibm/granite-speech-4.1-2b) for top accuracy—it leads the Hugging Face Open ASR Leaderboard with 5.33% word error rate (WER) across diverse datasets, translating to ~95% word accuracy in real-world scenarios. Its real-time factor (RTF) reaches 231, processing 4 minutes of audio per second of compute (e.g., 1-hour audio in 16 seconds). Supports 7 languages (English, French, German, Spanish, Portuguese, Japanese) for transcription, bidirectional speech-to-text translation, punctuation, truecasing, and keyword biasing—pass domain-specific terms like names or acronyms in the prompt to boost recognition.

Switch to the Plus variant (ibm/granite-speech-4.1-2b-plus) for speaker-attributed ASR (diarization) and word-level timestamps. It labels speakers (e.g., Speaker 1, Speaker 2) for podcasts or meetings, with timestamp accuracy outperforming Whisper-X and customized Whisper models. Incremental decoding lets you prefix prior transcripts for seamless long-audio chunking with overlap, maintaining consistent speaker IDs. Trade-offs: WER rises slightly, drops to 5 languages (no Japanese), no translation or keyword biasing.

For bulk processing, pick the NAR model (ibm/granite-speech-4.1-2b-nar)—non-autoregressive design skips sequential token generation, achieving RTF 1820 batched on H100 (1-hour audio in 2 seconds). No diarization, timestamps, translation, or biasing, but WER stays competitive.

ModelKey StrengthsWERRTFLanguagesFeatures
BaseAccuracy5.33%2317Translation, keyword bias
PlusDiarization, timestampsHigherLower5Incremental decode
NARThroughputCompetitive1820 (H100)?Raw transcripts

Non-Autoregressive Transcript Editing Beats Sequential Decoding

Standard ASR like Whisper or Parakeet uses autoregressive transformers, generating tokens sequentially—each depends on priors, bottlenecking GPUs with tiny forward passes. NAR fixes this via NLE (Non-autoregressive LLM-based editing): a cheap CTC encoder drafts a bidirectional-attention transcript, then an LLM edits it (copy, insert, delete, replace). This parallelizes decoding without losing conditioning, improving on one-shot predictions. Result: massive speedups without huge WER hits, ideal for hundreds of hours of raw audio.

Run Locally with Transformers: Chunking and Fine-Tuning Tips

Load via Hugging Face Transformers: processor = AutoProcessor.from_pretrained(model_id); model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id). Use generate() with custom prompts for diarization (<|startoftranscript|><|en|><|transcribe|><|speaker_attributed_asr|>) or keywords (<|startoftranscript|><|en|><|transcribe|><|keywords|>["term1", "term2"]<|endkeywords|>). Requires Flash Attention for NAR (compile for CUDA 13+; issues on T4 Colab GPUs).

For long audio (e.g., 4-hour podcasts): chunk with overlap, prefix prior text for continuity. Fine-tune on domain data like court transcripts or podcasts using prior Granite notebooks—train on host-specific accents for better WER. Test RTF varies by hardware (RTX 6000 Blackwell hits good speeds but below H100 claims without batching). Build local agents to query via API for cloud-free transcription.