Audio Flamingo Next: NVIDIA's Open Audio LLM
AF-Next processes up to 30min audio at 16kHz for transcription, captioning, QA on speech/sounds/music. Use instruct-tuned checkpoint for chat/QA; think variant for reasoning traces; captioner for dense descriptions. Install via Transformers.
Choose AF-Next Variant by Task to Maximize Output Quality
NVIDIA's Audio Flamingo Next (AF-Next) handles general audio understanding across speech, environmental sounds, and music, processing 16kHz audio in 30-second chunks up to 1800 seconds (30 minutes). Select variants based on needs:
| Task Type | Recommended Checkpoint | Key Strengths |
|---|---|---|
| QA, chat, ASR/AST, direct answers | nvidia/audio-flamingo-next-hf (Instruct) | Default for assistant-style responses. |
| Multi-step reasoning, timestamped evidence, long traces | nvidia/audio-flamingo-next-think-hf (Think) | Explicit reasoning chains grounded in audio timestamps. |
| Dense captions, timestamped breakdowns, descriptive outputs | nvidia/audio-flamingo-next-captioner-hf (Captioner) | Verbose scene descriptions and transcriptions. |
Start with Instruct for most use cases; switch to Think for complex analysis requiring evidence traces, or Captioner for detailed summaries. Model excels in multi-turn chat but limits to non-commercial research; excludes streaming TTS/voice-to-voice from this audio-text-to-text release.
Limitations include struggles with very long audio fidelity, non-English dominance, and music identification accuracy—use Think/Captioner to mitigate via structured prompting.
Prompt Precisely for ASR, Captioning, and QA Tasks
Craft prompts to unlock specific skills; always pair text instructions with audio inputs in chat format. Examples yield precise outputs:
- ASR/ASR with diarization: "Transcribe the input speech." or "Transcribe the input audio. If multiple speakers are present, provide diarized transcripts with speaker labels. Speaker 1 ... Speaker 2 ..." (Instruct/Think).
- Audio Captioning: Short: "Generate a caption for the input audio." Long: "Generate a detailed caption... transcribe all spoken content by all speakers precisely." (Captioner/Think).
- Music Analysis: "Summarize the track with precision: mention its musical style, BPM, key, arrangement, production choices, and the emotions or story it conveys." (Captioner/Instruct/Think).
- Lyrics: "Generate a lyrics transcription from the input song." (Instruct/Captioner/Think).
- Translation: "Translate any speech you hear from <src_lang> into <tgt_lang>." (Instruct/Think).
- Timestamped QA: "What precise description did the commentator use for the punch that ended the fight?" or multi-turn: Initial summary then "What happens right before the argument becomes heated?" (Instruct/Think).
Combine in conversations: Load audio path with text prompt, generate with max_new_tokens=1024, repetition_penalty=1.2. For multi-turn, append assistant/user roles sequentially.
Implement in 5 Lines with Transformers for Single/Multi-Turn Inference
Install: pip install --upgrade transformers accelerate. Load via:
import torch
from transformers import AutoModel, AutoProcessor
model_id = "nvidia/audio-flamingo-next-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto").eval()
Build conversation as list of dicts with "role": "user"/"assistant", "content": list of {"type": "text/audio", "text"/"path": ...}. Process: batch = processor.apply_chat_template(conversation, tokenize=True, add_generation_prompt=True, return_dict=True).to(model.device). Generate: generated = model.generate(**batch, max_new_tokens=1024, repetition_penalty=1.2). Decode: processor.batch_decode(generated[:, prompt_len:], skip_special_tokens=True)[0].
Trained on 45K hours pre-training, 200K+ mid-training samples (5 datasets, 30 epochs), 2M+ post-training, 1M GRPO-aligned instructions, plus 30K AF-Think for reasoning. Architecture: Audio encoder (hidden=1280, layers=32), text decoder (hidden=3584, layers=28, max_pos=131072), 128 experts, 30s patches, 2 connection types.
Training Curriculum Builds Robust Audio Reasoning
Four-stage pipeline: Pre-train on raw audio-text (45K hours), mid-train on 200K+ clips (5 datasets, 30 epochs), post-train on 2M+ instructions, GRPO-align for chat/safety/AudioSkills-XL. Final AF-Think dataset (30K) adds temporal grounding. Datasets: nvidia/LongAudio, AF-Chat, AF-Think.