Audio Flamingo Next: NVIDIA's Open Audio LLM

Choose AF-Next Variant by Task to Maximize Output Quality

NVIDIA's Audio Flamingo Next (AF-Next) handles general audio understanding across speech, environmental sounds, and music, processing 16kHz audio in 30-second chunks up to 1800 seconds (30 minutes). Select variants based on needs:

Task Type	Recommended Checkpoint	Key Strengths
QA, chat, ASR/AST, direct answers	`nvidia/audio-flamingo-next-hf` (Instruct)	Default for assistant-style responses.
Multi-step reasoning, timestamped evidence, long traces	`nvidia/audio-flamingo-next-think-hf` (Think)	Explicit reasoning chains grounded in audio timestamps.
Dense captions, timestamped breakdowns, descriptive outputs	`nvidia/audio-flamingo-next-captioner-hf` (Captioner)	Verbose scene descriptions and transcriptions.

Start with Instruct for most use cases; switch to Think for complex analysis requiring evidence traces, or Captioner for detailed summaries. Model excels in multi-turn chat but limits to non-commercial research; excludes streaming TTS/voice-to-voice from this audio-text-to-text release.

Limitations include struggles with very long audio fidelity, non-English dominance, and music identification accuracy—use Think/Captioner to mitigate via structured prompting.

Prompt Precisely for ASR, Captioning, and QA Tasks

Craft prompts to unlock specific skills; always pair text instructions with audio inputs in chat format. Examples yield precise outputs:

ASR/ASR with diarization: "Transcribe the input speech." or "Transcribe the input audio. If multiple speakers are present, provide diarized transcripts with speaker labels. Speaker 1 ... Speaker 2 ..." (Instruct/Think).
Audio Captioning: Short: "Generate a caption for the input audio." Long: "Generate a detailed caption... transcribe all spoken content by all speakers precisely." (Captioner/Think).
Music Analysis: "Summarize the track with precision: mention its musical style, BPM, key, arrangement, production choices, and the emotions or story it conveys." (Captioner/Instruct/Think).
Lyrics: "Generate a lyrics transcription from the input song." (Instruct/Captioner/Think).
Translation: "Translate any speech you hear from <src_lang> into <tgt_lang>." (Instruct/Think).
Timestamped QA: "What precise description did the commentator use for the punch that ended the fight?" or multi-turn: Initial summary then "What happens right before the argument becomes heated?" (Instruct/Think).

Combine in conversations: Load audio path with text prompt, generate with max_new_tokens=1024, repetition_penalty=1.2. For multi-turn, append assistant/user roles sequentially.

Implement in 5 Lines with Transformers for Single/Multi-Turn Inference

Install: pip install --upgrade transformers accelerate. Load via:

import torch
from transformers import AutoModel, AutoProcessor
model_id = "nvidia/audio-flamingo-next-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto").eval()

Build conversation as list of dicts with "role": "user"/"assistant", "content": list of {"type": "text/audio", "text"/"path": ...}. Process: batch = processor.apply_chat_template(conversation, tokenize=True, add_generation_prompt=True, return_dict=True).to(model.device). Generate: generated = model.generate(**batch, max_new_tokens=1024, repetition_penalty=1.2). Decode: processor.batch_decode(generated[:, prompt_len:], skip_special_tokens=True)[0].

Trained on 45K hours pre-training, 200K+ mid-training samples (5 datasets, 30 epochs), 2M+ post-training, 1M GRPO-aligned instructions, plus 30K AF-Think for reasoning. Architecture: Audio encoder (hidden=1280, layers=32), text decoder (hidden=3584, layers=28, max_pos=131072), 128 experts, 30s patches, 2 connection types.

Training Curriculum Builds Robust Audio Reasoning

Four-stage pipeline: Pre-train on raw audio-text (45K hours), mid-train on 200K+ clips (5 datasets, 30 epochs), post-train on 2M+ instructions, GRPO-align for chat/safety/AudioSkills-XL. Final AF-Think dataset (30K) adds temporal grounding. Datasets: nvidia/LongAudio, AF-Chat, AF-Think.