MOSS-Audio Unifies Audio Tasks in One Open Model

Single Foundation Model Replaces Audio Toolchains

Build audio apps without stitching ASR, emotion detectors, sound classifiers, and music analyzers—MOSS-Audio-4B/8B models process raw audio for transcription with timestamps, speaker ID, emotion from tone/timbre/context, background scene inference, music style/instrumentation/emotion arcs, captioning, summarization, and multi-hop reasoning over podcasts/meetings. Use Instruct variants (Qwen3-4B/8B backbone, ~4.6B/8.6B params) for structured outputs in pipelines; Thinking variants excel at chain-of-thought for complex inference. Input raw audio; encoder outputs 12.5 Hz representations projected to LLM embeddings for text generation.

DeepStack Injection and Time-Markers Boost Fidelity

Avoid losing prosody/timbres/transients by injecting multi-layer encoder features—DeepStack module selects early/intermediate/final layers, projects them separately, and feeds into LLM early layers for granular acoustic-to-semantic retention. Gain native temporal awareness without post-processing: pretraining inserts fixed-interval time tokens between frames, enabling 'what at 2:00?' QA, event localization, and long-audio reasoning directly in autoregressive generation. Train custom encoders from scratch for robust speech across domains over generic frontends.

Outperforms Larger Models on Key Benchmarks

MOSS-Audio-8B-Thinking averages 71.08% accuracy (MMAU 77.33, MMAU-Pro 64.92, MMAR 66.53, MMSU 75.52), topping open-source including 33B Step-Audio-R1 (70.67) and 30B Qwen3-Omni (67.91); 4B-Thinking hits 68.37, beating all bigger Instruct models. Leads speech captioning (8B-Instruct: 3.7252/5 across 13 traits like accent/emotion/personality via LLM judge). Lowest ASR CER 11.30 over 12 dims (health/code-switching/singing). Timestamp ASR: 8B-Instruct at 35.77 AAS (AISHELL-1), 131.61 (LibriSpeech) vs. 30B Qwen3-Omni's 833.66 and Gemini-3.1-Pro's 708.24.

Download from Hugging Face collections or GitHub repo to integrate into apps today.