Moving Beyond Traditional Recommendation Pipelines
Spotify is transitioning away from the industry-standard "multi-stage" recommendation architecture—which relies on siloed candidate generation and ranking models—toward a unified, generative approach. By leveraging LLM backbones, the team aims to build a single, steerable system that treats recommendations as a sequential generation task, similar to how LLMs predict the next word in a sentence.
Catalog Understanding via Semantic IDs
To enable LLMs to reason over a catalog of 100 million tracks and millions of podcast episodes, Spotify uses Semantic IDs. This technique compresses high-dimensional content vectors into a sequence of 4-6 tokens.
- Hierarchical Representation: These tokens capture both broad categories (e.g., genre) and niche characteristics. For example, two pop artists might share the first two tokens in their sequence, while the remaining tokens diverge to capture their unique stylistic differences.
- Autoregressive Generation: By tokenizing the catalog, the LLM can treat a user's listening history as a prompt and autoregressively generate the next "token"—which corresponds to the next song or episode the user is likely to enjoy.
Personalization Through Soft Tokenization
Because LLMs cannot be fully fine-tuned on the individual data of 750 million users, Spotify employs a soft tokenization layer.
- User Embeddings: The team maintains massive, daily-updated user embedding models (using autoencoder architectures) that represent a user's historical taste.
- Projection into Token Space: This user vector is projected directly into the LLM’s token space. By inserting this "soft token" into the prompt, the frozen model gains user-specific context, allowing it to attend to the user's unique preferences during the generation process. This approach is currently in production for podcast episode recommendations.
Steerability and User Control
This generative architecture supports greater user steerability. Spotify is rolling out "Taste Profiles" that expose what the system knows about a user, allowing them to edit their preferences. These edits are fed back into the generative model, allowing it to adapt recommendations in real-time based on explicit user feedback or natural language prompts.