The Failure of Partitioned Distillation
Standard knowledge distillation (KD) requires shared tokenizers between teacher and student models. Previous cross-tokenizer methods, specifically GOLD (Generalized Output Logit Distillation), attempt to solve this by partitioning tokens into a 'common' subset (trained with KL divergence) and an 'uncommon' remainder (trained with rank-based ULD noise). NVIDIA researchers identified two critical flaws in this approach:
- Uncommon-token failure: When tokenizers fragment text differently (e.g., Llama-3 treating numbers as single tokens vs. Qwen3 splitting them), critical tokens are relegated to the 'uncommon' set. This subjects them to identity-agnostic noise and suppressive gradients, where the model is forced to suppress tokens regardless of ground truth.
- Over-conservative matching: GOLD relies on strict string equality, discarding useful alignment signals even when tokens are structurally equivalent (e.g., 'Hundreds' vs 'Hund' + 'reds').
X-Token: Projection-Guided Distillation
X-Token replaces the rigid partition with a deterministic projection matrix (W) that maps student vocabulary space to teacher vocabulary space. This matrix is built once before training using exact-match canonicalization and multi-token weight assignment (using exponential decay for sub-token spans).
Researchers introduced two loss formulations based on this projection:
- P-KL (Projection-KL): Removes the partition entirely, projecting student probability mass directly into the teacher's space. This eliminates ULD noise and suppressive gradients, making it ideal for cases where critical tokens are misaligned.
- H-KL (Hybrid-KL): Retains a relaxed partition, expanding the 'common' set using the projection matrix to include near-equivalent pairs. This is superior when the partition is structurally sound, as it provides sharper per-pair supervision.
Performance and Multi-Teacher Scaling
In experiments using Llama-3.2-1B as the student, X-Token achieved significant gains over GOLD. On Qwen3-4B, P-KL improved average performance by +3.82 points. On Phi-4-mini, H-KL improved performance by +0.52 points.
Furthermore, X-Token supports multi-teacher distillation by aggregating per-teacher losses. The researchers found that static weighting outperforms confidence-adaptive schemes, and that teacher complementarity—rather than the sheer number of teachers—is the primary driver of performance gains. Combining Phi-mini and Llama-3B yielded the best results (40.48 avg), while adding a third, overlapping teacher (Qwen-4B) actually degraded reasoning performance.