Kernels Underpin Dot-Product Similarities in Attention

Dot products like QK^T act as kernels, measuring similarity in high-dimensional feature spaces without explicit computation. A kernel k(x, y) = φ(x)^T φ(y) implicitly maps inputs x and y to features φ via the kernel trick, avoiding costly explicit φ. Mercer's theorem (70-year-old result) specifies valid kernels as positive semi-definite functions, ensuring they define inner products in Reproducing Kernel Hilbert Spaces (RKHS). This guarantees QK^T cannot yield negative similarities in standard formulations, as kernels are bounded ≥0.

Attention lives in RKHS: queries and keys project into this space, where softmax normalizes kernel evaluations into probabilities, preserving the geometry.

Attention Machines Match Kernel Methods Exactly

Transformer attention is a kernel machine. For sequence length n, QK^T forms an n×n kernel matrix K, with softmax(K / √d) as attention weights. This parallels kernel regression/SVMs, where kernels compute similarities for non-linear decisions. Gaussian/RBF kernels (exp(-||x-y||^2 / 2σ^2)) offer alternatives to dot products, potentially improving expressivity for certain data but requiring tuning σ and risking vanishing gradients if unnormalized.

Softmax Is Mathematically Required, Not Optional

Skip softmax and weights explode or go negative, violating kernel properties—attention becomes unstable. Softmax enforces row-stochastic matrices (sum to 1), mimicking probability distributions over keys. Without it, raw QK^T lacks normalization, leading to dominance by magnitude over similarity. Use scaled dot-product (divide by √d_k) to control variance, but softmax remains essential for PSD kernel validity and numerical stability.