PyTorch nn.Linear Mismatches Raw Matmul by 1e-4

Raw torch.matmul gives identical results for single vs batched inputs (diff=0), but nn.Linear differs by 2e-5 between single/batched and 9e-5 from raw matmul due to fused ops.

Raw Matmul Preserves Precision Across Batch Sizes

Use torch.matmul for exact equivalence: with seed 42, x = torch.randn(2, 768) and w = torch.randn(768, 768), computing z1 = x[0] @ w matches (x @ w)[0] exactly—max absolute difference is 0. This holds because PyTorch's matrix multiply ignores batch dimensions consistently without introducing fusion artifacts.

nn.Linear Introduces Numerical Drift

nn.Linear(768, 768, bias=False) with weight copied from w.T fails exactness. q1 = m(x[0]) differs from q2 = m(x)[0] by max ~2e-5, and both deviate from raw z1 by ~9e-5. Avoid assuming single-sample Linear matches batched or raw matmul outputs—use raw ops for precision-critical math.

Root Cause: Fused Operations in Batched Mode

Commenter notes torch source shows fused kernels activate differently for batched (shape 2,768) vs single (768) inputs, causing drift. Test by disabling autocast or fusions (e.g., torch.backends.cudnn.deterministic=True) to isolate; impacts model debugging where exact reproducibility matters over speed.

Summarized by x-ai/grok-4.1-fast via openrouter

3920 input / 1128 output tokens in 10617ms

© 2026 Edge