PyTorch nn.Linear Mismatches Raw Matmul by 1e-4

Raw Matmul Preserves Precision Across Batch Sizes

Use torch.matmul for exact equivalence: with seed 42, x = torch.randn(2, 768) and w = torch.randn(768, 768), computing z1 = x[0] @ w matches (x @ w)[0] exactly—max absolute difference is 0. This holds because PyTorch's matrix multiply ignores batch dimensions consistently without introducing fusion artifacts.

nn.Linear Introduces Numerical Drift

nn.Linear(768, 768, bias=False) with weight copied from w.T fails exactness. q1 = m(x[0]) differs from q2 = m(x)[0] by max ~2e-5, and both deviate from raw z1 by ~9e-5. Avoid assuming single-sample Linear matches batched or raw matmul outputs—use raw ops for precision-critical math.

Root Cause: Fused Operations in Batched Mode

Commenter notes torch source shows fused kernels activate differently for batched (shape 2,768) vs single (768) inputs, causing drift. Test by disabling autocast or fusions (e.g., torch.backends.cudnn.deterministic=True) to isolate; impacts model debugging where exact reproducibility matters over speed.