#cuda
Every summary, chronological. Filter by category, tag, or source from the rail.
Tag · #cuda
CUDA Matrix Transpose: Naive to Swizzled Optimization
Matrix transpose on GPU pits coalesced reads against writes; solve via shared memory tiling, then fix bank conflicts with padding or XOR swizzling, plus float4 vectorization for peak bandwidth.
Level Up Coding
Showing 1 of 1