#cuda
Every summary, chronological. Filter by category, tag, or source from the rail.
Tag · #cuda
Building Tiled GPU Kernels with NVIDIA cuTile Python
NVIDIA cuTile allows developers to write efficient, tile-based GPU kernels directly in Python, providing a structured way to handle memory access and computation that can be benchmarked against standard PyTorch operations.
MarkTechPost
mKernel: Fusing Compute and Communication for GPU-Driven Scaling
mKernel eliminates host-driven communication bottlenecks by fusing intra-node NVLink, inter-node RDMA, and compute into persistent CUDA kernels, enabling fine-grained overlap at the tile level.
MarkTechPost
CUDA Matrix Transpose: Naive to Swizzled Optimization
Matrix transpose on GPU pits coalesced reads against writes; solve via shared memory tiling, then fix bank conflicts with padding or XOR swizzling, plus float4 vectorization for peak bandwidth.
Level Up Coding
Showing 3 of 3