№ 02 / SUMMARIES

#cuda

Every summary, chronological. Filter by category, tag, or source from the rail.

Tag · #cuda

DAY 01June 9, 2026 JUN 9 · 20261 SUMMARIES

MarkTechPostSoftware EngineeringJun 9, 2026

Building Tiled GPU Kernels with NVIDIA cuTile Python

NVIDIA cuTile allows developers to write efficient, tile-based GPU kernels directly in Python, providing a structured way to handle memory access and computation that can be benchmarked against standard PyTorch operations.

MarkTechPost

DAY 02May 30, 2026 MAY 30 · 20261 SUMMARIES

MarkTechPostSoftware EngineeringMay 30, 2026

mKernel: Fusing Compute and Communication for GPU-Driven Scaling

mKernel eliminates host-driven communication bottlenecks by fusing intra-node NVLink, inter-node RDMA, and compute into persistent CUDA kernels, enabling fine-grained overlap at the tile level.

MarkTechPost

DAY 03May 6, 2026 MAY 6 · 20261 SUMMARIES

Level Up CodingSoftware EngineeringMay 6, 2026

CUDA Matrix Transpose: Naive to Swizzled Optimization

Matrix transpose on GPU pits coalesced reads against writes; solve via shared memory tiling, then fix bank conflicts with padding or XOR swizzling, plus float4 vectorization for peak bandwidth.

Level Up Coding

Showing 3 of 3