Rust CUDA Kernels via Direct PTX Compilation

Native CUDA Model in Safe Rust Without C++ or DSLs

cuda-oxide targets CUDA's SIMT execution model natively in Rust, closer to C++ global functions than generic Rust async code. Mark kernels with #kernel proc macro, which namespaces them as cuda_oxide_kernel__; only these compile to PTX via custom rustc backend, while host code uses standard LLVM. This single-source approach in one .rs file produces host binary + target/debug/vecadd.ptx from cargo oxide build. Device libs compile lazily from .rlib Stable MIR metadata, avoiding upfront compilation of unused code.

Supports match/if let, generics like fn scale<T: Copy>, intrinsics (shfl_sync, ballot_sync), and thread::index_1d() for unique hardware indices. DisjointSlice ensures race-free mutable outputs—get_mut(idx) requires ThreadIndex (opaque usize from threadIdx/blockIdx/blockDim), preventing unsafe shared writes. Host launches use CudaContext, DeviceBuffer, and cuda_launch! macro with LaunchConfig::for_num_elems(1024) auto-sizing grids/blocks.

Disables rustc JumpThreading on device code to preserve bar.sync convergence—duplication breaks GPU barrier semantics. Marks syncs convergent in LLVM IR to block optimizer reordering.

Pure-Rust Pipeline Except One External Tool

Intercepts rustc at CodegenBackend::codegen_crate() using rustc_public (Stable MIR) for version-proof internals. Pipeline: Rust → rustc frontend → Stable MIR → dialect-mir (Rust MIR semantics: places/projections/rvalues/terminators) → mem2reg → dialect-llvm → textual .ll → llc (LLVM 21+ NVPTX) → .ptx.

Uses Pliron (Rust-native MLIR-like IR) for dialects: dialect-mir, dialect-llvm (.ll export), dialect-nvvm (NVIDIA intrinsics like thread indexing/barriers/TMA). Entire stack builds with cargo—no C++/CMake/tablegen. Observe full trace: cargo oxide pipeline vecadd prints MIR through PTX.

Complements rust-cuda (Rust ergonomics/async on GPU) by focusing on CUDA-native model. vecadd example adds 1024 f32s on GPU, verifies on host: cargo oxide run vecadd outputs ✓ SUCCESS: All 1024 elements correct!

Linux-Only Setup with Pinned Nightly and LLVM 21

Requires Ubuntu 24.04, Rust nightly-2026-04-03 (rustup toolchain install nightly-2026-04-03; add rust-src/rustc-dev), LLVM 21+ NVPTX (sudo apt install llvm-21; llc-21 --version | grep nvptx), Clang 21 (sudo apt install clang-21 for bindgen/cuda.h). Pin llc: export CUDA_OXIDE_LLC=/usr/bin/llc-21.

Clone repo or cargo install --git https://github.com/NVlabs/cuda-oxide.git cargo-oxide. cargo oxide doctor checks deps. Debug: cargo oxide debug vecadd --tui with cuda-gdb on target/debug/vecadd + .ptx.

Next: generics (cargo oxide run generic), host closures (cuda_launch_async! with .await/.sync()), async_mlp, gemm_sol (cta_group::2, index_2d(stride), unsafe for &mut T). Docs: nvlabs.github.io/cuda-oxide.

Native CUDA Model in Safe Rust Without C++ or DSLs

Pure-Rust Pipeline Except One External Tool

Linux-Only Setup with Pinned Nightly and LLVM 21

More from Software Engineering

North Korea Hit Axios NPM Maintainer, Exposing 100M Downloads

Axios NPM Hack Deploys RATs on 101M Dev Installs

Three Pillars of JavaScript Dependency Bloat

GPU Mesh Optimization Pipeline with meshoptimizer