Native CUDA Model in Safe Rust Without C++ or DSLs

cuda-oxide targets CUDA's SIMT execution model natively in Rust, closer to C++ global functions than generic Rust async code. Mark kernels with #kernel proc macro, which namespaces them as cuda_oxide_kernel__; only these compile to PTX via custom rustc backend, while host code uses standard LLVM. This single-source approach in one .rs file produces host binary + target/debug/vecadd.ptx from cargo oxide build. Device libs compile lazily from .rlib Stable MIR metadata, avoiding upfront compilation of unused code.

Supports match/if let, generics like fn scale<T: Copy>, intrinsics (shfl_sync, ballot_sync), and thread::index_1d() for unique hardware indices. DisjointSlice ensures race-free mutable outputs—get_mut(idx) requires ThreadIndex (opaque usize from threadIdx/blockIdx/blockDim), preventing unsafe shared writes. Host launches use CudaContext, DeviceBuffer, and cuda_launch! macro with LaunchConfig::for_num_elems(1024) auto-sizing grids/blocks.

Disables rustc JumpThreading on device code to preserve bar.sync convergence—duplication breaks GPU barrier semantics. Marks syncs convergent in LLVM IR to block optimizer reordering.

Pure-Rust Pipeline Except One External Tool

Intercepts rustc at CodegenBackend::codegen_crate() using rustc_public (Stable MIR) for version-proof internals. Pipeline: Rust → rustc frontend → Stable MIR → dialect-mir (Rust MIR semantics: places/projections/rvalues/terminators) → mem2reg → dialect-llvm → textual .ll → llc (LLVM 21+ NVPTX) → .ptx.

Uses Pliron (Rust-native MLIR-like IR) for dialects: dialect-mir, dialect-llvm (.ll export), dialect-nvvm (NVIDIA intrinsics like thread indexing/barriers/TMA). Entire stack builds with cargo—no C++/CMake/tablegen. Observe full trace: cargo oxide pipeline vecadd prints MIR through PTX.

Complements rust-cuda (Rust ergonomics/async on GPU) by focusing on CUDA-native model. vecadd example adds 1024 f32s on GPU, verifies on host: cargo oxide run vecadd outputs ✓ SUCCESS: All 1024 elements correct!

Linux-Only Setup with Pinned Nightly and LLVM 21

Requires Ubuntu 24.04, Rust nightly-2026-04-03 (rustup toolchain install nightly-2026-04-03; add rust-src/rustc-dev), LLVM 21+ NVPTX (sudo apt install llvm-21; llc-21 --version | grep nvptx), Clang 21 (sudo apt install clang-21 for bindgen/cuda.h). Pin llc: export CUDA_OXIDE_LLC=/usr/bin/llc-21.

Clone repo or cargo install --git https://github.com/NVlabs/cuda-oxide.git cargo-oxide. cargo oxide doctor checks deps. Debug: cargo oxide debug vecadd --tui with cuda-gdb on target/debug/vecadd + .ptx.

Next: generics (cargo oxide run generic), host closures (cuda_launch_async! with .await/.sync()), async_mlp, gemm_sol (cta_group::2, index_2d(stride), unsafe for &mut T). Docs: nvlabs.github.io/cuda-oxide.