Compiling with Arrays to CUDA
Array programming languages have become increasingly relevant in many domains, especially in machine learning and scientific computing. Array programming offers concise, declarative syntax with high-level operations on arrays. High-level operations on arrays enable implicit parallelism of computations; however, mainstream systems typically trade off compile-time shape guarantees for performance on accelerators.
This thesis introduces SALT, a minimal shape-typed array programming language that guarantees compile-time shape safety, and a compiler for SALT that targets CUDA® via a one-kernel-one-instruction strategy. This design isolates each high-level operation in its own CUDA® kernel, simplifying reasoning, testing, profiling, and translation, while preserving the high-level semantics of each operation in SALT, in the respective CUDA® kernel. We implement SALT as an embedded language in Lean4, and generate CUDA® code through a small intermediate representation. An extensive evaluation on a modern Nvidia® GPU assesses the performance on small end-to-end benchmarks. While kernels are quite efficient in isolation, the end-to-end performance lags behind even baselines. For matrix multiplication, a slowdown of up to 63x was observed, while dense vector layers were up to 4x slower. The main bottleneck is the excessive materialization of intermediate results in global memory. We discuss opportunities for optimizations and show that SALT provides a solid foundation for future work. In summary, SALT demonstrates the feasibility of one-kernel-one-instruction compilation to CUDA and provides a foundation for future optimization work.