Title: The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition
Authors: Robert Tjarko Lange, Aaditya Prasad, Qi Sun, Maxence Faldor, Yujin Tang, David Ha
Published: 2025-02-20
Link: https://pub.sakana.ai/static/paper.pdf

Abstract

Recent advances in Large Language Models have driven large-scale deployment, resulting in ever-growing inference time and energy demand. While manual optimization of low-level code implementations is feasible, it is an arduous task that requires deep expertise to balance the complex interplay of algorithmic, software, and hardware bottlenecks. This report presents the first comprehensive agentic framework for fully automatic CUDA kernel discovery and optimization, enabling frontier large language models to perform the translation of torch code to CUDA kernels and then iteratively improve their runtime. We introduce The A I CUDA Eng ineer, which acts in sequential stages. First, it translates raw PyTorch code into equivalent CUDA kernels. Next, it optimizes their runtime performance using a novel evolutionary meta-generation procedure tailored towards the CUDA ecosystem. Finally, it uses an innovation archive of discovered ’stepping stone’ kernels to improve future performance on new tasks. The A I CUDA Eng ineer can produce CUDA kernels that exceed the performance of torch native and compiled kernels. Out of the 250 tasks tested, The A I CUDA Eng ineer successfully optimizes 186 tasks to a median speedup of 1.52x. For operations such as fused 3D convolutions or Diagonal Matrix Multiplication, we show runtime improvements ≄50x over their torch implementations. Alongside this report, we release the best discovered kernels, an accompanying dataset of all discovered kernels and an interactive webpage for exploration of the results.


Background (§2)

Evolutionary Code Optimization with LLMs. One particular flavor of test-time compute is evolutionary code optimization: the usage, mutation, and recombination of previously generated code to produce new samples. This approach has previously been used to optimize reward and preference objectives (Lu et al., 2024a; Ma et al., 2023), mathematical science code (Romera-Paredes et al., 2024), entire machine learning papers (Lu et al., 2024b), and other applications (Berman, 2025; Lange et al., 2024; Lehman et al., 2022; Meyerson et al., 2023). Through prompting, LLMs are used as recombination engines (Lange et al., 2023; Meyerson et al., 2023), and are capable of simulating crossover between diverse code snippets and the rationales that produced them. A simpler form of this technique is retrieval augmented generation (RAG, Gao et al., 2023; Lewis et al., 2020b), whereby relevant historical samples are injected into context based on embedding similarities or other filters. Here, we utilize both RAG and code-crossover to improve our kernel-optimization results

KernelBench. The KernelBench (Ouyang et al., 2025) benchmark is a set of 250 neural network tasks, defined as PyTorch modules, along with a subset of corresponding results for CUDA kernel generation across these tasks. The tasks are split into three categories denoted as levels 1, 2, and 3. Level 1 tasks are common ML primitives, such as softmax, or various matmuls. Level 2 tasks are few-step fusions of those primitives, such as network layer activations followed by a norm. Level 3 tasks represent full network architectures (e.g., a ResNet).