AI kernel generation promises to unlock heterogeneous compute for agentic workloads, but faces critical challenges in validation, complexity, and human-in-the-loop integration.
The Heterogeneous Compute Bottleneck
- Natalie Serrino, co-founder of Gimlet Labs, identifies a critical bottleneck in AI inference. Agentic AI workloads (complex pipelines of multiple models and tool calls) demand heterogeneous compute for optimal performance.
- Current kernel optimizations are hardware-specific, creating portability issues across diverse vendors and architectures.
- Low-level kernel optimization significantly boosts ML workload speed; Nvidia's attention implementation, for example, yielded 3x Llama throughput.
- An acute shortage of expert kernel engineers, coupled with exploding complexity across frameworks (CUDA, Triton, Palace, Metal) and diverse hardware characteristics, exacerbates this problem.
- Different hardware platforms, even within a single vendor, possess unique properties (e.g., cache sizes) that impact optimal kernel implementation.
- "There's just not enough experts to be able to solve every problem in this space right now." - Natalie Serrino
AI Agents Mimic Human Kernel Optimization
- Gimlet Labs proposes AI agents automate the iterative human workflow for kernel porting and optimization.
- The human kernel expert workflow involves iterative steps: attempting an implementation, checking compilation, execution, and correctness, then profiling for bottlenecks.
- AI agents integrate into this loop, automating compilation, execution, and correctness checks before initiating optimization.
- Gimlet Labs' system demonstrated a 22% speedup over Torch Compile on an H100 for a PyTorch workload, exploring candidate optimizations.
- "The idea here is to put AI as the kind of like where the human would go in that same loop." - Natalie Serrino
Benchmarking Challenges and Agent Performance
- Measuring AI-generated kernel performance introduces significant complexities beyond simple timing.
- Defining "correctness" for floating-point operations requires careful tolerance settings and well-selected input sizes to avoid measuring overhead instead of kernel performance.
- Reliable performance measurement demands meticulous warm-ups and cache clearing to prevent skewed results from cached prior runs.
- A lack of comprehensive low-level kernel benchmarks across diverse hardware platforms hinders effective agent training and evaluation.
- Preliminary results on Apple M4 (Metal framework) show agents achieve an average 24% speedup on moderately complex (L1/L2) problems but performance degrades on highly complex (L3) tasks.
- "You have to be really neurotic about [benchmarking] otherwise you might get bad results." - Natalie Serrino
Successes and Strategic Limitations of AI Kernel Generation
- AI agents demonstrate specific optimization capabilities but struggle with highly optimized or trivial cases.
- Kernel Fusion: Agents successfully fuse multiple operations into a single "mega function," achieving 40% speedups on M4 by customizing to specific use cases. (Kernel fusion combines multiple sequential kernel operations into a single, larger kernel to reduce overhead and improve data locality.)
- Algorithmic Re-expression: Agents rewrite PyTorch code to leverage more optimized underlying hardware operations (e.g., re-expressing Average Pool 1D as a convolution for 80% speedup).
- PyTorch-level Fusion: Agents combine operations at the Python level, reducing op launches for efficiency.
- Limitations: Agents fail to outperform highly hand-optimized operations like matrix multiplication and can "cheat" by pruning unnecessary work (e.g., a 71,000x speedup by identifying redundant operations). Trivial optimizations, like swapping for an already optimized attention module (SDPA), also occur.
- "Matrix multiply is one of the most hand optimized ops that exists. So it's not that surprising that an agent would not do as well as something that a human expert spent a long time on." - Natalie Serrino
The Future: Human-in-the-Loop Agentic Swarms
- The path forward involves multi-agent systems with robust validation and human oversight.
- Standalone agents excel at generating diverse ideas and processing context for L1/L2 tasks, but require robust quality and performance validation.
- Empirical data from hardware-in-the-loop systems is essential to guide search and optimization, as low-level code performance is difficult to predict without execution.
- Modern agent design incorporates a supervisor agent managing a "synthesis agentic swarm" for idea generation and a "verification agent" for strict hardware-in-the-loop testing.
- Human prompting remains critical for guiding optimization paths and supervising results, especially for full models (e.g., 70% speedup on an audio encoder with six custom kernels for RTX 6000 Blackwell).
- "We still heavily rely on looking on profiling data... And we also need the human in the loop to supervise the results and guide the work." - Natalie Serrino
Investor & Researcher Alpha
- Capital Movement: Investment will shift towards platforms and tools that enable seamless orchestration of agentic workloads across diverse hardware, prioritizing performance and efficiency.
- New Bottleneck: The scarcity of expert kernel engineers will drive demand for AI-powered kernel generation tools, making robust benchmarking and validation frameworks critical infrastructure.
- Research Direction: Purely theoretical or simulation-based kernel optimization research will become obsolete without empirical hardware-in-the-loop validation, emphasizing practical, deployable solutions. Focus will shift to multi-agent architectures and formal verification for correctness.
Strategic Conclusion
AI kernel generation is a powerful tool for automating low-level optimization, addressing the expert bottleneck and enabling heterogeneous compute. The next step is developing sophisticated human-in-the-loop agentic systems with rigorous hardware validation to unlock its full potential across diverse AI workloads.