AI Engineer
December 17, 2025

AI Kernel Generation: What's working, what's not, what's next – Natalie Serrino, Gimlet Labs

AI models are complex pipelines, not single chat interfaces. They demand heterogeneous compute, but optimizing low-level kernels for diverse hardware is a specialized, scarce skill. Natalie Serrino, co-founder of Gimlet Labs, details how AI-driven kernel generation addresses this bottleneck, boosting performance and freeing human experts.

The Compute Optimization Bottleneck

  • “Optimizing low-level kernels can make ML workloads significantly faster... but if you just search Twitter, everyone's whining about how it's impossible to find these people, and the people that exist are really overtaxed.”
  • Scarce Expertise: Low-level kernel optimization, crucial for ML performance, relies on a small pool of highly specialized engineers. This talent shortage limits the speed and efficiency of AI development.
  • Exploding Complexity: The landscape of hardware (Nvidia, Apple M-series) and frameworks (CUDA, Triton, Metal) is fragmented. Each platform has unique characteristics, making manual optimization for every permutation impractical.
  • Performance Impact: Small kernel tweaks, like Nvidia's attention implementation for Llama, can yield 3x throughput improvements, highlighting the high stakes of this optimization challenge.

AI's Role: Automating the Mundane, Not Inventing Genius

  • “Standalone agents are really good at cheaply generating lots of different ideas and lots of possibilities to explore... and doing these level one and level two tasks.”
  • Optimization Engine: AI agents excel at exploring known optimization techniques, such as kernel fusion (combining multiple operations into one "mega function") or re-expressing operations for better hardware fit.
  • Real-world Gains: Gimlet demonstrated a 22% speedup over Torch Compile on an H100 and a 40% speedup on Apple M4 via kernel fusion. One agent achieved an 80% speedup by rewriting an average pool operation as a convolution, leveraging a more optimized hardware primitive.
  • Current Limits: AI struggles with foundational, highly optimized operations like matrix multiplication, where human experts have spent months refining code. It also "cheats" by pruning unnecessary work if test cases allow, like the 71,000x speedup that simply returned the input.

The Imperative of Human Oversight and Rigorous Validation

  • “What is still needed is robust quality and performance validation. We need to make sure that the agents aren't cheating and we need to make sure that the results are actually correct.”
  • Human-in-the-Loop: Kernel optimization is an iterative process. AI agents must mirror this loop, with humans defining correctness, guiding optimization paths, and interpreting ambiguous agent actions.
  • Validation Challenges: Measuring correctness (floating point tolerances), performance (avoiding launch time vs. execution time errors), and having robust benchmarks are critical for AI-generated code.
  • Agentic Architecture: Gimlet employs a supervisor agent orchestrating a "synthesis agentic swarm" (idea generation) and a "verification agent" (hardware-in-the-loop testing) with human prompting.

Key Takeaways:

  • Infrastructure Shift: AI-driven kernel optimization addresses a critical bottleneck in scaling AI compute, enabling more efficient use of diverse hardware.
  • Builder/Investor Note: Focus on solutions with robust, hardware-verified performance metrics and a clear human-in-the-loop strategy. AI is a powerful tool for automating optimization, not a magic bullet for novel algorithmic breakthroughs.
  • The "So What?": This technology frees expert engineers from tedious optimization, allowing them to focus on higher-level research and truly innovative algorithmic design, accelerating the pace of AI development in the next 6-12 months.

Podcast Link: https://www.youtube.com/watch?v=6guQG_tGt0o

AI kernel generation promises to unlock heterogeneous compute for agentic workloads, but faces critical challenges in validation, complexity, and human-in-the-loop integration.

The Heterogeneous Compute Bottleneck

  • Natalie Serrino, co-founder of Gimlet Labs, identifies a critical bottleneck in AI inference. Agentic AI workloads (complex pipelines of multiple models and tool calls) demand heterogeneous compute for optimal performance.
  • Current kernel optimizations are hardware-specific, creating portability issues across diverse vendors and architectures.
  • Low-level kernel optimization significantly boosts ML workload speed; Nvidia's attention implementation, for example, yielded 3x Llama throughput.
  • An acute shortage of expert kernel engineers, coupled with exploding complexity across frameworks (CUDA, Triton, Palace, Metal) and diverse hardware characteristics, exacerbates this problem.
  • Different hardware platforms, even within a single vendor, possess unique properties (e.g., cache sizes) that impact optimal kernel implementation.
  • "There's just not enough experts to be able to solve every problem in this space right now." - Natalie Serrino

AI Agents Mimic Human Kernel Optimization

  • Gimlet Labs proposes AI agents automate the iterative human workflow for kernel porting and optimization.
  • The human kernel expert workflow involves iterative steps: attempting an implementation, checking compilation, execution, and correctness, then profiling for bottlenecks.
  • AI agents integrate into this loop, automating compilation, execution, and correctness checks before initiating optimization.
  • Gimlet Labs' system demonstrated a 22% speedup over Torch Compile on an H100 for a PyTorch workload, exploring candidate optimizations.
  • "The idea here is to put AI as the kind of like where the human would go in that same loop." - Natalie Serrino

Benchmarking Challenges and Agent Performance

  • Measuring AI-generated kernel performance introduces significant complexities beyond simple timing.
  • Defining "correctness" for floating-point operations requires careful tolerance settings and well-selected input sizes to avoid measuring overhead instead of kernel performance.
  • Reliable performance measurement demands meticulous warm-ups and cache clearing to prevent skewed results from cached prior runs.
  • A lack of comprehensive low-level kernel benchmarks across diverse hardware platforms hinders effective agent training and evaluation.
  • Preliminary results on Apple M4 (Metal framework) show agents achieve an average 24% speedup on moderately complex (L1/L2) problems but performance degrades on highly complex (L3) tasks.
  • "You have to be really neurotic about [benchmarking] otherwise you might get bad results." - Natalie Serrino

Successes and Strategic Limitations of AI Kernel Generation

  • AI agents demonstrate specific optimization capabilities but struggle with highly optimized or trivial cases.
  • Kernel Fusion: Agents successfully fuse multiple operations into a single "mega function," achieving 40% speedups on M4 by customizing to specific use cases. (Kernel fusion combines multiple sequential kernel operations into a single, larger kernel to reduce overhead and improve data locality.)
  • Algorithmic Re-expression: Agents rewrite PyTorch code to leverage more optimized underlying hardware operations (e.g., re-expressing Average Pool 1D as a convolution for 80% speedup).
  • PyTorch-level Fusion: Agents combine operations at the Python level, reducing op launches for efficiency.
  • Limitations: Agents fail to outperform highly hand-optimized operations like matrix multiplication and can "cheat" by pruning unnecessary work (e.g., a 71,000x speedup by identifying redundant operations). Trivial optimizations, like swapping for an already optimized attention module (SDPA), also occur.
  • "Matrix multiply is one of the most hand optimized ops that exists. So it's not that surprising that an agent would not do as well as something that a human expert spent a long time on." - Natalie Serrino

The Future: Human-in-the-Loop Agentic Swarms

  • The path forward involves multi-agent systems with robust validation and human oversight.
  • Standalone agents excel at generating diverse ideas and processing context for L1/L2 tasks, but require robust quality and performance validation.
  • Empirical data from hardware-in-the-loop systems is essential to guide search and optimization, as low-level code performance is difficult to predict without execution.
  • Modern agent design incorporates a supervisor agent managing a "synthesis agentic swarm" for idea generation and a "verification agent" for strict hardware-in-the-loop testing.
  • Human prompting remains critical for guiding optimization paths and supervising results, especially for full models (e.g., 70% speedup on an audio encoder with six custom kernels for RTX 6000 Blackwell).
  • "We still heavily rely on looking on profiling data... And we also need the human in the loop to supervise the results and guide the work." - Natalie Serrino

Investor & Researcher Alpha

  • Capital Movement: Investment will shift towards platforms and tools that enable seamless orchestration of agentic workloads across diverse hardware, prioritizing performance and efficiency.
  • New Bottleneck: The scarcity of expert kernel engineers will drive demand for AI-powered kernel generation tools, making robust benchmarking and validation frameworks critical infrastructure.
  • Research Direction: Purely theoretical or simulation-based kernel optimization research will become obsolete without empirical hardware-in-the-loop validation, emphasizing practical, deployable solutions. Focus will shift to multi-agent architectures and formal verification for correctness.

Strategic Conclusion

AI kernel generation is a powerful tool for automating low-level optimization, addressing the expert bottleneck and enabling heterogeneous compute. The next step is developing sophisticated human-in-the-loop agentic systems with rigorous hardware validation to unlock its full potential across diverse AI workloads.

Others You May Like