Machine Learning Street Talk
November 23, 2025

"I Invented the Transformer. Now I'm Replacing It."

David Ha, a co-author of the original Transformer paper, is now moving on. This podcast unpacks his critique of the current AI landscape, where the Transformer's success has created a powerful local minimum, and introduces his new, biologically-inspired architecture designed to break free.

The Transformer Trap

  • "I personally don't think we're done. I don't think that this is the final architecture... There's some breakthrough that will occur at some point, and then it will once again become obvious that we're kind of wasting a lot of time right now."
  • "If you want to move the industry away from [Transformers], being better is not good enough. It has to be obviously, crushingly better. Transformers were that much better over RNNs."

The AI world is a victim of its own success. Ha argues the field is stuck in an "architecture lottery," making endless incremental tweaks to Transformers, much like the pre-Transformer era of RNNs. This "technology capture" stifles the bottom-up, unstructured freedom that birthed the Transformer in the first place. Today’s commercial pressures and publish-or-perish incentives push researchers toward safe bets, not revolutionary leaps. To escape this gravity well, a new architecture can't just be better; it must be "crushingly better" to overcome the massive inertia of existing tools and expertise.

A New Blueprint: The Continuous Thought Machine

  • "Instead, what we did was pay respect to the brain, pay respect to nature and say, well, if we build these inspired things, what actually happens? What different ways of approaching a problem emerge?"

Enter the Continuous Thought Machine (CTM), Sakana AI's spotlight paper at NeurIPS 2024. The CTM is built on three core ideas: an internal "thought dimension" for sequential processing, neurons that are mini-models themselves, and using neuron synchronization as its core representation. This nature-inspired design produces powerful emergent behaviors without being explicitly programmed for them. The model naturally exhibits adaptive computation, spending more time on harder problems, and is almost perfectly calibrated out-of-the-box—a strong sign of a more robust foundation.

The True Test of Reasoning

  • "If you wanted AGI, you wouldn't want all of the text that humans have ever created; you would actually want the thought traces in their head as they were creating the text."

Current models look smart but are they actually reasoning? Ha introduces Sudoku Bench, a benchmark built from complex, variant Sudoku puzzles from the YouTube channel "Cracking the Cryptic." These puzzles require nuanced, multi-step reasoning that today's best models, including GPT-4o, completely fail at, scoring around 15%. The benchmark includes thousands of hours of human "thought traces" from the channel's experts, but the creative leaps required to solve the puzzles are too sparse for current RL or imitation learning methods to grasp, highlighting a fundamental gap in AI capabilities.

Key Takeaways:

  • The AI industry is trapped by the Transformer's success, focusing on incremental gains instead of fundamental breakthroughs. Progress demands a return to risk-taking research and new architectural paradigms.
  • Escape the Architecture Lottery. The inertia behind Transformers is immense. A new model must be demonstrably superior across the board to justify a paradigm shift.
  • Nature's Algorithms are the Next Frontier. The CTM proves that biologically-inspired principles like neuron synchronization can unlock powerful capabilities like adaptive computation and better calibration naturally.
  • Reasoning is Deeper Than Scaling. The Sudoku Bench benchmark shows that current SOTA models cannot perform the creative, nuanced reasoning humans do. Brute-force scaling has hit a wall against truly complex problems.

For further insights and detailed discussions, watch the full podcast: Link

This episode reveals why a co-inventor of the Transformer is now moving beyond it, arguing that AI is trapped in a local minimum and that new, biologically-inspired architectures like the Continuous Thought Machine are essential for the next breakthrough in reasoning.

Moving Beyond the Transformer: A Creator's Pivot

  • A co-inventor of the Transformer architecture opens by explaining his decision to drastically reduce research on the very model he helped create. He argues the field has become an "oversaturated space," creating a gravitational pull that discourages true exploration.
  • Instead of making incremental tweaks to a dominant technology, his new focus is on fundamental exploration and developing entirely new architectures.
  • This perspective signals a critical shift for investors: the alpha in AI research may no longer be in scaling existing models but in identifying and backing teams that are exploring novel, non-Transformer paradigms.
  • He states, "I'm going to drastically reduce the amounts of research that I'm doing specifically on the transformer because of the feeling that I have that it's an oversaturated space."

The Lost Freedom of AI Research

  • The speaker contrasts the current research environment with the one that produced the Transformer. He notes that the original breakthrough was a product of a bottom-up, curiosity-driven process where researchers had the freedom to explore ideas over months.
  • Today, immense pressure from competition, commercialization, and the need for investment returns curtails this creative freedom. This creates "technology capture," where the success of Large Language Models (LLMs) forces researchers into a narrow, commercially viable path.
  • This environment makes it difficult for radical new ideas to gain traction, as the industry defaults to scaling what already works.
  • Strategic Implication: Investors should scrutinize the research culture of AI labs. Companies that protect and fund open-ended exploration, like Sakana AI, may be better positioned to discover the next architectural leap, even if it means a longer, less predictable path to commercialization.

The Local Minimum of Current Architectures

  • The conversation draws a parallel between the current dominance of Transformers and the previous era of Recurrent Neural Networks (RNNs)—a class of neural networks that process sequential data step-by-step. Just as RNNs saw endless minor tweaks before being made obsolete by the Transformer, today's research is filled with incremental improvements to the same core architecture.
  • The speaker warns that the industry is likely "wasting time in exactly the same way," stuck in a local minimum while a more powerful architecture waits to be discovered.
  • He points out that architectures better than the Transformer already exist in research, but they are not "obviously crushingly better" enough to overcome the massive industry inertia, software tooling, and expertise built around Transformers.
  • This highlights a key market dynamic: a new technology must offer a monumental performance gain to displace an entrenched incumbent, making the barrier for entry for new architectures extremely high.

Jagged Intelligence and the Problem of "Too Good" Models

  • The discussion addresses the phenomenon of "jagged intelligence," where LLMs can solve PhD-level problems one moment and make obvious, jarring errors the next. The speaker argues this is a sign of a fundamental flaw in the current architecture.
  • He provocatively suggests that current models are "too good." They are universal function approximators that can be forced to solve any problem with enough data and compute, but they don't represent the underlying structure of the data naturally.
  • Using the example of a model learning to classify a spiral dataset, he explains that a standard network approximates the spiral with many small, linear pieces rather than understanding it as a spiral. This "brute force" approach works but lacks true generalization and understanding.
  • For Researchers: This points to a need for models with better inductive biases—architectures that are naturally inclined to represent data in a more structured, human-like way.

Introducing the Continuous Thought Machine (CTM)

  • The episode introduces the Continuous Thought Machine (CTM), a new model from Sakana AI designed to address the limitations of current architectures. The CTM is a novel recurrent model inspired by how biological neurons synchronize.
  • Luke, a research scientist at Sakana AI, explains the CTM's three core novelties:
    1. Internal Thought Dimension: The model performs computations over an internal, sequential dimension, allowing it to "think" through a problem step-by-step, much like human reasoning.
    2. Neuron-Level Models: Each neuron is a small model itself (an MLP), creating more complex and dynamic internal states than simple activation functions like ReLU.
    3. Synchronization as Representation: The model's "thought" is not just the state of its neurons at one point in time. Instead, it's represented by measuring how pairs of neurons synchronize their firing patterns over time, creating a richer, temporal representation.

Adaptive Computation and Emergent Behaviors

  • A key feature of the CTM is its native adaptive computation—the ability to dynamically allocate more "thinking" time to harder problems.
  • Unlike previous attempts that required complex loss functions to prevent models from being "greedy" with compute, this behavior emerges naturally from the CTM's architecture.
  • The model learns to solve easy examples in just a few steps while using its full computational budget for more challenging ones.
  • Luke shares a fascinating emergent behavior observed during maze-solving tasks: "It would start going one path in the maze and then suddenly it would realize, 'oh no, damn, I'm wrong,' and would backtrack and then take another path." This demonstrates a more flexible, human-like problem-solving process.
  • Another surprising finding was that when constrained by time, the model developed a "leapfrogging" algorithm to solve mazes, jumping ahead and filling in paths backward—a novel strategy that emerged entirely from the system's constraints.

Reasoning, Benchmarking, and the Sudoku Challenge

  • The conversation shifts to the challenge of evaluating true reasoning in AI. The Transformer co-inventor introduces Sudoku Bench, a benchmark he created to push AI beyond simple pattern matching.
  • This is not standard Sudoku. It features "variant Sudokus" with complex, handcrafted rules described in natural language, often with unique twists or even deliberate errors in the instructions that the model must identify.
  • The benchmark was inspired by the YouTube channel "Cracking the Cryptic," where expert solvers verbalize their entire reasoning process. These "thought traces" provide a rich dataset for imitation learning, representing how humans discover novel "break-ins" or logical leaps to solve a puzzle.
  • Current state-of-the-art models, including GPT-4, score around 15% on this benchmark, failing on all but the simplest puzzles. They cannot make the creative leaps required and fall back to brute-force trial and error.
  • Actionable Insight: For researchers and investors, Sudoku Bench serves as a powerful, grounded tool to measure progress in advanced reasoning. Success on this benchmark would signal a genuine breakthrough beyond the capabilities of current LLMs.

Conclusion

This episode argues that the future of AI lies not in scaling existing models but in fostering research freedom to explore new architectures. The Continuous Thought Machine and the Sudoku Bench benchmark exemplify a shift toward building and measuring systems with more robust, human-like, and adaptive reasoning capabilities.

Others You May Like