Machine Learning Street Talk
July 16, 2025

The Humble Truths Behind Bombastic AI Papers

This episode features a category theorist navigating the messy, empirical world of machine learning, contrasting the rigorous proofs of pure mathematics with the "anything goes" ethos of AI research. The discussion explores the philosophical underpinnings of why models work, the limitations of our current evaluation methods, and the search for a more profound theory of intelligence.

The Crisis of Explanation in AI

  • “In pure mathematics, we have an incredibly brutal aesthetic/formal judgment for when something's true… But here is a context where you don't have that. So instead, you have to concoct something that will serve a similar role, and [Moritz] Hart argues that this is what benchmarks have been for.”
  • For years, benchmarks were ML’s North Star, a substitute for formal proof. The system worked because while raw performance numbers varied, the ranking of models remained consistent across different benchmarks. That’s no longer true. This breakdown signals that either the low-hanging fruit is gone or, more radically, that the entire benchmark-centric approach to science is broken. The field now grapples with a new, harder question: What does a good explanation for why a model works even look like?

Building Sandcastles on Platonic Illusions

  • “These deep learning models, as good as they are, they build sandcastles… a structure with too many degrees of freedom. You prod the sand castle and it just collapses because there's no undergirding structure in it.”
  • Are AI models discovering deep, universal truths (Platonism) or just creating convincing fakes (constructivism)? The speaker argues for the latter: models don't find a perfect, underlying reality, but instead get very good at creating the illusion of one. This explains why techniques like Chain of Thought are effective—not because they replicate human logic, but because they "sparsify" a problem's dependencies, making it more computationally tractable, much like Transformers do for language. The goal isn’t to find the perfect structure, but to develop robust illusions of it.

Category Theory: An Algebra for Architectures

  • “What's the role for category theory in such a story? Well, it makes it relatively easy to have the idea of what you mean by an ‘algebra for constructing these systems.’”
  • Category theory isn't a silver bullet for designing new architectures. Its real power lies in providing a formal language—an "algebra"—to describe, index, and compare different ways of building models. For example, it allows us to see Transformers as a parallelized, finite-depth approximation of an RNN. This perspective isn't about creating a new model from scratch but about providing a unified lens to understand why existing models, like Transformers, are so good at handling the sparse dependencies found in natural data. It helps systematize the search for what works.

Key Takeaways:

  • The conversation suggests a field at an inflection point, moving from adolescent-like brute-force discovery to a more mature search for foundational principles.
  • Benchmarks are broken. The ML community can no longer rely on leaderboards as a proxy for truth. The new frontier is developing robust, qualitative explanations for why models succeed or fail.
  • Embrace the illusion. The most effective models aren’t finding universal laws but are constructing powerful, computationally efficient illusions of them. Progress lies in refining these illusions, not in a futile search for Platonic perfection.
  • Think like a physicist. The future of foundational AI research is to treat models as complex physical systems. The task is to design parametric models where stochastic processes, like SGD, can efficiently "relax" into a state that approximates the data distribution.

For further insights and detailed discussions, watch the full podcast: Link

This episode dissects the philosophical and practical realities of AI research, revealing how the field grapples with the absence of formal proof and why a shift from bombastic claims to humble, foundational understanding is critical for future breakthroughs.

The Illusion of Truth: Platonism vs. Constructivism in AI

  • Paul introduces the concept of metabolism as a beautiful example of a constructive, anticipatory process. A metabolism is a chemical cycle that runs in anticipation of receiving fuel, creating a structure that appears to have a purpose but is fundamentally a contingent, emergent process.
  • This leads to the idea that inventing concepts that don't physically exist, like "fitness" in biology, is a powerful problem-solving strategy. These abstractions make reasoning simpler, even if they are just useful fictions.
  • For researchers, this highlights the value of creating and using abstract models, even if they aren't perfect representations of reality. The key is whether the model provides explanatory power and predictive accuracy, not whether it reflects an ontological "truth." Investors should be wary of projects claiming to have discovered a "fundamental law" of AI, as progress is more likely to be messy and constructive.
  • “I don't think that there is a platonic thing here. I think that's wrong. But I think that developing the illusions of platonism is how structure happens in the world.” - Paul

The Role of Benchmarks and the Crisis of Scientific Judgment

  • Previously, different benchmarks for a similar task would generally preserve the performance ranking of various models. Now, a model that excels on one benchmark may fail on another, indicating that the "easy" problems may be solved and the current approach is insufficient for more complex challenges.
  • This breakdown suggests two possibilities: either we need to abandon the expectation of consistent cross-benchmark performance, or the entire benchmark-centric approach to evaluation is fundamentally broken and needs to be replaced with "higher-order aesthetic sensibilities" for what constitutes a good explanation.
  • Investors and researchers should critically scrutinize performance claims based on single benchmarks. A model's true capability is better assessed by its performance across a diverse range of benchmarks and, more importantly, by the quality of the explanation for why it works. The era of SOTA-chasing on narrow benchmarks may be ending.

Deconstructing the "Sand Castle" Critique of Deep Learning

  • Dennett's critique is that LLMs appear to have structure but will collapse when prodded because they represent every possible structure, which is equivalent to having no structure at all.
  • Paul references a statistical learning theory analysis of Chain-of-Thought (CoT) reasoning. CoT is a technique where a model is prompted to "think step-by-step." The analysis suggests CoT works because it sparsifies the dependency of a problem.
  • This suggests that structure in LLMs is not an illusion but an emergent property of the data and the reasoning process. For Crypto AI, this is highly relevant. Techniques like CoT could be crucial for developing more reliable and verifiable AI agents for on-chain tasks, where computational certainty and sparse dependencies are paramount.

The Pragmatic Role of Category Theory in Machine Learning

  • Paul explains that Category Theory provides a precise "algebra for constructing parametric models." It offers a formal way to describe how different systems can be built and composed, whether through wiring diagrams or other algebraic structures.
  • He emphasizes that his recent work has focused less on direct categorical applications and more on learning the "accumulated wisdom" of the ML world. The true value of Category Theory, he suggests, is in providing a framework to index and compare different architectures, which can then be tested experimentally.
  • Researchers shouldn't expect Category Theory to provide off-the-shelf solutions. Instead, its value lies in providing a rigorous language to define, compare, and generate novel architectures for AI systems, which is especially useful in the complex, compositional world of decentralized AI.

The Bombastic Titles of AI Research

  • In pure math, a deep-seated "fear of saying something wrong" leads to cautious, precise titles. In contrast, ML culture, driven by rapid publication cycles and a need to capture attention, often favors provocative titles that imply a grand breakthrough, even for incremental improvements.
  • Paul admits his own paper, "Categorical Deep Learning: An Algebraic Theory of All Architectures," had a title he felt was too "hubristic," but editors pushed for a more forceful, bombastic framing to meet the discipline's standards.
  • This is a direct warning to investors: be highly skeptical of paper titles. A groundbreaking title may mask an incremental, or even flawed, result. The real value is in the paper's substance, not its marketing. Due diligence requires reading past the abstract and title to understand the true contribution.

Unifying Architectures: Transformers as Approximations of RNNs

  • He proposes thinking of a causally masked Transformer as a "parallelized approximation only n-layers deep of an RNN." While an RNN applies the same function sequentially, a Transformer performs a finite number of parallel updates.
  • In this view, an RNN remembers the function that moves data around. A Transformer, by contrast, remembers a series of displacements and passes these "bumpings" up the chain.
  • This perspective provides a unified mental model for reasoning about sequence processing. For researchers, it opens up new ways to think about architectural design, potentially combining the strengths of both parallel and recurrent processing.

The "Information Squashing" Problem in Large Models

  • GNNs (Graph Neural Networks) are models designed to operate on graph-structured data, treating nodes and edges as computational units.
  • Paul references work from Peter Battaglia's group showing that even with techniques like argmax (which selects the most likely token), the signal gets lost in very long contexts.
  • The proposed solution is to replace the full attention mechanism (where every token can attend to every previous token) with expander graphs. These are sparse graphs that are known to be excellent for message passing.
  • This is a critical area for Crypto AI. As on-chain data and smart contract interactions become more complex, models will need longer context windows. Research into sparse attention mechanisms and expander graphs is a key area to watch for breakthroughs in building more efficient and capable AI agents.

The Power of Multi-Domain Training

  • A model trained on multiple programming languages consistently performs better than a model trained on just one. The hypothesis is that this forces the model to learn higher-order regularities and strip away superficial syntactic patterns that might be confused for semantics.
  • Paul draws a parallel to training vision transformers. To make them equivariant (consistent under transformations like rotation), the dataset is saturated with augmented data (rotated, translated images).
  • When evaluating an AI project, look for models trained on diverse, multi-modal, or multi-domain datasets. This is a strong indicator of a more robust and generalizable model, which is less likely to be brittle or overfit to a narrow task.

The Walled Garden of Solvable Problems

  • The host shares an anecdote about searching for a solution to a series, only to find it in an old Russian mathematics tome, where it was simply defined by the series itself—no closed-form solution existed.
  • Paul suggests the reason for this pedagogical approach is to build confidence and prevent students from concluding that "math is stupid." By starting in a controlled environment, learners develop the skills and resilience needed to later tackle problems that are "almost always losing."
  • This connects to the concept of "Pavlovian anxiety," where a subject cannot learn if they don't know whether an action will be rewarded or punished.

A New Lens for ML Research: Designing Stochastic Processes

  • He describes neural networks as an "algebra for constructing a machine... a fake physics whose energy relaxation process computes the distribution of data we fed it."
  • In this view, the training process (like Stochastic Gradient Descent, or SGD) is analogous to bombarding a complex physical system with energy and observing the resulting distribution of its energy states.
  • The core task of a foundational ML researcher, then, is to discover an efficient language for specifying these parametric models so that their associated stochastic processes compute quickly and effectively.

This discussion reveals that progress in AI hinges on moving beyond superficial benchmarks to develop deeper, more elegant explanations for why models work. For investors and researchers, the key is to prioritize foundational understanding and architectural robustness over the hype of bombastic claims and narrow performance metrics.

Others You May Like