The "Final Boss" of Deep Learning

Author: Machine Learning Street Talk

This summary is for builders who realize that scaling data alone won't solve the "reasoning gap" in AI. It explains how Category Theory provides the mathematical blueprint to move from alchemical trial-and-error to principled neural architectures.

This episode answers:

Why do frontier models fail at second-grade math despite billions of parameters?
How can Category Theory turn "weight tying" from a hack into a provable law?
Can we build a "CPU" directly into a neural network's weights?

Petar Veličković and Andrew Dudzik of Google DeepMind argue that deep learning is currently in its "alchemy" phase. We have powerful empirical results but lack the fundamental theory to derive new architectures. They propose Categorical Deep Learning as the unifying framework to bridge the gap between neural networks and classical algorithms.

The Top 3 Ideas

THE PERIODIC TABLE

Principled Foundations: Current deep learning relies on ad hoc design choices. Categorical Deep Learning provides a formal framework to derive architectures instead of stumbling upon them.
Beyond Symmetries: Geometric Deep Learning assumes all transformations are invertible. This new framework handles non-invertible computation like deleting data or finding shortest paths.
Structuralist Logic: Synthetic mathematics focuses on relationships rather than internal substance. This allows researchers to ignore noise and focus on how knowledge is produced.

ALGEBRA WITH COLORS

Compositional Constraints: Categories act like magnets with colored ends that only stick if the colors match. This mirrors how non-square matrices or functions must align their dimensions to compose.
Formal Weight Tying: Traditional weight sharing is an empirical trick. Two-categories provide a comprehensive theory to prove when sharing parameters preserves the intended structure.

NEURAL CPUS

The Carry Problem: Standard Graph Neural Networks struggle with the "carry" operation in addition because they track states rather than changes. Solving this requires continuous mathematics that mimics discrete logic.
Algorithmic Alignment: Pure tool use is a temporary patch. Internalizing algorithmic reasoning makes models more stable and efficient than constantly calling external calculators.

Actionable Takeaways

The Macro Shift: [Algorithmic Convergence]. The gap between symbolic logic and neural networks is closing through category theory. Expect architectures that are "correct by construction" rather than just "likely correct."
The Tactical Edge: [Audit Architecture]. Evaluate new models based on their "algorithmic alignment" rather than just parameter count. Prioritize implementations that bake in non-invertible logic.
The Bottom Line: The next year will see a shift from scaling data to scaling structural priors. If you aren't thinking about how your model's architecture mirrors the problem's topology, you are just an alchemist in a world about to discover chemistry.

Podcast Link: Click here to listen

Current frontier models perform billions of multiplications per token yet fail at basic arithmetic, necessitating a move from empirical pattern matching to a formal "periodic table" of neural architectures rooted in Category Theory.

Chronological Deep Dives

The Algorithmic Failure of Frontier Models

Frontier models like GPT-4 and VEO (Google's generative video model) mimic reasoning through pattern recognition but collapse when faced with simple algorithmic tasks. Petar Veličković notes that while these models appear realistic, they lack the precision required for robotics or scientific discovery.
LLMs fail at addition when simple "tricks" or common patterns are removed, proving they do not internalize algorithmic procedures.
Frontier models perform massive computational overhead (billions of multiplications) but cannot reliably multiply small numbers.
External tool use (calculators or Model Context Protocol servers) acts as a patch rather than a structural fix for reasoning.
Internalizing computation is essential for efficiency, as constant tool calling creates significant latency and reasoning bottlenecks.

“Even the best tool in the world is not going to save you if you cannot predict the right inputs for that tool.”
Speaker Attribution: Petar Veličković

The Limits of Geometric Deep Learning

Geometric Deep Learning (GDL) uses group theory to build Equivariance (a property where transforming an input results in a predictably transformed output) into models. However, GDL assumes all transformations are invertible, which does not hold true for classical algorithms.
GDL handles spatial regularities like image rotation or graph permutation by assuming no information is lost.
Classical algorithms like Dijkstra (a method for finding the shortest path in a graph) are non-invertible because they destroy information during execution.
Transformers are inherently permutation equivariant models, which explains their efficiency but also their limitations in non-invertible reasoning.
Researchers are moving toward Category Theory to express "post-conditions" and "pre-conditions" that group theory cannot capture.

“Groups, which are the bread and butter of geometric deep learning, might not be enough for aligning to computation.”
Speaker Attribution: Petar Veličković

Category Theory as the Periodic Table of AI

Deep learning currently operates like alchemy, relying on empirical results without a unifying framework. Andrew Dudanev argues that Category Theory provides a synthetic mathematical foundation to derive architectures rather than discovering them by trial and error.
Analytic mathematics focuses on what things are made of, while synthetic mathematics focuses on the principles of inference and relationship.
Category Theory uses Morphisms (generalized functions or arrows representing relationships between objects) to describe structure abstractly.
The framework allows researchers to treat different neural architectures as instances of the same fundamental mathematical laws.
This structuralist approach aims to unify the probabilistic, neuroscience, and gradient-based perspectives of AI.

“Categorical deep learning is an attempt to find that periodic table for neural networks.”
Speaker Attribution: Andrew Dudanev

Formalizing Weight Tying with Two-Morphisms

Weight tying (sharing parameters across different parts of a network) is a standard practice in RNNs and Transformers that lacks a formal theoretical bridge. Higher category theory provides the language to prove when weight sharing preserves the necessary computational structure.
Two-morphisms (relationships between relationships) model the ways different neural network maps relate to one another.
Weight tying is formalized as a reparameterization in a two-category of parametric maps.
This abstraction allows for weight sharing that goes beyond simple copying, enabling complex algebraic relationships between parameters.
Higher categories may explain emergent effects where the behavior of a composite system differs from its individual parts.

“Two-morphisms allow us to see this algebraic structure encoded as relationships between the weights.”
Speaker Attribution: Andrew Dudanev

The Carry Problem and Neural CPUs

A fundamental gap in Graph Neural Networks (GNNs) is the inability to model a "carry" (the mechanism in addition where 9 becomes 10). Andrew Dudanev suggests that geometric subtleties in continuous space could finally enable "CPUs in neural networks."
Carrying is simple in discrete mathematics but extremely difficult to implement in continuous, gradient-based systems.
The Hopf fibration (a way to decompose a 3D sphere into circles) provides a potential geometric model for the carrying phenomenon.
Current systems are trained to always provide an answer rather than recognizing when a problem exceeds their computational budget.
The goal is a system that understands the "effort" required for a task and provides convergence guarantees.

“Are there ways to exploit this type of geometric subtlety to create the phenomenon of carrying and actually properly model algorithmic reasoning?”
Speaker Attribution: Andrew Dudanev

Investor & Researcher Alpha

The New Bottleneck: Capital is shifting from "scaling laws" (more data/compute) to "architectural priors." Investors should look for teams building Categorical Deep Learning frameworks that reduce data requirements exponentially by imbuing models with structural logic.
System 2 Architectures: Purely autoregressive models are reaching a plateau in reasoning. The next alpha lies in "System 2" systems that integrate neural pattern matching with algorithmic robustness, similar to AlphaGeometry or FunSearch.
Obsolete Research: Research focusing on "patching" LLMs with external tools for basic logic is a dead end. The industry is moving toward internalizing these operations through non-invertible Monoids (algebraic structures similar to groups but without required inverses).

Strategic Conclusion

Category Theory is the necessary bridge to move AI from stochastic parrots to verifiable reasoners. By formalizing non-invertible computation and weight tying, researchers can build architectures that respect the laws of logic. The industry must now prioritize structural synthesis over empirical scaling.

The "Final Boss" of Deep Learning