Machine Learning Street Talk

December 13, 2025

The Mathematical Foundations of Intelligence [Professor Yi Ma]

Professor Yi Ma, a leading expert in deep learning and AI, challenges the prevailing view of artificial intelligence. He argues that current large models primarily excel at compression and memorization, falling short of true understanding and abstraction. His work proposes a mathematical theory of intelligence built on two first principles: parsimony and self-consistency.

Identify the "One Big Thing":

The core argument is that current AI (especially LLMs) primarily excels at compression and memorization of empirical data, but true intelligence, particularly at higher levels (like human scientific reasoning), requires abstraction and understanding. Professor Ma proposes a mathematical theory of intelligence based on two first principles: parsimony (finding the simplest representation) and self-consistency (ensuring the memory can accurately predict the world). This framework offers a principled path to designing more robust and genuinely intelligent AI, moving beyond empirical trial-and-error.

Extract Themes:

1. The Nature of Intelligence: Compression vs. Abstraction

"The practice of artificial intelligence, the mechanism we have implemented, all the mechanisms behind all the large models, deep networks, and large models are truly are... and hence understand their limitations and also what it takes to truly build a system that has intelligent behaviors or capabilities."
"Is there a difference between compression and abstraction? Is there a difference between memorizing and understanding? It's kind of similar to when Turing was faced with the question, 'What is computable, what is not computable?'"

2. Parsimony and Self-Consistency as First Principles

"For this level of intelligence, for how our memory formed and how they work, is precisely the two principles are incredibly important... memory or knowledge is precisely trying to discover what's predictable about the world... finding the most simple representation of the data... that's the word captured by the word parsimony."
"The second part of the sentence, 'not any simpler,' precisely says consistency. Make sure your memory is actually consistent with being able to recreate, simulate the world just right."

3. Principled Architecture Design and the Future of AI

"If we believe there's something right, then we should be able to derive [architectures like] Crate from first principle, have a very clear unified understanding... we can dramatically simplify them... you can even throw away the MLP layer if you only care about the compression."
"The scientific activity, our ability to revise our memory to acquire new memory, that is a generalizable ability. That is intelligence... Not the memory accumulated up to a certain point."

Synthesize Insights:

Theme 1: The Nature of Intelligence: Compression vs. Abstraction

LLMs as Memorization Engines: Current large language models (LLMs) excel at statistically compressing vast amounts of text data, identifying correlations and low-dimensional structures. This is akin to memorizing a massive library.
Analogy for Compression: Imagine taking a huge, messy pile of clothes and folding them neatly into a small suitcase. You've compressed the information (the clothes), but you don't necessarily understand how to make new clothes or why certain fabrics behave the way they do.
The Abstraction Gap: True intelligence, as seen in human scientific discovery (e.g., Euclid's geometry, the concept of infinity), involves abstracting principles that go beyond empirical observation. LLMs struggle with this, as evidenced by their poor performance on abstract compositional reasoning tasks like the ARC challenge.
Language as a Compressed Code: Natural language itself is a highly compressed code for human knowledge, grounded in physical senses and world models. LLMs are compressing this already compressed data, not necessarily understanding the underlying reality it represents.
The Turing Test for Understanding: The distinction between memorizing data distributions and truly understanding them remains a fundamental open problem, similar to Turing's question of computability or the P vs. NP problem.

Theme 2: Parsimony and Self-Consistency as First Principles

Parsimony Defined: Intelligence, at its core, seeks the simplest possible representation of the world that explains the data without losing predictive power. This is Einstein's "make things as simple as possible, but not any simpler."
Analogy for Parsimony: Instead of remembering every single raindrop on a wet pavement, a parsimonious model would recognize the underlying "wet surface" concept.
Self-Consistency's Role: Memory must be consistent, allowing accurate prediction and simulation of the world. This acts as a self-correction mechanism, ensuring the learned model remains faithful to reality.
Lossy Coding and Noise: Lossy coding (allowing for some information loss) is crucial for differentiating between low-dimensional models. Noise, surprisingly, plays a vital role in connecting discrete samples into continuous manifolds, enabling the discovery of underlying structures (like "all roads lead to Rome" via diffusion/denoising).
Intelligence Seeks the Easy Path: Contrary to the idea that intelligence solves the hardest problems, Professor Ma argues it first identifies and solves the easiest problems with minimal effort, reflecting a resource-parsimonious approach (akin to the least action principle in physics).

Theme 3: Principled Architecture Design and the Future of AI

From Empiricism to First Principles: Current AI architecture development (e.g., Transformers, ResNets) has been largely empirical, a "natural selection" process. Professor Ma's "Crate" (Coding Rate Reduction Transformer) framework derives these architectures from the first principles of parsimony and self-consistency.
Analogy for First Principles Design: Instead of randomly trying different bridge designs until one stands, a first-principles approach uses physics (stress, load-bearing) to deduce the optimal design.
Benefits of Principled Design: This approach leads to dramatically simpler, more efficient, and explainable architectures. For example, multi-head self-attention can be derived as a gradient step on rate coding, and MLPs as sparsification operators.
Beyond Overfitting: If an AI operator is fundamentally performing compression, it will inherently not overfit, even with massive parameter counts. This explains phenomena like double descent and the generalization capabilities of large models.
Continuous and Generalizable Learning: True intelligence is the mechanism for continuously revising and acquiring new knowledge, not the accumulated knowledge itself. This mechanism, grounded in self-correction and low-dimensional world models, is inherently generalizable, supporting lifelong learning.

Filter for Action:

For Investors:
- Warning: Be wary of "general intelligence" claims based solely on scaling current LLM architectures. The fundamental shift from compression to abstraction is a deeper challenge.
- Opportunity: Invest in research and companies focusing on principled AI design and mathematical foundations rather than purely empirical scaling. Look for approaches that explicitly address abstraction, causality, and robust world modeling.
- Opportunity: Solutions that dramatically simplify existing complex architectures (like Crate's simplification of Dino) offer significant efficiency gains and could be disruptive.
For Builders:
- Shift Focus: Move beyond just optimizing for statistical correlation and token prediction. Explore architectures that explicitly encode parsimony and self-consistency.
- Embrace First Principles: Consider deriving architectural components from fundamental mathematical principles rather than relying on empirical trial-and-error. This can lead to more robust, efficient, and explainable systems.
- Rethink "World Models": Current "3D understanding" in AI often amounts to visualization, not true interactive comprehension. Focus on building models that facilitate manipulation, prediction, and spatial reasoning, grounded in structured representations.
- Leverage Noise: Understand the role of noise in connecting data points and enabling the discovery of low-dimensional manifolds, as seen in diffusion models. This isn't just a hack; it's a fundamental aspect of learning.
- Explore Crate/Rate Reduction: Investigate Professor Ma's open-source Crate architectures and the rate reduction objective function. These offer a principled path to designing more efficient and interpretable Transformers and other deep learning models.

New Podcast Alert: The Mathematical Foundations of Intelligence [Professor Yi Ma]
By Machine Learning Street Talk

The Abstraction Gap
"Is there a difference between compression and abstraction? Is there a difference between memorizing and understanding? It's kind of similar to when Turing was faced with the question, 'What is computable, what is not computable?'"

LLMs as Statistical Compressors: Large language models (LLMs) are powerful statistical engines that compress vast datasets, identifying correlations and low-dimensional structures. This process is akin to memorizing an immense library of information.
Beyond Memorization: Human intelligence, particularly in scientific discovery, moves beyond empirical observation to abstract principles. LLMs struggle with this, often failing at tasks requiring abstract compositional reasoning.
Language as Code: Natural language itself is a highly compressed code for human knowledge. LLMs are compressing this already compressed data, not necessarily grasping the underlying reality it represents.

Parsimony and Self-Consistency
"For this level of intelligence, for how our memory formed and how they work, is precisely the two principles are incredibly important... memory or knowledge is precisely trying to discover what's predictable about the world... finding the most simple representation of the data... that's the word captured by the word parsimony."

Parsimony's Core: Intelligence seeks the simplest possible representation of the world that explains observed data without sacrificing predictive power. Think of it as recognizing a "wet surface" rather than remembering every individual raindrop.
Self-Correction through Consistency: Memory must be self-consistent, enabling accurate prediction and simulation of the world. This feedback loop allows for continuous self-correction and refinement of internal models.
The Role of Noise: Lossy coding, which allows for some information loss, is essential for differentiating between low-dimensional models. Noise facilitates the connection of discrete samples into continuous manifolds, revealing underlying structures.

Principled Architecture Design
"If we believe there's something right, then we should be able to derive [architectures like] Crate from first principle, have a very clear unified understanding... we can dramatically simplify them... you can even throw away the MLP layer if you only care about the compression."

Beyond Empirical Search: Current AI architecture development has largely been an empirical "natural selection" process. Professor Ma's "Crate" (Coding Rate Reduction Transformer) framework derives architectures from fundamental mathematical principles.
Efficiency and Explainability: This principled approach yields simpler, more efficient, and explainable architectures. For instance, multi-head self-attention can be derived as a gradient step on rate coding.
Generalizable Mechanisms: True intelligence resides in the mechanism for continuously revising and acquiring new knowledge, not in the accumulated knowledge itself. This self-correcting mechanism is inherently generalizable, supporting lifelong learning.

Key Takeaways:

Strategic Implication: The next frontier in AI involves a fundamental shift from statistical compression to genuine abstraction and understanding.
Builder/Investor Note: Focus on research and development that grounds AI in first principles, leading to more robust, efficient, and interpretable systems, rather than solely scaling existing empirical architectures.
The "So What?": The pursuit of mathematically derived, parsimonious, and self-consistent AI architectures offers a path to overcome current limitations, enabling systems that truly learn, adapt, and reason in the next 6-12 months and beyond.

Podcast Link: https://www.youtube.com/watch?v=QWidx8cYVRs

This episode challenges conventional AI understanding, arguing that current large language models primarily perform sophisticated memorization, not true intelligence or abstraction. Professor Yi Ma, a leading expert in deep learning, unveils a mathematical theory of intelligence grounded in parsimony and self-consistency, offering a principled path to next-generation AI.

The Mathematical Foundations of Intelligence

Professor Yi Ma, inaugural director of the School of Computing and Data Science at Hong Kong University, presents a unified mathematical theory of intelligence. This framework, detailed in his book "Learning Deep Representations of Data Distributions," posits that intelligence, both natural and artificial, operates on two core principles: parsimony and self-consistency.

Parsimony (Compression): Intelligence seeks the simplest, lowest-dimensional representation of data, identifying predictable structures. This process is synonymous with compression, denoising, and dimensionality reduction.
Self-Consistency: The learned memory or "world model" must accurately recreate and predict the world, ensuring consistency without oversimplification.
LLMs as Memorization: Current large language models (LLMs) excel at compressing and memorizing vast amounts of text, treating language as raw signals. Professor Ma argues this is a sophisticated form of empirical knowledge acquisition, distinct from genuine understanding or abstract reasoning.
“What those large language models are doing are further treat those text as raw signals to further through compression to identify their statistical structures internal structures.”

Intelligence: From Evolution to Abstraction

Intelligence evolves through distinct stages, from phylogenetic (evolutionary) to ontogenetic (individual lifetime) and social accumulation, culminating in scientific abstraction. Each stage employs parsimony and self-consistency, but through different mechanisms.

Evolutionary Compression: Life compresses knowledge about the world into DNA, passing it through generations via brutal, random mutation and natural selection. This mirrors the empirical, trial-and-error development of early AI models.
Empirical vs. Abstract Knowledge: Early human knowledge, like traditional medicine or celestial observation, was empirical—gained through passive observation and error correction. A "phase transition" occurred roughly 3,000 years ago, enabling abstraction (e.g., the concept of infinity, Euclidean geometry), which transcends direct empirical observation.
The Abstraction Gap: Professor Ma questions if current AI, focused on compression and memorization, can bridge the gap to true abstraction—the ability to hypothesize, deduce, and create new knowledge not directly present in the data.
“Is there a difference between compression and abstraction? Is there a difference between memorizing and understanding?”

Rate Reduction & Benign Optimization Landscapes

Professor Ma introduces "rate reduction" as a core objective function for intelligence, driving the discovery of low-dimensional structures. This approach reveals surprisingly "benign" optimization landscapes, simplifying complex learning problems.

Lossy Coding Necessity: Differentiating between low-dimensional models requires a generalized measure of data volume, leading to the necessity of lossy coding (e.g., using an epsilon ball to connect discrete samples into continuous manifolds).
Denoising as Compression: Iterative denoising processes, like those in diffusion models, effectively reduce entropy, pushing representations towards lower-dimensional, structured distributions.
Structured Memory: Memory is not random; it is highly organized (e.g., in the cortex and hippocampus) to facilitate efficient access and prediction. Maximizing rate reduction ensures this structured, organized representation.
Blessing of Dimensionality: Contrary to traditional non-convex optimization, the landscapes for learning low-dimensional structures in high-dimensional spaces are often highly regular and "benign." This explains why simple algorithms like gradient descent succeed in deep learning, even with vast parameter counts.
“Intelligence is precisely the ability to identify what is easy to address first, what is easy to learn, what is natural to learn first.”

Principled Architecture Design: CRATE & Simplified DINO

Professor Ma's work demonstrates that deep learning architectures, including Transformers, can be derived from first principles rather than empirical guesswork. This leads to more efficient, explainable, and performant models.

CRATE (Coding Rate Reduction Transformer): This architecture derives multi-head self-attention as a gradient step on rate coding and MLPs as sparsification operators. It offers a principled explanation for the success of Transformers.
Linear Complexity Attention: By finding a variational form of the rate reduction objective, CRATE achieves linear time complexity for self-attention, a dramatic improvement over the quadratic complexity of standard Transformers.
Simplified DINO: Applying these principles, Ma's team dramatically simplified Meta's state-of-the-art DINO (self-supervised visual representation model). The "Simplified DINO" architecture is 10x simpler, requires fewer hyperparameters, trains more efficiently, and achieves better performance, with explainable internal structures.
“If we believe there's something right, then we should be able to derive CRATE from first principle, have a very clear unified understanding.”

Investor & Researcher Alpha

Capital Reallocation: The shift from empirical, trial-and-error AI development to principled, mathematically derived architectures (like CRATE and Simplified DINO) suggests a future where R&D capital moves from brute-force model scaling to foundational theoretical work. Investments in teams focusing on first-principles AI design will yield disproportionate returns.
New Bottleneck: Abstraction: The core challenge for next-generation AI is not more data or compute for compression, but the mechanism for true abstraction. Research into formalizing and implementing deductive reasoning, beyond statistical correlation, represents the next frontier for breakthrough AI.
Obsolete Research Directions: Blindly optimizing existing Transformer architectures or pursuing "general intelligence" solely through accumulating more empirical knowledge (data, parameters) becomes increasingly inefficient. Research focused on explaining why certain architectures work, and then deriving new ones from those principles, will supersede purely empirical architectural search.

Strategic Conclusion

Professor Yi Ma's mathematical theory of intelligence provides a unified framework for understanding and building AI. By grounding intelligence in parsimony and self-consistency, the industry can move beyond empirical guesswork to design dramatically simpler, more efficient, and truly intelligent systems capable of abstraction. The next step for AI is to formalize and implement the mechanisms of deductive reasoning.

The Mathematical Foundations of Intelligence [Professor Yi Ma]

Others You May Like

Why the US need Open Models | Nathan Lambert on what matters in the AI and science world

Why the US need Open Models | Nathan Lambert on what matters in the AI and science world

From $0 to $11B: The ElevenLabs Story

The Mathematical Foundations of Intelligence [Professor Yi Ma]

Join 10,000+ smart readers on our AI newsletter and stay ahead of the curve

Others You May Like

Why the US need Open Models | Nathan Lambert on what matters in the AI and science world

Why the US need Open Models | Nathan Lambert on what matters in the AI and science world

From $0 to $11B: The ElevenLabs Story