Machine Learning Street Talk
December 13, 2025

The Mathematical Foundations of Intelligence [Professor Yi Ma]

This podcast features Professor Yi Ma, a leading expert in deep learning and AI, discussing his book, "Learning Deep Representations of Data Distributions," which proposes a mathematical theory of intelligence based on parsimony and self-consistency. He explores the principles underlying deep networks and their limitations, while offering a new perspective on the nature of intelligence.

Principles of Intelligence: Parsimony and Self-Consistency

  • “Intelligence, artificial or natural, at a certain level, needs to be clarified scientifically or mathematically, so that we'll be able to talk about and study the mechanism behind it at each level.”
  • Intelligence, at its core, can be understood through the dual principles of parsimony (simplicity) and self-consistency.
  • Parsimony refers to finding the simplest representation of data, which is achieved through compression, denoising, and dimensionality reduction.
  • Self-consistency ensures that the derived representation accurately recreates and simulates the world, maintaining predictability and preventing oversimplification.

Intelligence as Compression

  • “The process to acquire knowledge to gain information about our outside world, that's a compression, find what is compressible, what has orders, what phenomena has orders.”
  • Intelligence, at a foundational level, involves identifying compressible patterns and structures in the world to enable prediction and decision-making.
  • Language models, while compressing data, may only do so superficially and fail to deeply understand or abstract knowledge in the same way humans do.
  • The mechanism of intelligence is similar to how life evolves through DNAs, compressing learned logic about the world and passing it to the next generation.

Abstraction vs. Memorization

  • “Abstraction is definitely related to compression, but there seems to be something different, something more... Is there a difference between memorizing those data distribution than understanding it?”
  • There’s a critical distinction between memorizing data distributions through compression and truly understanding the underlying principles.
  • Science uses abstraction, formalizing knowledge beyond empirical observations, hypothesizing and testing to formalize knowledge.
  • Current AI models excel at memorization and emulation but struggle with abstract compositional reasoning, which requires deeper, more structured knowledge representations.

The Role of Noise and Lossy Coding

  • "Adding noise to the data is precisely we're building roads, and denoising brings us back to remember where we come from... Noise is very important to help connect the dots."
  • Noise and lossy coding play essential roles in discovering the underlying structure of data. Adding noise helps "build roads" to the data distribution, while denoising retraces those paths back to the source.
  • Allowing lossy coding connects disparate data points, helping to bridge the gaps between isolated samples. Noise enables finite samples to be viewed as a continuum, helping define lines, planes and surfaces.
  • Iterative denoising serves as a form of compression, revealing the essential structure by eliminating irrelevant details.

Key Takeaways:

  • Embrace Parsimony and Self-Consistency: Adopt these principles as guiding forces in AI design. Build models that not only compress data efficiently but also maintain a high degree of self-consistency to ensure accurate and reliable world models.
  • Focus on Abstraction, Not Just Memorization: Prioritize developing systems that can abstract knowledge beyond mere memorization. Move beyond surface-level compression and aim for models that can discover and reason about the underlying principles of the world.
  • Understand and Reproduce the Brain’s Mechanisms: Focus on understanding and reproducing the mechanisms in the human brain that enable deductive reasoning, logical thinking, and the creation of new scientific theories to truly push AI to the next level.

For further insights, watch the full podcast here: Link

This episode unpacks Professor Yi Ma's mathematical theory of intelligence, revealing how parsimony and self-consistency underpin both natural and artificial learning, and challenging current AI paradigms.

Introduction to Professor Yi Ma and His Work

  • The host introduces Professor Yi Ma, a world-leading expert in deep learning and artificial intelligence, currently the inaugural director of the School of Computing and Data Science at Hong Kong University. Professor Ma's foundational work on sparse representation and low-rank structures has profoundly influenced modern computer vision and machine learning. His recently published book, "Learning Deep Representations of Data Distributions," proposes a mathematical theory of intelligence grounded in two principles: parsimony and self-consistency, which have led to "whitebox transformers" (Crate architectures) derived from first principles.
  • Professor Ma explains his motivation for the book: to provide a principled approach to understanding deep networks and the underlying mechanisms of intelligence, moving beyond empirical guesswork.
  • He highlights the rapid progress in AI over the last decade and the necessity for a systematic organization of this knowledge, which he now teaches in a new course.

Defining Intelligence: Parsimony and Self-Consistency

  • Professor Ma emphasizes the need for a scientific and mathematical clarification of "intelligence," acknowledging its loaded nature and different levels. He focuses on a fundamental level common to animals and humans: how memory, or a "world model," is formed, evolves, and is used for prediction and decision-making. This level of intelligence, crucial for survival, is governed by two principles.
  • Parsimony: The pursuit of the simplest representation of data, discovering what is predictable and has low-dimensional structures. This involves compression, denoising, and dimension reduction, echoing Einstein's principle of making things "as simple as possible, but not any simpler."
  • Self-Consistency: Ensuring that the learned memory can accurately recreate and simulate the world. If a model is too simple, it loses predictive ability; consistency ensures the memory is robust and accurate.
  • Memory or knowledge is precisely trying to discover what's predictable about the world.

Compression, Evolution, and Language Models

  • The discussion delves into compression as the fundamental process of acquiring knowledge, identifying order and low-dimensional structures in phenomena to improve prediction. Professor Ma draws an analogy between the evolution of life and the development of AI models.
  • Biological evolution compresses learned knowledge into DNA, passed through generations via brutal mechanisms like random mutation and natural selection. This mirrors the empirical, trial-and-error approach seen in the development of current large AI models.
  • Professor Ma suggests that current AI is at an "early stage of life form," primarily focused on this compression process.
  • He distinguishes between natural language (a result of compression of physical senses and world models over billions of years) and Large Language Models (LLMs) that reprocess natural text as raw signals. LLMs primarily memorize and regenerate text by identifying statistical structures, which is not equivalent to the deep, grounded understanding humans derive from physical senses.
  • Our language is precisely trying to describe that—it’s abstraction of that world model we have in our brain.

Levels of Intelligence: Empirical vs. Abstract

  • Professor Ma outlines four stages of intelligence: phylogenetic (evolutionary), ontogenetic (individual lifetime), social (collective knowledge), and scientific (abstract reasoning). All stages share the common goal of extracting and recording structures from data through compression, denoising, and dimensionality reduction. However, the mechanisms for acquiring and updating this information differ significantly.
  • Early stages (animals, humans before science) primarily rely on empirical, passive observation and learning from mistakes, similar to how traditional medicine or early astronomical observations accumulated knowledge.
  • A "phase transition" occurred around 3,000 years ago, enabling abstraction—the ability to develop knowledge beyond empirical observation, such as the concept of infinite numbers or Euclidean geometry's parallel postulate.
  • This raises critical open questions: What is the difference between compression and abstraction? Is there a distinction between memorizing (reproducing data distribution) and understanding (mastering underlying logic)? Professor Ma believes this "true artificial intelligence" is what the 1956 Dartmouth workshop pioneers envisioned, but current LLMs are still largely operating at the level of empirical memory formation.

The Role of Noise and Lossy Coding

  • Professor Ma explains the necessity of "lossy coding" and the nuanced roles of noise in understanding data distributions. He clarifies that noise is not merely a technical hack but a fundamental component.
  • Diffusion Denoising: Adding noise (like building roads from Rome) allows exploration of the entire data space. Denoising then guides back to the underlying low-dimensional structure (Rome), representing the core knowledge.
  • Connecting Dots: Noise helps connect finite data samples into continuous structures like lines or planes. This "percolation" phenomenon suggests that at a certain density, a connected, low-dimensional manifold becomes a more parsimonious explanation than isolated points.
  • This understanding of noise and lossy coding has dramatically advanced the ability to pursue low-dimensional structures from finite samples, a problem that baffled Professor Ma as a graduate student.

Benign Landscapes and the Blessing of Dimensionality

  • Professor Ma discusses how the optimization landscapes for problems involving low-dimensional structures (like sparsity or low-rank matrices) are surprisingly "benign," contrary to traditional non-convex optimization theory.
  • These objective functions, arising from natural structures, exhibit high regularity and symmetry. They lack stagnant points, flat surfaces, or numerous spurious local minima.
  • The higher the dimension the better. We call it a blessing of dimensionality. This explains why simple algorithms like gradient descent can effectively find good solutions in high-dimensional spaces, even for complex deep networks.
  • Professor Ma argues that intelligence is not about solving the hardest problems but about identifying and exploiting what is easy to learn first, aligning with the principle of parsimony and the "least action principle" in physics.

Overfitting and Inductive Biases

  • Professor Ma challenges the conventional understanding of overfitting in deep learning, arguing that if neural networks are fundamentally performing compression, they should not overfit.
  • He posits that if an operator consistently shrinks a solution towards an underlying low-dimensional structure (e.g., a line), it will generalize, regardless of overparameterization. This is akin to PCA's power iteration, which converges to the principal component without overfitting.
  • He redefines "inductive bias" not as empirical additions but as fundamental first principles or initial assumptions (e.g., data distribution is low-dimensional) from which network architectures and operations are deduced.
  • For example, if the task requires translational invariance, convolution naturally emerges as the compression operator, rather than being an imposed inductive bias. This approach aims to eliminate trial-and-error in theory building.

Self-Consistency and Continuous Learning

  • Professor Ma elaborates on the self-consistency principle, emphasizing the crucial role of decoding (prediction/reconstruction) in verifying and improving memory.
  • Memory formation is an encoding process; decoding allows the system to predict future states or reconstruct observations, checking for errors.
  • Unlike traditional autoencoders with external error measurement, natural intelligence (animals, humans) must self-correct internally. This "closed-loop learning" involves constantly comparing internal predictions with observations through the same sensing channels.
  • This self-correction is possible if the external world's data distribution is sufficiently low-dimensional, allowing the brain to discern differences. This mechanism supports continuous and lifelong learning, constantly revising and improving memory.
  • The scientific activity, our ability to revise our memory to acquire new memory, that is a generalizable ability. That is intelligence. He argues against the concept of "general intelligence" as accumulated knowledge, emphasizing that the mechanism of learning and adaptation is what is truly generalizable.

Crate Architectures and Transformer Derivation

  • Professor Ma discusses how his first-principles approach can derive and explain successful deep learning architectures, moving beyond empirical discovery.
  • ResNet: Its iterative optimization structure reflects the process of compressing and optimizing low-dimensionality.
  • Transformers: The self-attention mechanism, which computes correlations, aligns with the goal of identifying and organizing data distributions.
  • The "Crate" (Coding Rate Reduction Transformer) architecture demonstrates that multi-head self-attention can be derived as a gradient step on rate coding, and MLPs as sparsification operators.
  • This principled derivation leads to significant improvements, such as the "Toss" (Token Statistics Transformer), which achieves linear time complexity for self-attention, a dramatic improvement over the quadratic complexity of current transformers. This is achieved by finding equivalent variational forms of the rate reduction objective, making optimization much more efficient.
  • The search will no longer be random, will be actually guided.

Real-World Impact and Future Directions

  • Professor Ma highlights the practical implications of his principled approach, including simplifying and improving existing state-of-the-art models.
  • Simplified DINO (S-DINO): His group dramatically simplified Meta's DINO, a leading self-supervised visual representation model, reducing hyperparameters and complexity by tenfold while improving performance. This work has garnered attention from Meta and Google, who are now scaling up these new architectures.
  • VIT Comparison: Crate architectures are already on par with Vision Transformers (VITs) in performance but offer principled design and explainability. Crate models learn semantically, statistically, and geometrically meaningful internal structures, where each "head" becomes an expert for specific visual patterns (e.g., animal legs, faces), a clarity not observed in VITs.
  • Actionable Insight for Crypto AI Investors/Researchers: The shift from empirical AI development to principled, mathematically derived architectures promises more efficient, explainable, and scalable models. This could lead to breakthroughs in on-chain AI, verifiable computation (zkML), and resource-constrained environments, making these architectures a critical area for investment and research.
  • Professor Ma encourages ML engineers and researchers to explore his open-source code on GitHub (for Crate, Toss, S-DINO) and his book for a systematic understanding of the methodology.

Reflective and Strategic Conclusion

  • Professor Ma's framework shifts AI from empirical to principled design, highlighting compression and self-consistency as core to intelligence. Crypto AI investors and researchers should prioritize architectures derived from first principles. These promise greater efficiency, explainability, and scalability, crucial for decentralized and verifiable AI, unlocking new opportunities in resource-constrained and trust-minimized environments.

Others You May Like