Machine Learning Street Talk
September 19, 2025

Top AI Expert Reveals Best Deep Learning Strategies

NYU Professor Andrew Wilson dismantles deep learning’s biggest dogmas, arguing that our intuitions about model complexity, overfitting, and generalization are fundamentally broken. The path forward isn't smaller, constrained models, but bigger, more expressive ones with the right kind of bias.

The Paradox of Scale: Bigger is Simpler

  • "Larger models are often more inclined towards simple solutions. And there have been demonstrations of this hiding in plain sight like double descent... the only possible way that larger models could be generalizing better is because they have some other sort of bias like a simplicity bias."
  • "It's completely fine to build a huge model that will also have a stronger bias for simple solutions... these sorts of perspectives actually help us understand phenomena that are often seen as very mysterious like double descent and benign overfitting."
  • Contrary to popular belief, increasing a model’s size often enhances its preference for simple solutions. This "simplicity bias" is a key driver of why scaling up works so well.
  • The "double descent" phenomenon is a prime example. As models become large enough to perfectly fit training data, generalization error, after initially rising, starts to decrease again. This second descent isn't due to flexibility—it's driven by the stronger simplicity bias of larger models.
  • Parameter counting is a poor measure of complexity. The real magic lies in the model's induced preferences, with larger models finding "flatter," more compressible solutions in the loss landscape.

Rethinking Overfitting and the Bias-Variance "Myth"

  • "I think the bias-variance trade-off is an incredible misnomer. There doesn't actually have to be a trade-off... building large neural nets are another way of getting both low bias and low variance; you actually have flexibility combined with a simplicity bias."
  • The classical bias-variance trade-off doesn't hold in the deep learning paradigm. Instead of trading one for the other, large models can achieve both low bias (high expressiveness) and low variance (strong generalization) simultaneously.
  • The standard prescription for overfitting—"build a smaller model"—is often the opposite of the correct approach. Making models bigger can actually alleviate overfitting by strengthening their inherent simplicity bias.
  • Instead of hard-coding constraints, models should combine expressiveness with "soft" biases. This approach is more adaptive, performing well on both small and large datasets without needing manual intervention.

Bayesian Inference: The Ultimate Occam's Razor

  • "Being Bayesian means that you want to represent the honest belief that you have uncertainty over what solution is correct given a finite data sample... the sum and product rules of probability say you should be doing marginalization."
  • Bayesian methods are most critical in deep learning, where expressive models can represent countless solutions for a given dataset. Ignoring this uncertainty is not an honest representation of our beliefs.
  • The core of Bayesian inference, marginalization, serves as an automatic and principled Occam's Razor. By averaging over all possible solutions, it naturally favors simpler, more robust explanations without needing complex regularizers.
  • Techniques like deep ensembles, often seen as non-Bayesian, are actually powerful, practical approximations of Bayesian marginalization, outperforming many methods explicitly labeled as "Bayesian."

Key Takeaways:

  • When building models, the guiding philosophy should be to honestly represent your beliefs: the world is complex (embrace expressiveness), but simple explanations are more likely (embrace simplicity bias).
  • Stop Fearing Parameters. When in doubt, go bigger. Scale is not just about capacity; it’s a tool for inducing a powerful simplicity bias that improves generalization and paradoxically reduces overfitting.
  • Trade Hard Constraints for Soft Biases. Instead of rigidly constraining your model architecture, use gentle encouragements. An expressive model with a soft simplicity bias will find the simple solution if the data supports it, while retaining the flexibility to capture true complexity.
  • Think Like a Bayesian. Even if you don't run complex MCMC, adopt the core principle of marginalization. Techniques like ensembling or stochastic weight averaging approximate the benefits of considering multiple solutions, leading to more robust and generalizable models.

For further insights and detailed discussions, watch the full podcast: Link

This episode dismantles the conventional wisdom of deep learning, revealing why bigger models can be simpler and how this counterintuitive insight into 'simplicity bias' is reshaping AI investment and research strategies.

Challenging the Mysteries of Deep Learning

  • Key Misconception: Many believe large models are inherently complex and prone to overfitting. Wilson argues the opposite can be true.
  • Core Insight: Large, overparameterized models can possess a stronger inductive bias—an inherent preference for certain types of solutions—towards simplicity. This is a form of Occam's Razor, the principle that simpler explanations are generally better.
  • Actionable Insight: Investors should not dismiss projects using massive models on relatively small datasets. This approach, often seen as inefficient, may be a deliberate strategy to leverage the simplicity bias that emerges at scale, leading to better generalization.

The Flaw in Conventional Model Selection

  • Parameter Counting is Misleading: Wilson asserts that the number of parameters is a poor proxy for a model's functional complexity. A model can have millions of parameters but be strongly biased towards simple functions.
  • The Power of Soft Constraints: Instead of imposing hard constraints (e.g., using a small model), Wilson advocates for highly expressive models with soft biases. These models can represent complex solutions but have a strong preference for simple ones, making them more adaptive to varying amounts of data without manual intervention.
  • Quote: "Parameter counting is a very bad proxy for model complexity. Really, what we care about is the properties of this sort of induced distribution over functions rather than just how many parameters the model happens to have."
  • Strategic Implication: Researchers should explore architectures that combine high expressiveness with strong, tunable simplicity biases. This is a move away from rigid, problem-specific models toward more universal, self-regulating systems.

The Unsolved Question: Where Does Simplicity Bias Come From?

  • Current Intuition: The effect is linked to the geometry of the loss landscape—the high-dimensional surface representing the model's error. Larger models appear to have exponentially more "flat" regions of low error, which correspond to more compressible, simpler solutions.
  • Ongoing Research: Wilson identifies this as one of the most important areas of current research: rigorously understanding why scale produces this bias and finding more elegant ways to achieve it without simply building bigger models.
  • Investor Takeaway: The race for computational scale is not just about processing more data. It is fundamentally a search for models with stronger, more effective inductive biases. Projects that can articulate a clear research program for achieving simplicity bias more efficiently than brute-force scaling hold a significant long-term advantage.

A Scientific Approach to AI Research

  • The Value of Understanding: He emphasizes that understanding why a model works is more valuable than the model itself. This knowledge is timeless, whereas specific methods quickly become obsolete.
  • Connecting Theory and Practice: Wilson's work combines classical theory with empirical validation, believing that true understanding must be demonstrable. He notes that even low-level engineering details, like numerical stability, can lead to high-level insights about model construction.
  • Research Focus: His work centers on key principles for building intelligent systems:
    • Inductive Biases: What assumptions should models make? This includes symmetries like equivariance, where a transformation of the input results in a corresponding transformation of the output (e.g., translating an image shifts the features in a convolutional neural network).
    • Uncertainty Representation: How can models quantify their confidence? He argues that predictions without error bars are not actionable in the real world and champions Bayesian methods as a principled framework for reasoning about uncertainty.

Geometric Deep Learning and the Case for Soft Constraints

  • The Limits of Hard Constraints: In the real world, perfect symmetries are rare. Physical systems are rarely closed, and data is often noisy. Forcing a model to adhere to a strict constraint can be a dishonest representation of reality.
  • The Advantage of Soft Biases: A model with a "soft constraint" is biased towards a symmetry but not forced into it. If the data perfectly aligns with the symmetry, the model will naturally converge to that solution. If not, it retains the flexibility to find a better, non-symmetric solution.
  • Key Finding: Wilson's research shows that larger models can simultaneously increase expressiveness and strengthen their simplicity bias. This is demonstrated by double descent, a phenomenon where generalization error first worsens as a model becomes more complex (overfitting) and then surprisingly improves again as it becomes massively overparameterized.
  • Actionable Insight: For Crypto AI researchers, this suggests that instead of building rigid, domain-specific models (e.g., for a specific financial instrument), it may be more effective to use a larger, more general model with soft biases that can adapt to the underlying structure of the data, even if that structure is not perfectly understood beforehand.

The Bias-Variance Tradeoff is a "Misnomer"

  • You Can Have Both: He argues that modern techniques allow for both low bias and low variance. Ensembling models is one classical example. Critically, large neural networks trained with a simplicity bias achieve the same outcome: they are flexible enough to fit the data (low bias) but are regularized towards simple solutions that generalize well (low variance).
  • Avoiding "Bad" Overfitting: The prescription for avoiding harmful overfitting is counterintuitive. Instead of making a model smaller, the answer is often to make it even bigger to push it into the "second descent" regime, where simplicity biases dominate and improve generalization.
  • Strategic Consideration: This fundamentally alters risk assessment for AI models. The risk of overfitting is not a simple function of parameter count. Investors and researchers must evaluate the entire system, including the scale, architecture, and training dynamics, to understand its true generalization potential.

The Bayesian Perspective: Why Deep Ensembles Are Secretly Bayesian

  • The Misconception: Deep ensembles were widely considered a "non-Bayesian" alternative to formal Bayesian methods.
  • The Reality: Wilson argued in a widely influential blog post that deep ensembles are, in fact, a highly effective approximation of the Bayesian ideal. They perform marginalization—averaging over many possible solutions—which is the core principle of Bayesian inference for handling uncertainty.
  • The Takeaway: The goal should be to become "more Bayesian" by finding better ways to approximate the average over all plausible models. This was a crucial clarification that shifted research focus toward improving uncertainty quantification.
  • Relevance for Crypto AI: In volatile crypto markets, robust uncertainty estimates are critical for risk management. This insight validates the use of practical techniques like ensembling as a principled way to achieve more reliable predictions, even if they don't fit the textbook definition of a Bayesian model.

The Surprising Power of Simplicity in Real-World Data

  • Kolmogorov Complexity: This is a formal measure of simplicity, defined as the length of the shortest computer program that can produce a piece of data. Wilson's research suggests that real-world data has low Kolmogorov complexity, and successful deep learning models share this bias.
  • Challenging No-Free-Lunch Theorems: These theorems state that no single algorithm can be optimal for all possible problems. Wilson argues this is technically true but irrelevant in practice because the "universe of all problems" is mostly random noise. The real world occupies a small corner of this space, one that is highly structured and compressible.
  • Evidence from LLMs: The surprising ability of Large Language Models (LLMs) to transfer to unrelated tasks like time-series forecasting or materials science is presented as evidence. The pre-training on text instills a powerful bias for discovering compressible representations (like symmetries), which is a universal principle of induction that applies across domains.
  • Investor Insight: The generality of foundation models is not magic; it stems from learning a fundamental bias (compression) that aligns with the structure of real-world data. Investments in models that demonstrate strong cross-domain transferability are likely tapping into this powerful principle.

Bayesian Inference: The Ultimate Occam's Razor

  • What is Marginalization? Instead of picking a single "best" set of parameters, marginalization involves integrating (or averaging) over all possible parameter settings, weighted by how well they fit the data. This process naturally penalizes overly complex models because they spread their predictive power too thinly across many possible datasets.
  • Why It Matters for Deep Learning: Wilson states, "Bayesian marginalization... is going to be most important when we have a model that's very expressive." With billions of parameters, countless different solutions can fit the training data perfectly. Relying on just one is a massive, unprincipled bet.
  • Epistemic Uncertainty: This is uncertainty that can be reduced with more data. Wilson argues it is the most critical type of uncertainty to model. To not model it is "mathematically incorrect and could come at a huge cost."
  • Crypto AI Application: For decentralized AI systems, provably robust uncertainty quantification is essential for trust and safety. Bayesian methods provide a formal framework for this, and practical approximations like those discussed are key to implementation.

Grocking, Mode Connectivity, and the Shape of Learning

  • Mode Connectivity: Wilson's research discovered that different solutions (modes) found by a neural network are not isolated islands. Instead, they are connected by "paths" or "wormholes" of low loss. An optimizer can traverse these paths, finding solutions with the same low training error but different generalization properties.
  • A New View of Optimization: This suggests that during training, especially prolonged training like in grokking, the model isn't just minimizing loss. It's moving within a connected manifold of perfect solutions, implicitly searching for one that is simpler or more compressible.
  • Quote: "The larger you make the model... the more it looks almost like a straight line path between the two solutions."
  • Research Frontier: Understanding this high-dimensional geometry is key to unlocking more efficient and reliable training methods. For researchers, this opens up new avenues for designing optimizers that don't just seek low loss but explicitly navigate these manifolds to find regions with desirable properties like flatness and compressibility.

Actionable Philosophy and The Bitter Lesson

  • Practical Steps:
    1. Embrace Expressiveness: Use large models.
    2. Induce Simplicity: If compute allows, scale is an effective (though inelegant) way to do this. Otherwise, techniques like Stochastic Weight Averaging (SWA) or Bayesian marginalization can help find more compressible solutions.
  • Revisiting Sutton's "Bitter Lesson": This famous essay argues that brute-force computation and learning consistently outperform human-engineered knowledge over the long term. Wilson agrees but adds a crucial clarification: learning is impossible without assumptions (inductive biases).
  • The Real Goal: The goal is not to eliminate assumptions but to find the most general and effective ones. A bias towards simplicity (compression) appears to be one such universal assumption. Improving these assumptions can lead to exponential improvements in performance for a given amount of computation.

What's Missing: Beyond Prediction to Scientific Discovery

  • The Next Frontier: Wilson is most excited about developing AI systems that can discover novel scientific theories at the level of general relativity. This requires moving beyond black-box function approximation to systems that can generate new, interpretable insights.
  • The Role of AI: Current AI for science is powerful but mostly applies existing knowledge. The dream is an AI that can automate the process of scientific reasoning itself, proposing and testing new hypotheses.
  • Final Thought: The ultimate value of AI may not be in its ability to predict, but in its potential to compress reality into new, fundamental theories that expand human understanding.

Conclusion:

This episode argues that model scale is not just about capacity but about inducing a powerful simplicity bias that improves generalization. Crypto AI investors and researchers should prioritize platforms leveraging massive scale to discover more robust, compressible models, as this is where true performance gains and universality lie.

Others You May Like