This episode reveals the fundamental architectural limits of current LLMs—they are powerful navigators of existing knowledge, not creators of new science, a critical distinction for anyone investing in the future of AI.
The Information Theory Lens on LLMs
- Martin Casado introduces Vishal Misra, a Columbia University Computer Science professor, highlighting their shared background in networking. Martin praises Vishal's work for providing the most predictive and formal models for understanding how LLMs operate. Unlike the prevailing discourse, which often swings between hype ("AGI is here") and oversimplification ("they're just stochastic parrots"), Vishal applies principles from information theory to create a structured framework for analyzing LLM reasoning.
- Martin explains his interpretation of Vishal's core idea: LLMs reduce the complex, multi-dimensional universe of information into a lower-dimensional geometric structure called a Bayesian Manifold. This manifold represents a high-confidence state space where the model can reason effectively.
- Bayesian Manifold: A conceptual geometric space representing the knowledge an LLM has learned. When reasoning, the LLM moves along paths within this manifold where it has high confidence; straying from it leads to hallucinations.
- Martin notes, "We take this very complex heavy-tailed stochastic universe and we reduce it to kind of this geometric manifold and then when we reason we just move along that manifold."
Entropy, Prompts, and Predictability
- Vishal confirms Martin's summary and elaborates on the mechanics. At their core, LLMs generate a probability distribution for the next token (word or part of a word). The key to understanding their behavior lies in the entropy of this distribution.
- Information Entropy (Shannon Entropy): A measure of uncertainty in a probability distribution. Low entropy means the next token is highly predictable, while high entropy means many tokens are plausible.
- LLMs perform best with prompts that are high in information (specific, rare) but lead to low prediction entropy (a clear, predictable next step).
- For example, "I'm going out for dinner" is a low-information prompt leading to high prediction entropy (many possibilities).
- In contrast, "I'm going to dinner with Martin Casado" is a high-information prompt that reduces prediction entropy, as the LLM can infer a narrower set of likely restaurant types.
The Mechanics of Reasoning and Chain of Thought
- Vishal explains that this entropy-reduction mechanism is why techniques like Chain of Thought are effective.
- Chain of Thought: A prompting method where the LLM is asked to break down a problem into smaller, sequential steps before providing a final answer.
- When asked a complex question like "What is 769 * 1025?", the LLM's initial prediction entropy for the answer is high and diffuse.
- By invoking an algorithmic, step-by-step process (like long multiplication), the prediction entropy at each stage becomes very low. The model knows exactly what to do next.
- Vishal states this process allows the LLM to "arrive at an answer which you're confident of and which is correct." This demonstrates that LLMs excel when they can follow learned, structured procedures within their known manifold.
From Cricket Statistics to Accidental Innovation
- Vishal shares the origin story of his work on LLMs, which began with a personal project to fix the cumbersome user interface of Cricinfo's statistics database, StatsGuru. The goal was to replace a complex web form with a natural language query system.
- In July 2020, using the early GPT-3 API, he found the model couldn't handle the database's complexity directly due to its small 2,048-token context window.
- To solve this, he created a system where a user's query would retrieve similar example queries from a database and feed them into the GPT-3 prompt. This provided the necessary context for the model to generate the correct structured query.
- This method, which he built 15 months before ChatGPT's release, was an accidental invention of what is now known as RAG (Retrieval-Augmented Generation).
- RAG: An AI framework that retrieves data from an external knowledge base to ground LLM responses in factual, specific information, reducing hallucinations.
The Plateauing Pace of LLM Advancement
- Reflecting on the evolution since GPT-3, Vishal expresses surprise at the rapid pace of development but now sees signs of a plateau. He compares the current state of LLMs to the iPhone—after initial revolutionary leaps, recent iterations offer only incremental improvements, like a slightly better camera.
- He observes that across models from OpenAI, Anthropic, and Google, "the capabilities of LLMs has not fundamentally changed. They've become better, right? They've improved but they have not crossed into a different realm."
- Strategic Implication: For investors, this suggests that simply scaling current transformer architectures with more data and compute may yield diminishing returns. The next major breakthrough will likely require a new architectural paradigm.
A Formal Model: The Matrix Abstraction
- Martin praises Vishal for developing a formal model to analyze LLMs while others were focused on rhetoric. Vishal outlines his "matrix abstraction" to explain their inner workings.
- Imagine a massive matrix where each row is a possible prompt and each column is a token in the LLM's vocabulary. Each cell contains the probability of that token following that prompt.
- This matrix is astronomically large—more rows than atoms in the known universe—and thus cannot be stored directly. It is also extremely sparse, as most token sequences are nonsensical.
- LLMs create a compressed, interpolated representation of this matrix based on their training data. When given a new prompt, they use it as evidence to compute a Bayesian posterior distribution for the next token.
- This model elegantly explains in-context learning (or few-shot learning), where providing examples in a prompt acts as new evidence that allows the LLM to update its predictions and learn a new task on the fly without retraining.
Why Recursive Self-Improvement is Unlikely
- Using this framework, Vishal argues against the possibility of recursive self-improvement with current architectures. The output of an LLM is the inductive closure of its training data—it can only generate conclusions that can be logically derived from what it has already seen.
- Inductive Closure: The complete set of knowledge that can be inferred from an initial set of data. For LLMs, this means their outputs are fundamentally bound by their training data.
- Even with multiple LLMs interacting, they cannot generate truly new information because they are all operating within the same closed system defined by their initial training.
- Vishal gives a powerful example: "Any LLM that was trained on pre-1915 physics would never have come up with a theory of relativity. Einstein had to sort of reject the Newtonian physics and come up with this space-time continuum."
Defining AGI: Creating New Manifolds
- Vishal defines AGI not as a system that is merely intelligent but as one that can create new knowledge and scientific paradigms.
- Current LLMs are navigators; they are exceptionally good at exploring and connecting dots within the existing manifolds of human knowledge they were trained on.
- True AGI will be a creator; it will have the ability to generate entirely new manifolds—discovering new axioms, new branches of mathematics, or new laws of physics.
- He argues that achieving this requires a new architecture, as simply adding more data or compute to current models will only "smoothen out the already existing manifolds."
The Path Forward: Beyond Current Architectures
- Vishal believes the next leap requires a new architecture that sits on top of or replaces LLMs. He points to several promising, albeit early, research directions:
- Simulation: Developing models that can run approximate mental simulations to test ideas, much like a human catching a ball, rather than relying solely on language processing.
- Energy-Based Models: Architectures like Yann LeCun's JEPA (Joint Embedding Predictive Architecture), which aim to learn more abstract representations of the world.
- Analyzing Failures: Studying why LLMs fail on abstract reasoning benchmarks (like the ARC Prize) to reverse-engineer the requirements for a more capable architecture.
- Actionable Insight: Researchers and investors should prioritize and explore these alternative architectures, as they represent the most likely path to overcoming the inherent limitations of today's LLMs.
Conclusion
- This discussion provides a formal framework for understanding LLM limitations: they are powerful Bayesian reasoners confined to the knowledge manifolds of their training data. For investors and researchers, the key takeaway is that true AGI requires a new architecture capable of creating knowledge, not just navigating it.