This episode unpacks the Reinforcement Learning (RL) renaissance, detailing how AI scaling has evolved beyond pre-training and inference time compute, and what this shift means for decentralized AI infrastructure and investment.
Guest Introduction: Sam Lehman
- Sam Lehman, Partner at Symbolic Capital and AI researcher, joins Tommy to discuss his influential post, "The World's RL Gym."
- Symbolic Capital focuses on pre-seed and seed investments in Web3, with a strong thesis around decentralized AI. Sam brings experience from traditional finance and even craft brewing, now deeply immersed in the intersection of AI and crypto.
The Evolution of AI Scaling: Pre-Training Focus (Phase 1)
- Sam outlines a three-phase history of AI scaling: pre-training, inference time compute, and the current RL scaling phase.
- The initial phase centered on pre-training: scaling models by increasing data and compute power during the initial training process.
- Key research from OpenAI (2020) and DeepMind (2022), known as the Chinchilla Scaling Laws, codified the optimal ratio of training data (tokens) to model size (parameters) – roughly 20 tokens per parameter. This revealed early models were often "over-parameterized" relative to their training data.
- Strategic Insight: Understanding these scaling laws highlighted the critical bottleneck of data acquisition and compute for improving model performance in the early days, driving the race for massive datasets and GPU clusters.
The Shift to Inference Time Compute (Phase 2)
- A later phase focused on inference time (or test time) compute scaling, particularly evident in 2023-2024 with the rise of reasoning models.
- Inference Time Compute: Refers to the computational resources a model uses after initial training, when it's actively solving a problem or responding to a query.
- Researchers discovered that allowing models more time and compute resources to "think" during inference significantly improved performance, even enabling smaller models to outperform larger ones by thinking longer. Google DeepMind's paper "Scaling LLM Test Time Compute" was pivotal here.
- Sam notes Jensen Huang's (NVIDIA CEO) comment about inference potentially being "a billion times bigger than pre-training," marking a paradigm shift where runtime compute became a new lever for performance.
- Actionable Takeaway: The rise of inference-time compute signals a shift in resource demand. Investors should track innovations in inference optimization, hardware, and models designed for complex, multi-step reasoning, as this is becoming a major compute consumption area.
The Reinforcement Learning Renaissance (Phase 3)
- Sam identifies the current phase as the RL scaling renaissance, catalyzed significantly by DeepSeek's work.
- Reinforcement Learning (RL): A type of machine learning where an agent learns to make decisions by performing actions in an environment to achieve a goal, receiving rewards or penalties for its actions. In LLMs, this often involves optimizing responses based on feedback.
- While RL, particularly RLHF (Reinforcement Learning from Human Feedback), wasn't new, DeepSeek demonstrated an innovative and elegant application using GRPO (Generalized Reward Probability Optimization), a specific RL algorithm.
- The key breakthrough was using RL to elicit powerful reasoning capabilities from a base model with limited human intervention, showcasing a path to scale model intelligence more autonomously.
- Investor Note: The surge in diverse RL algorithms (like GRPO) post-DeepSeek indicates rapid experimentation. Tracking which RL techniques gain traction is crucial for understanding future model development pathways and potential efficiency gains.
Deep Dive: The DeepSeek Moment (V3, R10, R1)
- Sam clarifies the DeepSeek releases: V3 (a powerful 67B parameter sparse base model, a Mixture of Experts (MoE) model where different parts specialize), followed by R10 and R1 (reasoning models derived from V3 via RL).
- The R10 training process was particularly notable:
- It fed the V3 base model verifiable problems (math, coding).
- Provided simple binary rewards (1 for correct, 0 for incorrect) without explicit instructions on how to solve them.
- Crucially, this involved minimal human-generated preference data. "With R10, it was literally just here's a bunch of problems. Start thinking, figure it out," Sam explains.
- Through this trial-and-error process with binary rewards, the model learned to reason more effectively and "think longer" to achieve correct answers.
The Power and Quirks of RL-Generated Reasoning
- The R10 model became highly performant at problem-solving but wasn't "human-legible." Its reasoning traces (the step-by-step thought process) might mix languages or lack proper syntax.
- Sam describes R10 as a "wild child," super smart but not conditioned for typical human interaction, highlighting a trade-off between raw capability and human alignment/preference.
- The model learned that longer thinking often correlated with correct answers, leading to progressively longer response lengths during training – Sam references a chart showing this and the researchers' "aha moment" where the model explicitly noted the need to think longer.
- Research Consideration: This raises questions about the potential limitations imposed by human preference data (like in RLHF). Does optimizing for human legibility inhibit a model's raw problem-solving potential or creativity?
The Role of Human Data vs. Model Exploration (AlphaGo Analogy)
- Sam draws a parallel to AlphaGo's development. Initially trained with human game examples, it performed well. However, a later version (AlphaZero), trained only on rules and self-play (pure RL without human examples), became even better, discovering novel strategies (like the famous "Move 37") humans wouldn't conceive of.
- This suggests that excessive reliance on human data might constrain a model's exploratory capabilities and ultimate performance ceiling.
- Strategic Question: Should the focus be purely on verifiable outcomes (like R10's binary rewards) rather than mimicking human thought processes, especially for complex, non-intuitive problem domains? This has implications for data strategies in AI development.
The Importance of Reasoning Traces
- Tommy raises the point that reasoning traces – the model's step-by-step "thinking" process – seem increasingly valuable, potentially more so than just raw data.
- Reasoning Trace: The sequence of intermediate steps or thoughts generated by a model as it works towards a final answer. DeepSeek made these visible, showing the model strategizing ("I should solve it this way...").
- Sam agrees on their importance but notes the current reliance on high-quality initial question-answer pairs to kickstart the RL process, even in systems aiming for synthetic data generation.
- A key challenge is moving beyond easily verifiable domains (math, code) to apply RL effectively in creative or subjective areas where defining "correctness" is harder. Generalizability across domains is another under-researched area.
- Insight: The generation, verification, and utilization of reasoning traces represent a new, high-value data frontier. Infrastructure and methods for handling this data type are becoming critical.
Conceptualizing the Decentralized RL Network (Foundation, Gym, Refinery)
- Sam proposes a three-part mental model for decentralizing the RL process:
- The Foundation: A performant base model, ideally developed via decentralized pre-training (referencing projects like Noose, Prime Intellect, Jensen).
- The Gym: An environment or platform to generate diverse, high-quality reasoning data (traces) across various domains and cognitive strategies.
- The Refinery: A network to perform the actual RL optimization (post-training) using the data from the gym to improve models.
- Crypto AI Relevance: This framework directly maps to potential decentralized infrastructure plays – networks for compute (Foundation, Refinery) and specialized platforms for data generation/verification (Gym).
The "World's Gym": Generating Diverse Reasoning Data
- The "Gym" concept builds on existing ideas like OpenAI Gym or Carla (for self-driving). Sam envisions open, decentralized platforms where:
- Anyone can create specialized "environments" (e.g., for medicine, finance, physics).
- Users bring models to these environments to generate reasoning traces by attempting tasks.
- Robust, potentially decentralized, verification mechanisms assess the quality/correctness of these traces.
- This verified data corpus feeds back into model training (the "Refinery").
- Contribution isn't about manually writing traces, but about creating environments, running models, and participating in verification.
- Opportunity: Platforms facilitating the creation of diverse RL environments and verifiable reasoning trace generation could become crucial infrastructure, potentially leveraging crypto-economic incentives.
Why Decentralize RL? Open vs. Closed Innovation
- Sam argues for decentralization based on the principle of "innovation at the edge" (borrowing from USV).
- He believes an open platform allowing massive experimentation across countless domains by diverse contributors is more likely to discover the absolute best reasoning strategies than closed, centralized labs.
- His analogy: A closed lab is like a single genius in a room with tutors; an open network is like a global school where the best minds collaborate and share diverse approaches freely.
- Core Argument: Decentralization fosters broader exploration and cross-pollination of ideas, essential for the trial-and-error nature of RL, potentially leading to faster and more diverse breakthroughs.
Feeding RL Insights Back: The Future of Model Training (Modular Models)
- How does the valuable data from the "Gym" improve models globally?
- One vision: A continuously improving "world model" fed by decentralized RL data streams.
- Sam's current interest: Highly Modular Sparse Models (like MoE). The idea is to have specialized "expert" sub-models within a larger architecture that can be independently trained, improved, and potentially swapped or combined ("Lego blocks").
- This modularity could allow specialized teams to develop best-in-class experts for niche domains (e.g., a specific coding language, medical diagnosis) that can then be plugged into broader models. Jensen's HD paper explores training sub-experts in parallel on heterogeneous hardware.
- Future Trend: Modular architectures could decentralize not just training data generation but also model development itself, creating markets for specialized AI components. This contrasts with monolithic model training.
Open Source vs. Proprietary AI: An Evolving View
- Sam expresses a measured view: "Distributed AI is inevitable," citing frontier labs already using distributed training. Decentralized AI (trustless, heterogeneous compute) is the next step, particularly suited for RL's parallelizable nature.
- He acknowledges the success of projects like Prime Intellect and Jensen in demonstrating decentralized RL's efficacy.
- However, he questions the necessity of tokens/speculative incentives for all aspects, while seeing potential in models like Pluralis's (sharding models, rewarding compute based on usage).
- The big uncertainty remains: Can open, decentralized efforts truly compete with the performance and resources of frontier labs like OpenAI? He hopes the diversity of open contribution provides an edge.
- Investor Consideration: The debate continues. While decentralized approaches show promise, the scale and integration capabilities of proprietary models (like OpenAI's) present significant competitive hurdles and lock-in risks.
The Lock-In Problem and Model Swapping Challenges
- Contrary to early beliefs that models would be easily swappable commodities, Sam argues that providers want lock-in.
- Features like OpenAI's Memory make models deeply personalized and harder to switch away from. Prompt engineering and workflows often become highly tuned to a specific model's behavior.
- "They want to lock you in and they want to make it as hard as possible for you to swap your model," Sam states, predicting this trend will intensify.
- Risk Factor: Investors in applications built on specific proprietary models should be aware of platform risk and the increasing difficulty/cost of migrating to alternatives.
Perspectives on AGI and Current Model Capabilities
- Sam shares his less-than-magical experience with GPT-4o on a specific web-scraping task, encountering bugs and loops, suggesting current models are still far from flawless or true AGI (Artificial General Intelligence).
- However, he sees the potential for models learning to use tools and autonomously string together actions (as seen with GPT-4o's direction) as the path toward more "real-world performant model behavior" that approaches AGI territory.
- He finds the concept of models developing reasoning traces and solving problems autonomously (even imperfectly) more indicative of progress towards AGI than simple next-token prediction. The "Arabesque" paper on letting models learn from world interaction is relevant here.
Investment Focus: Decentralized AI Landscape
- Symbolic Capital invests exclusively in Crypto AI / Decentralized AI.
- Sam highlights several impressive teams pushing the frontier: Pluralis, Prime Intellect, Jensen, Noose, Ambient, Exo.
- He also mentions non-crypto players like SF Compute (financializing GPU compute) and teams working on bringing similar concepts on-chain or creating developer-friendly distributed compute platforms.
- He finds appeal in "middle-ground" approaches (distributed but not fully decentralized, using stablecoins for settlement) as they offer immediate utility, while acknowledging the high-risk, high-reward potential of fully decentralized visions like Pluralis.
- Area of Interest: Sam is particularly focused on modular model architectures and anyone working on training or composing specialized "expert" sub-models.
Final Thoughts: RL is Not Dead & Base Model Importance
- Sam addresses recent online discourse questioning RL's value, stemming from research suggesting base models can eventually reach similar answers to reasoning models if given enough attempts.
- His counter: RL teaches models efficient reasoning behavior that gets the right answer faster and more reliably on the first try, which is crucial in practice. "RL is definitely not dead. RL got us [GPT-4o]," he emphasizes.
- Crucially, Sam stresses that the RL renaissance does not mean the death of pre-training. Quoting Dario Amodei (Anthropic), he notes that RL works best on already capable base models. Better pre-training leads to better RL outcomes.
- Key Takeaway: Both strong base models (via pre-training) and effective RL (for reasoning) are complementary and essential for advancing AI capabilities.
This episode highlights Reinforcement Learning as the pivotal next phase in AI scaling, driven by reasoning capabilities. Crypto AI investors and researchers must track RL advancements, decentralized data generation (like RL gyms), and modular architectures to identify emerging infrastructure needs and strategic investment opportunities.