Delphi Digital
April 30, 2025

Sam Lehman: What the Reinforcement Learning Renaissance Means for Decentralized AI

Sam Lehman, Partner at Symbolic Capital and AI researcher, joins the podcast to unpack the Reinforcement Learning (RL) renaissance, charting its evolution from pre-training scaling laws to its potential fusion with decentralized networks.

The Shifting Sands of AI Scaling

  • "I think about this history in three phases. Phase one being focused on pre-training, phase two on inference time... and then sort of this new emergent phase which is RL scaling."
  • "...by allowing models sort of more time to think, we could get more performant models."
  • AI model improvement initially focused on pre-training: scaling data and compute based on codified laws (like Chinchilla's 20 tokens/parameter ratio).
  • A paradigm shift occurred with inference-time scaling: realizing models improve by being given more compute during problem-solving ("thinking longer"), allowing smaller models to potentially outperform larger ones.
  • This evolution led to the current focus: scaling Reinforcement Learning processes.

Reinforcement Learning Takes Center Stage

  • "...they took a very performant base model and through... their very elegant GRPO focused RL process got a base model to develop extremely powerful reasoning capabilities... with minimal human intervention."
  • "A reasoning trace is just the chain of thought string that the model produced to end up at its eventual answer."
  • DeepSeek's R1/R10 models showcased RL's power, particularly using GRPO, to enhance a base model's reasoning significantly with limited human oversight, a departure from typical RLHF.
  • Models learned through trial-and-error on verifiable tasks (math, code), generating valuable "reasoning traces" (their step-by-step thought process). They discovered longer thinking often led to correct answers ("aha moment").
  • While highly performant, models trained this way (like R10) can be less human-legible, sometimes mixing languages or using odd syntax to reach the correct answer.

Decentralizing the RL Playground: The World's Gym

  • "My goal was to map out... how you would decentralize the RL process, but also why you might want to."
  • "The idea is to have many environments that can elicit the best strategy for a given domain. And then you need to pair with that robust verifiers."
  • Lehman proposes a decentralized RL framework: 1) A decentralized foundation (base models), 2) A "Gym" (diverse environments for generating reasoning traces across many domains), 3) A "Refinery" (network for optimizing models with verified traces).
  • The "World's RL Gym" concept allows open contribution: creating domain-specific environments (math, medicine, finance), letting models generate strategies (traces), and verifying results.
  • Projects like Prime Intellect's Genesis/Synthetic 2 and General Reasoning are early examples, building open libraries of reasoning traces.

Open vs. Closed: The Future Battlefield

  • "I personally think that you want a platform where as much experimentation across as many different domains can happen... because only then will you elicit the absolute best possible strategies."
  • "The incentives for them [frontier labs] are to pull up the ladder behind them, create moats wherever they can... They're trying to lock you in to their models..."
  • The argument for open, decentralized RL: broad experimentation across many domains and contributors (like a collaborative school) will likely yield better, more diverse strategies than closed labs (like a genius locked in a room).
  • Frontier labs leverage huge user bases and actively build moats like model-specific memory features to create lock-in, countering the idea of easily swappable models.
  • While distributed AI training is here, truly decentralized AI (trustless, heterogeneous compute) faces challenges competing with centralized resources, though projects like Pluralis, Jensen, and Prime Intellect are pushing boundaries, often without tokens initially.

Key Takeaways:

  • Decentralized AI leverages RL's need for diverse experimentation, potentially outpacing closed labs through open collaboration. However, frontier models maintain significant advantages via scale and user lock-in tactics. The future likely involves a blend: highly performant base models enhanced by increasingly sophisticated, potentially decentralized, RL techniques.
  • RL is the New Scaling Frontier: Forget just bigger models; refining models via RL and inference-time compute is driving massive performance gains (DeepSeek, 03), focusing value on the process of reasoning.
  • Decentralized RL Unlocks Experimentation: Open "Gyms" for generating and verifying reasoning traces across countless domains could foster innovation beyond the scope of any single company.
  • Base Models + RL = Synergy: Peak performance requires both: powerful foundational models (better pre-training still matters) and sophisticated RL fine-tuning to elicit desired behaviors efficiently.

For further insights and detailed discussions, watch the full podcast: Link

This episode unpacks the Reinforcement Learning (RL) renaissance, detailing how AI scaling has evolved beyond pre-training and inference time compute, and what this shift means for decentralized AI infrastructure and investment.

Guest Introduction: Sam Lehman

  • Sam Lehman, Partner at Symbolic Capital and AI researcher, joins Tommy to discuss his influential post, "The World's RL Gym."
  • Symbolic Capital focuses on pre-seed and seed investments in Web3, with a strong thesis around decentralized AI. Sam brings experience from traditional finance and even craft brewing, now deeply immersed in the intersection of AI and crypto.

The Evolution of AI Scaling: Pre-Training Focus (Phase 1)

  • Sam outlines a three-phase history of AI scaling: pre-training, inference time compute, and the current RL scaling phase.
  • The initial phase centered on pre-training: scaling models by increasing data and compute power during the initial training process.
  • Key research from OpenAI (2020) and DeepMind (2022), known as the Chinchilla Scaling Laws, codified the optimal ratio of training data (tokens) to model size (parameters) – roughly 20 tokens per parameter. This revealed early models were often "over-parameterized" relative to their training data.
  • Strategic Insight: Understanding these scaling laws highlighted the critical bottleneck of data acquisition and compute for improving model performance in the early days, driving the race for massive datasets and GPU clusters.

The Shift to Inference Time Compute (Phase 2)

  • A later phase focused on inference time (or test time) compute scaling, particularly evident in 2023-2024 with the rise of reasoning models.
  • Inference Time Compute: Refers to the computational resources a model uses after initial training, when it's actively solving a problem or responding to a query.
  • Researchers discovered that allowing models more time and compute resources to "think" during inference significantly improved performance, even enabling smaller models to outperform larger ones by thinking longer. Google DeepMind's paper "Scaling LLM Test Time Compute" was pivotal here.
  • Sam notes Jensen Huang's (NVIDIA CEO) comment about inference potentially being "a billion times bigger than pre-training," marking a paradigm shift where runtime compute became a new lever for performance.
  • Actionable Takeaway: The rise of inference-time compute signals a shift in resource demand. Investors should track innovations in inference optimization, hardware, and models designed for complex, multi-step reasoning, as this is becoming a major compute consumption area.

The Reinforcement Learning Renaissance (Phase 3)

  • Sam identifies the current phase as the RL scaling renaissance, catalyzed significantly by DeepSeek's work.
  • Reinforcement Learning (RL): A type of machine learning where an agent learns to make decisions by performing actions in an environment to achieve a goal, receiving rewards or penalties for its actions. In LLMs, this often involves optimizing responses based on feedback.
  • While RL, particularly RLHF (Reinforcement Learning from Human Feedback), wasn't new, DeepSeek demonstrated an innovative and elegant application using GRPO (Generalized Reward Probability Optimization), a specific RL algorithm.
  • The key breakthrough was using RL to elicit powerful reasoning capabilities from a base model with limited human intervention, showcasing a path to scale model intelligence more autonomously.
  • Investor Note: The surge in diverse RL algorithms (like GRPO) post-DeepSeek indicates rapid experimentation. Tracking which RL techniques gain traction is crucial for understanding future model development pathways and potential efficiency gains.

Deep Dive: The DeepSeek Moment (V3, R10, R1)

  • Sam clarifies the DeepSeek releases: V3 (a powerful 67B parameter sparse base model, a Mixture of Experts (MoE) model where different parts specialize), followed by R10 and R1 (reasoning models derived from V3 via RL).
  • The R10 training process was particularly notable:
    • It fed the V3 base model verifiable problems (math, coding).
    • Provided simple binary rewards (1 for correct, 0 for incorrect) without explicit instructions on how to solve them.
    • Crucially, this involved minimal human-generated preference data. "With R10, it was literally just here's a bunch of problems. Start thinking, figure it out," Sam explains.
  • Through this trial-and-error process with binary rewards, the model learned to reason more effectively and "think longer" to achieve correct answers.

The Power and Quirks of RL-Generated Reasoning

  • The R10 model became highly performant at problem-solving but wasn't "human-legible." Its reasoning traces (the step-by-step thought process) might mix languages or lack proper syntax.
  • Sam describes R10 as a "wild child," super smart but not conditioned for typical human interaction, highlighting a trade-off between raw capability and human alignment/preference.
  • The model learned that longer thinking often correlated with correct answers, leading to progressively longer response lengths during training – Sam references a chart showing this and the researchers' "aha moment" where the model explicitly noted the need to think longer.
  • Research Consideration: This raises questions about the potential limitations imposed by human preference data (like in RLHF). Does optimizing for human legibility inhibit a model's raw problem-solving potential or creativity?

The Role of Human Data vs. Model Exploration (AlphaGo Analogy)

  • Sam draws a parallel to AlphaGo's development. Initially trained with human game examples, it performed well. However, a later version (AlphaZero), trained only on rules and self-play (pure RL without human examples), became even better, discovering novel strategies (like the famous "Move 37") humans wouldn't conceive of.
  • This suggests that excessive reliance on human data might constrain a model's exploratory capabilities and ultimate performance ceiling.
  • Strategic Question: Should the focus be purely on verifiable outcomes (like R10's binary rewards) rather than mimicking human thought processes, especially for complex, non-intuitive problem domains? This has implications for data strategies in AI development.

The Importance of Reasoning Traces

  • Tommy raises the point that reasoning traces – the model's step-by-step "thinking" process – seem increasingly valuable, potentially more so than just raw data.
  • Reasoning Trace: The sequence of intermediate steps or thoughts generated by a model as it works towards a final answer. DeepSeek made these visible, showing the model strategizing ("I should solve it this way...").
  • Sam agrees on their importance but notes the current reliance on high-quality initial question-answer pairs to kickstart the RL process, even in systems aiming for synthetic data generation.
  • A key challenge is moving beyond easily verifiable domains (math, code) to apply RL effectively in creative or subjective areas where defining "correctness" is harder. Generalizability across domains is another under-researched area.
  • Insight: The generation, verification, and utilization of reasoning traces represent a new, high-value data frontier. Infrastructure and methods for handling this data type are becoming critical.

Conceptualizing the Decentralized RL Network (Foundation, Gym, Refinery)

  • Sam proposes a three-part mental model for decentralizing the RL process:
    • The Foundation: A performant base model, ideally developed via decentralized pre-training (referencing projects like Noose, Prime Intellect, Jensen).
    • The Gym: An environment or platform to generate diverse, high-quality reasoning data (traces) across various domains and cognitive strategies.
    • The Refinery: A network to perform the actual RL optimization (post-training) using the data from the gym to improve models.
  • Crypto AI Relevance: This framework directly maps to potential decentralized infrastructure plays – networks for compute (Foundation, Refinery) and specialized platforms for data generation/verification (Gym).

The "World's Gym": Generating Diverse Reasoning Data

  • The "Gym" concept builds on existing ideas like OpenAI Gym or Carla (for self-driving). Sam envisions open, decentralized platforms where:
    • Anyone can create specialized "environments" (e.g., for medicine, finance, physics).
    • Users bring models to these environments to generate reasoning traces by attempting tasks.
    • Robust, potentially decentralized, verification mechanisms assess the quality/correctness of these traces.
    • This verified data corpus feeds back into model training (the "Refinery").
  • Contribution isn't about manually writing traces, but about creating environments, running models, and participating in verification.
  • Opportunity: Platforms facilitating the creation of diverse RL environments and verifiable reasoning trace generation could become crucial infrastructure, potentially leveraging crypto-economic incentives.

Why Decentralize RL? Open vs. Closed Innovation

  • Sam argues for decentralization based on the principle of "innovation at the edge" (borrowing from USV).
  • He believes an open platform allowing massive experimentation across countless domains by diverse contributors is more likely to discover the absolute best reasoning strategies than closed, centralized labs.
  • His analogy: A closed lab is like a single genius in a room with tutors; an open network is like a global school where the best minds collaborate and share diverse approaches freely.
  • Core Argument: Decentralization fosters broader exploration and cross-pollination of ideas, essential for the trial-and-error nature of RL, potentially leading to faster and more diverse breakthroughs.

Feeding RL Insights Back: The Future of Model Training (Modular Models)

  • How does the valuable data from the "Gym" improve models globally?
    • One vision: A continuously improving "world model" fed by decentralized RL data streams.
    • Sam's current interest: Highly Modular Sparse Models (like MoE). The idea is to have specialized "expert" sub-models within a larger architecture that can be independently trained, improved, and potentially swapped or combined ("Lego blocks").
  • This modularity could allow specialized teams to develop best-in-class experts for niche domains (e.g., a specific coding language, medical diagnosis) that can then be plugged into broader models. Jensen's HD paper explores training sub-experts in parallel on heterogeneous hardware.
  • Future Trend: Modular architectures could decentralize not just training data generation but also model development itself, creating markets for specialized AI components. This contrasts with monolithic model training.

Open Source vs. Proprietary AI: An Evolving View

  • Sam expresses a measured view: "Distributed AI is inevitable," citing frontier labs already using distributed training. Decentralized AI (trustless, heterogeneous compute) is the next step, particularly suited for RL's parallelizable nature.
  • He acknowledges the success of projects like Prime Intellect and Jensen in demonstrating decentralized RL's efficacy.
  • However, he questions the necessity of tokens/speculative incentives for all aspects, while seeing potential in models like Pluralis's (sharding models, rewarding compute based on usage).
  • The big uncertainty remains: Can open, decentralized efforts truly compete with the performance and resources of frontier labs like OpenAI? He hopes the diversity of open contribution provides an edge.
  • Investor Consideration: The debate continues. While decentralized approaches show promise, the scale and integration capabilities of proprietary models (like OpenAI's) present significant competitive hurdles and lock-in risks.

The Lock-In Problem and Model Swapping Challenges

  • Contrary to early beliefs that models would be easily swappable commodities, Sam argues that providers want lock-in.
  • Features like OpenAI's Memory make models deeply personalized and harder to switch away from. Prompt engineering and workflows often become highly tuned to a specific model's behavior.
  • "They want to lock you in and they want to make it as hard as possible for you to swap your model," Sam states, predicting this trend will intensify.
  • Risk Factor: Investors in applications built on specific proprietary models should be aware of platform risk and the increasing difficulty/cost of migrating to alternatives.

Perspectives on AGI and Current Model Capabilities

  • Sam shares his less-than-magical experience with GPT-4o on a specific web-scraping task, encountering bugs and loops, suggesting current models are still far from flawless or true AGI (Artificial General Intelligence).
  • However, he sees the potential for models learning to use tools and autonomously string together actions (as seen with GPT-4o's direction) as the path toward more "real-world performant model behavior" that approaches AGI territory.
  • He finds the concept of models developing reasoning traces and solving problems autonomously (even imperfectly) more indicative of progress towards AGI than simple next-token prediction. The "Arabesque" paper on letting models learn from world interaction is relevant here.

Investment Focus: Decentralized AI Landscape

  • Symbolic Capital invests exclusively in Crypto AI / Decentralized AI.
  • Sam highlights several impressive teams pushing the frontier: Pluralis, Prime Intellect, Jensen, Noose, Ambient, Exo.
  • He also mentions non-crypto players like SF Compute (financializing GPU compute) and teams working on bringing similar concepts on-chain or creating developer-friendly distributed compute platforms.
  • He finds appeal in "middle-ground" approaches (distributed but not fully decentralized, using stablecoins for settlement) as they offer immediate utility, while acknowledging the high-risk, high-reward potential of fully decentralized visions like Pluralis.
  • Area of Interest: Sam is particularly focused on modular model architectures and anyone working on training or composing specialized "expert" sub-models.

Final Thoughts: RL is Not Dead & Base Model Importance

  • Sam addresses recent online discourse questioning RL's value, stemming from research suggesting base models can eventually reach similar answers to reasoning models if given enough attempts.
  • His counter: RL teaches models efficient reasoning behavior that gets the right answer faster and more reliably on the first try, which is crucial in practice. "RL is definitely not dead. RL got us [GPT-4o]," he emphasizes.
  • Crucially, Sam stresses that the RL renaissance does not mean the death of pre-training. Quoting Dario Amodei (Anthropic), he notes that RL works best on already capable base models. Better pre-training leads to better RL outcomes.
  • Key Takeaway: Both strong base models (via pre-training) and effective RL (for reasoning) are complementary and essential for advancing AI capabilities.

This episode highlights Reinforcement Learning as the pivotal next phase in AI scaling, driven by reasoning capabilities. Crypto AI investors and researchers must track RL advancements, decentralized data generation (like RL gyms), and modular architectures to identify emerging infrastructure needs and strategic investment opportunities.

Others You May Like