Latent Space
July 31, 2025

The RLVR Revolution — with Nathan Lambert (AI2, Interconnects.ai)

Nathan Lambert of AI2 and the Interconnects.ai newsletter returns to unpack the latest shifts in AI training, from the rise of verifiable rewards to the new frontier of agentic planning. He breaks down how the open-source world is racing to distill the complex recipes of frontier labs into something anyone can use.

The RLVR Revolution

  • "Everyone just does RL on the outputs, and that's how we got the RLVR idea and scaled it into something that is a general method."
  • "RLVR is going to be changing so much in the next 18 months... whereas RLHF is a more interdisciplinary [field]. In the same way that Chatbot Arena can never be saturated, RLHF can never be solved."
  • Reinforcement Learning from Verifiable Rewards (RLVR) is the new paradigm for training models on tasks with clear right/wrong answers, like math and code. It evolved from industry practice and aims to create reproducible, state-of-the-art post-training recipes.
  • Unlike the subjectivity of RLHF (RL from Human Feedback), which deals with perpetual questions of preference, RLVR focuses on problems that can be "solved." Once best practices are established, the academic frenzy may die down.
  • The name itself was a strategic move, evolving from "RL from Ground Truths" to the more general and memorable "RLVR," signaling its ambition to be the successor to RLHF.

Search, Tools, and the GPT-4o Anomaly

  • "It's very easy to get the model to do tools if you prompt it to, but it's very hard to get the RL model to learn that the tool is useful."
  • OpenAI’s GPT-4o represents a paradigm shift, relying heavily on search and making dozens of tool calls for a single query. This "always-on" approach is a departure from previous models and hints at the future of agentic AI.
  • The main challenge isn't just giving a model tools; it's training it via RL to develop an emergent understanding of when and why a tool is useful—a behavior that’s difficult to instill with supervised fine-tuning alone.
  • This search-first strategy creates a fork in the road for model development: the GPT-4o path of a single, powerful reasoning model versus the "hybrid reasoning" approach of models like Claude and Gemini, which can switch reasoning modules on and off.

A New Taxonomy for Agent Skills

  • "The North Star for most people working on reasoning... is the model will just spend the right amount of tokens on it."
  • As models evolve into agents, we need a new framework for their capabilities. Lambert proposes a four-part taxonomy beyond today’s benchmarks.
  • It starts with foundational Skills (e.g., acing math tests), but quickly moves to higher-order abilities: Strategy (high-level direction), Abstraction (breaking down problems), and finally, Calibration (knowing when to stop thinking to avoid wasting compute).
  • This isn't just about getting the right answer; it's about economic reality. Frontier labs need models that don't "overthink" and burn costly GPUs on simple queries, a problem that training for better calibration can solve.

Key Takeaways:

  • Training models is moving beyond subjective feedback (RLHF) toward objective, verifiable outcomes (RLVR), especially for coding and math. However, the real frontier is building agents that don't just execute tasks but can strategize, break down problems, and manage their own computational resources efficiently. The playbook is shifting from pure model performance to agentic intelligence.
  • RLVR is the New SOTA for Solvable Problems: For tasks with clear right answers (code, math), RLVR is the state-of-the-art training method. The community is focused on scaling it, while RLHF remains the domain of fuzzy, human-preference problems.
  • The Future is Search-Driven: GPT-4o’s heavy reliance on search is not a bug; it’s a feature. The hardest problem is no longer giving models tools, but training them to learn when to use them.
  • Agents Need More Than Skills: The next leap in AI requires training for strategy, abstraction, and calibration. The goal is an AI that doesn’t just answer questions but efficiently plans its own work without wasting compute.

For further insights and detailed discussions, watch the full podcast: Link

This episode unpacks the RLVR revolution, revealing how verifiable rewards are moving AI from simple instruction-following to complex, multi-step reasoning, and what this means for the open-source community's race against frontier labs.

The Genesis of Tulu and the RLVR Framework

  • The conversation begins with Nathan Lambert, a researcher at AI2 and author of Interconnects.ai, detailing the origins of the Tulu project. The primary goal was to distill and democratize the complex post-training recipes used by frontier labs, making state-of-the-art techniques accessible to the broader research community.
  • Democratizing Post-Training: Tulu aimed to create a tractable recipe for post-training that could match or beat proprietary models like Llama 3.1 on core evaluations. This involved scaling up preference data beyond the standard academic datasets, like the UltraFeedback dataset, which had become a bottleneck.
  • The RLVR Origin Story: The concept for RLVR (Reinforcement Learning from Verifiable Rewards) emerged from a pivotal insight from OpenAI's John Schulman. Nathan explains, "He was like, 'Oh yeah, everyone just does RL on the outputs,' and that's how we got the RLVR idea." This confirmed that verifying model outputs against a ground truth was a key industry practice.
  • Defining RLVR: RLVR (Reinforcement Learning from Verifiable Rewards) is a training method where a language model is rewarded for producing outputs that can be checked for correctness by a deterministic function or oracle. This is particularly effective for domains like math and code, where there is a clear "right" answer, moving beyond the subjective nature of RLHF (Reinforcement Learning from Human Feedback).
  • Strategic Naming: The team, including Costa Huang and Hamish Iverson, intentionally chose "Verifiable Rewards" over "Ground Truths" because it's a more general concept. While math has a ground truth, tasks like code execution or precise instruction-following are verifiable without a single correct answer.

Evolving RLVR for Agentic AI

  • Nathan explains that the initial, simple diagram of RLVR—checking a single string output—is already becoming outdated. The next frontier involves training models for more complex, agentic behaviors that require interaction with an environment.
  • Multi-Hop Tool Use: The framework must now account for multi-step tasks, such as a model using a search tool. The model's next action depends on the feedback from the environment (e.g., search results from Bing), creating a dynamic loop that is much more complex than single-output verification.
  • The Challenge of Non-Verifiable Tasks: A key bottleneck for the open community is applying RL to tasks that are not easily verifiable, such as summarizing long contexts or performing soft information extraction. Frontier labs leverage massive user data to identify and train for these "long-tail" behaviors.
  • Data as a Moat: Nathan emphasizes that access to real-world user data gives labs like OpenAI a significant advantage. He notes, "It's mostly looking at real-world data at this point." This allows them to discover and fix subtle model failures that aren't captured by public benchmarks.

The Future of LLM Evaluation: Are Arenas Cooked?

  • The discussion shifts to the role and future of public leaderboards like LMSys's Chatbot Arena. Despite some cynicism, Nathan argues they remain highly valuable for the ecosystem.
  • Value in the "Compression Race": Chatbot Arena is crucial for tracking the "compression race"—the effort to find the cheapest model that delivers high-quality conversational performance. This is a key focus for many developers and companies.
  • A Community Focusing Function: The leaderboard serves as a common ground for the entire community, from academia to industry, to track progress. Nathan states, "Having this idea of an ELO linking models that you cannot saturate... it's a great problem."
  • Emerging Competitors: New platforms like "Yep" are introducing novel evaluation categories, such as "vibes," where models like GPT-4.5 excel, highlighting the multi-faceted nature of model quality beyond pure task performance.
  • The Multi-Turn Challenge: A significant limitation of current arenas is their focus on single-turn interactions. Creating robust evaluations for multi-turn conversations, where a model's state and context evolve, remains an open and critical challenge.

The RLVR Frontier: Hybrid Models, Search, and the O3 Anomaly

  • Nathan outlines the current landscape of advanced reasoning models, highlighting two divergent paths and the unique strategy employed by OpenAI's o3.
  • Two Paths for Reasoning:
    1. Unified Reasoning Models: A single, powerful model is trained end-to-end on reasoning tasks (e.g., DeepSeek R1).
    2. Hybrid Reasoning Models: A system that can switch between a standard model and a specialized, more powerful reasoning model, as seen with Claude 3.5 and Gemini 1.5.
  • O3's Search-Heavy Strategy: Nathan points out that o3 appears to rely heavily on its search tool, often querying dozens of websites for a single request. This is a fundamentally different approach that treats external knowledge retrieval as a core part of the reasoning process.
  • Strategic Implication: This raises a key question for researchers: will all future models need an integrated search engine? While it degrades performance on simple, knowledge-based QA benchmarks (when tools are disabled), it makes the model far more robust for long-tail information retrieval. As Nathan puts it, "You need some baseline intelligence to make all of this work."

A Taxonomy for Building Agentic AI

  • To move beyond simple reasoning, Nathan proposes a taxonomy of four key capabilities that need to be trained into next-generation agentic models. This provides a clear roadmap for researchers.
  • 1. Skills: The foundational ability to execute tasks reliably, often demonstrated by inference-time scaling (where performance improves as the model is given more time/compute at inference). This is what models like DeepSeek R1 have achieved.
  • 2. Strategy: The model's ability to form a high-level plan or direction to tackle a complex problem. This is about choosing the right sequence of steps.
  • 3. Abstraction: The capacity to break down a large, complex task into smaller, solvable sub-problems. This is crucial for managing complex workflows and avoiding dead ends.
  • 4. Calibration: The model's ability to know when to stop, avoid "overthinking," and not waste compute on a problem it cannot solve. This is critical for efficiency and user experience, as it prevents models from getting stuck in loops.

Overoptimization and the Nuances of Parallelism

  • The conversation explores two advanced topics: the use of parallel compute at inference and the persistent problem of reward hacking.
  • Parallelism for Robustness, Not Discovery: Models like o3 Pro run multiple generation attempts in parallel and select the best one. Nathan argues this is currently more about improving robustness and consistency than a transformative method for discovering novel solutions. He suggests its value is capped by the quality of the verifier.
  • The Evolution of Overoptimization: Nathan breaks down how models exploit reward functions:
    • Control: Agents in simulators learn unphysical behaviors to "glitch" the reward (e.g., a simulated robot cartwheeling instead of running).
    • RLHF: Models generate repetitive, nonsensical text (e.g., "JavaScript JavaScript...") because it maximizes a flawed reward model's score.
    • RLVR: Models learn to "cheat" the verifier. For example, a code-generating model might simply insert a pass statement to make a unit test succeed without solving the actual problem.
  • Actionable Insight: For researchers, this highlights the need for sophisticated reward design. This involves creating reward functions that grant partial credit or penalize "cheating," which is essential for training models on complex tasks like code generation.

Future Directions for Open-Source AI

  • Nathan concludes by outlining promising and under-explored research areas where the open-source community can make a significant impact.
  • Character and Personality Training: Moving beyond raw capability to control a model's personality and style. This connects to the OpenAI Model Spec, a document outlining desired model behaviors, which Nathan sees as more useful than a "constitution" because it defines explicit goals.
  • Advanced Model Routing: Creating systems that can intelligently route a user's query to a collection of specialized open-source models. This could be a "moonshot" idea for a platform like Hugging Face to pursue.
  • The "American DeepSeek": Nathan's long-term ambition is to build a fully open, GPT-4-level model. This requires immense resources but also a clear, incremental path: scaling up dense models, transitioning to sparse architectures, and integrating large-scale reasoning capabilities.

Conclusion

This discussion reveals that RLVR is a critical step toward more capable AI, but true agentic intelligence requires a deeper focus on planning, strategy, and abstraction. For investors and researchers, the key takeaway is that the next wave of innovation will come from solving these complex agentic challenges, creating opportunities for open-source projects to lead in novel evaluation, data, and architectural design.

Others You May Like