The TWIML AI Podcast with Sam Charrington
December 17, 2025

Rethinking Pre-Training for Agentic AI [Aakanksha Chowdhery] - 759

Aakanksha Chowdhery, a technical staff member at Reflection and veteran of Google's Palm and Gemini teams, argues that true agentic AI requires a fundamental shift in pre-training. Current methods, optimized for static benchmarks, are hitting a wall. The next generation of models needs core capabilities like multi-step reasoning, planning, and error recovery baked in from the start.

Identify the "One Big Thing":

  • The core argument is that to achieve truly agentic AI, the industry must fundamentally rethink and re-architect pre-training, moving beyond static benchmarks and next-token prediction to embed multi-step reasoning, planning, and tool-use capabilities directly into the base models. Current post-training methods are hitting a ceiling.

Extract Themes:

1. The Pre-training Bottleneck for Agentic AI:

  • “For the longest time, we were measuring pre-training on static benchmarks. If you want these models to be useful as agents, they need to be able to interact with environments. And when we start caring about those agentic tasks, pre-training needs to rethink from fundamentals. This is not just a post-training problem.”
  • “What is especially important about agentic tasks like coding tasks is that if you're trying to read a codebase and then execute some parts of it or write about it, you will accumulate context over time... What you need a model to be able to do is have planning as a capability, have ability to reason over its context length, which might be very long.”

2. Rethinking the Pre-training Recipe (Architecture, Data, Loss):

  • “When I go and talk about rethinking pre-training, I always think about it in terms of what are the capabilities we want out of the models. And to me, architecture is just one part of the equation. A big part of the equation is what is the loss objective and what is the training data.”
  • “We have actually trained these models to probabilistically predict the next token. What is the most likely next token? And out of that we have managed to get these models to be extremely strong reasoners. But is that the best objective one could go after?”

3. The Challenge of Measurement and Emergent Capabilities:

  • “The kind of benchmarks we need for measuring this kind of intelligence is sometimes not available today.”
  • “Typically, model scaling is one way in which we notice model capabilities that we haven't noticed before and often the model capabilities that become available at scale then also with the right training data and the right set of things and training longer and whatnot become available in smaller models as well.”

Synthesize Insights:

Theme 1: The Pre-training Bottleneck for Agentic AI

  • Static Benchmarks Limit Progress: Traditional LLM pre-training optimizes for static benchmarks (e.g., GLUE, GSM8K) that measure isolated language understanding or generation. This approach fails to prepare models for dynamic, interactive agentic tasks.
  • Agents Need Dynamic Interaction: Agentic AI requires models to interact with environments, accumulate context, plan multi-step actions, and learn from execution feedback. Analogy: Training a chess player by only showing them static board positions, not by letting them play games and learn from moves.
  • Post-training is Insufficient: Current methods like RAG or fine-tuning are "tacking on" capabilities. They hit a ceiling because the foundational pre-training doesn't embed core agentic intelligence, leading to context engineering problems and limited long-term reasoning.
  • Coding Agents as a Bellwether: Coding agents (e.g., refactoring large codebases, deep research agents finding multiple articles) highlight the need for models to handle long context, plan, and correct errors over multiple steps.

Theme 2: Rethinking the Pre-training Recipe (Architecture, Data, Loss)

  • Beyond Next-Token Prediction: The standard next-token prediction loss, while powerful, may not be optimal for agentic capabilities like planning, reasoning, and tool use. New loss objectives are needed.
  • Attention Mechanism Evolution: While not advocating for radical architectural shifts (like post-Transformer models), tweaking the attention mechanism to better handle long-form reasoning and synthesize information from vast, disparate contexts is crucial. Analogy: A librarian who can not only find a specific book but also synthesize insights from 100 books on a topic.
  • High-Quality, Reasoning-Rich Data: Pre-training data needs to move beyond sheer volume and diversity to include high-quality, curated "reasoning traces" – examples of expert problem-solving, multi-step planning, and error correction.
  • Synthetic Data Challenges: Generating synthetic reasoning data is expensive and risks "smoking your own exhaust" (models learning from their own generated data, potentially limiting distribution expansion). The goal is to augment natural data, not replace it.

Theme 3: The Challenge of Measurement and Emergent Capabilities

  • Lack of Agentic Benchmarks: There's a significant gap in benchmarks that truly measure multi-step reasoning, planning, error recovery, and tool learning in dynamic environments. Labs often build internal benchmarks.
  • Emergence at Scale: Core capabilities often emerge at large model scales first (e.g., GPT-4 showing reasoning, then smaller models acquiring it through distillation). This necessitates building large models to discover new capabilities before optimizing for efficiency.
  • Recoverability is Key: Agentic models must learn to recover from failed trajectories, understand when they lack information, and ask for help or verify solutions. This is distinct from hallucination (probabilistic next-token error) and requires embedding self-correction into the model's core learning. Analogy: A self-driving car that recognizes a dead end, re-plans, and asks for human input if stuck, rather than confidently driving into a wall.

Filter for Action:

  • For Investors:
    • Opportunity: Invest in companies focusing on fundamental pre-training innovation for agentic AI, not just post-training hacks. Look for teams with deep experience in scaling and debugging large models, as this is a high-capital, high-risk endeavor.
    • Warning: Be wary of claims of "agentic breakthroughs" that rely solely on prompt engineering or post-training without addressing the underlying model capabilities. These may hit a performance ceiling quickly.
  • For Builders:
    • Focus on Foundational Capabilities: If building agents, push for models that embed reasoning, planning, and long-context understanding at the pre-training level. Don't rely solely on context window stuffing or multi-agent orchestration as a long-term solution.
    • Develop Better Benchmarks: Contribute to or build new benchmarks that measure multi-step, interactive, and error-recovery capabilities. This is a critical missing piece for advancing agentic AI.
    • Explore Loss & Data Innovation: Experiment with novel loss objectives and high-quality, reasoning-rich training data (even if synthetic) to imbue models with agentic traits from the ground up.
    • Embrace Iteration & Scale: Recognize that discovering new capabilities often requires training at scale, then distilling those capabilities into smaller, more efficient models.

New Podcast Alert: Rethinking Pre-Training for Agentic AI [Aakanksha Chowdhery] - 759

For further insights and detailed discussions, watch the full podcast: Link

This episode exposes a fundamental flaw in current AI pre-training: models optimized for static benchmarks fail to deliver the dynamic, interactive capabilities essential for agentic AI. Aakanksha Chowdhery, a veteran of PaLM and Gemini, argues for a radical re-evaluation of pre-training objectives, data, and evaluation to unlock the next generation of intelligent agents.

The Agentic Imperative: Beyond Static Benchmarks

  • Aakanksha Chowdhery, a member of technical staff at Reflexion, asserts that traditional pre-training, measured against static benchmarks like GLUE (General Language Understanding Evaluation) or AIME (American Invitational Mathematics Examination), fundamentally limits AI capabilities. Agentic AI, designed to interact with environments and achieve multi-step goals, demands a new approach.
  • Current models, primarily chatbots, struggle with complex, goal-oriented tasks requiring sustained interaction.
  • Coding agents (e.g., refactoring large codebases) and deep research agents (e.g., synthesizing multiple articles) exemplify the interactive, multi-step workflows demanding agentic intelligence.
  • Post-training techniques, while useful, only offer incremental gains; true agentic capabilities require foundational shifts in pre-training.
  • “When we start caring about those agentic tasks, pre-training needs to rethink from fundamentals. This is not just a post-training problem.”

Rethinking Pre-Training: Attention, Loss, and Data

  • Chowdhery emphasizes that achieving agentic capabilities hinges on evolving three core pre-training levers: architecture (specifically attention), loss objectives, and training data. Architectural changes are high-risk and incremental, while loss and data offer more immediate impact.
  • Attention Mechanism: Must evolve beyond local context to support long-form reasoning, enabling models to synthesize information from vast, disparate sources (e.g., 100 articles, an entire codebase). Memory architectures can augment attention, acting as tools for models to retrieve and reason over stored information.
  • Loss Objectives: The standard "next token prediction" is insufficient. Masking techniques, like "fill-in-the-middle" for code generation or masking tool use, can teach models specific agentic behaviors during pre-training, unifying pre- and post-training paradigms.
  • Training Data: Beyond sheer scale and diversity, quality curation and the inclusion of reasoning traces are paramount. While synthetic data poses challenges (distribution shift, "smoking your own exhaust"), augmenting natural data with expert problem-solving examples is a critical direction.
  • “Believe it or not, we have actually trained these models to probabilistically predict the next token. What is the most likely next token? And out of that we have managed to get these models to be extremely strong reasoners. But is that the best objective one could go after?”

The Challenge of Agentic Benchmarking

  • The rapid saturation of traditional benchmarks necessitates a new generation of evaluation systems tailored for agentic capabilities. Frontier labs often develop these in-house, focusing on real-world workflow complexities.
  • New benchmarks must measure multi-step problem-solving, planning, and the ability to recover from failed trajectories.
  • Examples include tasks with varying horizons, like METR's "software atomic actions" or multi-hop reasoning benchmarks like MRC V2 and Loft, which test abstract reasoning over long contexts.
  • Early success with PaLM on novel, crowdsourced reasoning tasks demonstrated the power of targeted benchmarks in revealing emergent capabilities.
  • “The current paradigm that we have kind of settled on is that everyone builds benchmarks in-house to measure what is the set of model capabilities they’re going after.”

Scaling for Emergence, Distilling for Efficiency

  • Chowdhery explains that larger models often reveal emergent capabilities sooner than smaller ones. While the ultimate goal is cost-efficient, highly capable models, achieving scale is a necessary first step to discover these new behaviors before distilling them into smaller, more practical deployments.
  • Emergence, the appearance of new capabilities at scale, is a consistent observation in large language model development.
  • Once discovered in large models (e.g., GPT-4), these capabilities can often be distilled or engineered into smaller models in subsequent generations.
  • Reflexion aims for the most capable models first, then optimizes for cost-efficiency, often through distillation.
  • “Often times… emergence is something that you find fun and interesting things in the larger models and then you often distill them or get them in the smaller models in the next generation because you figured out exactly the recipe that it takes to get them.”

Long-Form Reasoning and Recoverability

  • Agentic AI requires sophisticated long-form reasoning and the ability to recover from failures, moving beyond simple retrieval or next-token prediction.
  • Long-form reasoning involves abstracting and synthesizing information across vast, disparate contexts (e.g., understanding an entire codebase, synthesizing multiple research articles for a vacation plan). It also includes multi-step planning, anticipating future actions, and learning from past trajectories.
  • Recoverability is the model's ability to recognize and correct past failures, avoiding repetitive loops. This implies an awareness of when information is insufficient, prompting the model to seek verification, explore alternative actions, or ask for user input, rather than confidently hallucinating.
  • This capability is not a solved problem but is a key focus for pre-training, requiring strong reasoners developed through specific training data and loss objectives.
  • “If you’re looking at an existing trajectory where there are failures and then recover from that often the models will get stuck in repetitive loops today because they don’t really know that they made that failure.”

Learning New Tools and Cross-Domain Reasoning

  • True agentic intelligence extends to the ability to learn and adapt to new tools and environments, applying reasoning across diverse modalities and domains.
  • Models must meta-learn how to explore and utilize new tools through interaction and feedback, rather than being pre-programmed or limited by context window size.
  • Promising directions include unified domain-specific languages (e.g., code formats) that facilitate this exploration and learning.
  • The principles developed for coding agents, such as reasoning across code and text, are broadly applicable to enterprise and consumer workflow agents, as the underlying capability is cross-modal and cross-domain reasoning.
  • “The question comes back to how can we get the models to meta-learn like learn how to learn what exactly is the format in which this exploration of the new environment needs to be done so that the model learns to take actions in that space.”

Investor & Researcher Alpha

  • Capital Shift: Investment is moving from incremental post-training tweaks to foundational pre-training innovation for agentic AI. Labs like Reflexion, backed by Nvidia, are betting on end-to-end training of frontier open agentic models.
  • New Bottleneck: The primary bottleneck for agentic AI is no longer just compute or data volume, but the quality and structure of training data (especially reasoning traces) and the design of loss objectives that foster multi-step planning and recoverability.
  • Obsolete Research: Research solely focused on static benchmark optimization or post-training as a standalone solution for agentic capabilities risks becoming obsolete. The future demands integrated pre-training solutions that bake in agentic behaviors from the ground up.

Strategic Conclusion

The era of agentic AI demands a fundamental re-architecture of pre-training. By redefining loss objectives, curating high-quality reasoning data, and developing dynamic benchmarks, the industry can move beyond static performance to models capable of complex, interactive, and self-correcting intelligence. The next step is to unify pre- and post-training paradigms to build truly autonomous agents.

Others You May Like