The TWIML AI Podcast with Sam Charrington
December 17, 2025

Rethinking Pre-Training for Agentic AI [Aakanksha Chowdhery] - 759

Aakanksha Chowdhery, a veteran of Google's PaLM and Gemini teams, argues that building truly capable "agentic" AI requires a fundamental shift in how we approach pre-training. The current focus on static benchmarks and next-token prediction won't cut it for models that need to plan, reason, and act over multiple steps in dynamic environments. This isn't a post-training fix; it's a foundational problem.

The Agentic Imperative

  • “For the longest time, we were measuring pre-training on static benchmarks. If you want these models to be useful as agents, they need to be able to interact with environments. And when we start caring about those agentic tasks, pre-training needs to rethink from fundamentals.”
  • Static Limitations: Current LLM pre-training optimizes for benchmarks like GLUE or GSM8K, which measure isolated language tasks. This is like training a chef by only having them identify ingredients, not cook a meal.
  • Dynamic Demands: Agentic AI, such as coding agents that refactor large codebases or deep research agents synthesizing multiple articles, requires models to accumulate context, plan multi-step actions, and learn from environmental feedback.
  • Beyond Fine-Tuning: While post-training techniques (like RLHF) add some agentic behaviors, they often hit limits imposed by context window engineering. Core agentic capabilities must be embedded during pre-training.

Re-engineering the Foundation

  • “Architecture is just one part of the equation. A big part of the equation is what is the loss objective and what is the training data.”
  • Loss Objective Evolution: The standard "next token prediction" loss, while powerful, may not be optimal for agentic tasks. Techniques like masking (as in BERT) can teach models specific behaviors, such as tool selection. Think of it as teaching a student to solve a specific type of problem, not just to recite facts.
  • Data Curation: Beyond sheer volume, the quality and type of training data are paramount. Incorporating "reasoning traces"—expert demonstrations of problem-solving, including failures and corrections—can teach models to plan and recover.
  • Long-Form Reasoning: Models need to process and reason over vast, disparate information (e.g., an entire codebase, 100 research articles) and understand complex relationships. This is like a lawyer synthesizing arguments from thousands of pages of legal documents, not just finding keywords.
  • Error Recovery: Models must learn from failed trajectories and adjust their course, rather than getting stuck in loops or confidently hallucinating. This is akin to a self-driving car learning from near-misses to improve future navigation.

The Measurement & Scale Challenge

  • “The kind of benchmarks we need for measuring this kind of intelligence is sometimes not available today.”
  • Benchmark Deficit: Existing benchmarks fail to measure multi-step planning, long-form reasoning, or error recovery. New, dynamic benchmarks are essential to simulate real-world agentic workflows.
  • Scaling for Discovery: Larger models often reveal emergent capabilities sooner. While the goal is cost-efficient models, discovering these capabilities typically requires pushing the boundaries of scale first, then distilling that knowledge into smaller models.
  • Synthetic Data Nuance: While synthetic data can augment reasoning traces, it risks "smoking your own exhaust" if not carefully managed. The goal is to augment natural data, not replace it, to maintain distribution integrity.

Key Takeaways:

  • Pre-Training is the New Frontier: The next leap in AI capabilities, particularly for agentic systems, will come from fundamental advancements in pre-training, not just post-training tweaks.
  • Builders & Investors: Focus on teams rethinking loss objectives, curating high-quality reasoning data, and developing dynamic benchmarks for agentic capabilities. Be wary of "agentic" claims that lack foundational pre-training innovation.
  • The "So What?": Over the next 6-12 months, expect a push for new benchmarks and data strategies that explicitly train models for multi-step planning, long-form reasoning, and error recovery, moving beyond simple next-token prediction.

For further insights, listen to the podcast: Link

This episode exposes the critical flaw in current AI development: pre-training models on static benchmarks fundamentally limits their potential as dynamic, agentic systems. Aakanksha Chowdhery, a veteran of PaLM and Gemini, argues that achieving true agentic intelligence demands a radical re-evaluation of pre-training from its core—loss objectives, training data, and even architectural nuances.

The Agentic Imperative: Beyond Static Benchmarks

  • Traditional pre-training, focused on static benchmarks like GLUE or AIME, fails to equip models for real-world agentic tasks. Aakanksha Chowdhery, a member of technical staff at Reflection, asserts that agentic AI requires models to interact with dynamic environments, a capability current pre-training methods do not adequately foster.
  • Current models, primarily tuned via post-training techniques like reinforcement learning, hit a ceiling in complex agentic tasks.
  • Agentic applications, such as coding agents refactoring large codebases or deep research agents synthesizing multiple articles, demand continuous context accumulation and multi-step goal achievement.
  • The "context engineering problem" highlights this limitation, where agent performance is constrained by the model's immediate context window.
  • Chowdhery emphasizes that achieving next-generation agentic capabilities is not merely a post-training problem; it necessitates a foundational shift in pre-training.

"For the longest time, we were measuring pre-training on static benchmarks. If you want these models to be useful as agents, they need to be able to interact with environments."

Re-architecting Pre-Training: Loss Objectives and Data

  • Rethinking pre-training extends beyond architecture to encompass fundamental changes in loss objectives and training data. Chowdhery, drawing on her experience with PaLM and Gemini, stresses that while architectural tweaks to the attention mechanism are considered, the primary levers are how models learn and what they learn from.
  • The attention mechanism, central to Transformers, faces scrutiny for its efficacy in long-form reasoning, despite its success in long-form retrieval.
  • Current models are trained to probabilistically predict the next token; Chowdhery questions if this objective is optimal for developing robust reasoning and planning capabilities.
  • Loss objectives can be modified to embed agentic behaviors: for coding models, "fill-in-the-middle" tasks teach code completion, while masking techniques train models for tool use by predicting which tool or query to employ.
  • This approach unifies pre- and post-training, moving capabilities like reasoning from a final tuning stage to a core, pre-trained attribute.

"Architecture is just one part of the equation. A big part of the equation is what is the loss objective and what is the training data."

Training Data: Quality, Reasoning Traces, and Synthetic Augmentation

  • The quality and composition of training data are paramount for agentic models. Chowdhery highlights that while scale and diversity remain crucial, the focus must shift towards high-quality curation and the integration of "reasoning traces."
  • High-quality data curation, as demonstrated by models from DeepSeek, significantly improves compute efficiency and model performance.
  • The challenge lies in acquiring expert reasoning traces—data reflecting human problem-solving processes across diverse domains—at scale.
  • While current methods use reinforcement learning environments to generate these traces, the goal is to integrate them directly into pre-training data volumes.
  • Synthetic data, though challenging due to potential distribution shifts and "smoking your own exhaust" problems, offers a path to augment natural data, provided it maintains representativeness of the desired distribution.

"What would be good reasoning traces that can feed back into the models at large scale is a question that is a thing that we're thinking a lot about."

Benchmarking Agentic Intelligence: Measuring Dynamic Capabilities

  • Existing benchmarks are quickly saturated, failing to capture the dynamic, multi-step capabilities required for agentic AI. Chowdhery advocates for developing new, in-house benchmarks that specifically test planning, recovery from failure, and tool learning.
  • New benchmarks must assess multi-step problem-solving, planning across varying horizons, and the ability to recover from failed trajectories.
  • Examples include coding tasks like TerminalBench or SWE-bench, and multi-hop reasoning benchmarks like MCRV2, which test a model's ability to synthesize information from disparate sources.
  • The development of these benchmarks often stems from identifying current model limitations in real-world workflows, turning those gaps into measurable challenges.
  • The field is just beginning to see reinforcement learning workflows translate into robust benchmarks, which will be critical for future progress.

"The kind of benchmarks we need for measuring this kind of intelligence is sometimes not available today."

Long-Form Reasoning, Recoverability, and Tool Learning

  • Long-Form Reasoning: This involves models reasoning about disparate information far apart in context, such as understanding an entire codebase or synthesizing a report from ten articles, going beyond simple retrieval. It also includes multi-step planning, akin to chess, where future actions are projected towards a goal.
  • Recoverability from Failure: Agents must learn from failed trajectories, correcting course instead of repeating errors. This requires the model to recognize failures, understand their implications, and choose new actions, moving beyond probabilistic next-token generation. This capability directly addresses issues like "hallucination" by enabling models to seek more information or ask for user guidance when uncertain.
  • Tool Learning: Models need to meta-learn (learn how to learn) to explore new tools and understand their utility through interaction and feedback, rather than relying solely on tools described within their context window or extensive fine-tuning. This could involve unified domain-specific languages or learning by imitation.

"We want the model to learn how to learn and how to correct itself."

Investor & Researcher Alpha

  • Capital Shift: Investment is moving towards foundational pre-training for agentic capabilities, not just post-training. Companies like Reflection, backed by Nvidia, are prioritizing end-to-end training of frontier open agentic models. This signals a shift in where significant compute and research capital will be deployed.
  • New Bottlenecks: The primary bottlenecks are no longer just raw data volume but high-quality data curation, the generation and integration of expert reasoning traces, and the development of dynamic, multi-step benchmarks for agentic intelligence. Research into synthetic data generation that maintains natural data distribution is critical.
  • Research Direction: Approaches focused solely on post-training or context engineering for agentic systems are becoming obsolete. The future lies in embedding agentic capabilities—planning, long-form reasoning, failure recovery, tool meta-learning—directly into the pre-training phase. Architectural innovations that enhance attention for long-context reasoning, while high-risk, are also areas of intense focus.

Strategic Conclusion

Achieving truly capable agentic AI necessitates a fundamental re-evaluation of pre-training. The industry must move beyond static benchmarks and next-token prediction to embed dynamic reasoning, planning, and self-correction directly into models through innovative loss objectives, high-quality reasoning data, and advanced measurement. The next step for AI is to build models that learn how to learn, interact, and adapt from their earliest stages.

Others You May Like