This episode exposes the critical flaw in current AI development: pre-training models on static benchmarks fundamentally limits their potential as dynamic, agentic systems. Aakanksha Chowdhery, a veteran of PaLM and Gemini, argues that achieving true agentic intelligence demands a radical re-evaluation of pre-training from its core—loss objectives, training data, and even architectural nuances.
The Agentic Imperative: Beyond Static Benchmarks
- Traditional pre-training, focused on static benchmarks like GLUE or AIME, fails to equip models for real-world agentic tasks. Aakanksha Chowdhery, a member of technical staff at Reflection, asserts that agentic AI requires models to interact with dynamic environments, a capability current pre-training methods do not adequately foster.
- Current models, primarily tuned via post-training techniques like reinforcement learning, hit a ceiling in complex agentic tasks.
- Agentic applications, such as coding agents refactoring large codebases or deep research agents synthesizing multiple articles, demand continuous context accumulation and multi-step goal achievement.
- The "context engineering problem" highlights this limitation, where agent performance is constrained by the model's immediate context window.
- Chowdhery emphasizes that achieving next-generation agentic capabilities is not merely a post-training problem; it necessitates a foundational shift in pre-training.
"For the longest time, we were measuring pre-training on static benchmarks. If you want these models to be useful as agents, they need to be able to interact with environments."
Re-architecting Pre-Training: Loss Objectives and Data
- Rethinking pre-training extends beyond architecture to encompass fundamental changes in loss objectives and training data. Chowdhery, drawing on her experience with PaLM and Gemini, stresses that while architectural tweaks to the attention mechanism are considered, the primary levers are how models learn and what they learn from.
- The attention mechanism, central to Transformers, faces scrutiny for its efficacy in long-form reasoning, despite its success in long-form retrieval.
- Current models are trained to probabilistically predict the next token; Chowdhery questions if this objective is optimal for developing robust reasoning and planning capabilities.
- Loss objectives can be modified to embed agentic behaviors: for coding models, "fill-in-the-middle" tasks teach code completion, while masking techniques train models for tool use by predicting which tool or query to employ.
- This approach unifies pre- and post-training, moving capabilities like reasoning from a final tuning stage to a core, pre-trained attribute.
"Architecture is just one part of the equation. A big part of the equation is what is the loss objective and what is the training data."
Training Data: Quality, Reasoning Traces, and Synthetic Augmentation
- The quality and composition of training data are paramount for agentic models. Chowdhery highlights that while scale and diversity remain crucial, the focus must shift towards high-quality curation and the integration of "reasoning traces."
- High-quality data curation, as demonstrated by models from DeepSeek, significantly improves compute efficiency and model performance.
- The challenge lies in acquiring expert reasoning traces—data reflecting human problem-solving processes across diverse domains—at scale.
- While current methods use reinforcement learning environments to generate these traces, the goal is to integrate them directly into pre-training data volumes.
- Synthetic data, though challenging due to potential distribution shifts and "smoking your own exhaust" problems, offers a path to augment natural data, provided it maintains representativeness of the desired distribution.
"What would be good reasoning traces that can feed back into the models at large scale is a question that is a thing that we're thinking a lot about."
Benchmarking Agentic Intelligence: Measuring Dynamic Capabilities
- Existing benchmarks are quickly saturated, failing to capture the dynamic, multi-step capabilities required for agentic AI. Chowdhery advocates for developing new, in-house benchmarks that specifically test planning, recovery from failure, and tool learning.
- New benchmarks must assess multi-step problem-solving, planning across varying horizons, and the ability to recover from failed trajectories.
- Examples include coding tasks like TerminalBench or SWE-bench, and multi-hop reasoning benchmarks like MCRV2, which test a model's ability to synthesize information from disparate sources.
- The development of these benchmarks often stems from identifying current model limitations in real-world workflows, turning those gaps into measurable challenges.
- The field is just beginning to see reinforcement learning workflows translate into robust benchmarks, which will be critical for future progress.
"The kind of benchmarks we need for measuring this kind of intelligence is sometimes not available today."
Long-Form Reasoning, Recoverability, and Tool Learning
- Long-Form Reasoning: This involves models reasoning about disparate information far apart in context, such as understanding an entire codebase or synthesizing a report from ten articles, going beyond simple retrieval. It also includes multi-step planning, akin to chess, where future actions are projected towards a goal.
- Recoverability from Failure: Agents must learn from failed trajectories, correcting course instead of repeating errors. This requires the model to recognize failures, understand their implications, and choose new actions, moving beyond probabilistic next-token generation. This capability directly addresses issues like "hallucination" by enabling models to seek more information or ask for user guidance when uncertain.
- Tool Learning: Models need to meta-learn (learn how to learn) to explore new tools and understand their utility through interaction and feedback, rather than relying solely on tools described within their context window or extensive fine-tuning. This could involve unified domain-specific languages or learning by imitation.
"We want the model to learn how to learn and how to correct itself."
Investor & Researcher Alpha
- Capital Shift: Investment is moving towards foundational pre-training for agentic capabilities, not just post-training. Companies like Reflection, backed by Nvidia, are prioritizing end-to-end training of frontier open agentic models. This signals a shift in where significant compute and research capital will be deployed.
- New Bottlenecks: The primary bottlenecks are no longer just raw data volume but high-quality data curation, the generation and integration of expert reasoning traces, and the development of dynamic, multi-step benchmarks for agentic intelligence. Research into synthetic data generation that maintains natural data distribution is critical.
- Research Direction: Approaches focused solely on post-training or context engineering for agentic systems are becoming obsolete. The future lies in embedding agentic capabilities—planning, long-form reasoning, failure recovery, tool meta-learning—directly into the pre-training phase. Architectural innovations that enhance attention for long-context reasoning, while high-risk, are also areas of intense focus.
Strategic Conclusion
Achieving truly capable agentic AI necessitates a fundamental re-evaluation of pre-training. The industry must move beyond static benchmarks and next-token prediction to embed dynamic reasoning, planning, and self-correction directly into models through innovative loss objectives, high-quality reasoning data, and advanced measurement. The next step for AI is to build models that learn how to learn, interact, and adapt from their earliest stages.