This episode exposes a fundamental flaw in current AI pre-training: models optimized for static benchmarks fail to deliver the dynamic, interactive capabilities essential for agentic AI. Aakanksha Chowdhery, a veteran of PaLM and Gemini, argues for a radical re-evaluation of pre-training objectives, data, and evaluation to unlock the next generation of intelligent agents.
The Agentic Imperative: Beyond Static Benchmarks
- Aakanksha Chowdhery, a member of technical staff at Reflexion, asserts that traditional pre-training, measured against static benchmarks like GLUE (General Language Understanding Evaluation) or AIME (American Invitational Mathematics Examination), fundamentally limits AI capabilities. Agentic AI, designed to interact with environments and achieve multi-step goals, demands a new approach.
- Current models, primarily chatbots, struggle with complex, goal-oriented tasks requiring sustained interaction.
- Coding agents (e.g., refactoring large codebases) and deep research agents (e.g., synthesizing multiple articles) exemplify the interactive, multi-step workflows demanding agentic intelligence.
- Post-training techniques, while useful, only offer incremental gains; true agentic capabilities require foundational shifts in pre-training.
- “When we start caring about those agentic tasks, pre-training needs to rethink from fundamentals. This is not just a post-training problem.”
Rethinking Pre-Training: Attention, Loss, and Data
- Chowdhery emphasizes that achieving agentic capabilities hinges on evolving three core pre-training levers: architecture (specifically attention), loss objectives, and training data. Architectural changes are high-risk and incremental, while loss and data offer more immediate impact.
- Attention Mechanism: Must evolve beyond local context to support long-form reasoning, enabling models to synthesize information from vast, disparate sources (e.g., 100 articles, an entire codebase). Memory architectures can augment attention, acting as tools for models to retrieve and reason over stored information.
- Loss Objectives: The standard "next token prediction" is insufficient. Masking techniques, like "fill-in-the-middle" for code generation or masking tool use, can teach models specific agentic behaviors during pre-training, unifying pre- and post-training paradigms.
- Training Data: Beyond sheer scale and diversity, quality curation and the inclusion of reasoning traces are paramount. While synthetic data poses challenges (distribution shift, "smoking your own exhaust"), augmenting natural data with expert problem-solving examples is a critical direction.
- “Believe it or not, we have actually trained these models to probabilistically predict the next token. What is the most likely next token? And out of that we have managed to get these models to be extremely strong reasoners. But is that the best objective one could go after?”
The Challenge of Agentic Benchmarking
- The rapid saturation of traditional benchmarks necessitates a new generation of evaluation systems tailored for agentic capabilities. Frontier labs often develop these in-house, focusing on real-world workflow complexities.
- New benchmarks must measure multi-step problem-solving, planning, and the ability to recover from failed trajectories.
- Examples include tasks with varying horizons, like METR's "software atomic actions" or multi-hop reasoning benchmarks like MRC V2 and Loft, which test abstract reasoning over long contexts.
- Early success with PaLM on novel, crowdsourced reasoning tasks demonstrated the power of targeted benchmarks in revealing emergent capabilities.
- “The current paradigm that we have kind of settled on is that everyone builds benchmarks in-house to measure what is the set of model capabilities they’re going after.”
Scaling for Emergence, Distilling for Efficiency
- Chowdhery explains that larger models often reveal emergent capabilities sooner than smaller ones. While the ultimate goal is cost-efficient, highly capable models, achieving scale is a necessary first step to discover these new behaviors before distilling them into smaller, more practical deployments.
- Emergence, the appearance of new capabilities at scale, is a consistent observation in large language model development.
- Once discovered in large models (e.g., GPT-4), these capabilities can often be distilled or engineered into smaller models in subsequent generations.
- Reflexion aims for the most capable models first, then optimizes for cost-efficiency, often through distillation.
- “Often times… emergence is something that you find fun and interesting things in the larger models and then you often distill them or get them in the smaller models in the next generation because you figured out exactly the recipe that it takes to get them.”
Long-Form Reasoning and Recoverability
- Agentic AI requires sophisticated long-form reasoning and the ability to recover from failures, moving beyond simple retrieval or next-token prediction.
- Long-form reasoning involves abstracting and synthesizing information across vast, disparate contexts (e.g., understanding an entire codebase, synthesizing multiple research articles for a vacation plan). It also includes multi-step planning, anticipating future actions, and learning from past trajectories.
- Recoverability is the model's ability to recognize and correct past failures, avoiding repetitive loops. This implies an awareness of when information is insufficient, prompting the model to seek verification, explore alternative actions, or ask for user input, rather than confidently hallucinating.
- This capability is not a solved problem but is a key focus for pre-training, requiring strong reasoners developed through specific training data and loss objectives.
- “If you’re looking at an existing trajectory where there are failures and then recover from that often the models will get stuck in repetitive loops today because they don’t really know that they made that failure.”
Learning New Tools and Cross-Domain Reasoning
- True agentic intelligence extends to the ability to learn and adapt to new tools and environments, applying reasoning across diverse modalities and domains.
- Models must meta-learn how to explore and utilize new tools through interaction and feedback, rather than being pre-programmed or limited by context window size.
- Promising directions include unified domain-specific languages (e.g., code formats) that facilitate this exploration and learning.
- The principles developed for coding agents, such as reasoning across code and text, are broadly applicable to enterprise and consumer workflow agents, as the underlying capability is cross-modal and cross-domain reasoning.
- “The question comes back to how can we get the models to meta-learn like learn how to learn what exactly is the format in which this exploration of the new environment needs to be done so that the model learns to take actions in that space.”
Investor & Researcher Alpha
- Capital Shift: Investment is moving from incremental post-training tweaks to foundational pre-training innovation for agentic AI. Labs like Reflexion, backed by Nvidia, are betting on end-to-end training of frontier open agentic models.
- New Bottleneck: The primary bottleneck for agentic AI is no longer just compute or data volume, but the quality and structure of training data (especially reasoning traces) and the design of loss objectives that foster multi-step planning and recoverability.
- Obsolete Research: Research solely focused on static benchmark optimization or post-training as a standalone solution for agentic capabilities risks becoming obsolete. The future demands integrated pre-training solutions that bake in agentic behaviors from the ground up.
Strategic Conclusion
The era of agentic AI demands a fundamental re-architecture of pre-training. By redefining loss objectives, curating high-quality reasoning data, and developing dynamic benchmarks, the industry can move beyond static performance to models capable of complex, interactive, and self-correcting intelligence. The next step is to unify pre- and post-training paradigms to build truly autonomous agents.