The #1 SWE-Bench Verified Agent

Augment Code just dropped a new AI agent feature, rocketing to the top of the SWE-Bench leaderboard. Guy Pascarella, co-founder of Augment Code and ex-Google AI researcher, unpacks the tech behind their number one verified agent and their strategy for tackling complex enterprise codebases.

Augment's Agent & SWE-Bench Triumph

"We're announcing our agent feature... the agent can understand the request, understand where to make changes and how in the codebase... and then it can go and execute commands, run tests all the way to getting to a working PR."
"This is going to be the number one SWE-Bench verified agent... We just made number one on SWE-Bench."
Augment's new agent is designed for the messy reality of large, existing codebases, understanding context, conventions, and executing tasks end-to-end to generate PRs.
It snagged the #1 spot on the SWE-Bench verified benchmark using primarily off-the-shelf generation models, proving clever engineering can outperform raw model power.
While off-the-shelf models handle generation, Augment uses its own custom models specifically to enhance the agent's crucial codebase understanding within the product.

Building High-Performing Agents: Techniques & Testing

"Sequential thinking lets the model kind of reflect and do better. Another thing that helped was at the end run the agent a few times and Ensemble those scores..."
"The process that I like to follow is... come up with a curated set of samples... run through the samples... Beyond a certain point... you go to hundreds or sometimes maybe thousands or more of samples."
Techniques like sequential thinking (found more effective than standard reasoning modes) and ensembling (running the agent multiple times) were key to boosting benchmark scores, though ensembling adds cost and UX hurdles.
Augment employs a rigorous testing process, starting with small, curated sample sets for rapid iteration before scaling to automated evaluations on thousands of samples, including code execution checks.
Features like "Orientation" help the agent learn codebase conventions, while "Memory" allows it to learn from mistakes, adapting its behavior over time.

Augment's Playbook: Enterprise Focus & Workflow Integration

"Our product is meant for developers who are working in organizations that have large complex code bases... it's more about working within existing large code bases."
"Our philosophy is that we want to meet developers where they are... our product is a set of extensions... we really try to not ask developers to change their workflow."
Augment squarely targets developers navigating complex, mature codebases within organizations – less "zero-to-one" and more "day-N" productivity.
The strategy hinges on seamlessly integrating into existing developer workflows via IDE extensions (VS Code, JetBrains, Vim), avoiding disruptive changes unlike some standalone agent approaches.
They prioritize speed-to-market using off-the-shelf models for generation, contrasting with competitors investing heavily in proprietary models, though custom models might become necessary as agent usage scales.

Key Takeaways:

Augment's success highlights how smart engineering and deep codebase understanding can yield top performance, even without relying solely on bleeding-edge proprietary models. Their focus is meeting enterprise developers where they live – inside the IDE, working on complex code.
#1 SWE-Bench Rank: Augment's new agent tops the SWE-Bench verified charts using off-the-shelf models plus custom codebase understanding tech.
Enterprise & IDE Focus: Augment targets developers in large, complex codebases, integrating directly into VS Code/JetBrains workflows rather than forcing new ones.
Pragmatic Model Strategy: Leverages off-the-shelf models for rapid deployment now, anticipating potential custom model needs as agent usage and costs inevitably explode.

Podcast Link: https://www.youtube.com/watch?v=Y5hTKuUafDo

This episode unpacks Augment Code's launch of the #1 SWE-Bench verified AI coding agent, detailing the technical strategies behind this achievement and the implications for developers navigating complex, large-scale codebases.

1. Augment Code Agent Launch & SWE-Bench Triumph

Gari from Augment Code announces the launch of their new AI agent feature, building upon their existing suite of tools like code completions, "next edit," and chat with codebase understanding. This agent aims to understand user requests, identify necessary code changes respecting existing conventions, execute commands, and run tests, potentially automating workflow up to creating a working Pull Request (PR).
Significantly, Gari reveals their agent achieved the #1 ranking on the SWE-Bench verified benchmark, surpassing previous leaders like the one from Weights & Biases. SWE-Bench (Software Engineering Benchmark) is a standard benchmark used to evaluate the ability of AI models to resolve real-world GitHub issues.
Gari emphasizes that this top performance on SWE-Bench was achieved using off-the-shelf generation models, although their commercial product incorporates custom-trained models specifically for enhancing codebase understanding. He notes that while codebase understanding provides a small delta on SWE-Bench, the benchmark's structure often pinpoints required changes, limiting the observed benefit of deep context awareness in this specific test environment.

2. Technical Deep Dive: Strategies Behind the #1 Ranking

Gari discusses the techniques contributing to their SWE-Bench success. Sequential thinking, a method where the model reflects or reasons step-by-step before acting, provided a noticeable performance bump compared to standard reasoning modes tested with models like Claude 3 Sonnet. Gari notes, "sequential thinking helped bump the score."
Ensembling, the technique of running the agent multiple times and combining the results (in their case, using a simple majority vote with n=1 for the final submission due to cost), also improved the score. Ensembling involves combining predictions from multiple models or multiple runs of the same model to improve overall accuracy or robustness.
Gari highlights that even foundational elements like reliable file editing required significant iteration, underscoring the practical engineering challenges in building robust agents beyond pure model capability. The discussion, primarily led by Gari with Swyx probing, emphasizes an empirical, experimentation-driven approach to finding optimal configurations.

3. Experimentation Framework and Evaluation Process

Gari outlines Augment Code's iterative development process. It starts with small, curated sets of test samples (around 10) for initial feature development, allowing deep familiarity. This progresses to larger datasets (hundreds or thousands) managed via internal infrastructure for automated evaluations, including checks with and without actual code execution.
A key challenge addressed is bridging the gap between research environments and production systems. Augment runs production systems against research evaluation sets to catch regressions before deployment, ensuring real-world performance aligns with development findings.
For tasks difficult to evaluate automatically, like chat interactions and complex agent behaviors, Gari confirms they utilize human contractors, highlighting the continued need for human judgment in assessing nuanced AI performance.

4. Technical Wins, Cost Constraints, and UX Trade-offs

Beyond sequential thinking and ensembling, Gari mentions minor technical wins like optimizing file modifications (e.g., preferring string replace over "smart paste" in certain contexts). However, the major trade-off discussed is the cost versus performance/UX impact of techniques like ensembling.
While more sophisticated ensembling could yield further performance gains (potentially a few percentage points on SWE-Bench), Gari points out a critical user experience (UX) issue. "Users really want to see what the agent is doing and follow along," he explains. Ensembling creates multiple execution paths, making real-time supervision difficult and potentially frustrating for users who prefer transparency over waiting for a final, combined result. This UX challenge might limit extensive ensembling even if costs decrease.

5. Multi-Agent Concepts: Orientation and Memory

Gari emphasizes the importance of "orientation" for agent effectiveness, particularly in complex codebases. Augment implements an orientation phase where the agent analyzes the codebase to understand conventions, especially around testing frameworks and execution methods, crucial for SWE-Bench and real-world tasks.
The product incorporates "Memories," allowing the agent to learn from past interactions and mistakes. If the agent identifies a generalizable fact about the codebase (e.g., testing conventions), it stores this to avoid repeating errors, adapting progressively to the specific environment.
A more "thorough orientation" feature is planned, which will run for several minutes to deeply analyze languages, frameworks, versions, and testing setups, aiming for a better out-of-the-box experience.

6. Competitive Landscape and Augment's Philosophy

Positioning Augment Code against competitors like Devin, Factory AI, and Sourcegraph, Gari articulates their core philosophy: focusing on developers working within existing large, complex enterprise codebases, rather than "0-to-1" greenfield development. This focus drives their investment in codebase understanding via their context engine.
Gari states, "We want to meet developers where they are." This translates to integrating directly into existing developer workflows via IDE extensions (VS Code, JetBrains, Vim) rather than requiring radical workflow changes. This contrasts with approaches that might offer standalone agent platforms. He also notes compatibility with tools like Cursor.
Regarding competitors like Magic.dev and Poolside potentially training large proprietary models, Gari confirms Augment currently prioritizes using external models for agent generation (except for codebase understanding) to achieve faster go-to-market, acknowledging the significant demand. However, anticipating exploding usage and costs, they see future potential for custom models to manage expenses.

7. Live Demo: Agent Modifying Itself

Gari provides a live demo showcasing the agent adding a new feature (a simple user dialogue tool) to its own VS Code extension codebase.
The demo illustrates the agent's workflow: fetching task details from a Linear ticket using a built-in integration, utilizing the Augment context engine for orientation, planning the changes, editing multiple files (inheriting from correct classes, implementing interfaces, using VS Code APIs), and handling dependencies.
The demo concludes by successfully invoking the newly added tool within the IDE and initiating a PR creation process via GitHub integration, demonstrating an end-to-end task completion cycle managed by the agent within the developer's familiar environment.

8. The Future of IDEs vs. Agent-Centric Interfaces

While Augment currently focuses on deep IDE integration, Gari speculates on the long-term evolution of developer workflows. He envisions a future where the IDE might become a secondary tool used for deep dives (perhaps 20% of the time), with developers spending most of their time (80%) managing tasks through dedicated agent control interfaces (web apps or standalone apps).
However, he stresses that for their target users working in complex codebases, the need for IDE-based interaction remains strong today, justifying their current product strategy. This reflects a pragmatic view grounded in current user needs and feedback.

9. Tool Extensibility and MCP Support

Alessio prompts a discussion on MCP (Multi-Component Protocol or similar standard for agent tool use/extensibility). Gari confirms Augment supports MCP, recognizing its value for power users who want to connect the agent to custom internal or external systems not covered by built-in integrations.
He notes a current trade-off: while Augment provides seamless first-party integrations (like GitHub, Linear) with smooth authentication, achieving the same ease-of-use via generalized MCP standards can be challenging. They would consider replacing custom integrations with MCP equivalents if the user experience, particularly authentication, becomes equally seamless.

10. Research Directions and Market Perspectives

Gari highlights promising research areas, particularly in Reinforcement Learning (RL) for coding. RL involves training models based on feedback from their actions. He specifically recommends the "SWEL" paper by Weyand et al. as a recent, insightful contribution alongside foundational work like the DeepSeek Coder RL paper and understanding techniques like DPO (Direct Preference Optimization) and its variants.
As an ex-Google employee from the PaLM 2 team, Gari offers perspective on Google's AI trajectory. He acknowledges the "crisis moment" post-ChatGPT but notes a significant internal cultural shift towards faster, more aggressive development. Based on conversations with current employees, he senses optimism about Google's ability to compete effectively, driven by strong talent and motivation, even as the AI race continues. Gari concludes that despite rapid progress, "it's still day one" in terms of model capabilities, indicating substantial room for improvement remains before achieving truly autonomous, human-level coding proficiency.

11. Call to Action and Open Source Contribution

Gari encourages listeners, especially those working with large, complex codebases, to try Augment Code by downloading the extension from augmentcode.com. A free tier is available for individual developers willing to share code for training purposes.
Importantly for the research community, Gari announces that Augment Code has open-sourced their implementation for achieving the #1 SWE-Bench score, allowing others to examine and build upon their methods.

Augment's #1 SWE-Bench agent underscores how pragmatic engineering—blending off-the-shelf models with targeted techniques and deep workflow integration—currently leads in complex coding tasks. Investors and researchers should monitor how context-awareness and seamless IDE integration, beyond raw benchmark scores, drive practical adoption and value creation.

The #1 SWE-Bench Verified Agent

Others You May Like

The Financial Repression Endgame | Weekly Roundup

Our GPT-5 first impressions (one thing you missed)

The Hidden Threat of Centralized AI: Surveillance, Data Leaks, and What Comes Next

The #1 SWE-Bench Verified Agent

Join 4,000+ smart readers to get access to all our research and tools for free.

Others You May Like

The Financial Repression Endgame | Weekly Roundup

Our GPT-5 first impressions (one thing you missed)

The Hidden Threat of Centralized AI: Surveillance, Data Leaks, and What Comes Next