This episode unveils Augment Code's new AI coding agent, which has achieved the #1 spot on the SWE-Bench verified benchmark, delving into the technical strategies, market positioning, and future implications for AI-assisted software development.
Introducing Gari and Augment Code's New Agent
- The episode features Gari from Augment Code, introduced by hosts Alessio (Partner & CTO, Decible Partners) and Swyx (Founder, smol.ai).
- Gari announces Augment Code's new "agent" feature, building upon their existing offerings like code completions, "next edit" (modifying code away from the cursor), and chat with codebase understanding.
- This new agent possesses codebase understanding, allowing it to comprehend requests, identify where and how to make changes respecting existing conventions, execute commands, run tests, and ultimately aim for a working Pull Request (PR).
Achieving #1 on SWE-Bench
- Augment Code's agent secured the top position on the SWE-Bench verified benchmark. SWE-Bench (Software Engineering Benchmark) is a standard dataset used to evaluate the ability of large language models to resolve real-world GitHub issues.
- Gari confirms this achievement, noting SWE-Bench has been crucial for exploring agent capabilities and refining their approach. Swyx highlights this dethrones previously featured competitors.
- Critically, Gari clarifies the generation capabilities rely on off-the-shelf models, while their proprietary, custom-trained models enhance the agent's specific codebase understanding for the product, though the impact on SWE-Bench itself was observed to be small due to the benchmark's nature often pinpointing change locations.
Technical Strategies Behind the Benchmark Success
- Gari discusses techniques contributing to the SWE-Bench score, echoing strategies mentioned by previous guest Eric from Anthropic.
- Sequential Thinking: This approach, where the model reflects step-by-step, proved more effective than Anthropic's "reasoning mode" with Claude 3 Sonnet in their tests for coding tasks, leading to a score bump. Gari notes, "sequential thinking helped bump the score there were there were a few things that helped."
- Ensembling: Running the agent multiple times and combining the results (specifically using a simple majority vote, N=1 mentioned) also improved performance, though Gari acknowledges the associated cost increase.
- Reliable File Edits: Gari emphasizes that achieving reliable file editing by the agent was a non-trivial engineering challenge they iterated on significantly.
Challenges: Cost, UX, and Multi-Model Necessity
- The conversation highlights the significant cost of running capable large models, making multi-model strategies (mixing and matching different models, potentially from different providers like OpenAI and Anthropic) a practical necessity for real-world systems. Augment Code was built with multi-model support from the start (e.g., separate retrieval models).
- While more complex ensembling might yield further performance gains (potentially a few percentage points on SWE-Bench), Gari points out a crucial User Experience (UX) challenge: users want to supervise the agent's process in real-time. Ensembling creates multiple trajectories, making supervision difficult. This UX friction might limit complex ensembling even if costs decrease.
Experimentation and Evaluation Frameworks
- Addressing how Augment Code iterates and evaluates, Gari outlines their process:
- Starts with a small, curated set of samples (e.g., 10) for initial development, often using notebooks.
- Scales to hundreds or thousands of samples using internal infrastructure for automated evaluations, including checks with code execution (running tests) and comparisons against ground truth.
- Crucially, they bridge the research-to-production gap by running production systems against research evaluation sets to catch regressions during deployment.
- Utilizes human contractors for evaluations where automation is difficult, particularly for chat and agent interactions.
Further Technical Insights and Multi-Agent Approaches
- Gari mentions small technical wins like finding string replace more effective than "smart paste" in certain contexts.
- Multi-Agent Concepts: Augment employs strategies that resemble multi-agent systems:
- Orientation Phase: An initial step where the agent analyzes the codebase to understand conventions, especially around testing frameworks and execution for SWE-Bench, and broader context (frameworks, versions) for the product.
- Memories: A feature allowing the agent to learn from mistakes or user guidance regarding codebase conventions (e.g., how tests are stored or run), preventing repeated errors.
- Planned Thorough Orientation: A future feature involving a more intensive (several minutes) analysis to deeply understand project structure, languages, tests, and frameworks.
Competitive Landscape and Augment's Philosophy
- Gari positions Augment Code relative to competitors like Devin, Factory AI, and Sourcegraph.
- Augment's core focus is on developers working within large, complex, existing codebases, rather than "0-to-1" greenfield development. This drives their investment in the context engine.
- Their philosophy is to meet developers where they are, integrating directly into IDEs (VS Code, JetBrains, Vim plugin) without forcing major workflow changes. Gari states, "our philosophy is that we want to meet developers where they are."
- Compared to Magic.dev or Poolside, who invest heavily in proprietary models, Augment currently prioritizes speed-to-market using off-the-shelf generation models, though Gari anticipates the exploding usage (and cost) of agents may necessitate custom models later.
Live Demo: Agent Modifying Itself
- Gari performs a live demo where the Augment agent adds a new feature (a tool to show a dialog box) to the agent's own VS Code extension codebase.
- The demo showcases the agent:
- Fetching task details from a Linear ticket using built-in integration.
- Using the Augment context engine to understand how tools are implemented.
- Planning the changes.
- Editing multiple files, respecting existing class structures and interfaces (e.g., inheriting correctly, implementing safety features).
- Building the modified extension.
- Successfully invoking the new tool within the updated agent.
- Initiating a PR creation process via GitHub integration.
The Evolving Role of the IDE
- While acknowledging agents are becoming powerful, potentially reducing IDE dependency for simpler tasks, Gari believes that for complex codebase work (Augment's target), the IDE remains essential for now.
- He envisions a future where the IDE might become more of a specialized tool for deep dives ("an app that you can launch when you need to dig in deeper"), with developers spending more time managing agents elsewhere, but asserts the market isn't there yet for their target users.
MCP Support and Extensibility
- Augment supports MCP (Multi-Cloud Protocol/Provider - contextually referring to a tool-calling protocol/standard), allowing users to extend the agent's capabilities by connecting it to external or internal systems beyond built-in integrations (like GitHub, Linear, Notion).
- Gari sees value in MCP for power users but notes that custom integrations currently offer a smoother authentication experience, presenting a trade-off. They would consider replacing custom integrations with MCP if the user experience becomes equally seamless.
Research Directions: RL for Coding
- Gari expresses interest in Reinforcement Learning (RL) – a type of machine learning where agents learn by trial and error, receiving rewards or penalties – for improving coding models.
- He specifically recommends a recent paper, SWIRL ("Synthesis With Reinforcement Learning"), by Wey et al., as a valuable resource for RL applied specifically to coding, alongside foundational work like the DeepSeek Coder paper and understanding techniques like DPO/GRPO.
Market Perspective: The Google Gemini Revival
- As an ex-Google employee (on the PaLM 2 team) who left around the time of ChatGPT's launch, Gari shares insights into Google's response.
- He notes a significant cultural shift towards being more "startup-y, fast-moving, aggressive," with internal optimism about Google's ability to compete, driven by strong talent and motivation.
- Despite Gemini 1.5 Pro's strength, Gari believes "it's still day one" for AI capabilities, seeing substantial room for improvement beyond current models, especially in reducing the need for users to break down complex tasks. Gari reflects, "I really have to take a task and break it down to bite-sized pieces and feed it so I think there's... a ton of room for improvement."
Call to Action & Open Source Contribution
- Gari encourages listeners, especially those working with large, complex codebases, to try Augment Code via
augmentcode.com
. A free tier is available for individual developers willing to share code for training purposes.
- Significantly, Augment Code has open-sourced their SWE-Bench implementation, allowing researchers and developers to see exactly how they achieved the #1 ranking.
Augment Code's #1 SWE-Bench agent highlights the rapid progress in AI coding, emphasizing practical integration within developer workflows and existing complex codebases. Investors and researchers should track agent benchmark performance, IDE integration trends, and the evolving cost/capability trade-offs as key indicators of market direction and technological maturity.