Latent Space
April 2, 2025

The #1 SWE-Bench Verified Agent

Guy from Augment Code joins the pod to announce their new coding agent feature, which just snagged the #1 spot on the SWE-Bench verified benchmark by demonstrating serious codebase understanding and workflow prowess.

Augment's Agent: #1 on SWE-Bench & Codebase Mastery

  • "Today we're launching agent with codebase understanding... the agent can understand the request, understand where to make changes and how in the codebase... and then it can go and execute commands, run tests, all the way to getting to a working PR."
  • "We just made number one on SWE Bench... for us, SWE Bench has been a useful tool for exploring how can we get the most out of agents."
  • Augment’s agent isn't just writing code; it's navigating complex projects end-to-end – understanding requests, pinpointing changes respecting codebase conventions, running tests, and pushing PRs, integrating neatly with tools like Linear and GitHub.
  • Hitting #1 on SWE-Bench wasn't about brute-forcing with custom models (for generation); it was achieved with off-the-shelf LLMs, showcasing that smart agent architecture and tooling can outperform raw model scale, at least for now.
  • In a nod to the community, Augment open-sourced their SWE-Bench implementation, laying bare the techniques that propelled them to the top rank.

Inside the Agent: Architecture & Tuning for Performance

  • "Sequential thinking lets the model kind of reflect and do better... another thing that helped was at the end run the agent a few times and Ensemble those scores."
  • "For agents, orientation in particular is extremely important for getting good value... we have this orientation phase where we try to look around the code base and try to understand the conventions."
  • Getting peak performance involved clever tricks: "Sequential thinking" prompting helped the model reflect and plan better than standard reasoning modes, while ensembling (running the agent multiple times) boosted scores, though cost and UX complexity currently limit how far this can go.
  • The secret sauce includes deep codebase understanding. An "orientation phase" lets the agent learn project-specific conventions (testing frameworks, file structures) upfront, complemented by "memories" where it learns from mistakes – vital for real-world complex projects, even if SWE-Bench doesn’t fully test this yet.
  • Development relies on a multi-pronged evaluation strategy (curated samples, automated evals, contractor reviews) and supporting multiple LLMs (OpenAI, Anthropic) is essential for balancing cost, performance, and cloud compatibility in production.

Integrating with Developers: Philosophy & The Road Ahead

  • "Our product is meant for developers who are working in organizations that have large complex code bases... it's less about 0-to-1 development and it's more about working within existing large code bases."
  • "Our philosophy is that we want to meet developers where they are... our product is a set of extensions... we really try to not ask developers to change their workflow."
  • Augment isn't trying to replace developers or radically change their habits. The focus is squarely on enhancing productivity within large, existing codebases by integrating smoothly into IDEs (VS Code, JetBrains, Vim), meeting developers in their natural habitat.
  • While agent capabilities are skyrocketing, Augment bets the IDE remains the crucial hub for developers today, especially for complex tasks. The future might see agents handling more upfront, but the IDE likely remains essential for deep dives and supervision.
  • Extensibility via MCP is supported, but Augment also builds polished first-party integrations (GitHub, Linear) prioritizing a seamless user experience, particularly around authentication. Expect the pressure of exploding agent usage and costs to push towards custom, cost-optimized models down the line.

Key Takeaways:

  • Augment's SWE-Bench victory highlights that sophisticated agent architecture, deep codebase context, and seamless workflow integration currently trump raw model power. The path forward involves balancing potent AI capabilities with pragmatic concerns like cost, user experience, and fitting into how developers actually work. Future agents need to be less like generic tools and more like context-aware collaborators.
  • Architecture Beats Models (For Now): Augment hit #1 on SWE-Bench with off-the-shelf LLMs, proving intelligent agent design and context injection are paramount.
  • Integrate, Don't Dictate: Winning developer adoption means embedding agents within existing IDEs and workflows, especially for navigating complex enterprise code.
  • Context & Cost Shape the Future: Deep codebase understanding ("orientation," "memory") and tackling the escalating cost of agent operation are the next major frontiers in agent development.

For further insights, watch the full podcast: Link

This episode unveils Augment Code's new AI coding agent, which has achieved the #1 spot on the SWE-Bench verified benchmark, delving into the technical strategies, market positioning, and future implications for AI-assisted software development.

Introducing Gari and Augment Code's New Agent

  • The episode features Gari from Augment Code, introduced by hosts Alessio (Partner & CTO, Decible Partners) and Swyx (Founder, smol.ai).
  • Gari announces Augment Code's new "agent" feature, building upon their existing offerings like code completions, "next edit" (modifying code away from the cursor), and chat with codebase understanding.
  • This new agent possesses codebase understanding, allowing it to comprehend requests, identify where and how to make changes respecting existing conventions, execute commands, run tests, and ultimately aim for a working Pull Request (PR).

Achieving #1 on SWE-Bench

  • Augment Code's agent secured the top position on the SWE-Bench verified benchmark. SWE-Bench (Software Engineering Benchmark) is a standard dataset used to evaluate the ability of large language models to resolve real-world GitHub issues.
  • Gari confirms this achievement, noting SWE-Bench has been crucial for exploring agent capabilities and refining their approach. Swyx highlights this dethrones previously featured competitors.
  • Critically, Gari clarifies the generation capabilities rely on off-the-shelf models, while their proprietary, custom-trained models enhance the agent's specific codebase understanding for the product, though the impact on SWE-Bench itself was observed to be small due to the benchmark's nature often pinpointing change locations.

Technical Strategies Behind the Benchmark Success

  • Gari discusses techniques contributing to the SWE-Bench score, echoing strategies mentioned by previous guest Eric from Anthropic.
  • Sequential Thinking: This approach, where the model reflects step-by-step, proved more effective than Anthropic's "reasoning mode" with Claude 3 Sonnet in their tests for coding tasks, leading to a score bump. Gari notes, "sequential thinking helped bump the score there were there were a few things that helped."
  • Ensembling: Running the agent multiple times and combining the results (specifically using a simple majority vote, N=1 mentioned) also improved performance, though Gari acknowledges the associated cost increase.
  • Reliable File Edits: Gari emphasizes that achieving reliable file editing by the agent was a non-trivial engineering challenge they iterated on significantly.

Challenges: Cost, UX, and Multi-Model Necessity

  • The conversation highlights the significant cost of running capable large models, making multi-model strategies (mixing and matching different models, potentially from different providers like OpenAI and Anthropic) a practical necessity for real-world systems. Augment Code was built with multi-model support from the start (e.g., separate retrieval models).
  • While more complex ensembling might yield further performance gains (potentially a few percentage points on SWE-Bench), Gari points out a crucial User Experience (UX) challenge: users want to supervise the agent's process in real-time. Ensembling creates multiple trajectories, making supervision difficult. This UX friction might limit complex ensembling even if costs decrease.

Experimentation and Evaluation Frameworks

  • Addressing how Augment Code iterates and evaluates, Gari outlines their process:
    • Starts with a small, curated set of samples (e.g., 10) for initial development, often using notebooks.
    • Scales to hundreds or thousands of samples using internal infrastructure for automated evaluations, including checks with code execution (running tests) and comparisons against ground truth.
    • Crucially, they bridge the research-to-production gap by running production systems against research evaluation sets to catch regressions during deployment.
    • Utilizes human contractors for evaluations where automation is difficult, particularly for chat and agent interactions.

Further Technical Insights and Multi-Agent Approaches

  • Gari mentions small technical wins like finding string replace more effective than "smart paste" in certain contexts.
  • Multi-Agent Concepts: Augment employs strategies that resemble multi-agent systems:
    • Orientation Phase: An initial step where the agent analyzes the codebase to understand conventions, especially around testing frameworks and execution for SWE-Bench, and broader context (frameworks, versions) for the product.
    • Memories: A feature allowing the agent to learn from mistakes or user guidance regarding codebase conventions (e.g., how tests are stored or run), preventing repeated errors.
    • Planned Thorough Orientation: A future feature involving a more intensive (several minutes) analysis to deeply understand project structure, languages, tests, and frameworks.

Competitive Landscape and Augment's Philosophy

  • Gari positions Augment Code relative to competitors like Devin, Factory AI, and Sourcegraph.
  • Augment's core focus is on developers working within large, complex, existing codebases, rather than "0-to-1" greenfield development. This drives their investment in the context engine.
  • Their philosophy is to meet developers where they are, integrating directly into IDEs (VS Code, JetBrains, Vim plugin) without forcing major workflow changes. Gari states, "our philosophy is that we want to meet developers where they are."
  • Compared to Magic.dev or Poolside, who invest heavily in proprietary models, Augment currently prioritizes speed-to-market using off-the-shelf generation models, though Gari anticipates the exploding usage (and cost) of agents may necessitate custom models later.

Live Demo: Agent Modifying Itself

  • Gari performs a live demo where the Augment agent adds a new feature (a tool to show a dialog box) to the agent's own VS Code extension codebase.
  • The demo showcases the agent:
    • Fetching task details from a Linear ticket using built-in integration.
    • Using the Augment context engine to understand how tools are implemented.
    • Planning the changes.
    • Editing multiple files, respecting existing class structures and interfaces (e.g., inheriting correctly, implementing safety features).
    • Building the modified extension.
    • Successfully invoking the new tool within the updated agent.
    • Initiating a PR creation process via GitHub integration.

The Evolving Role of the IDE

  • While acknowledging agents are becoming powerful, potentially reducing IDE dependency for simpler tasks, Gari believes that for complex codebase work (Augment's target), the IDE remains essential for now.
  • He envisions a future where the IDE might become more of a specialized tool for deep dives ("an app that you can launch when you need to dig in deeper"), with developers spending more time managing agents elsewhere, but asserts the market isn't there yet for their target users.

MCP Support and Extensibility

  • Augment supports MCP (Multi-Cloud Protocol/Provider - contextually referring to a tool-calling protocol/standard), allowing users to extend the agent's capabilities by connecting it to external or internal systems beyond built-in integrations (like GitHub, Linear, Notion).
  • Gari sees value in MCP for power users but notes that custom integrations currently offer a smoother authentication experience, presenting a trade-off. They would consider replacing custom integrations with MCP if the user experience becomes equally seamless.

Research Directions: RL for Coding

  • Gari expresses interest in Reinforcement Learning (RL) – a type of machine learning where agents learn by trial and error, receiving rewards or penalties – for improving coding models.
  • He specifically recommends a recent paper, SWIRL ("Synthesis With Reinforcement Learning"), by Wey et al., as a valuable resource for RL applied specifically to coding, alongside foundational work like the DeepSeek Coder paper and understanding techniques like DPO/GRPO.

Market Perspective: The Google Gemini Revival

  • As an ex-Google employee (on the PaLM 2 team) who left around the time of ChatGPT's launch, Gari shares insights into Google's response.
  • He notes a significant cultural shift towards being more "startup-y, fast-moving, aggressive," with internal optimism about Google's ability to compete, driven by strong talent and motivation.
  • Despite Gemini 1.5 Pro's strength, Gari believes "it's still day one" for AI capabilities, seeing substantial room for improvement beyond current models, especially in reducing the need for users to break down complex tasks. Gari reflects, "I really have to take a task and break it down to bite-sized pieces and feed it so I think there's... a ton of room for improvement."

Call to Action & Open Source Contribution

  • Gari encourages listeners, especially those working with large, complex codebases, to try Augment Code via augmentcode.com. A free tier is available for individual developers willing to share code for training purposes.
  • Significantly, Augment Code has open-sourced their SWE-Bench implementation, allowing researchers and developers to see exactly how they achieved the #1 ranking.

Augment Code's #1 SWE-Bench agent highlights the rapid progress in AI coding, emphasizing practical integration within developer workflows and existing complex codebases. Investors and researchers should track agent benchmark performance, IDE integration trends, and the evolving cost/capability trade-offs as key indicators of market direction and technological maturity.

Others You May Like