Latent Space

December 31, 2025

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

The Arena Pivot: Why AI Coding Agents are Moving Beyond Unit Tests

by Latent Space

Date: [Insert Date]

Quick Insight: This summary targets developers and researchers tracking the transition from simple code completion to autonomous software agents. It explains why the next generation of benchmarks focuses on competitive tournaments and long-horizon development rather than static test suites.

💡 Why are unit tests failing: as the ultimate yardstick for AI coding agents?
💡 How does Code Clash: turn software development into a competitive tournament?
💡 Can long-running agents: actually replace human developers or is interactivity the real bottleneck?

Top 3 Ideas

🏗️ The Death of Static Verification

"I don't like unit tests as a form of verification."

Active Arenas: SWE-bench is moving toward environments where models maintain codebases over time. This forces agents to live with the consequences of their previous edits.
Competitive Logic: Code Clash pits models against each other in programmatic games like Halite. Success is measured by performance in a live tournament rather than passing a pre-defined test suite.
Long Horizon Development: Models now face tasks that require hours of execution and iteration. This mirrors the reality of professional software engineering where the first draft is rarely the final one.

🏗️ The Autonomy Paradox

"I definitely don't believe in this idea of just getting rid of the human."

Collaboration Friction: Real-world development relies on underspecified goals and constant back-and-forth. Benchmarks must account for human-in-the-loop interaction to remain relevant.
Task Abstraction: Different tasks require different levels of agentic freedom. High-toil work like JSON parsing is ready for total autonomy while core architecture remains a collaborative effort.

🏗️ Benchmarking the Intangible

"How do you benchmark understanding?"

Context Management: Future agents will focus on codebase retrieval and automated context management. This enables models to handle massive repos that exceed current window limits.
Market Utility: The industry is moving toward economically valuable arenas that simulate real market value. This moves the focus from "does it run" to "does it create value."

Actionable Takeaways

🌐 The Macro Pivot: The transition from completion to agency requires moving from static repos to active, economically valuable environments.
⚡ The Tactical Edge: Prioritize agentic workflows that emphasize codebase understanding over simple code generation.
🎯 The Bottom Line: The next 12 months will see a move from stunt autonomy to integrated human-AI systems that handle long-running tasks without losing the human intent.

Podcast Link: Click here to listen

We're here at NeurIPS with John Yang of SWE-bench and many other things. Welcome.

Thanks so much for having me. Yeah, really happy to be here. Uh last year I talked to Oier and uh I think Carlos as well, one of your co-authors. How's SWE-bench doing? Like just generally the project is like one and a half years old.

Yeah. I think one and a half years old in terms of when it was actually useful and we put it out October 2023 and then people didn't really touch it too much and then of course like Cognition came on the scene and Devon was an amazing release and I think after that it kind of kicked off the arms race.

Did they tell you beforehand or they just showed up?

It was you know I got an email about like two weeks ago. I think it was from I think it was from Walden. He was like, "Hey, you know, we have a good number on it." I was like, "Wow, congrats." You know, thanks for using it. And then the release was like mind-blowing. I was like, "Wow, these guys did an excellent job."

Amazing. And then SWE-bench verified was like maybe last year. That's right.

Catch us up this year. Like you have uh other languages. You have there's like a whole bunch of varieties of SWE-bench now.

Yeah, for sure. Um, I think there's a couple extensions that are happening. One is like more SWE-benches, SWE-bench Pro, SWE-bench Live.

Oh, SWE-bench Pro. Was that with you guys? Because it looks independent. It's like different authors.

It's completely independent. Yeah. So, they just call it SWE-bench Pro without your blessing?

Yeah. I think uh I think we're we're we're okay with it. Uh when we came out, we were like, "Oh, cool. Interesting." It would have been, you know, fun to be part of it. But, you know, I mean, congrats to them. It's a great benchmark.

Uh but yeah, uh multimodal. Yeah, we did multimodal and multilingual. Um, and I think like those have multilingual seems to be is it like JavaScript? What else?

Yeah. Multilingual is like it's like nine languages across like 40 repos. But yeah, you got them like JavaScript, Rust, Java, C, you know, Ruby.

And then Corbench itself, a lot of people like they they talk about the the Django focus.

Is there is there is there like I don't know how do we how do we move past Django?

Yeah, for sure. I mean, it's cool to see um a lot of the newer benchmarks like really try to diversify the repos. Like in the two follow-ups we did with multimodal and multilingual, we made it a point to do that. So, I think but you can also just put out SWE-bench 2025 and just.

That is true. And do a new distribution. Yeah.

So, it's been cool to see the follow-ups. I think quietly and and it's an open question for me. I'm excited to see how people curate the next sets. Like it's kind of interesting to see in the literature or in their blog posts like how they're justifying why they're creating their separate split the easier ones where like oh more languages more repos. And then I think now people are like well ours is more difficult because of this curation technique. And I'm yeah I'm excited to see how how long that lasts and you know where we're going to like guide the evaluations towards.

And more recently you're working on Code Clash.

Uh so let's give people you've already done other episode other podcasts about it. I refer people to to that with your chat with Andy but just give like a people like a one-two sentence.

No, happy to do it especially on your podcast. It's honor. Um yeah. So basically the idea is I don't like unit tests as a form of verification. And I also think there's an issue with SWE-bench where all of the task instances are independent of each other. So the moment you have the model kind of submit it, oh it's done, you know, and and that's the end of the story, end of the episode, you know. So with Code Clash, what we're thinking is let's try to really evaluate like long horizon development and uh development on a codebase that is consequential and condition upon what a model did, you know, before to that codebase.

And so the general idea is you have two or more language models and they play a programming tournament. And what that means is each model maintains their own code base and each round of the tournament first they get to like edit and improve their code base however they see fit very self-determined and then in the competition phase those two code bases are are pitted against each other. So the code bases are run and there's generally an arena you know we have a lot of diverse arenas but the arena is determined like codebase A is better than codebase B and then you kind of repeat that across multiple as determined by an Elo judge.

So Elo judge is definitely one of the mechanisms. Uh we started with some pretty like simple programming games. So one of the cooler ones is like Halite which uh my.

Oh yeah, I played it uh for Jane Street.

Yes, that's right. You know that's awesome. Yeah. Halite one, two, three. Like Michael Troll of Cursor wrote this uh game.

Two Sigma Jane Street.

Two Sigma. I worked at Two Sigma. I'm like, "Oh, there you go." Yeah. 2016 at this point, but we're bringing it back. You know, Halite is fun. I I would say if you've never done a programmatic competition where you have to control fleets of uh ships and attack things and defend things and collect resources.

It's like play Starcraft but you can code, right?

Exactly. Exactly. Yeah. A lot of games.

Is there are there non-games or you focus on games?

I think that's an excellent point. So for kind of the initial release for scientific purposes, we kind of use existing programming games. Uh the current ongoing effort is you know to build economically valuable arenas. That's you know the popular word these days. So yeah, sweeter is a big one this year. GDP. Awesome.

Just uh I mean I think the big selling point of Terminal Bench and SWE-bench and these evals is that you know it's really close to real world utility and so I think it's resolvable for Code Clash and that's what we're working on.

So you're part of OIR's group.

Um the other students have also been putting out a lot of other stuff. What would you highlight?

No I mean OIR is such a prolific mentor when it comes to benchmarking. So efficiency I really like in the line of performance operation.

What's the one?

Yeah, for sure. Um, so efficiency was wrote by this PhD student called Jeffrey Ma who happened to be my high school classmate and the idea there was like you take a codebase and you just want to you know do modifications that will literally make the code run faster. So I think this like parallelization SIM operation stuff like that.

So so no no behavior change just faster.

Exactly. Keep the unit test passing but I want better runtime.

And then and then there there's AlgoTune that is kind of in line with that. And then there's also kind of pushing along like the scientific coding domain.

Yeah, exactly. PsychoCoder 2 is awesome. They did like a quick and for for people is the way I explain PsychoCoder is it's human eval but better.

Exactly. I think you know there's a lot of good stuff that these days where.

That's that's the way to go. Which is like SWE-bench is expensive to run. Any agent tech benchmark is expensive to run. Actually, you do need some completions benchmarks that just just complete. Exactly. Like, you know, you can do well on those first and then sort of graduate to the multi-turn expensive stuff.

Okay. Other than that, just like broadly other work in the field in 2025, uh in terms of coding evals, um obviously we shot up Meter. They use VBench and they have a very interesting like I guess human hours worked number.

Yeah, they like the x-axis being sort of the runtime and or yeah, y-axis being the completion, you know, like we can do more longunning speed and tasks. I think the projections are are quite interesting and I definitely appreciate them kind of using SWE-bench verified to to sort of proxy a lot of these things. But yeah, they're great.

Okay. Any other work that like caught your eye?

Yeah, I mean I I think within the Okay, Terminal Bench bench. Uh yeah, Critical Point was kind of cool. Um Critical Point. Yeah, that it's like a very new benchmark that uh OIR did. Um and I think it's kind of related to physics. Um there's this one called SecBench kind of related to cyber security.

Exactly. SecBench, which I I think is affiliated with lot like it's just cool to kind of see people really dive into different coding domains. And then stepping a little bit outside of coding, um I'm personally think it's quite interesting to think about the user simulator stuff. So like TWIBench bench too.

And VendingBench and I got mix feelings.

Well, I mean it's it's like you're sampling one path. I I don't know how realistic it is to be honest. It's just the elements but it is cool.

For sure. Yeah, I agree. I I think it's a good initial effort. Um to me I think it's super cool to see companies like you know I'm sure Metaphor and stuff are focusing on building environments like for code beyond code and so I think it it might be interesting to have like Work Gym style stuff. This is stuff that my adviser D. Young at Stanford thinks about a lot.

I just realized we we're talking about Terminal Bendy.

In front of a lot of folks. Yeah. You know, really, really, really good work. Uh just overall, um yeah, let's talk about TowBench cuz you mentioned TowBench.

Uh there's some discussion or some people are saying that TowBench is uh impossible to get a high score on because some of the tasks are underspecified or just impossible.

Yeah, I don't know if you're up to speed on that. I'm a little bit spicy. Yeah, it's a bit spicy. I think I saw so I you know for like I worked with Shinyu and Karthik back in Princeton very closely. I think Karthik I just saw posted a tweet kind of um yeah like rebutting some of these claims. Um yeah, I mean it it's I I think I get the concern. Um but yeah, I think it it's also brings up just maybe like interesting research problems to solve of like okay like why is it impossible? Is it the ambiguity? Is it kind of the user simulator that has issues? And I think generally we all agree that you know we'll improve on these things over time for Ubots. So I actually really like benchmarks that intentionally uh I think we should intentionally include impossible tasks as a flag.

Of like hey you're cheating.

Yeah, it's kind of sad that like Karthik actually is defending it because the master move would be like, "Oh, yeah, you caught us." Like that that that was uh you know like everyone reporting above 75 on Top Bench retail uh you'd be cheating.

Oh, interesting. That would be that would be cool. Yeah.

I mean, yeah, you'll have to ask the TowBench authors, but yeah. No, that that's that's fun. Um yeah, I I think there was uh Impossible Bench was a recent benchmark. Uh maybe from was it from Anthropic? I don't know. But they basically took SWE-bench verified and they changed the issues to make them impossible and they checked like how often the models would be like I actually just can't do this. I don't know what's going on.

Oh, like for refusals.

So, oh, how did they do? I thought that was interesting. I think they're all the models are all kind of attempting and saying like, oh, I did it, you know. So, maybe not great. That's cool. But no, that's a that's an important one.

How does Cody evalance evolve next year?

Wow, that's a great question. And I mean honestly I think I think it's people will make more SWE-benches. Um I think Terminal Bench has really got something going where you you ask people to you know a SWE-bench you're you're confined in some sense to the domain of issues and PRs that already exist. Um which I think has its benefits of being close to reality and natural but I think with Terminal Bench there's a lot of creativity that you can infuse into that. So I would personally be really excited like the 2.0 job was really excellent and I'd be super excited to see you know 3.0 4.0 because of like the environments. Yeah, I mean the environments, you know, bringing more people into the fold, you know, I think, correct me if I'm wrong, Mike, but early on you had PhD students, very smart CS people who are adding task and, you know, what does that look like when you fold more coding environments for non-coding tasks, non-coding environments in general, and ask people to make stuff there. So, that's pretty cool.

And then, of course, for myself, I think just like this longunning sweet agent kind of thing just feels very compelling. I think the vision of like, hey, I tell it a goal. I don't have to be super specific about my task. I have like a decent verifier that proxies what I want. Something literally like a codebase that makes the most money in this like setting, you know, like that's my verifier, you know, and I walk away for 5 hours. The thing is just running. I'm hanging out with you, talking to my friends. I come back and it gives me like literally a soda codebase on on on that, you know, task. I think that would be super cool.

Okay, I'll push back. We're part time. And we are uh emphasizing a lot of interactivity because the the point is that you're going to underspecify, right? And actually what people want is back and forth, back and forth and on like a really fast time frame, which is terrible for a benchmark author, right? Because how you [laughter] do that? Uh but but realistic.

So, um I I think like that this uh this this is where I'm a little bit anxious or cautious about this push for long autonomy, right? We're g I mean, you know, let's say this time next year, we'll have 5 hours is is pessimistic like it'll be it'll be 24 long. Right. Days. Um but I don't know if that actually materially changes the industry. So, we'll push it like as an evals, you know, we have the people people make evals here. Yeah. We push the industry in ways that we wanted to push, but I don't know if we like that's a productive way because that's more of like a a stunt that that like Yeah, it's a proof of concept that proof existence proof it can be done. Yeah. But will you use it in for real life?

I mean, honestly, um to me, I think there's there's potentially room for growth. So, I I would actually agree with your take here. Um I mean uh with my lab at Stanford with DE like there's a you know her emphasis is on human AI collaboration and so I I definitely don't believe in this idea of just kind of getting rid of the human. Um but yeah maybe just like finding the balance of like you know just because the developer ecosystem is so diverse and there's so many participants in it who want different things out of it like just enabling different levels of abstraction.

And you know, it depends on the task. Like there's settings where you want to be, you know, more involved and more sort of hands-on and so you want to use Windsurf for that. But then maybe there's kind of this general data processing thing. It's just a lot of JSON parsing you don't really care about and that's the one I kind of want to walk away from and just let it figure it out. Um, so yeah, I would agree with you generally.

Amazing. Any calls to action? What What do you want help on? How can people uh I guess like find more of your work?

Definitely for the call to action. Super jealous of all the great data that Cognition and you know Cursor would get like that user interaction data is like really fascinating. From an academic standpoint it feels like there's two difficult approaches to resolving that. Either you build like a really compelling product like El Marina that people have people use consistently which is I mean really tricky in and of itself or you build like really good user simulators that try to mimic sort of these settings. But that is also like non-trivial. I don't think it's as simple as hey chatbt act like a human, right?

So it would be really cool to sort of get inspiration of like what exactly does that data look like or or between the two like what's the best way to scale up sort of evaluating human AI interaction and then I think for visibility for my own we're pushing more arenas like I think for for Code Clash what I'm excited about is the current framing is really long running sweet agents but you know you could have multi- aents like two agents work together on the codebase and what happens you have a human and an agent work on the codebase versus just AIs, what happens there? You know, like when the models improve and hopefully they hill climb and they become better at digesting logs and iterating on analysis, you know, how does how does human AI interaction like change with model capability. Um, and so I'm kind of hoping, you know, I'm trying to inspire and and and convince people that it's a very cool test bed where you can do a lot of different sort of combinations of like human AI on different arenas playing one arena at a time and arenas at a time, you know, and just, you know, yeah, I think very interested to work with you on on the interaction stuff.

Oh, that would be awesome. And then I I think uh one one more thing I'll add is for Cognition uh is going to be pushing a lot of codebase understanding which is kind of codebased retrieval plus. And mostly it is helping um humans understand their own code bases better to enable humans or to to sort of mind meld the human with the machine uh to do the highest possible task that LM could not do alone, humans couldn't couldn't do alone. And then the other thing is also like basically automated context engineering for an LM. So that that is like sort of like a research sub aent uh that we're that we're working on.

That's so awesome. Yeah. So I don't know what the benchmark would be because like how do you how do you benchmark understanding that is true [laughter] uh apart from I think like yeah it's mostly like you freeze a repo um have some manually cured answers and then you know pose trivia questions that's very easy to saturate. So, I don't know how else to.

I think um I I think Silus tweeted a while ago like sort of like the the wiki the code wiki s that that's incredible. I mean I I use with Google actually just came out their own version.

Oh yeah, with the the anti-gravity people. That's uh No, no, no. This is like a separate team.

But cool. That's the state of code.

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

The Arena Pivot: Why AI Coding Agents are Moving Beyond Unit Tests

Top 3 Ideas

🏗️ The Death of Static Verification

🏗️ The Autonomy Paradox

🏗️ Benchmarking the Intangible

Actionable Takeaways

Others You May Like

Dario Amodei and Dwarkesh Patel – Exponential Scaling vs. Real World Friction

The Deflationary Singularity: Why Everything is Going to ZERO w/ Salim Ismail

What If Intelligence Didn't Evolve? It "Was There" From the Start! - Blaise Agüera y Arcas

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

The Arena Pivot: Why AI Coding Agents are Moving Beyond Unit Tests

Top 3 Ideas

🏗️ The Death of Static Verification

🏗️ The Autonomy Paradox

🏗️ Benchmarking the Intangible

Actionable Takeaways

Join 10,000+ smart readers on our AI newsletter and stay ahead of the curve

Others You May Like

Dario Amodei and Dwarkesh Patel – Exponential Scaling vs. Real World Friction

The Deflationary Singularity: Why Everything is Going to ZERO w/ Salim Ismail

What If Intelligence Didn't Evolve? It "Was There" From the Start! - Blaise Agüera y Arcas