Latent Space
December 31, 2025

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

Gamifying the Path to Autonomous Engineering: SWE-bench, Code Clash and the Future of Evals by Latent Space

John Yang, the mind behind SWE-bench, joins Latent Space to discuss the transition from static code fixes to autonomous agents. As the Devin era accelerates, the industry is moving toward economically valuable arenas where models don't just pass tests but actually compete.

Quick Insight: This summary is for builders and researchers tracking the frontier of AI agents. It explains why the next generation of benchmarks is moving from static unit tests to competitive, long-horizon tournaments.

  • 💡 Why are unit tests failing: As the ultimate yardstick for AI agents?
  • 💡 How does Code Clash turn LLM evaluation: Into a Starcraft-style tournament?
  • 💡 What is the "impossible task" trap: In modern benchmarks?

Top 3 Ideas

🏗️ Beyond Unit Tests

"I don't like unit tests as a form of verification."
  • Static Verification Limits: Unit tests only check if a specific patch works in isolation. This fails to capture how an agent manages a complex, evolving codebase over time.
  • Tournament Style Evals: Code Clash pits models against each other in programmatic games like Halite. This forces agents to optimize for performance and strategy rather than just syntax.
  • Broader Context: Moving beyond Django-focused repos to multilingual and multimodal environments is necessary. Diversifying the distribution of tasks prevents models from over-fitting to specific frameworks.

🏗️ The Autonomy Paradox

"I definitely don't believe in this idea of just getting rid of the human."
  • Interaction Over Isolation: Purely autonomous 24-hour runs are impressive but lack real-world utility. Most high-value engineering requires a tight feedback loop between human intent and machine execution.
  • Abstraction Level Transitions: Agents should handle the JSON parsing drudgery while humans focus on high-level architecture. This creates a new division of labor based on cognitive load.

🏗️ Benchmark Integrity

"We should intentionally include impossible tasks as a flag."
  • Detecting Model Deception: Some benchmarks like Tau-bench are criticized for being impossible to solve perfectly. Including unsolvable tasks reveals which models are hallucinating success versus those that can admit defeat.
  • Curation Competition: As models saturate existing benchmarks, researchers are forced to find more difficult, verified subsets. This ensures that progress is measured against real engineering hurdles rather than memorized training data.

Actionable Takeaways

  • 🌐 The Macro Shift: The transition from completion to agency means benchmarks are moving from static snapshots to active environments.
  • The Tactical Edge: Integrate unsolvable test cases into internal evaluations to measure model honesty.
  • 🎯 The Bottom Line: Success in AI coding depends on navigating the messy, interactive reality of production codebases rather than chasing high scores on memorized puzzles.

Podcast Link: Click here to listen

We're here at NeurIPS with John Yang of SWE-bench and many other things. Welcome.

Thanks so much for having me. Really happy to be here. Uh last year I talked to Oier and Carlos as well, one of your co-authors. How's SWE-bench doing? Just generally, the project is like one and a half years old.

Yeah. I think one and a half years old in terms of when it was actually useful. We put it out October 2023, and then people didn't really touch it too much. Then of course, Cognition came on the scene, and Devon was an amazing release. I think after that, it kind of kicked off the arms race.

Did they tell you beforehand, or they just showed up?

You know, I got an email about two weeks ago. I think it was from Walden. He was like, "Hey, you know, we have a good number on it." I was like, "Wow, congrats. Thanks for using it." And then the release was mind-blowing. I was like, "Wow, these guys did an excellent job."

Amazing. And then SWE-bench Verified was like maybe last year.

That's right.

Catch us up this year. Like you have other languages. There's like a whole bunch of varieties of SWE-bench now.

Yeah, for sure. I think there's a couple extensions that are happening. One is like more SWE-benches, SWE-bench Pro, SWE-bench Live.

Oh, SWE-bench Pro. Was that with you guys? Because it looks independent. It's like different authors.

It's completely independent. Yeah.

So, they just call it SWE-bench Pro without your blessing?

Yeah. I think we're okay with it. When we came out, we were like, "Oh, cool. Interesting." It would have been fun to be part of it. But congrats to them. It's a great benchmark.

Yeah. But yeah, multimodal. We did multimodal and multilingual.

Multilingual seems to be, is it like JavaScript? What else?

Yeah. Multilingual is like nine languages across like 40 repos. But yeah, you got them like JavaScript, Rust, Java, C, Ruby.

Yeah. And then Corebench itself, a lot of people like they talk about the Django focus.

Is there like, I don't know, how do we move past Django?

Yeah, for sure. I mean, it's cool to see a lot of the newer benchmarks like really try to diversify the repos. Like in the two follow-ups we did with multimodal and multilingual, we made it a point to do that. So, I think but you can also just put out SWE-bench 2025 and just...

That is true and do a new distribution.

Yeah. So, it's been cool to see the follow-ups. I think quietly, and it's an open question for me. I'm excited to see how people curate the next sets. Like it's kind of interesting to see in the literature or in their blog posts like how they're justifying why they're creating their separate split, the easier ones where like, oh, more languages, more repos. And then I think now people are like, well, ours is more difficult because of this curation technique. And I'm excited to see how long that lasts and where we're going to guide the evaluations towards.

And more recently, you're working on Code Clash.

Yes, that's right.

So let's give people, you've already done other episodes, other podcasts about it. I refer people to that with your chat with Andy, but just give people like a one-two sentence.

Happy to do it, especially on your podcast. It's an honor. So basically, the idea is I don't like unit tests as a form of verification. And I also think there's an issue with SWE-bench where all of the task instances are independent of each other. So the moment you have the model submit it, oh, it's done, and that's the end of the story, end of the episode. So with Code Clash, what we're thinking is let's try to really evaluate like long horizon development and development on a codebase that is consequential and conditioned upon what a model did before to that codebase.

And so the general idea is you have two or more language models, and they play a programming tournament. And what that means is each model maintains their own code base, and each round of the tournament, first they get to edit and improve their code base however they see fit, very self-determined. And then in the competition phase, those two code bases are pitted against each other. So the code bases are run, and there's generally an arena. We have a lot of diverse arenas, but the arena is determined like codebase A is better than codebase B, and then you kind of repeat that across multiple as determined by an Elo judge.

So Elo judge is definitely one of the mechanisms. We started with some pretty simple programming games. So one of the cooler ones is Halite, which...

Oh yeah, I played it for Jane Street.

That's right. Halite one, two, three. Michael Troll of Cursor wrote this game.

Two Sigma Jane Street.

Two Sigma. I worked at Two Sigma. I'm like, "Oh, there you go."

2016 at this point, but we're bringing it back. You know, Halite is fun. I would say if you've never done a programmatic competition where you have to control fleets of ships and attack things and defend things and collect resources...

It's like play Starcraft, but you can code, right?

Exactly. A lot of games.

Is there are there non-games, or you focus on games?

I think that's an excellent point. So for the initial release for scientific purposes, we kind of use existing programming games. The current ongoing effort is to build economically valuable arenas. That's the popular word these days. So yeah, SWEeter is a big one this year. GDP. Awesome.

I mean, I think the big selling point of Terminal Bench and SWE-bench and these evals is that it's really close to real-world utility, and so I think it's resolvable for Code Clash, and that's what we're working on.

So you're part of the OIR group.

Yes.

The other students have also been putting out a lot of other stuff. What would you highlight?

OIR is such a prolific mentor when it comes to benchmarking. So Efficiency, I really like in the line of performance operation.

What's the one?

Efficiency was wrote by this PhD student called Jeffrey Ma, who happened to be my high school classmate, and the idea there was like you take a codebase and you just want to do modifications that will literally make the code run faster. So I think this like parallelization sim operation stuff like that.

So no behavior change, just faster.

Exactly. Keep the unit test passing, but I want better runtime.

And then there's AlgoTune that is kind of in line with that. And then there's also kind of pushing along like the scientific coding domain.

PsyCode2 is awesome. They did like a quick and for people is the way I explain PsyCode is it's human eval but better.

Exactly. I think there's a lot of good stuff these days where that's the way to go, which is like SWE-bench is expensive to run. Any agent tech benchmark is expensive to run. Actually, you do need some completions benchmarks that just just complete. Like, you know, you can do well on those first and then sort of graduate to the multi-turn expensive stuff.

Okay. Other than that, just like broadly other work in the field in 2025, in terms of coding evals, obviously we shot up Meter. They use VBench, and they have a very interesting, like, I guess human hours worked number.

They like the x-axis being sort of the runtime and or yeah, y-axis being the completion, you know, like we can do more long-running speed and tasks. I think the projections are quite interesting, and I definitely appreciate them kind of using SWE-bench Verified to sort of proxy a lot of these things. But yeah, they're great.

Okay. Any other work that caught your eye?

I mean, I think within the Okay, Terminal Bench Bench. Critical Point was kind of cool. Critical Point. Yeah, that it's like a very new benchmark that OIR did. And I think it's kind of related to physics. There's this one called SecBench, kind of related to cybersecurity.

SecBench, which I think is affiliated with LOT. It's just cool to kind of see people really dive into different coding domains. And then stepping a little bit outside of coding, I'm personally think it's quite interesting to think about the user simulator stuff. So like TW Bench Bench too and Vending Bench and I got mixed feelings.

I'm interested. Well, I mean it's like you're sampling one path. I don't know how realistic it is to be honest. It's just the elements, but it is cool.

For sure. I agree. I think it's a good initial effort. To me, I think it's super cool to see companies like, I'm sure, Metaphor and stuff are focusing on building environments like for code beyond code, and so I think it might be interesting to have like Work Gym style stuff. This is stuff that my advisor D. Young at Stanford thinks about a lot.

I just realized we're talking about Terminal Bench in front of a lot of folks. Really, really, really good work. Just overall, let's talk about TowBench because you mentioned TowBench.

There's some discussion or some people are saying that TowBench is impossible to get a high score on because some of the tasks are underspecified or just impossible. I don't know if you're up to speed on that. I'm a little bit spicy.

It's a bit spicy. I think I saw, you know, I worked with Shinyu and Karthik back in Princeton very closely. I think Karthik, I just saw, posted a tweet kind of rebutting some of these claims. I think I get the concern. But I think it also brings up just maybe interesting research problems to solve of like, okay, like why is it impossible? Is it the ambiguity? Is it the user simulator that has issues? And I think generally we all agree that we'll improve on these things over time for Ubots. So I actually really like benchmarks that intentionally, I think we should intentionally include impossible tasks as a flag.

Of like, hey, you're cheating.

It's kind of sad that Karthik actually is defending it because the master move would be like, "Oh, yeah, you caught us." Like that was, you know, like everyone reporting above 75 on Top Bench retail, you'd be cheating.

That would be cool.

Yeah. I mean, yeah, you'll have to ask the TowBench authors, but yeah. No, that's fun.

I think there was Impossible Bench was a recent benchmark. Maybe from, was it from Anthropic? I don't know. But they basically took SWE-bench Verified, and they changed the issues to make them impossible, and they checked like how often the models would be like, I actually just can't do this. I don't know what's going on.

Oh, like for refusals.

So, oh, how did they do? I thought that was interesting. I think they're all the models are all kind of attempting and saying like, oh, I did it, you know. So, maybe not great. That's cool. But no, that's an important one.

How does Cody Evalance evolve next year?

Wow, that's a great question. And I mean honestly I think I think it's people will make more SWE-benches. I think Terminal Bench has really got something going where you ask people to, SWE-bench you're confined in some sense to the domain of issues and PRs that already exist, which I think has its benefits of being close to reality and natural, but I think with Terminal Bench there's a lot of creativity that you can infuse into that. So I would personally be really excited like the 2.0 job was really excellent and I'd be super excited to see 3.0 4.0 because of the environments.

I mean the environments, you know, bringing more people into the fold, you know, I think, correct me if I'm wrong, Mike, but early on you had PhD students, very smart CS people who are adding task and, you know, what does that look like when you fold more coding environments for non-coding tasks, non-coding environments in general, and ask people to make stuff there. So, that's pretty cool. And then, of course, for myself, I think just like this long-running SWE agent kind of thing just feels very compelling. I think the vision of like, hey, I tell it a goal. I don't have to be super specific about my task. I have like a decent verifier that proxies what I want. Something literally like a codebase that makes the most money in this setting, you know, like that's my verifier, and I walk away for 5 hours. The thing is just running. I'm hanging out with you, talking to my friends. I come back and it gives me literally a soda codebase on that task. I think that would be super cool.

I'll push back. We're part time. And we are emphasizing a lot of interactivity because the point is that you're going to underspecify, right? And actually what people want is back and forth, back and forth and on like a really fast time frame, which is terrible for a benchmark author, right? Because how you do that? But realistic.

So, I think like that this is where I'm a little bit anxious or cautious about this push for long autonomy, right? We're going, I mean, you know, let's say this time next year, we'll have 5 hours is pessimistic like it'll be it'll be 24 long. Days. But I don't know if that actually materially changes the industry. So, we'll push it like as an evals, you know, we have the people people make evals here. We push the industry in ways that we wanted to push, but I don't know if we like that's a productive way because that's more of like a stunt that like, it's a proof of concept that proof existence proof it can be done. But will you use it in for real life?

I mean, honestly, to me, I think there's potentially room for growth. So, I would actually agree with your take here. With my lab at Stanford with DE, there's a you know her emphasis is on human AI collaboration and so I definitely don't believe in this idea of just getting rid of the human. But yeah maybe just like finding the balance of like you know just because the developer ecosystem is so diverse and there's so many participants in it who want different things out of it like just enabling different levels of abstraction.

And you know, it depends on the task. Like there's settings where you want to be more involved and more sort of hands-on and so you want to use Windsurf for that. But then maybe there's this general data processing thing. It's just a lot of JSON parsing you don't really care about and that's the one I kind of want to walk away from and just let it figure it out. So yeah, I would agree with you generally.

Amazing. Any calls to action? What do you want help on? How can people guess like find more of your work?

Definitely for the call to action. Super jealous of all the great data that Cognition and Cursor would get like that user interaction data is really fascinating. From an academic standpoint it feels like there's two difficult approaches to resolving that. Either you build like a really compelling product like El Marina that people have people use consistently which is really tricky in and of itself or you build like really good user simulators that try to mimic sort of these settings. But that is also like non-trivial. I don't think it's as simple as hey chatbt act like a human, right?

So it would be really cool to sort of get inspiration of like what exactly does that data look like or or between the two like what's the best way to scale up sort of evaluating human AI interaction and then I think for visibility for my own we're pushing more arenas like I think for for Code Clash what I'm excited about is the current framing is really long running SWE agents but you know you could have multi-agents like two agents work together on the codebase and what happens you have a human and an agent work on the codebase versus just AIs, what happens there? You know, like when the models improve and hopefully they hill climb and they become better at digesting logs and iterating on analysis, how does human AI interaction change with model capability.

And so I'm kind of hoping, I'm trying to inspire and convince people that it's a very cool test bed where you can do a lot of different sort of combinations of like human AI on different arenas playing one arena at a time and arenas at a time, you know, and just, yeah, I think very interested to work with you on on the interaction stuff.

Oh, that would be awesome. And then I think one more thing I'll add is for Cognition is going to be pushing a lot of codebase understanding which is kind of codebase retrieval plus. And mostly it is helping humans understand their own code bases better to enable humans or to sort of mind meld the human with the machine to do the highest possible task that LM could not do alone, humans couldn't do alone. And then the other thing is also like basically automated context engineering for an LM. So that that is like sort of like a research sub agent that we're working on.

That's so awesome. Yeah. So I don't know what the benchmark would be because like how do you how do you benchmark understanding that is true apart from I think like yeah it's mostly like you freeze a repo have some manually cured answers and then you know pose trivia questions that's very easy to saturate. So, I don't know how else to...

I think Silas tweeted a while ago like sort of like the wiki the code wiki that's incredible. I mean I use with Google actually just came out their own version.

Oh yeah, with the the anti-gravity people.

No, no, no. This is like a separate team.

That's the state of code.

Others You May Like