AI Engineer
December 24, 2025

METR's Benchmarks vs Economics: The AI capability measurement gap – Joel Becker, METR

The Productivity Paradox: Why Benchmarks Lie About AI

by Joel Becker

Date: October 2023

This summary is for builders and investors who equate rising benchmarks with immediate economic utility. Joel Becker reveals that while AI capability is scaling exponentially, it currently makes the world's best developers 19% slower.

This episode answers:

  • Why did expert developers perform worse when given access to frontier models?
  • Is the "Time Horizon" of AI models actually doubling every six months?
  • How does the "Verification Tax" kill the productivity gains of fast code generation?

Joel Becker from METR (Model Evaluation and Threat Research) is the researcher measuring the distance between AI potential and human reality. He presents a jarring reality check: while model "Time Horizons" scale exponentially, the smartest humans in the room find that AI often gets in the way of complex work. The tension lies in the gap between clean benchmark success and messy real-world implementation.

The Benchmark Mirage

"Progress is rapid... but benchmarks can be low ceiling."
  • Exponential Time Horizons: METR tracks how long humans take to solve tasks compared to models. This metric shows a remarkably steady doubling of capability every six to seven months.
  • The Context Gap: Benchmarks usually test low context humans on their first day. Real work requires years of repository knowledge that models currently lack.
  • Clean Room Bias: Most benchmarks use contained environments. Real world code is a messy web of human coordination and legacy systems.

The 19% Slowdown

"The punch line is that we find that developers are slowed down by 19%."
  • Expert Friction: Top contributors to massive projects like scikit-learn were slower when using AI. These experts are limited by typing speed rather than basic problem solving.
  • The Verification Tax: Using AI requires constant auditing. If a model is only 80% reliable, the time spent correcting errors exceeds the time saved by generation.

The Reliability Threshold

  • The 99% Requirement: Productivity gains only kick in when model accuracy hits near perfect levels. Anything less forces a high stakes proofreading cycle that kills flow.
  • Task Interdependence: Solving part of a problem with AI breaks the mental model needed for the next step. Humans must maintain the full context to ship reliably.

Actionable Takeaways

  • The Macro Shift: The Capability-Productivity Gap. We are entering a period where model intelligence outpaces our ability to integrate it into high stakes production.
  • The Tactical Edge: Audit your stack. Identify tasks where "good enough" generation is a win versus high context tasks where AI is currently a net negative.
  • The Bottom Line: Do not mistake a climbing benchmark for a finished product. For the next year, the biggest wins are not in smarter models but in better verification loops.

Podcast Link: Click here to listen

Hey guys, thank you so much for having me. My name is Joel Becker. I work as a researcher or member of technical staff at METR, which stands for model evaluation and threat research. As we'll see in a second, I'm going to be talking about AI capabilities. How do we know how performant AIs are today? How performant they might be in the near future from these two different sources of evidence that seem to give somewhat conflicting answers.

I could have done this whole talk without reference to METR papers in particular, but we'll look at two papers I've been involved with as examples of benchmark style evidence and then more economic style evidence. On the benchmark side, measuring AI ability to complete long tasks. This is the paper that comes with the charts that many of you would have seen on Twitter and so on that METR is well known for. And then the second this RCT measuring how allowing AI affects developer productivity. And then we'll be talking about how to reconcile the gap that's implied between these two different kinds of measurements.

As I mentioned, METR stands for model evaluation and threat research. We are an independent research nonprofit that seeks to inform the public, policy makers, labs about the degree to which AIs might pose catastrophic risks to society. The model evaluation part means that we seek to understand AI capabilities and propensities. And the threat research part means we try to connect those capabilities and propensities to potential catastrophic risks.

Okay. The first paper we're going to talk about is associated with this chart that many of you I think might have seen. Taking a step back first before we dive into the paper. How usually do we think about measuring AI capabilities using benchmarks on a SWE bench or a GPQA so on and so forth. There's some notion of 0% performance or random performance. So for GPQA that's 25% which corresponds to this flaw that the worst you can possibly do. Perhaps there's a human baseline that's below 100% for GPQA. I think this is something like 75% that represents maybe expert human performance. And then of course you can go all the way up to 100% potentially on these kinds of benchmarks.

But what does it mean? If I'm getting 50% on GPQA, if I'm like half the way from the floor to the expert baseline, what does that really mean about how performant the AIS are? If I meet the human baseline, does that mean that the AIS are now as performant or even more performant than expert humans in a relevant sense that I care about? It's hard to interpret. Another thing that you see from this graph is that benchmarks seem to have less and less time between coming online sort of giving any signal at all and being fully saturated. It's harder and harder to create benchmarks that have plenty of signal that you know might be informative to us about how capable models are for an extended period of time.

So, we're going to go about this a different way. First, we're going to gather human baseline data for diverse tasks spanning a range of difficulties. You should think of these humans as experienced experts, but on their first day or first week on the job. These are not people with context on the tasks in particular. It's not exactly the kind of thing that's come up in their work before, but if it's a software engineering task, you know, there are relevantly skilled general software engineer. Same for the machine learning tasks and the cyber security tasks here that we'll talk about.

The type of tasks come from these three buckets or task distributions. Hcast which is a collection of software based tasks seemingly requiring autonomy interacting with tools interacting with the environments thinking through the problem not just this kind of Q&A style data set the SWAR suite which are these atomic problems these are problems that you know maybe GBT2 can do maybe maybe it can't problems like here are four files one of them is called passwords.txt txt which file contains the passwords and then on the other end of difficulty we have rebench which are challenging novel open-ended machine learning research engineering challenges which are very difficult even for top human experts.

In addition to gathering the human baseline data we'll also under as close to identical conditions as possible measure AI performance for the AIs that we're that we're interested in on the same set of tasks and then we're going to convert the time it takes for humans to complete these tasks into an estimate of AI autonomous capabilities as I'll show you in a second.

Here's an illustrative diagram in this case for claw 3.7 Sonet which was the frontier model at the time that this paper came out. You can see that for the very short tasks something like 4 minutes or below Sonet is getting the answers correct essentially 100% of the time or maybe even here literally 100% of the time. for the very hardest tasks it's struggling and then and then there's some range where we're kind of in the middle we're somewhere between 10 and 10 and 90%. I'll say that this empirical pattern where models are less performant at tasks that take humans longer is it's not a fact of nature but it's something that we see pretty commonly pretty robustly across models at least on this task distribution and I'd conjecture for other task distributions as well.

So we try and fit this dark purple line to something like this data on how long it took humans to complete the relevant tasks that the models are attempting. And then we call the point on the x-axis this horizontal axis this human time to complete axis at which we predict the models will succeed 50% of the time the time horizon of those models that there's much to debate in the 50% number. I can talk later about the reasons why we chose that and then we'll do the same exercise for the other models.

So here I have claw 3 opus has a time horizon of something like 4 minutes. That's where we're predicting that it has a success probability on this task distribution of 50%. For 01 preview I'm seeing something like 15 minutes so on and so forth. And then of course all these models you know they come out over calendar time. So if we plot the time horizon, the x-coordinate on this set of plots against against calendar's time, we find something like this. It looks kind of like kind of like an exponential trend that's going up at some constant rate.

In fact, it doesn't just look like an exponential trend. If we had a perfectly straight line here, it would indicate a perfectly exponential trend. we see something really remarkably steady actually much more steady than we were anticipating when we went about doing this research project and that's continued to be the case. So many of you will have seen updates that we've made of this graph on Twitter. This is going all the way up to GPT 5.1 CEX max. So extremely recent the predictions from this shockingly straight line have held up very well I think.

Taking a quick step back, what are benchmarks telling us or here kind of benchmark like evidence? Well, one thing is that AIs can succeed at what for humans would be exceedingly difficult tasks. The tasks in our ebench are really far beyond my capabilities personally and you know the AI is having a good crack at them some decent percentage of the time. And the second's kind of obvious is that progress is rapid.

On the other hand, how much stock should we put in the evidence suggested by benchmarks? What limitations might they have? Lots, but here are here are three that I'll note.

  • One is, as I mentioned, these are humans who are expert in some relevant sense, but they're low context. It's something like their first week on the job. They haven't seen tasks exactly like this previously. They just have some relevant experience. presumably people who were more sort of not just having the relevant experience but also highly familiar with the set of tasks would perform the tasks even sooner and then we think relative to those people the AIs were more performant.
  • The second is that benchmarks can be low ceiling. Even you know GPQA or use that example again where we're beginning to get to the point where where that benchmark is totally saturated not providing additional information for marginal models whereas time horizon is providing this nice way to sort of chain benchmarks together in some sense over time. But nonetheless it's still very hard to create these ever harder tasks when the time horizon of models is doubling every something like six to seven months. So even time horizon might be might be saturated in not too long or the benchmarks underlying time horizon.
  • And the next one is not a concern that's limited to the to the METR task to the task behind time horizon. It's also true for sweet bench. which is also true for for many of your favorite agentic benchmarks that the problems aren't very messy in some sense. They don't require a ton of coordination with humans. They're often in relatively small contained environments where where not much can go wrong. You know, not these sort of massive open source code bases or other ways in which the the problems can involve more interaction with the real world or be messy in some sense.

So we did this we did this project and then early this year we were trying to think about how can we attack some of these limitations? What's a different source of evidence that might have its own pros and cons but importantly be more externally valid in the scientific jargon. Perhaps field experiments are the answer. So more economic style evidence. So here we might be interested in very high context developers who are expert on the kind of tasks they're already doing speed up or some notion of productivity boost. You know it seems to have more signal through even some superhuman according to benchmarks range.

You know perhaps GPQA is fully saturated and you're getting a 1.5x 2x speed up something like that but you can still achieve a 3x 4x 5x speed up even even after that we we maintain more signal. And the last is that you know that the tasks are messier. They are tasks that are coming up in people's real work. They're not synthetic. They're not small and contained. This is a real deployment scenario.

Here's what we're going to do for this paper. We're going to gather 16 experienced developers on large mature open source projects that we'll go through in a second. Each of these developers will on average complete about 16 tasks from their real work. These are these are issues on the relevant GitHub repositories. The kind of thing that they might otherwise have completed with the caveat that the very longest issues we're not going to include. The tasks will be randomly assigned to AI disallowed or AI allowed. AI disallowed, you know, it means what you think it means. It means software development in 2019. It means no AI powered tab autocomplete. It means no cursor agentic coding tools. It means no LLMs via the web UI. or they can be randomly assigned to AI allowed in which case everything's on the table. You know, any of the AI tools I just mentioned or not using the AI tools. If you're in the AI allowed condition, you're not compelled to use AI. You just have the option. And we buy these developers Cursor Pro. So, for the most part, that's the tool that they're using with typically 3.6 or 3.7s on it at the time, which was the Frontier model when we conducted this work.

And then we're going to record the time it takes for the developers to complete each task and see the degree to which they might save time when AI is allowed versus when it's not. These are some of the repositories. Many of you will be familiar with them. We've got the Haskell compiler represented. We have scikitlearn. We have hugging face transformers. These are on average a million lines of code plus. They've been around for 10 plus years. The developers who are going to be working on these repositories as part of this study are on average the third top contributor out of hundreds or even in some cases thousands of contributors to these repositories. They personally have been contributing to the repository for something like 5 years on average. These are top experts.

Some of you might have seen this graph too. And so the punch line's been spoiled for the rest of you. We asked economics experts, machine learning experts, you know, these are people at major AI companies and labs, top academics, some graduate students, so on and so forth, you know, how much they expect developers to save time when they're using AI. They say something like 40% or a little bit less. We ask the developers themselves, the study participants, how much they expect to be sped up ahead of time, and they say something like 24 25%. Then we ask the developers after the study has been completed how much they think they were sped up in the past by AI being allowed on the issues they completed as part of this study and they say that it will have sped them up by something like 20%.

And the punch line is that we find that developers are slowed down by 19%. They take 19% more time when AI is allowed relative to when AI is not allowed. When I first saw the data coming in, saw sort of early versions of this plot, I thought presumably the same thing that many of you might be thinking right now, that we've messed something up. That that, you know, something's gone wrong. There's some there's some issue in how we've set up the experiments. How could it possibly be the case? You know, at least these these developers have access to the zero points because they cannot use AI at any time.

So we poured over many, many, many, many, many hours of screen recordings from these developers working on issues as part of the study. We look to dive into a bunch of hypotheses that might explain what's going on and try to categorize the things that that we think are going on versus not. Many of this is listed in the paper. I'll just quickly go through some of the things that we think are contributing.

  • First, overoptimism about AI usefulness. that that seems like an obvious one. You know, the developers even after the study is completed, they think that that AI is going to be helpful to their work. It's it makes sense they might overuse AI on that basis.
  • Two more implicit repository context and high developer familiarity. You know, these developers are coming to these problems already knowing the solution to the problem. They don't they're so expert in this work. you know, I I I imagine them as as not trying to spend a bunch of time thinking through the solution that the the AI can can work through. Instead, they're just limited by how fast they can type. Which which means that, you know, using AI, instructing AIS to do it, comes with some significant time cost versus how they might otherwise have spent their time.
  • I think many of us have the sense that AIS might be less performant on on large and complex repositories, which is a different from this difference from this benchmark style evidence or from or from some previous work.
  • And then low AI reliability. You know, maybe the AIs are very performant on these kinds of tasks, but you know, they're only performant 50% of the time or 80% of the time, 20% of the time. And so, at the very least, you need to check their work afterwards. And perhaps even you need to spend time correcting their work afterwards, which is which is something we see quite a lot on these issues.

One thing from the factors with an unclear effect that I that I'll mention briefly I have to talk to people about later is below average use of AI tools which came up in the public discussion. This this is in the unclear column because it's sort of evidence evidence for and against. That that's true for for many of the things here. We don't have anything so conclusive to say we're still working on on this line of work.

Here are some here are some caveats. All important. First you know obviously we do not provide evidence for all software developers or tasks. These are extremely experienced developers working on extremely complex longived open source repositories. I in my own work you know not as expert in the relevant sense as these people are. I'm working on much smaller repositories. I feel more comfortable saying that even at this time I was sped up by AI tools even if even if the developers weren't. This setting is weird. It's weird for the same reasons that it's that it's interesting this this unusual developer population. Second, the experiment is concentrated in March 2025. As I mentioned, we know that AI progress is rapid. Perhaps this this result will have already changed by the by the time I'm giving you this talk.

So there's a kind of puzzle suggested right that the benchmark style evidence is giving a very impressive sense of what benchmark of what AI capabilities look like today whereas the more economic style you know I include labor market impacts working here too in addition to our in addition to our field experiments look somewhat more bearish or unimpressive. You know why why is the former not not translating to the latter at least naively there seems to be a clash. How might we go about resolving this puzzle?

So one possibility is that in fact we we messed something up. This is this is still live and on the table. You know maybe the developers really are not very capable at using AI and if we continue to run this experiment as as in fact we are they'll you know learn more familiarity with the tools and so get productivity benefits that they they weren't getting at the time. I'm a little skeptical of that story but but that's one possibility. Another that economists like to bring up is that we're not incentivizing these developers to finish quickly. we're paying them per the hour, which we do for external validity reasons. You know, looking through their videos, I really do not think that they're developing differently in accordance with these incentives, but but that certainly is one possibility that's on the table. You know, another more statistical in nature possibility is, you know, this is a small study. You shouldn't you shouldn't over update so much from small studies. We we are doing bigger things that I'm excited to release at some point.

Okay, but let's let's assume we haven't messed something up and this is this this is a result that that we think that we think does hold up. How could we resolve the puzzle?

So, one possibility, you know, as I alluded to briefly is that reliability needs to be very high to save time. That you need to be getting the answers these problems that developers are putting in correct. you know, something like 95 99% of the time in order for developers to tab tab tab through and you know, not not spend lots of time verifying the AI's work, which which of course is pretty costly from a time perspective. Another possibility is bbenchlike or algorithmic costless scoring at the margin versus mergeability like scoring. Sweet scores are not trying to account for you know whether the code is spilled honable by by other people in future or whether it's matching quality considerations that aren't considered by the unit tests. You know perhaps AIS really are performance according to SWEBenchl like scoring but not performance according to this kind of more holistic holistic scoring that we might care about low versus high context baseliners. I I I mentioned I mentioned previously these are just much more skilled humans, you know, relative to those humans. Perhaps the AIs are less capable. Task distribution, maybe these are just different kinds of tasks, you know, in particular less less messy than the than the benchmark style task. Maybe that's explaining what's going on here.

Suboptimal capability elicitation. A huge amount of work has gone in at METR to making the agents as performant as possible given the underlying models on on our kinds of tasks. And you know that involves churning through a load of AI tokens. Perhaps that's that's less the case for cursor in particular at the time when we completed the study. And then interdependence across tasks. Maybe it's the case that you know if humans can complete task A and task B. AIS can only complete task A but not task B and of course can do task A faster. then it still makes sense to for humans to do task A and task B, not delegate task A because you know they they need to know the outputs. They need to know how how task A was completed in order to reliably complete task B. I think that's that's part of what's going on. You need to maintain context as you're working through these subtasks.

Lastly I will say that we are hiring not just for this kind of work that you've seen being extended you know ever longer tasks ever more ambitious RCTs even more sources of evidence from which we can triangulate the truth about AI capabilities but also for for much more besides you can you can find this at meter.org/careers org/careers. In particular, I'm excited about research engineers, research scientists who might be hiding in the current audience. We're excited not just, you know, for for research types with academic experience, but very much for scrappy startup people as well. And we're also hiring for a director of operations. And with that, thank you very much for listening.

Others You May Like