Latent Space
December 31, 2025

[State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI

The Post-Training Pivot: Why Token Efficiency is the New Scaling Law

by Latent Space

Date: [Insert Date]

Quick Insight: This summary breaks down how OpenAI is moving from raw compute scaling to behavioral optimization and token efficiency. It is essential for builders deciding whether to invest in complex RAG or wait for model-native context improvements.

  • 💡 Why is token efficiency becoming a more critical metric: than raw benchmark scores?
  • 💡 How does the "Anton vs. Clippy" divide: dictate the future of developer-facing AI?
  • 💡 What specific skill set is currently: the biggest bottleneck at the AI frontier?

Josh McGrath, a post-training researcher at OpenAI, explains the transition from GPT-4.1 to 5.1. The focus has moved from pre-training data curation to the complex infrastructure of reinforcement learning.

The Efficiency Arbitrage "[Do I want to make compute efficiency wins of like 3% or do I want to change the behavior by 40%?]"
  • Behavioral Gains: Post-training allows for massive behavioral improvements with relatively small compute budgets. This makes the final stage of model training the most fertile ground for product differentiation.
  • Token Economy: GPT-5.1 prioritizes doing more with fewer tokens. Reducing the intelligence tax is the primary path to making autonomous agents commercially viable.
  • Co-design Culture: OpenAI researchers bridge the gap between systems engineering and machine learning. Hardware constraints dictate the limits of algorithmic experimentation.
Verifiable Truth "[When you find the answer to a math problem, it's a lot less debatable than what the human preferred.]"
  • Objective Rewards: Moving from RLHF to RLVR provides a cleaner signal for model optimization. Using math and code as ground truth eliminates the variance found in human preference data.
  • The Anton Preference: Developers are opting for "Anton" models that perform tasks without unnecessary cheerfulness. Efficiency in interaction is becoming as important as efficiency in compute.
The Talent Bottleneck "[We're still having trouble producing people that want to do both systems work and ML work.]"
  • Dual Competency: The biggest constraint in AI is the lack of researchers who understand both distributed systems and statistics. Solving the data engineering problems of RL requires a full-stack technical method.
  • Contextual Transformation: Future models will focus on graph walks rather than simple retrieval. Models must learn to perform complex logic across the entire context window to be truly useful.

Actionable Takeaways

  • 🌐 The Macro Pivot: Intelligence is moving from a scarce resource to a commodity where the primary differentiator is the cost per task rather than raw model size.
  • ⚡ The Tactical Edge: Prioritize building on models that demonstrate high token efficiency to ensure your agentic workflows remain profitable as complexity grows.
  • 🎯 The Bottom Line: The next year will be defined by the systems vs. models tension. Success belongs to those who can engineer the environment as effectively as the algorithm.

Podcast Link: Click here to listen

We're here with Josh from OpenAI. Welcome. How would you introduce yourself?

Yeah, I work on a bunch of the thinking models at OpenAI. Recently I've been sort of focused on doing search related stuff. I'm just a post-training researcher at OpenAI.

You were on with us for GPT 4.1 when we're talking with Michelle who's on maternity leave. Now we're in 5.1. It's been a whole generation. 4.1 was a non-thinking model and then since then we switched into...

No, we still are releasing non-thinking models, but that one was the one that we did that was like API specific non-thinking. Focus has shifted a little.

How'd you get into post training?

Previously before OpenAI, I was doing like pre-training data curation stuff and I was seeing from the news and looking at papers, it seemed like a lot of not pre-training is dead, but there's going to be so much interesting stuff in post training. At that point I was like I really want to make some contributions there. It's not even necessarily that pre-training was dead but it was definitely changing. Do I want to make compute efficiency wins of like 3% or do I want to like change the behavior by 40%? It just seemed like more exciting to go to post training.

Many late nights later that's definitely true. It's a different kind of data and engineering discipline too. It's very strange like the the the kind of work that you need uh in especially RL like scaling it.

Definitely. The number of moving parts in an RL run is just a lot higher. If you think about pre-training you're moving tokens to many machines and then you're getting like a basically a scaler from them and then you're backing. The issue with RL is you're doing tasks and each task could have like a a different grading setup and each one of those different grading setups that's like more infrastructure. When I'm staying up late trying to figure out what's going on with a run it could be in way more things than there is in a pre-training run generally.

Does it matter if you own the code of the task or is it an outsourced third party person? My sense of it and the external sense of it obviously I don't see it up close is that you work a lot with external partners and I'm sure also some internal stuff but which is better.

I don't think I'll comment like too much on like how many external partners there are some and there there's some internal. We do like but the technical tradeoff of like Well, Like I don't own this code.

When it comes to I don't own this code, actually, when I'm babysitting a run or something, it doesn't really matter if it's like internal, external, whatever. Do I understand the system that's going underneath and I think you end up having to jump into a lot more code that you're like, I actually don't know what this does because I'll be watching the, I work on my pieces of a run. There's also other people working on it and like, do I understand what their code is doing? So that way at like 12:30 in the morning when I'm like something looks wrong and it's I'm like looking at this code, can I like get context fast enough to understand?

Throw a CodeX at it.

I use CodeX so much. It's really changed how I work. I feel like there's a degree to which sometimes I feel trapped by CodeX because if I spend like you know 30 40 minutes writing something that looks like a design doc or something CodeX can do more work than I could do in a few hours in like 15 minutes but then like what do I do during those 15 minutes after and like the it's actually just like really changed how the flow of my day goes because I have to somehow now manage these like 40-minute sessions with like 15 minutes where like I could do something but it's actually not nearly as effective as like this new flow to the day. I think I'm still getting used to that honestly.

I think it should be interesting for like also just codebased understanding when you're encountering unfamiliar code.

You briefly before we started talked a little bit about the shopping model which is like the the latest hottest thing and obviously we're just recording this right after Black Friday, Cyber Monday. of small any interesting findings from basically releasing shopping in LGBT uh right into that period.

Well, I think the first thing is I don't know like why I would say in a meeting in you know August or so like oh hey Black Friday is coming up like maybe we could maybe we could do a release by them in hindsight like wait why would I say something like they're like yes now you own it.

I guess the most interesting thing to me is the new interruptability and like the the sort of qualitative experience of using it. The same thing happens with CodeX, right? You write a prompt and you can like press escape and say like, "Oh, I like I messed something up." We actually did the same thing in the shopping model. So, it shows you its chain of thought with like what products it's looking at. And you can write it new messages saying like, "Oh, you know, I actually wanted this." Like, "I wanted USBC on this or whatever it is." And I think that's a really new interesting like interaction paradigm that we have in a couple of our different services and I'm excited to see uh how people use it and if they enjoy it.

Why did it have to be its own model and not just like a new tool?

Stay tuned. I think like there's no reason that we couldn't do it in the same model eventually, but I think you know if we want to try out new things sometimes it makes sense to to make a new model and I think it just made sense to this time say like can we do a deep research style model but like for shopping where it's going to look really hard all across the internet for different things you know I think if you look at like deep research the original one and GPT5 thinking on like high reasoning today I think you'll see that like eventually the models all sort of convert in their their capabilities.

Would you say that this is a discussion that also a little spicy that I've kicked off in the community? There's still maybe 30% of the community is still using deep research. A lot of them have moved over to just using five thinking as deep research. Is that the spiritual success or are they direct replacements? Are there things that we lose in the original deep research model if we if we do that?

I mean I think if you look at our published evals, they're they're they look like basically on par if if not better. So like I mean that's personally what I do. I I use like think uh thinking on high uh versus using the deep research model but like you know I think every as we've learned over the past uh few months there are sometimes people prefer the quirks of like one model over another and so people like the deep research model you know more power to them. People like for uh anything special in the 40 post trading that like are people like really responding to personality? Is that like a differentiator that people really care about and and you like it's a part of your job to care about personality?

I mean definitely people like care quite a bit about personality. I think like over the past few months we've been working a lot on giving users more choice over what personality they want.

Which is the toggles.

So now we we have those toggles. What was your favorite toggle?

Honestly custom instruction for like I want I personally want my model to like be a tool and so like I don't I don't necessarily like want the the warmth or anything. I just want some answers cuz I'm, you know, mostly using it at work.

So, I call this the Anton versus Clippy divide. So, Anton is the Silicon Valley HBO. Uh, is it a machine? It's it owner does work. Uh, and doesn't doesn't try to be helpful or friendly or anything. Uh, it tries to be helpful, but like doesn't try to be cheery. Whereas Clippy tries to be cheery and I'm like, well, stop smiling at me. I'm like having a problem.

So, it sounds like you also come down on the side of like using it using Anton.

I think a lot of developers want Anton, right? just like it just quietly does it works and when it's done it shuts up and

Well, I think like we're we're doing a lot of work to provide both like people Anton and Clippies and I I hope they all like it.

So, just generally I was thinking about like well what can we update people on post training? You know, what what do we know today in at in 2025 that we didn't know in US 2024?

I would say like uh a lot of people at the time this there's still like this whole PO versus DPO discussion was there that was a whole era. Since then we've moved on to RLVR and I think a lot of like agents um specific RL uh training I guess like am I missing any large chunks of the post- training debates that are going on?

I mean, so not necessarily debates internal, but like my read personally from like looking at different papers that are coming out. When you look at like an RLVR paper or like a RHF paper, they read more like an optimization paper. To me, like the the sort of interesting thing that's going on is we have this like spectrum of how high quality a signal is. So like really at the end of the day like RLHF, RLVR, they're both policy gradient methods, but the what's different is just like the input data. It's always interesting to me that we call RLHF nonverifiable because we've trained this model to be good at like predicting human feedback. So in some sense that's like verification, but obviously it's human preference rather than truth.

But like if the if like your value of truth is like does the user like this more like see there's there's something strange that I think we haven't like looked at that axis of okay well how like sort of clean is this signal how much do I trust it and like I totally agree that you know you don't necessarily trust the RHF signal as much as like is this the solution to this polomial but I think there's a whole spectrum of like how high quality is the signal what's going to happen when I like do a lot of optimization against it and that's very different than I worrying about like the variance of different gradients which I think is what you end up seeing in a lot of the the papers that are currently coming out um rather than being like very dataentric they're pretty optimizationcentric even though I think the the innovation really is is where the data is coming from.

Before I want to go broad before I go deep. Um any other discussions that maybe having in Europe or or sort of uh round about this time on post trading debates like what are what are you meet your peer at anthropic and deep mind and what what do you talk about?

Well anthropic and deep mind we're all saying I'm working on stuff and things you know. Uh, and I think like it's more so uh talking a lot more broadly with my my friends there or or we're just talking about man the the intro is so hard to keep up. We're not not necessarily talking too much about methods directly. Because on one level it kind of doesn't matter.

I think also like there's there's something that's very different about academic work where like the what really matters is how narrativeizable it is. I think that's one of the reasons you see like a lot of optimization papers come out is a lot of the data work there's a less clear narrative around it. I think my my the data and the scaling is actually more important than a specific. But it doesn't have like necessarily the same narrative that you get out of like some of the papers that you see here. And so like there becomes more of a like given a a specific vertical how do I like understand that? I I wish there was actually more papers on it here, but I think it can sometimes be harder to wrap up into a clean story.

That's also something that like we're actually having a lot of conversations about with other folks as well, like what's what's next, right? Like what where do you go from here now that we we have like some kind of road map? I think what's interesting also for me is I guess the innovations that are exposed by the Chinese models are maybe copies or like discussions of what's going on in in the labs. I think obviously GRPO you mentioned like a lot of these RL optimizations they come out as they they present themselves as optimizations. Um JPO came out in the Deepseek math paper which uh when it came out I read it and I was like okay this is kind of cool. So it's like a little bit cheaper, but like it does seem to have a more broad impact I think on the industry as a whole than was initially appreciated. I just want to I I don't feel like we've processed that enough.

Definitely. I mean like yeah, as you said, it came out in the deepseek math paper and like it's an interesting optimization method, but it's like the more interesting thing that they have a new reward signal that they sort of like re that we can really really trust. Like when you know you find the answer to a math problem, it's a lot less debatable than like oh well is this thing that the human preferred actually what we want to do.

Like you want to be right at math.

I think in some ways that's underappreciated in I would say what's getting published.

Let's talk about I guess long horizon. What do people consider in terms of like very long horizon? Like we're talking like 30 hours, you know, more than more than a day of autonomy. Does does it is it just more of the same or there anything like sort of qualitatively different?

Okay, so first off what I would first say is I tend to think more in terms of like actual number of tokens than than time because I think yeah the human in the loop can take a while. Well and also like it it gives you a different uh measure to optimize against right like as I was saying earlier with um when I use CodeX it does something that would take me much longer uh you know it would take me like four hours in you know 10 minutes. what we can actually push on there is token efficiency. So like and that is a huge huge research area.

You can see like from five to 5.1 our our overall eval we bumped some but if you look at a 2D plot of how many tokens it takes for us to get that it went way down. Um and so I think that's like a pure when you when you had that like that was such a great chart dude I live by those charts like that those that I was your chart.

Not necessarily that, but like that shape of chart. Like I think that's something that we think about a lot just because it contributes so much to your experience. Like how long does it take to to do this task?

I think the other thing is as you're pushing that token efficiency, it changes, you know, how many tool calls can I make and like how many different things can the agent do in a reasonable number of tokens that we can actually serve.

I personally think in terms of tokens. I think the interesting thing or the the hard to understand thing from the outside is having an explicit router in GBT5 and but then also basically having an implicit router in terms of the thinking uh spending thing that conflates a little bit right like at some point you do kind of need to merge them or else you're just going to get these like weird bumps where sometimes the router at the top decides something and it's wrong and actually if you just handed it to GT5 it would have figured it out.

I think you know we'll figure out the correct abstractions over time. I think like there's a is the intention is still to merge because that's what it was said in the paper. I think like eventually you know we'll have uh AGI and like you're not going to have to worry too much about how hard to to think directly. It'll just you know we'll have a one tool that you always go to and it knows how long to think for and things like that. I think that the abstractions in the way that we drive these things today it'll change and like you know I think even the amount that we've changed from you know having a non-thinking model to you can choose between two and like you know now we can sort of route and how hard do you want to think we're adding lots of knobs and you know eventually it'll it'll probably simplify.

Another super interesting knob that everyone is doing is context compaction or memory compaction. What's going on there?

Nothing to share at the moment. Let me share. Clearly an important feature, clearly inspired by CodeX usage as well obviously. Um, but I think like from the engineer's point of view, it feels like I used to do that as part of my harness and now it's now the model's doing it for me and I don't know how to think about that like in terms of I guess I used I'm used to having more control and now I have less.

Is is there is there a specific like there's like a specific question I just I'm just getting like feedback on like well is this a trend that like we need where you it's basically a permanent fact of life from here on out.

I see. Uh, you know, I don't know. I worked on long context. That was why I was on last was for 4.1 where we, you know, I think 10xed the the effect of context window for 4.1. And so there'll always be some dance of like, well, if we want to push as much as what we can do, not only should we increase the length of the context window, but like we should also have strategies for keeping that context window available for as long as possible. I'm guessing that both things will will sort of happen just because we want to put as much power into the models as possible.

I think we're still in a period where we should all be expecting changes in the the interfaces that all the models give to us. That way we can improve the models because if we lock the interface, I think what would be sad from from my perspective is if we lock the interface, if we discover something new about models, we might sort of trap that improvement under an interface that needs to change, right?

Talking about long context as well, there is some discussion about I guess context rot or like the utilization of the context. Even if you gave us like a million token context, probably wouldn't use all of it. What's the recommendation there? Where are things going? Are we going to have I guess perfect context by next year? Is that is that an impossible dream?

I don't know. Um no, it's not an impossible dream. Uh I think I'll give a shout out to some of the evals that we did for 4.1 with uh called graph walks where I love graph walks. They we covered this in in the podcast.

You know, I think if you look over time, all of those uh all of those evals are are still climbing. I think one of the interesting things about that is you have to do complicated transformations across the entire context window. Like that's sort of the issue with um those heat map plots of the those different I need a little Yeah. But the problem is if you only have to sample from one point in the context window, it's like sort of easy. Whereas with those graph walks problems, you're having to do multiple transformations across the entire context window. Um, and so I think keep watching those. I think they've they've been climbing. They'll continue to climb. I would say that that's definitely like a temporary issue that we are climbing on over time.

So, and then like is 10 million tokens realistic? Is 100 million like where does is there a natural end or there's no end and we just are going as far as the I can see?

I I don't know. Like what what do you think?

I feel like okay, there are use cases that require billions and there are use cases that require many many billions, maybe trillions.

Out of curiosity, like what what would be billions of tokens?

We just had uh the context engineering discussion about like a rag codebase over support issues for a company and it was 100,000 documents totaling about 8 billion tokens. You can't stick that in a context window for now.

That's fair. I guess the so I would still say like I don't know but I think I've been like really surprised. It reminds me of when I was doing like more information retrieval stuff and like uh BM25 and these like very simple like engram indexes were like just super hard to beat. I think the agents with GP are like they feel really similar to me where it's like just unreasonably effective.

So so that then but then I will not use your 10 million token context window even if you gave it maybe but like what if we're using that context window in service of like some larger goal that just has a lot of uh sub search calls which is why I'm saying like I I just don't know and I think that's what makes it so exciting.

I would say also like the other other modalities like video um would eat up a lot and like then obviously the hard sciences have proteins and all all that which uh a lot a lot of information just encoded in uh in in physics. Um, so, so I mean, yeah, I I I'm mixed feelings about it just because I'm like, well, this will never scale. Not with like full attention and uh we we probably just need to invest in systems anyway, which means we we're good with what we have.

I mean, like get your graph walks up, but like I don't know if we need like 10 100x when actually maybe we need to figure out ways to 1,000 1 millionx. Right. Like like the these are just different slopes. I mean, I'm definitely I'm I'm glad that you're happy with the the current context windows. I think my dream would be to push it and see what happens anyway, but I think the engineers the engineers incentive is always to say, well, the systems meta more than the models and the researchers incentive is say, well, screw your systems. Well, we'll just put the models. Oh, no. So, differently.

I think that's one of the most like sort of beautiful things about post training at OpenAI is everyone code design. It's all so co-design. like you know I I spend a lot of time just doing our system stuff and I also do lots of stuff like where I'm making graph walks and I'm like doing a lot more like things on the learning side and I think it's a great culture to have a place where people just move seamlessly between the two.

What are you guys hiring for? Presumably you're hiring what are you guys hiring for that is hard to hire? What is the skill set that is like we really need this can't find it please everyone go skill up on this?

as my my definitely personal opinion here. I think we're still having trouble uh not at OpenAI but I think as a whole producing lots of people that do lot want to do lots of both systems work and ML work and I think if you're trying to push the frontier you don't know which place is currently bottlenecking the frontier and it and it changes all the time I mean even within one project it might change multiple times where the the current bottleneck is but I think the education system we have right now isn't really optimized for that. So like I personally I studied math and then I was very very lucky to have some like great mentors after school that like taught me to be a good software engineer but it seems like if we're going to be in this place for a while and I I think we will be we should probably be producing more students that are great at doing both you know distributed systems and like a lot of core engineering as well as the statistics and other like things that are required to be a good machine learning researcher.

If we were to throw CodeX at it, obviously we can't do CodeX at everything. That's why it's still Let's say like which will progress faster? Which is more solvable by LLM?

That is a that's a spicy question. Uh you can't say they're both equally hard. I don't know. Maybe maybe they are. I mean, they're differently hard. Like one is more hill climbable than the other, which is it cuz then we can go go do it.

I think I think one thing that's slightly simpler about some of the ML research like or you know ML research is also distributed systems to be clear but like some of the things that I would say like get traditionally called ML research are things that you can treat a bit more of as a black box whereas like you know the the environment to train on you know building these these different systems is actually just like a complicated data engineering problem. Theoretically, I would say that they're like probably roughly equal. Um, but I think that there's some there's some amount of effort I feel like to making the the environments for the

But they require say they require GPUs in themselves as well.

I guess they both would, but yeah, that's that would be my guess. But I I don't have high confidence in it.

Well, so so a lot of people are building this like AI scientists, right? that that automates. Research, you guys have a your own benchmark on on taper bench and that's the one area that um like for example at Cognition we've just decided to not do because it's so hard.

Any other people on the post training team that you want to shout out have done like uh interesting work this year? They should get more attention but they're they're not getting credit.

Well, okay. For sure. Everyone on the shopping team that I was just working with. So like Andrew Hoyel, Manuka Castrada, John Hullman, all all great people. Issa Fulford obviously the the manager for it. And she was the original deep deep research. Person. There was like three of them. And so definitely that part of the team, but I mean everyone everyone is so great. Like I think it's hard to to give out a list. It's a really fun time on on post training right now. It's exciting every day.

It feels like uh we're all enjoying our Diet Cokes together in the office late at night.

Oh, I I did wanted to squeeze this in before we end. Nobody actually serious is saying that pre-training is dead. It's just a meme. There's a lot of work going on on pre-training. In fact, actually, a lot of my researcher friends are saying too much money is going to post training. That's also spicy. I don't know. One of the charts I hold in memory from this year is the Grock 4 chart. I don't I don't know if you've seen it, but uh it's basically saying, well, we scaled pre-training to here and about the same level of about this level of compute and now we're spending the same level of compute on post training as well. That's very controversial, I guess, to me because like we're all used to post training taker taking orders of magnitude less data, compute, whatever. Obviously, we're scaling that up now. Do we get to a point where they're equal? I don't know. But that's a topic for conversation, I think. how much do we invest in this versus more like different free trading that you've done before.

So, first off, neither neither one of those is dead. I think it's really interesting to sort of be living through something that I you know, all of my other like historic or technological revolutions are things that I read about in in history books and like this one's live as it's happening.

We don't know the end yet. And so there's this almost like fog of war where I'm like, "Oh, did people think that like we got like the steam like uh the steam engine and they would have, you know, the factories, I don't know if you know this, but like the factories, they used to be like very linear because you had to drive like one motor across an an entire room. And it made it so when electricity got developed, they just tried to do the same thing. And they're like, ah, this isn't all that useful. And it took like a couple of decades before they realiz whatever is most ergonomic. And then you know manufacturing was transformed by electricity." I think like it really gives me no confidence in being like oh this thing is dead.

our timelines are so short but recently the way like good ideas get experimented and funded and propagated actually they're still on a human timeline. That's not on AI timeline.

And so I think like things will maybe be like dormant, but it'll be spiky. Like there will be all a sudden, you know, some And and then we'll all feel different. It's like we're uh what's what's the meme? It's so over. We're so back. Like it's going to be that many times. I think having like a some some emotional uh stabilizing to it is probably going to be good for for everyone's sanity.

More sanity. Well, thank you so much for joining. Thanks for uh all the great post training this year.

Thank you. And yeah, continue giving feedback. I love to hear what you think.

Awesome.

Others You May Like