Latent Space
December 26, 2025

⚡️GPT5-Codex-Max: Training Agents with Personality, Tools & Trust — Brian Fioca + Bill Chen, OpenAI

⚡️ Codex Max and the Rise of the Agentic Abstraction Layer by Latent Space

Author: Latent Space

Date: [insert date]

Quick Insight: OpenAI is shifting focus from raw model performance to agentic reliability by baking personality and tool-specific habits into the training process. This summary explores how the abstraction layer is moving upward, turning coding models into general-purpose computer-use agents.

This episode answers:

  • Why does a coding model need a personality to be effective?
  • How do model habits like tool naming conventions impact performance?
  • What happens when the abstraction layer moves from the model to the agent?

OpenAI engineers Brian Fioca and Bill Chen reveal that the future of software development isn't just better code completion but the creation of autonomous agents that manage their own context and tools. By moving the abstraction layer up, they are turning the terminal into a universal interface for digital labor.

Personality as a Trust Primitive

“It’s really important to build trust with developers for how a model works.”
  • Personality Builds Trust: OpenAI trained the 5 series with specific behavioral traits like communication and planning. This ensures the model acts as a predictable partner rather than a black box.
  • Communication Reduces Waste: Models now provide preambles before calling tools to explain their strategy. This allows senior engineers to interject before the agent burns tokens on a wrong path.
  • Behavioral Engineering: Training focuses on software engineering best practices like checking work and gathering context. This turns raw intelligence into a reliable teammate.

The Agentic Abstraction

“The abstraction layer is moving upwards from the model layer towards the agent layer.”
  • Shipping Entire Agents: Instead of just providing an API, OpenAI is packaging models inside specific harnesses like Codex Max. Builders can now plug in a functional agent rather than rebuilding the scaffolding every release.
  • Self-Customizing Software: Agents can now spawn sub-agents to write custom plugins for new APIs. This creates software that integrates itself into any environment without manual engineering.

Applied Evals as the North Star

“The path to AGI goes through evals.”
  • Beyond Academic Benchmarks: OpenAI is moving toward applied evals that mirror real-world developer workflows. This creates a tighter feedback loop between user needs and model training.
  • The Job Interview Analogy: Effective evaluation looks like a multi-turn interview where the model must ask for missing constraints. This measures reasoning and clarification skills rather than just code completion.

Actionable Takeaways:

  • The Macro Shift: The transition from chatbots with tools to agents that build tools marks the end of the manual integration era.
  • The Tactical Edge: Stop building custom model scaffolding and start building on top of opinionated agent layers like the Codex SDK.
  • The Bottom Line: In 12 months, the distinction between a coding agent and a general computer user will vanish as the terminal becomes the primary interface for all digital labor.

Podcast Link: Click here to listen

Okay, we're here at AIE code and we have two of our speakers, Bill and Brian. Welcome.

Bill Chen: Hi.

Brian Fioca: Thanks for having us.

Host: Bill, Brian, I know you've been the listener for a little bit. What's your take on Lindspace? What role does it perform in your function at OpenAI?

Bill Chen: I mean first of all, I love the name.

Brian Fioca: I'm a massive latent space context management person.

Host: Tell the story behind the name.

Brian Fioca: So we never had lspace as a name at the start. It was called Lspace. And one of my readers donated the domain name lat.space. He's like, "You want it?" I'm like, "Yeah, awesome." So Lame just came accidentally. So it was in the ether, but I didn't have the domain. So I just called it LSPace. Lspace is like the visual domain.

Host: No, it's amazing. I love it because you're always on the cutting edge and it goes into a lot of detail about all the things that I should be keeping up with as part of my job and there's so much to keep up with, right? So there's only so many sources of really good high quality information for what's happening on a deep level.

Brian Fioca: Well, you guys have your own podcast now, so I'm like prepetition.

Host: Well, I still listen to yours and I still think yours is really good. So you guys I guess are representing like startups team codeex all the things you just launched codex the max yesterday. We're good at namings. I do people do make I think Tibo was like, you know, we're good at a lot of things but not like what why call it max? Was there any like internal discussion?

Brian Fioca: I mean it's complicated because it needs to be differentiated from the previous one. And the idea is like max can run for a really long time. We can go 24 hours or more. I've actually sort of had it gone for more than that. And the name is, you know, it's inside codeex on the web.

Host: When you say a really long time, 24 hours?

Brian Fioca: Oh, I on my on my Oh, that's I think that was on the web. Inside of Codex, I'm not sure, but I've actually done it on my local computer for quite a bit longer than 24 hours over the course of a couple days with closing my laptop and nobody opening it. But the name, you know, you could come up with something like Pro, but Pro is sort of like slower, more thoughtful. Max is about sort of like speed and maximization, like maximalist.

Brian Fioca: So for this mono, uh you can it can run for a long time, but it can also actually for for the same types of problems, it can actually get to the right answer faster. Yes. So I can it's simply better and faster. Yeah.

Host: So I think the part of what you guys are speaking about is the training that goes into something like as rightly people just kind of wave their hands to RL, but like what specifically have you learned involves what's a good patty sauce on?

Brian Fioca: I mean this sounds weird to say, but I was lucky enough to be really close to the training team while GPT5 was training and one of the big things that we focused on Bill was there too. We focused on personality, right? So, it's really important to build trust with developers for how a model works. And if a model doesn't act the way that you expect it to do or if it doesn't work alongside of you as well, you're not going to really trust it. You're not going to get as much out of it.

Brian Fioca: So, for coding, we thought, okay, well, what is the best personality for a coder, for a pair programmer, for somebody you trust, and how do we like eval against that? How do we come up with behavioral characteristics? And we came up with things like:

  • Communication: It needs to keep you abreast of what's going on while it's working.
  • Planning: Come up with a strategy. Do some searching around like figure out context gather figure out what to do before you just dive in if if it makes sense to.
  • Check your work, right?

Brian Fioca: And so these are just best software engineering practices that turn out to be behavioral characteristics and we can measure the model's performance on those behaviors and grade it that way.

Bill Chen: Another key aspect to how we train the model is we work really really closely with some of our C coding partners and a lot of those folks lead on the bleeding edge and so they have a lot of understanding of what particular particularities they need and we really focused on sort of those areas and really dive deeply as into those.

Brian Fioca: That's right. Especially tools, right? So like different harnesses have different tools. Some people have context like semantic search. Some people have different ways of doing code edits and initially you know our models were trained the way they were trained to use tools and that kind of bakes in a habit and so we've been getting the models better at using different types of tools.

Host: It's a lot to follow that point but I'll I go tools first and then I'll I'll go back on the personality base. both the engineers wise I think the communication where the five portex just came out was well this is the model trade for our codeex not necessarily your choice right has that message changed for other startups using the five codex model?

Bill Chen: No so codeex is just to be clear is the frontier coding model that we have that is optimized for its harness the codeex team is is very focused on creating a coding agent and they want it to work perfectly inside of the the shape of the harness and API that we have. So they're completely unbounded. It's open source. So yes, that's open source and the model is available in the API. So so that's what they focus on.

Host: It and then the conflict is well you just said other startups have other tools and obviously I know that it is possible like one thing to mention here is I think we can probably disentangles a little bit on sort of the the codecs apart from the sort of the mainline models. The codeex model are sort of focused on the a agents itself, right? Like the codeex agent itself. The model has been trained with the agent specifically in mind. It actually turns out to be somewhat even sometimes easier to integrate because we come into it with a firm opinion on what the sort of best way of using it look like.

Bill Chen: And so some folks that we work with actually really appreciate that we come into it with that opinion. While for the other ones that has a more of a general or specific pools that they definitely need, the mainline model is is a one that's more general in a sense and that's sort of what Brian was referring to when he talked about GPD5's tools getting.

Brian Fioca: So the five five non-codex is more general across the board. It can respond to things that are it's it's much broader than just code it. It has coding capabilities that are also mirrored in codecs and they they work together to keep that chewed up. But since it's more general, it does have more steerability to different types of tools. And when you're implementing tools, you the model can get bogged down if it hasn't seen a a tool that it's used to. And it might take more time thinking about how to use it or make more mistakes.

Brian Fioca: So our recommendation is if you're wanting to go bleeding edge coding focused, pay attention to the Codex line and the Codex SDK and the Codex models because that's the one that's like really aimed at that. You'll have to do, you know, some work to like look at how we're implementing our tools inside of codeex to maximize its capability without bogging it down. But like people are having success like bending it in ways that maybe we haven't thought of. If anything comes to mind, I always want to pry if Sure.

Host: Do you have any examples? You say you say bendy in you have a thought.

Bill Chen: So I think so Codex is trained with terminal tools in mind and so what we've thought would be the case is you will essentially only have to like strip out you have to strip out all of the tools except for the terminal tools. But we found some like partners of ours like did they dis discover that what you can do is that you can actually still have a lot of the tools just named in the same way as the terminal tools as well as having the same input and output. And all of a sudden the tool called performance jump by a lot.

Brian Fioca: Encoded loves rip grip. So if you make a rip grip tool and tell it to use it, it'll use it. So if if you call it gre it actually does a little bit worse, but if you call it RG, it actually does really well, right?

Bill Chen: This is something that that we ourselves only discovered. This is one of the coolest things about like model training is literally like they develop habits just like a person does like if if you're like working on some podcasting tool, right? You're really good at editing and then somebody makes you use a different one, it's going to slow you down. You're going to get kind of bogged down and make mistakes.

Host: But I would I don't know if like yes, that's very human, but I would I don't I don't know if I call it cool because it's supposed to generalize.

Brian Fioca: Well, right. That's the end the end goal. Yes, of course. And so that's what we're doing with the five series of models is they're they're way more general and codeex is focused on maximizing coding and those are the sort of two horizons that we're working on.

Host: Awesome. I want to go back on personal personality. This I know you hate that word sometimes. It means different things to do. Yes. And when it comes to people who are like very very keen on like model research, model personality is much more like uh what I think what your phobic would say. It's like your warmth, your friendliness or your I agreeing people's emotional state whatever and so it's it's really jarring when that is also applied to clothing agents where like well I I want to talk to like Silicon Valley H real but could be I'm doing the fake. Awesome. I think the other thing is also but what does it matter cuz you said a lot of things about like commenting is that you're going to use user engage and all that. Does it matter if it's a chronob anyway, right? Like you're going for 24 hours, you're you're closing your laptops. Uh you have like the extra high parameter now. Doesn't matter.

Brian Fioca: So here's we're in this world right now where we're in between uh a situation where people don't quite have like the models don't quite have the trust of senior engineers or engineers doing like very important work. And so we found our customers have found that people really want to follow along with what it's doing so they can like interject or stop it or at least understand what it's thinking so they don't waste all the kinds of time like doing a roll out that they have to throw away.

Brian Fioca: So with the 5 series because it's more general and it's just about as good as coding as codeex for a lot of things. We've taught it to be more communicative and so it has preamles before tool calls. It'll say things like I'm about to go look for this. And you can steer that really well. I actually really like it. I have I've created like a personality. I tweeted about this. I created a personality for my coding agent because I really like my tools to be kind of like fun to work with if I'm in there with them. And so I have it. It's got this like it gets really excited if we do something together and like because I want to wake up in the morning and be like, Oh, I'm going to go work on this project with my my buddy 51, right?

Brian Fioca: But some people don't like that. And also for like you said, longunning agentic tasks that can get in the way like you're burning tokens that don't really matter if it's running in the cloud. So 5.1 you can turn that off. You can prompt it not to do that, but the Codex model can't actually do that and it relies on the reasoning summarizer to give you that update.

Host: I guess more broadly uh what should people know or think about in terms of what OBS do we putting models in general like uh you know more broadly than just like the media experience release just like what trends are you see what discussions are active our talk today uh is focused on talking a little bit about sort of the trend that we're sort of seeing is the abstraction layer really moving uh starting to move upwards from the model layer towards the agent layer.

Bill Chen: As I said, we train our models are starting to be a little bit more opinionated especially with regard to Boeing model like codeex and the models are really good at doing certain things when inside of a certain barness a certain uh type in search shape and so we're hacking that packaging that up more closely. So we're actually shipping this entirety entire agent all together. Then you can actually build on top of that agent. That's one of the patterns that we're we're seeing there is rather than focusing on optimizing with every single model release, you're actually just be able to plug in an agent like Codex uh into your platform and be able to use an app box.

Brian Fioca: And you're seeing Zed use this uh GitHub uh VS Code lets you just like package a whole agent to work inside of it. That way, like if you're building a coding tool, like I said, and you don't feel like having a whole team keep up with all every single model release and every single API change and how to update the harness to do different kinds of sandboxing and all that kind of stuff, you can just build one layer above. And that is actually super powerful because coding is is just like one aentic behavior. It turns out it's a really nice one to start with because you can measure the performance sometimes easier with a lot of other ones, but it also gives the model the capability, right?

Brian Fioca: So, we started out with like chat bots, like you're having a conversation. Let's give the chatbot a tool to use. Okay, so now you have an agent that can like run commands. Well, let's give the chatbot agent a codeex to use. So now if it doesn't have a tool, it can make a tool that it needs to solve a problem, right? So that's like another layer of abstraction and it's not just coding. You can write software that has an agent that can spin up a codeex instance and write a custom plugin for your software for that customer's API, right? And so now your software is self-customizable because it has its own team of people inside that can do integrations when at launch. Yeah. Solving integration engineering is.

Host: One thing one theme I'm finding at this conference so far even early like the preers rocks I think people are starting to really explore sub ages uh agents that more abstractly agents that use agents and we used to call it multi- agent I don't know what is now I I don't know if there's any thoughts on your end about this where like you get to call I guess like a very basic example is what you just said which is that the agents can uh create another instance of bodex the case a tool and then drop me. You just use the tool. Is there a case for skating like some beaches?

Brian Fioca: Yeah, I think so. I mean, Codex Max was designed for that, right? So, it has its own compaction and and context management. Codex Max manages its own context window and so it can run basically forever without you having to worry about it while it's inside of the Codeex harness. And that lets you do a lot of different things. you can essentially have it hand off its own context to other sub agents, right? So, letting it sort of like spawn different agents to do more of its work in parallel and all kinds of things like that. So, it's built for that. I mean, we're just sort of like starting to see the indications of like what that means, but that's I think the future and we're really excited about that.

Bill Chen: It's really I think like as I said uh the the trend that we're sort of observing here really moving up the abstraction layer to the agent to to the agent layer really allows you to do a lot of cool things like brand new screen spinning a few agents creating new abstractions uh as things as the long running agent workflow continues and and right now like we're building all the uh primitives as well specifically with that bikes.

Brian Fioca: And it's really about moving the the threshold up further, right? Like I was saying before, like I now trust like codeex to do some of my hardest work. I haven't written a single line of code by hand in months because I know what I can trust it to do.

Host: You're the forms person that said that in the last 24 hours.

Brian Fioca: No, it's real. I mean, I've actually launched something. There's an open source project that I did that was a codeex upgrade pack for migrating from completions to responses that was totally written by codeex and I didn't write a single line of that code and now it's it it's out there. It was open source. Most of the folks at OpenAI uh well initially when Codex first launched it was around 50% of folks at OpenAI started using you but now they go but those folks at Open AI that's very true use it every day.

Brian Fioca: The way that we do it is we're really good at eval right like in order to develop trust and like build a product that can do more than you design it for, which is really what we're talking about here. You're making an agent that can like solve its own problems. Um, you have to get really good at figuring out how to build those guardrails and evals around, you know, what is it doing? What is it allowed to do and check it in production. So we have all of this platform tooling now around agent traces and rollout traces and and coming up with evals for that and building you know graders and all the things you need to sort of like maximize the pipeline so you can let it go and then like be like okay I don't really like the way it did that great it have it metaprompt itself so that next time it actually does a better best practices.

Host: What are the biggest guesses in in terms ofizational capabilities that OPI is investigate is you say more about that like why is that suddenly a big priority now? Uh obviously I think there was o always did internal evs but now it's like a team that is more over facing and maybe this random error that the path to agi really goes through evos and well I'm sorry that was a little it's though it's so true it was repeated way too many times but I I think um there are a lot of academic evos right there's like sweet bends there's um other like you name it uh but I think there's a slightly lack of eval um on sort of what people care about the most.

Bill Chen: And we want to make sure that whatever we're developing modelwise as well as productwise are aligned and are actually making the most amount of useful impact on this world and applied eval is really in that direction uh capturing all of those sorts of uh real world use cases and things for us to hill climb on together.

Brian Fioca: I like to think of it as like we have I mean people say it's a PhD in in an API right but you if you you know you hire a PhD student they don't know how to do the job you have to give them a job description okay that's a prompt right so now you have your policy um and then you have them do the job and they're going to kind of like flail around right so they need mentorship they need guard rails they need eval performance reviews on how to do their job the best practices and so what we're doing is we're trying to put our models out there and see what they're good at, what they're not good at.

Brian Fioca: Talking to our customers are like, Oh, we could really use your model for more things. If it could do this one thing, here's our eval for help us build those evals with you so that we can see where we're deficient and go back and train the model to be able to do that job in the way that we wouldn't normally get to see it form.

Host: How do you do multi-turn evals? So I think that's the really hard thing that I mean sometimes you need multi-turn if it doesn't get around in the first go but it could just get around the first go then it's no longer multi-turn right so then what do you want to take I have I have some ideas oh yeah you go I mean I've built a few myself um I don't this is this is sort of like my personal work um I think this is like an area that people are are just now getting into right we have LM as a judge you can use LLM as a judge to look at an entire rule the the trajectory. Yeah.

Brian Fioca: And see, okay, over the course of all of this, like how well did it perform? What did it do? And then you could maybe like walk it back a step to the part where you don't like and then you could have the model uh run the next step with the instructions, graded on that, and then have it improve itself. Oh, I don't like the way that you I we do this all the time inside of harnesses. It's like that was a good answer, but I don't really like how long it took you to get there. So, can you give yourself better instructions for doing that next time? and it'll write something and we'll add it in there and then suddenly it's better, right? So there's that's one way of doing it.

Bill Chen: I think multi-turn evals most of the the the companies or startups that we work with like these days the agent runs in a multi-turn way, right? And then so therefore if you can build an agent harness that works in a multiple turn way you can evalidate and then there are like also academic benchmarks already does this in some ways like cowbench and now we have like tquarebench that that does this like particularly well and we definitely certainly take inspirations from that.

Brian Fioca: I have this idea. I call it like a a job interview eval. I haven't finished it, but really like if you're evaluating a coding agent, what do you want it to be able to do? You want it to be able to take an underspecified imagine you interviewing a developer. You give them a problem. Hey, like go implement a string reverse or whatever. Um, and then it's like up to them to like ask for okay well I need more information. What are the constraints here? Like what is and then you you judge them on that and then they start implementing it. you give them some modifications, you grade them on that. You can imagine like building, you know, with an LLM like a rollout that is promptable and the the model responds and then you can kind of grade the whole thing. Yeah.

Host: One thing I I would love uh and this is like the feature request part of the podcast is um batch multi-turn eval so batch API is single turn but you can't really batch multi-turn requests. Is that already doable?

Brian Fioca: batch multi-turn requests. I don't believe it you can't do it yet. But yeah, I think that's like a valid because you need eval to be cheap as possible. Yes, they're not that time sensitive and you want to run it overnight when when the things are cheapest. Yes. Um well, feedback taken.

Brian Fioca: Feedback's taken and but that's the thing like every day we're trying to make the platform better and right now eval is certainly part literally how we make product feature updates is we talk to people like you they're like, Hey, can you do this? And I I mean it's super like.

Host: If I if I'm going to throw thousands of runs at this thing, you know, I I I should probably spend some time worrying about costs. Speaking of which, what are you trying to eat though? I mean, Devon and Cascade, you know, and uh I uh so I have a personal side project where I want to make Devon for non-coding. Oh, I like I love Devon so much. Like I think Slack uh my my kind of semih hot take that I'm floating around because just to see how it feels is I think Slack is the ultimate user interface. Yes. for for work, right? I don't want to read email. I just read Slack all day. I interact with my email agent through Slack. So, build basically I'm building a Devon for email.

Brian Fioca: Well, that's the thing is like you can use you could use Devon to do that, right? Like um a coding agent like Codeex, a CLI. It used to be back in the old days like I started out in the 90s working at IBM as a system administrator and I had to write my own custom software and bash scripts and whatever to uh to like actually solve real world problems every day and so I had this like you know toolkit of scripts that I made right that were like organizing file directories or doing like other random things that weren't necessarily writing code and so you can get phenomenal use cases to just like sort through your email using like Elm or something right in the terminal or like have it generate like like snippets of video clips from YouTube that you can watch later or things like that.

Host: You know, I never thought about that, but I do that all the time as part of lane space. Yeah, I should probably invest in that tooling. I had I had Codex go through my really messy directory of all of these experiments that I was running and like completely organized them and like put them into shape and it was so wonderful. I used it for something that's more boring, organizing my desktop. Yeah, you know, we have a lot of files on the desktop and codeex is really good.

Brian Fioca: Yeah, people think they have code name img0416.jpg. Yeah. Well, just find all the images and put them in one folder. I think that even that that's that's something Codex can do. I think that's one of the big themes we're also seeing like coding tools are breaking out of coding and just like everything they're personal automation. Exactly.

Bill Chen: Because the way if you can think about um before graphic user interfaces and browsers like what did we how did we interact with a computer? We did so through a terminal and we did so by uh writing commands and writing code and stringing them together inside of the terminal. So one way you think about it is are those coding agents are actually a computer use agent but for the terminal. Yes.

Brian Fioca: They're actually incredibly general. I I would say that coding agents today are still not vision native enough. Like you have to try Yes. to to get it to use vision and oftentimes it it fails still. We should use vision a lot for.

Host: I I would say, you know, I was going to end the episode with asking for your 2026 predictions. Like we sit down this time next year. What do you want to see? You know, what what do you hope to see? I I'll just kick it off with the easy one. Yeah. More computer use. You know, I think like when you say things like, Oh, we'll have a coding agent build its own integration to to your uh application. A lot of applications don't have APIs, don't have SCPs. The only thing you have is a UI, right? Yeah. Because they're legacy or because they don't want you to take the data, but while the data is yours, you just have to like in a non-provisioned way take it by the user. Yeah.

Bill Chen: And I I can continue just by sort of like saying that that's definitely going to be something uh I think is going to be something that we'll be capable of in 2026. And but also the other thing that I am sort of really like looking forward to are codecs being able to do more. Right? We're already starting to talk about how codecs uh or like coding agents can sort of use computers uh in novel ways. We're going to be able to sort of see more general and general use cases like that coming along as well and more extensible ways for you to build with those sub agents as well.

Brian Fioca: I really want to see the the trust level go up even further further, right? Like at opening I get to work with some of the most amazing developers I've ever worked with in my life. They're incredible like some crazy tech leads. I wish every company, no matter whether like a small dev shop in Alaska where I worked for a while or or OpenAI, be able to have on their team like capabilities that you would only be able to get at like a top tier firm, right? So like so all of my teammates at all these places could turn to a coding model and be like, Hey, how do we do this like crazy awful refactor that we have to do to get to support this new customer that we have? Or like, Wow, there's so much of a mess here. or like what's the best way to actually implement this new technology and have it be so trusted and so right and so smart that like you know we can actually perform better than we could normally get access to. Yeah.

Host: See, I think that's going to be it. Any any final calls to action?

Brian Fioca: Oh yeah, we're Brian and Bill at OpenAI and uh yeah, feel free to find us on our Twitter, socials, whatever. And then then let us know how you're building.

Bill Chen: And we love working with startups and anytime you have feedback about do you really wish the model could do this or the product could do this and you could unlock some massive capabilities just let us know.

Host: Amazing. We'll do. That's it. Thank you.

Brian Fioca: Nice.

Bill Chen: Thank you.

Others You May Like