
Author: Nina Lopatina, Contextual by Latent Space
Date: [Insert Date]
Quick Insight: This summary is for builders moving beyond toy RAG demos to production grade AI systems. It explains why the future of model performance lies in engineering the context rather than just scaling the weights.
Context engineering has moved from a niche research topic to the core bottleneck for production AI. Nina Lopatina of Contextual AI explains how the industry is pivoting from simple retrieval to complex agentic workflows. The tension lies in balancing massive data access with the cognitive limits of current frontier models.
"Agentic RAG is just generally better than RAG."
"This year is kind of the year of the sub-agent."
"Agents can get pretty confused pretty quickly."
Podcast Link: Click here to listen

We are here back with Nina Latutina. Welcome from Contextual. We're going to talk about the state of context engineering in general. But one thing I wanted to also give people a sense we're not here of the Nurips flavor discussion and all that. We were talking about like I was asking which are the best nurse after parties and you said Mccor and what's the other one Turing and Nvidia Turing and Nvidia where there was a really good fireside chat discussion which is rare because I I often tend to shy away from fireside chats but apparently the guest was Yedin Troy who is very well-known figure on these parts. What was it about?
Yeah, it was about kind of the overall state of AI. Since she's at Nvidia, they talked about a lot about scaling laws and the future of AI. I think Jonathan Sedaris the CEO of Turring who was interviewing her was just at asking fun questions as was the audience. I would say the last question there was kind of perfect. So she focuses on small language models and the question was you know wouldn't your research cause Nvidia to to lose a lot of market cap and in fact she had emailed her research to Jensen Wang before joining and he was like yeah that's great and I think that's fair it's really in the world right now well no but actually the implication is that if you have smaller language models you can do more you can have them run on phones and so you actually could end up using more compute to to power more and more small language models to to do more.
I would say though so also at Europe's the open router team released their state of AI survey based on open router and they had this very interesting chart on the adoption of small versus medium versus large language models. So the cutoff was small was less than 15 billion and medium was 15 to 70 and large is anything above 70 and you could see the market share of small models trending down over time in practice really and the market share of medium trending up and then large is staying the same.
Yeah. And this is open models only of course and obviously the closed models have gone up and up and out. So I don't I don't it's just a just a data point.
Yeah, I think I think it was one of those things where like Apple intelligence I think is widely acknowledged to be a failure this year and they were kind of the champions of the small models on device movements that really needed to work and it did not work. Gemini launched Gemini Nano in Chrome this year. I haven't used it. I don't know if y is Yeah, basically like the small model hope and dream I think it's it's still in like very small use cases and not rolled out and I don't know what needs to happen for it to roll out.
Yeah, I would say that makes a lot of sense for general purpose model. I think for other component models like let's say like for a reanker due to latency constraints of course smaller is better is what I've heard from other developers but I can see that being a trend more generally.
Yeah, you guys had a lot of reanker work last year. I don't know is is that like as much of a focus this year for contextual. We released the first instruction following reanker in March. And we updated it recently. So it's something that I think we'll keep digging to keep up with the state-of-the-art because it is really a key function within context engineering. So for example, when you're reasoning over larger and larger databases, you want to be able to have more recall in those initial retrievalss, but you want more precision for what you're actually putting into the context window for fear of context rod and, you know, poor performance. And so that reanker can really help you narrow down that initial retrieval.
I did a brief comment that it's weird that that that was the first instruction following re-ranker because well don't you always wanted to be instruction following and the simple answer from that I've had I've I've literally asked the search firms this the startups like um you know those people who know will know and they they're all like well you know to be instruction following you have to have a larger model and that affects our latency budget and I'm like I don't think that's the case like like the latency for for smaller models are pretty good these days. I don't know.
So, actually that is like the biggest complaint I've heard from developers about our reanker. But what we're using it for increasingly is for dynamic agents and there you're so that's insensitive, right? Just take however long you want to take.
Yeah, exactly. Yeah, that's fair. I'll get to right to context engineering. I think the other thing that people are there was a big topic of conversation this year was quote unquote the death of normal rag and the rise of agentic rag. Do you agree? Is it o is the debate itself overrated because obviously use whatever in the right context right like it depends. I don't know is is this a meaningful debate?
I mean I think the debate is not so meaningful. I think like progress is meaningful but I think to me it's also like maybe somewhat a decided debate like agentic rag is just generally better than rag even that initial incremental step of making that uh doing query reformulation. So when you receive that initial query, being able to bake it down into subqueries so that you can better match those queries to documents that you might want to retrieve and then combine that for the retrieval. Even that step improves performance so dramatically that you know that's kind of that became the new baseline.
that that is obviously so helpful and you can farm it out and parallelize and then re regroup it. Uh so this year the work bear I did on sweet grap uh did that like so we we had like a fast context model where it was trained uh I think a couple innovations I think was really interesting. One uh trained for massively or much more parallel than normal tool calling for searching. So you can sort of uh normal normal parallelism is like one to two maybe maximum four. We train the baseline to be six uh parallel searches at a time and goes up goes up to eight. And then also limited agency that you don't want your agentic search to run forever. You you do want it to terminate at some point and return the answer. So uh incentivizing the RL to do that I think was helpful, easy and actually scaled very well.
Yes. And actually we have found that having turn limits and limits on the sub agents checking and validating their work is super important. We recently did uh my myself and a member of my team recently participated in a context engineering fund. Tell us about it.
Yeah. So Brian Bishoff and Hamill Hussein hosted it in San Francisco last month mid November and they had about just under a 100,000 documents so PDFs all in the retail space. So it's called retail universe and they had log files uh tables giant CSVs files and we used our dynamic agent to answer really challenging queries in this and generate structured output and some of the very early steps we noticed is you know with a data set that large that agent will want to take that sub agent let's say the unstructured retrieval or structured retrieval sub agent will want to take so many turns to make sure they you know looked under every rock in the data set and we actually don't want that that data set is pretty large. And then also it turns out the sub agent would want to check its work over and over and over. So um so that's kind of like something that we noticed with large scale. And so kind of similar to what you found like you don't want unlimited, you know, whether you're enforcing that within like the RL reward structure or with explicit instructions in the system prompt.
I would say very much this year is kind of the year of the sub agent for me in terms of like the people use using constrained agents to do very specific things and sometimes being too general actually is like an antiattern because well it doesn't go very far on its own or it's not very reliable or doesn't have the tools that it needs blah blah blah blah blah right uh in some makes sense also like you also happen to be able to fine-tune the model towards that specific task so you can also hill climb very easily as and and and so I think to me that is like all the right elements of AI engineering that I want to see people do more of and people just needed a term for it and I guess I think people have settled on sub agents as as a term just briefly staying on the hackathon are there any other alternative approaches do you found interesting that you want to shout out I actually have a blog I'll write uh sooner about our experience blog it I love it and so I think the leading team was using Mix Brad and Claude Second place was using cursor mace spread. I've uh seen that a couple times.
Yeah, I think they also have a reanker and some other open source models. Okay. Um interesting name.
Yeah. And I forget what the other team was. And and there was also a human doing the challenge and so there was a human benchmark. Oh yeah. How did human do? I think they got 23 points. We got I think about 25 and the winner had about 29. So it was like very Okay. Super human.
Yeah. Uh very cool. Okay. So let's get right into the meat which is context engineering. This very very big year for context engineering. Your company is contextual which like could not be more on the nose. How do you describe this year in context engineering?
It's been a very fast year because convex engineering really took hold six months ago and that actually feels like a year. And so I think one thing that stands out to me is that there are a lot of design patterns that are kind of bubbling up but there isn't like a uniform design that folks are using as the as the architecture. And I think there's a lot of optimization and efficiencies to gain. So I think with a lot of new development you kind of start by you know letting the agent use as many tokens as it wants and then later you figure out how to constrain that and how to optimize it let's say like with key value caching or other approaches that can help really scale the technology. So I think it's kind of to me maybe more in a prototyping stage and I'm expecting next year we'll really see scale for context engineering.
What does scale look like? what will we be able to do end of next year that we cannot do this year or that we're not doing this year. I think the kinds of tasks that we'll be able to solve are going to increase. So, uh I think for example, we're already seeing the start of that. There was in the how benchmarks there was one about uh this is the the chroma piece. Uh no, this is from Princeton. Okay. No, I'm not familiar with it. uh how I forget now what it stands for but it's a set of benchmarks for uh really evaluating longer running agentic tasks and in this case there was one where they were evaluating recreating a research paper and that benchmark came out in October and it was saturated earlier this week oh jeez by cloud code oh jeez yeah so and in fact uh they needed to have humans run the eval do the evaluation because the solution was like somewhat like maybe a different approach that a human would take somewhat superhuman.
You know, the com the common pattern you see now where actually like the the gold data set has some errors in it and so the model correct. Yeah. Um because if it gets 100 then you're like, "Oh, well there something's wrong." Like it's it's a canary for well something's actually wrong. Yeah. So we should do it on purpose.
Yeah. And I think we'll just see these benchmarks, you know, these new, you know, really well thought out and really challenging benchmarks come out and then very quickly be uh saturated. Yeah. And we'll continue to see that I think for more and more challenging tasks. Do you find in like just general deell work and marketing and just like leadership of a category like do you find it useful to maintain a benchmark or like to to have like oh this is the contextual benchmark that everyone should should adopt? I I struggle with this because obviously a lot of benchmarks come from research and not industry but I feel like industry should have a role.
Yeah, actually I mean we've been using that data set from the hackathon I described earlier a benchmark somewhat. I think it's a it's a really interesting data set because most benchmarks use a very small set of data to train or to inference about and this actually requires reasoning over yeah this will never fit in a context window right yes so yeah so I think you know how many tokens it is or you say there's 100 thousand documents but I don't know how many tokens that translates to oh I say like thousand each 4,000 each probably more probably more okay so that's many billions yes Yeah. Okay. Cool.
Yes. So I think um I would love to see benchmarks. Yeah. I would love to see more industry benchmarks because those would help us actually evaluate at the scale and not have a toy example as you know many benchmarks are.
Amazing. And you know just like the the sort the lore of context engineering obviously we got a shout out to Dex who did a couple great talks this year maybe maybe three. I think also Drew Bernig al written a lot about the failure modes of context engineering. What's sticking with the people that you talk to, right? Like context rot is it's a it's a well established term. Shout out to Chroma. Anything else? Context poisoning. I haven't heard as much, right? So that that term is not sticking. What what else is like the topic of conversation that everyone should know?
Yeah. I mean, I think I see context rot cited in every blog about Yeah. Yeah. And I I would call up my L on that. Like I told Jeff, I was like, "Are you sure this needs to be written because everyone knows this?" He's like, "No, everyone. Everyone doesn't know this. Well, exactly. It was like obvious to us, but you know. Yes. I think it's very intuitive, but I think actually having the metrics and results to show for Yes. He did the work. Yes. Exactly. He did the work. You like a lot of people have intuitions just based on using a model. But actually, if you can put in a model saying put a number saying like, well, okay, you know, of this million token context at 700,000 tokens, your retrieval is actually like 30%. Exactly. Yeah. So now you can compare it to other performance gaps and kind of see what's having a bigger impact.
I think uh Anthropic has had some really great blogs in this space. I would say uh there's one on some design and architecture choices that they put out that was really interesting uh fairly fairly early. I would say, you know, it didn't come out this year, but I think MCP has been a huge driver of context engineering driver and also a flaw I would say as well. Let's talk about it.
Yeah, because MCP is this giant JSON thing up front. You stuffing in the descriptions of all the tools and so it's very quickly you get into straight up context rot when you have like 10 tools, especially if the tools are fat. Yeah. So there's been some really interesting work on uh some really interesting blogs on tool use. Manis had one that was kind of more general but had some best practices around tools and Anthropic has written more on on tool use patterns. So Cloudflare. Yeah, that's the other one. Yeah.
We actually funny enough going back to our reanker I actually set up uh just a prototype of selecting which MCP servers to use. uh being able to select those servers is already like a context challenge because there's so many of them. It's a sub agent. Yeah. Interesting. Is that something people are very excited about? They're deploying at scale. You see a lot of traction.
I think similarly, so I think yes, we've definitely seen a lot of great use for it. We have our dynamic agent um use MCP servers as well and I think earlier in the year I made some really fun demos being able to just really quickly combine tools for a prototype. Yeah. So I think it's really helped people prototype faster and show value in an early version to then kind of build out in larger scale. And I think for us, for me personally, like in my dynamic agent configs, I'm moving more toward API calls and something a little bit more uh once I've kind of maybe been able to prototype with an MCP server and figure out how I'm going to use this. Uh I think then you can cause you can reduce the complexity. Yeah. And reduce that dependency.
Um, mentioning MCP, I don't know how far you want to go into this, but um, the MCP gateway finally launched from Anthropic. There's a bunch of MCP sort of services that are doing various sort of discovery and being a registry. What should people know? What are people betting on? What's what's actually working in terms of MCP servers or directories, gateway, off, anything of that nature? basically like everything that happens after the initial launch of SCP.
Yeah, like there's been a bunch of work. There's MCP UI, but I don't know if that's really strictly contexting. So, yeah, we added our MCP server to the registry and it really seemed like uh the official anthropic one. Yeah. Okay. Yeah. And it really seemed like that registry is meant to be read by agents, not humans. Yeah. Yeah. This is like a GitHub repo. Yeah. Uh, which is interesting because I think yes, there's definitely value to having strong agent experience and I think that's going to be an area that's going to be growing uh over the next year for sure is being able to let agents just like do things without a human in the loop. But I do think for MCP servers, I think you kind of still want a human to make that selection of what you're going to include in your list of tools. You want to check, you know, the security and uh other other things before you would let it run.
Amazing. Okay. So more broadly I guess any other things you call out in like the state of engineering any uh state of context engineering any other good work by other companies you c shut it out manners briefly yeah I think there's some been there's been some really interesting research let's say in uh optimizing the system prompt okay so this is the uh continual prompt learning from arise or no I'm thinking ja japa um I think people are very excited about japa the way that I explain it you can feel free to correct me is it's kind of like an evolution ition of the original DSPI idea where you set objectives and let LLMs optimize their own prompts by looking at output and thinking about and reasoning about like what what should they continue to add in the prompts to improve their evals. Um so it's like a nice like sort of pietorch like model of a training noob but it's only in the prompts not in the weights. Mhm. You can also obviously extend it to the weights. And I think the other thing about why it's called Jeepa, there's an evolutionary element or genetic evolutionary element where you roll out multiple samples and you select the the best survivors from that. Anything else I missed?
Uh no, that captured it really well and actually joged my memory of ACE. Yeah. So uh so actually a dented context engineering that approach actually has shown better benchmark performance on financial and other complex document sets and the approach they've taken is quite interesting. So basically if you take an approach like Jeepa and you basically like maybe throw out the whole prompt and start over or you do like many many steps and you kind of like compress and expand and compress and expand you're kind of going to lose some information and you can see a significant drop off in performance. And so they're using this agentic approach to just make smaller tweaks in the in the current prompt rather than rewriting from scratch. Yeah, that's among their innovations for that approach that I think is actually kind of goes along with what Seth said about the KV cache and really like I think um agents can get pretty confused pretty quickly and that can really degrade performance and cause hallucinations. And so I think wherever you can like one use the KV cache for both efficiency but also for some sort of like more stable environment for the agent to be able to take um you know more and more actions in.
Yeah. How much KV cache decision making is there for contest engineering? I feel like obviously the answer is the stuff that doesn't change put it up front and the stuff that does change a lot put it at the bottom right. I don't care. And I think that's mostly because most agents that I care about are multi-turn and so the cash is the whole turn the the the the five turns that happened before and I'm not changing the system prompt that much, right? So cash KB cash would you save money if you have you're serving the same prompt to a thousand customers. I guess I guess that's it. But we don't do that. So I don't know.
Yeah. I think it can also improve performance. But I think as conversations take more and more turns, I haven't really seen a system that handles this well. Like for cursor, I get to a certain point in the conversation, I'm just I'm opening a new window even if I wasn't done with that conversation because the context bloats and you Yeah. So, so um you know, Dex would call this like intentional context compression, intentional frequent context compression because like you don't trust the model to do compaction just yet. Mhm. I would say that both Enthropic and I think OpenAI and maybe Gemini as well, they're all doing like compaction inside the model with the the uh each literally each of their frontier releases. I don't know if that's interesting to you to you or Yeah. Yeah. Any evalu? Well, actually I did a I like what I like to call an embodied eval of Chad GPT last winter. I use it as a training coach for a snowboarding mogul race. What is a mogul race? You you would snowboard around mogul. Yeah. Like a double diamond mogul run. You lap 25 times. And um 25. That's a lot. Yeah. It's about 40,000 vertical feet. Actually, 23. So, right. I mean, training is very important to be able to do this and to do it safely. It's also a workout. I don't know how calories do you burn on that. Oh, wow. They give you a big pasta dinner at the end. And so, this was a really long-term project. It was like three, four months. And also it required me taking action in the real world in an embodied way and then uh having that loop and I just had to like um close that window and like restart and then I lost a lot of training info. So I think like that's kind of made me I think pre proactively uh limit the number of turns. So maybe I'm missing some of the progress that's happened since then. This season I'm evaluating multiple models to see who is the best coach. Yeah. But then you have to like copy paste a good data entry on all these models, right? It's that's not great. Unless you have your own custom interface or are you just going to chatb.com cloudi? I think I'm going to do like initial interviews on Ella Marina. Okay. I see. I see. And then kind of maybe pick one or two. Yeah. Yeah. Yeah. Interesting. Okay. Cool. Um automated context engineering is really great. Any specifically for code? I guess I don't know how how much you guys encounter code. Uh it sounds like sounds like Yes. Um and is it very different than like legal and retail and support all these other domains?
Yeah, so our goal has been to create a platform that's really end to end and easy to get up and running for any domain and any use case. And you know I think we saw that firsthand uh with kind of a you know a recent beta at that hackathon I mentioned for e-commerce although we do have customers in that domain but I haven't uh really interacted with that work very much. And so code has been one of the domains we've worked with as well. And so we actually just used our platform for uh test code generation for devices. And what we found is kind of the same approach we take where we have a multimodal ingestion and an ability to get the hierarchy of the document contents and then retrieval pipeline that includes you know filters, rerankers, hybrid search. All of that combined is a really great starting point and then we're able to hill climb very quickly and uh actually had state-of-the-art or I guess uh the highest uh humanbased evals for that customer compared to you know coding platforms. So I think it's really I mean it requires a bit of customization but I think context engineering you know applies to code the same way that it applies to other domains.
Awesome. A little bit of prediction corner here. What do you think is underrated now in concept engineering that will be a big topic conversation next year like basically what's underrated right like what what what should people talk about more but they're not I think really the full system so I think right now what people are talking about are innovations in one part of the system or another like let's say different components like a memory system or a reanker or you know design patterns around compressing context things like that and I think in the next year we'll have full systems that can be kind of a design pattern rather than and and being having a discussion at that level rather than components.
Amazing. This is your fifth new I would say like well for people who've been long-timers I think the last time that Nur was in San Diego was the Florida Nurips. I don't know if you you were there for that. What year was that? Like 2017 or something. Yeah. Um, how did you reflect on the scene changing when you come back?
Yeah, it's so interesting. So, my first Nuripss was in 2016 in Barcelona. So, that's nice. Yeah. Uh, so I was uh actually I had just finished my graduate research in neuroscience in reward learning and decision-m I was doing a posttock at Berkeley, but I had a poster from my graduate work. So I flew out for a few days and it was uh really not what I expected from a research conference having mostly attended neuroscience research conferences. Oh yeah. How's it different?
So this is neuroinformation processing. I know. I know. So I've run into so many neuroscientists here as well and people from that um earlier career stage. Uh I was not expecting all the industry parties. That's just not a thing in neuroscience. Yeah. Yeah. And it was just such a smaller, you know, it seemed like a large conference and everyone there was like, "Oh, yeah. It used to be much much quieter, smaller." And I think now I'm just like, you know, there's so many people. It's their first year and I'm just like not seeing some of the same folks I ran into in the earlier years. They're still around. You just have to find them in parties and like places like La Lounge.
Very cool. Uh, any calls to action? How can people help you, find you, anything like that?
Yeah, we have some really exciting updates coming in the domain of context engineering. So, you know, follow me, Nina Latina, on Twitter or LinkedIn or contextual AI and uh just uh stay tuned. Yeah, stay tuned. Awesome. Thank you.