AI Engineer
December 23, 2025

Continual System Prompt Learning for Code Agents – Aparna Dhinakaran, Arize

The Memento Method: How System Prompt Learning Fixes Coding Agents

Author: Aparna Dhinakaran

Date: [Insert Date Here]

Quick Insight: Building elite coding agents requires moving past static prompts toward a recursive loop of natural language feedback. This summary explains how to use LLM-as-a-judge evals to automate the iteration of system instructions for immediate performance gains.

This episode answers:

  • Why is natural language feedback: more sample efficient than traditional reinforcement learning?
  • How did 150 training examples: increase GitHub issue resolution by 15 percent?
  • What role does the meta-prompt: play in evolving an agent's core logic?

Aparna Dhinakaran of Arize AI argues that the secret sauce of top-tier agents like Cursor or Claude is not just the base model. It is the massive iterated system prompt that acts as the agent's cognitive framework.

The Feedback Advantage

"It almost feels like humans learning because they take back English feedback."
  • Natural Language Rewards: Traditional RL provides a scalar score like a 70 percent on a test. This forces the model to guess how to improve which wastes compute.
  • The Teacher Analogy: Prompt learning provides specific notes on what the student missed. This allows the agent to correct specific logic errors in a single iteration.
  • Sample Efficiency: You do not need a massive data science team or millions of rows of data. Small datasets of 150 examples can drive double-digit performance gains.

The Memento Architecture

"It's almost like that movie Memento where the guy forgets what he learns and then he starts writing it down."
  • Persistent Rule Sets: Coding agents often start with empty rule files like Claude.md or Cline rules. These files act as the external memory for the agent's best practices.
  • Automated Diffing: The system compares the old world of empty rules against a new world of generated instructions. This ensures the agent never repeats the same parsing error twice.

Eval Engineering Priority

"Writing really good evals is how you get the best insight into what you could do to improve your agents."
  • Judge Quality: The bottleneck for agent performance is the quality of the LLM-as-a-judge prompt. Better explanations from the judge lead to better rules in the meta-prompt.
  • Optimization Loops: High-fidelity evals outperform generic optimizers by requiring fewer rollouts. Precision in the feedback phase reduces the need for brute-force compute.

Actionable Takeaways:

  • The Macro Shift: The transition from model-centric to loop-centric development. Performance is now a function of the feedback cycle rather than just the weights of the frontier model.
  • The Tactical Edge: Implement an LLM-as-a-judge step that outputs a "Reason for Failure" field. Feed this string directly into a meta-prompt to update your agent's system instructions automatically.
  • The Bottom Line: Static prompts are technical debt. Teams that build automated systems to iterate on their agent's instructions will outpace those waiting for the next model training run.

Podcast Link: Click here to listen

Hi everyone. Thanks so much for coming. Well, today I'm excited. We're going to talk a little bit about prompt learning and how to use that with eval. If any of you guys are spending a lot of time thinking about the frontier coding models, I think there's so much attention on them. But just what's not so obvious is how much time is actually spent on the system prompts for those building these coding agents. So here's actually a look this is a tweet that went viral about the whole system prompt of Claude that's been leaked. I'm sure you know they've changed it since then. But you can actually see that Claude, there's cursor, there's Clyde. And just the length of the actual system prompt for each one of these. And I think what's not as obvious is these actually aren't just static. They are repeatedly iterated on. And it's such an important piece of context that actually goes into making these coding agents the most successful agents out there.

It's not just us talking about it. Karpathy talks about it a lot. And this was a viral tweet that he posted, which was there's this paradigm around iterating on these prompts that he's kind of coined it system prompt learning. And what he said is that it almost feels like humans learning because they take back English feedback and use that to actually iterate on what they should do differently the next time. And I think he wrote something like it's almost like that movie momento where the guy forgets what he learns and then he starts writing it down and then uses that to actually kind of go through his next day. And so this is a little bit of the concept behind system prompt learning. And what we wanted to do was show you guys a little bit of how that works and then put that to test on two of the most popular coding agents Claude and Klein today.

So first off, how does prompt learning actually work? So for those of you who are familiar with RL, what I thought we'd do is just do a little analogy compare how does RL work versus system prompt learning. For RL, you know, if we just took an analogy of a student who's trying to improve their exam scores. They take an exam, you know, somebody grades the exam, you have a scalar reward, which is like, you know, they got a 70%, an 80%, 90%, and then they have to figure out almost blindly just with that score how to actually improve their score on the next exam. And I think this is actually one of the flaws of I mean RL works, don't get me wrong, amazing in so many concepts and domains, but it can be, you know, a long path to actually figure out what the right solution is.

And I think some of the things that we've noticed is that it can be sample inefficient. It takes a lot of data to get what you want. It's time intensive. It's data hungry. You need to have a whole data science team to do this. and it just might be overkill for teams who are trying to build agents because LLMs are already so good. So if you're a team who's actually trying to build an agent, maybe prompt learning is actually slightly might be slightly more of an interesting paradigm for you.

So in this scenario, same same analogy. You have a student who's taking an exam, there's some exam score, except in this case, what actually gets outputed isn't just the score. They got a 70, they got an 80, but you also get back some kind of English feedback. Why did they get this answer right? What did they mess up on? Here's concepts that they missed on, what do they need to go study? And then they use this information to actually go and and prepare on what to do next to to get a better score. This is basically the the concept that we applied to coding agents.

And we ran this kind of test on both Claude as well as Klein. Both of these, as you know, start off with some kind of system prompt, which in cloud code, this is kind of a snippet of it. And they both kind of come with something that you can append rules to. So, client has rules, cloud MD has the cloud MD file, and it starts off empty. You can go in and add whatever is important for your repo. So, what we did was actually took, you know, just benchmark both client and cloud code on Swebench. I'm going to kind of run through the this entire example at Sweetbench, but this entire thing we also ran on BBH and a ton of other software engineering data sets, but you can see here just on vanilla client vanilla cloud code nothing added to the cloud MD or the client rules. They had you know about I think with client somewhere on you know cloud sonnet 45 it was about 30% of the github issues actually resolved cloud code it was about 40% of the github issues resolved.

So we took this as kind of our starting benchmark and the thesis is is could we actually use prompt learning to see if we can improve the system prompt and see if it was able to with the new system prompt actually you know give us a better score on these benchmarks. We didn't do anything on fine-tuning. We didn't change the models anything like that. It was just focused on the system prompt. This is the process that we went through. We took the coding agent. We had it actually write some code. We ran unit tests and then we then passed that through to some kind of model that was doing the LLM as a judge evals. And I'll show you guys what that looks like. But the LLM as a judge eval actually gave back why did it fail? Did it fail because of this? Can you give some examples of you know what were common scenarios that it didn't do good on? and then it actually use those kind of evals to then go back and add it to a meta prompt to come back with kind of the the system prompt rules that we're going to append to.

So let's talk through kind of the process. So first we had kind of the SWEBench data set. SWEBench in this scenario is just 150 examples. We did this for both client and cloud code where we took the original prompt which had no rules. We gave it kind of the software engineering problem and then it generated some kind of patch to actually solve that and then we ran the generated solution through the unit test. Then whatever the unit test came back with whether it was right or wrong, we then passed this into an LLM as a judge eval. And this is kind of the most important part because this actually generated the explanation for us. So we passed in the problem statement. We passed in what the coding agent solution was, the unit tests, and then the actual solution that it came up with. Pass that in. And this that you're looking at in the center here is actually the LLM as a judge eval. And these evalu engineering is a whole kind of concept that, you know, we spend a lot of time on. And writing really good evals is I think how you get the best kind of insight into what you could do to improve your agents.

So in this scenario, what we did was we wrote a good LM as a judge eval prompt. It outputed whether it failed or passed. And then this is the key part. We actually asked for an explanation. Why did it actually mess up? you know for specific libraries in the Sweetbench light test you know it was parsing errors or it was not handling there there's all sorts of actually different categories of errors but we went through and we we kind of looked at the explanation of what went wrong in each scenario. We then passed into a huge meta prompt. So this is actually what's helping us iterate on our system prompt. We passed in the original claude or client system prompt. We passed in the original rules which for us started off empty. And then we passed in here was the input, here was the LM is a judge eval, and then here was the actual explanation from that eval. Passed that all into the meta prompt and then we did kind of a diff comparing you know the old world. So just for you just to remember the old world had the original clawed system prompt no rules kind of added or appended to it. And then the new world where it generated this entire rules of what to avoid or what to what it had learned essentially from all those mistakes it had actually made.

And then we ran this basically on the entire Sweetbench light again. And what we saw was that you know on 150 examples we were able to get cloud code up by 5% more GitHub issues resolved client you know 15% and this was literally on I think the key thing is like 150 examples of just training data that was used on the most kind of powerful coding agents that are out there. And so just think about kind of the impact that could have for your agents.

Many of you guys in this room might be thinking, okay, well, prompt learning is cool, but how does that compare to GEA? If you're familiar with DSPI and you've kind of seen, I don't know if it's GEA or Jeepa. I've heard both. But you know, you guys might be asking, well, how is this different? So GEA, just just in case you guys aren't familiar, it's a prompt optimizer from DSPI that is essentially very very similar to what we're talking about, which is taking English feedback using that English feedback inside of the actual prompt. And what we did was actually run a sidebyside benchmark against GEA where we compared kind of our prompt learning against GEA. And I think what we saw was that GEA required many many loops and rollouts compared to kind of a a fraction of that which was our approach. And I think the key difference here, I mean the underlying approach around using English feedback is the same, but I think the key thing that was really different here was we spent a lot of time actually developing and iterating on the evals and the eval prompts really mattered to making sure that you gave really good explanations back to the agent. And so eval. This was super critical for us to be able to get this to work.

And if you guys are curious about learning more, reading more about kind of what we do, check out kind of our blog. We write a lot about eval prompt optimization and, uh, we're actively hiring, so come check us out.

Others You May Like