AI Engineer

January 6, 2026

Build a Prompt Learning Loop - SallyAnn DeLucia & Fuad Ali, Arize

How Prompt Learning Solves the Agent Reliability Gap

by SallyAnn DeLucia & Fuad Ali, Arize

Quick Insight: This summary is for builders tired of brittle AI agents that fail on edge cases. It explains how to replace static instructions with a dynamic text-driven optimization loop that treats prompt engineering like machine learning.

💡 Why do high-performing models: Like Claude 3.5 Sonnet still fail in complex environments?
💡 How can natural language feedback: Outperform traditional scalar reward functions in optimization?
💡 Is overfitting: Actually a feature rather than a bug when building specialized agents?

SallyAnn DeLucia and Fuad Ali from Arize argue that agents do not fail because models are weak. They fail because instructions are static and environments are complex. By building a Prompt Learning Loop, builders can use LLM-as-a-judge feedback to turn failures into expertise.

The Textual Advantage

"These LLMs are operating in the text domain... why wouldn't we use that rich text to improve?"

Textual Feedback Loops: Prompt learning uses English explanations instead of raw scores. This provides the model with a clear roadmap for correction.
The Student Analogy: Think of the agent as a student taking an exam. Instead of just getting a grade, the student gets a red-penned paper explaining the logic errors.
Contextual Engineering: Most failures stem from missing tool guidance. Providing the model with environmental context prevents hallucinations during tool selection.

Rules Over Fine-Tuning

"15% improvement just through the addition of rules."

System Prompt Primacy: Most builders reach for fine-tuning first. Adding specific operational rules to the system prompt is the lowest lift for the highest gain.
Expertise vs Overfitting: Generalization is overrated for specialized tasks. You want your agent to overfit to your specific codebase just like a senior engineer would.

Co-Evolving Loops

"The loop only works as well as your eval."

Dual Optimization: You must optimize the evaluator and the agent simultaneously. Brittle evaluators create a ceiling for agent performance.
Continuous Adaptation: Static prompts are dead. High-signal agents require a cycle that repeats as new failure modes appear in production.

Actionable Takeaways

🌐 The Macro Transition: The move from manual prompt engineering to automated prompt learning. As models become commodities, the proprietary loop that refines them becomes the moat.
⚡ The Tactical Edge: Implement a Train-Test Split for your prompts. Use a subset of failure data to generate new rules and validate them against a separate holdout set to ensure the logic holds.
🎯 The Bottom Line: Reliability is the only metric that matters for agent adoption. If you are not using a feedback loop to update your system instructions, you are building on sand.

Podcast Link: Click here to listen

Hey everyone, gonna get started here. Thanks so much for joining us today. I'm Sally. I'm the director of RISE. I'm going to be walking you through some of crowd prompt learning. We're actually going to be building a driven optimization loop for the part of the workshop. I come from a technical background and started off in data science before I made my way over to product. I do like to still be touching code today. I think one of my favorite projects that I work on is building our own agent Alex into our platform. So I'm very familiar with all of the pain points and how important it is to optimize your prompt. So I'm going to spend a little bit time on slides. I like to just set the scene, make sure everybody here has context on what we're going to be doing and then we'll jump into the code with me. So, I'll let you do a little bit of an intro.

Yeah, thank you so much, Ellen. Great to meet all of you. Excited to be walking through prompt learning with you all. I don't know if you got a chance to see a harness talk yesterday, but hopefully that gave you some good background on how powerful prompting and prompt learning can be. So, my name is I'm a product manager here at Arise as well. And like Sally said, we like to stay in code. We'll be doing a few slides, then we'll walk through the code and we'll be floating around helping you guys debug and things like that. My background is also technical. So, I was a backend distributed systems engineer for a long time. So, no stranger to how important observability infrastructure really is. And I think it's an appropriate setting in AWS for that. So, yeah, excited to dive deep into front loading with you all. Thank you.

Awesome. All right, so we're gonna get started. Just give you a little bit of an agenda of the things I'm going to be covering. So we're gonna talk about why agents fail today. what is evening prom learning? I want to go through a case study kind of show youall why this actually works. And we'll talk about learning versus GA. I think everybody I had a few people come up to me over the conference about like what about GEA? We have some benchmarking against that and then we'll hop into our workshop. But with this I want to ask a question. How many people here are building agents today?

Okay, that's what I expected. And how many people actually feel like the agents they're building are reliable?

Yeah, that's what I also thought. So let's talk a little bit about why agents fail today. So why do they fail? Well, there's a few things that we're seeing with a lot of our folks and we're seeing even internally as we build with Alex for why agents are b breaking. So I think that a lot of times it's not because the models are weak. It's a lot of times the environment and the instructions are weak. So having no instructions from their learned environment no planning or very static planning. I feel like a lot of agents right now don't have planning. We do have some good examples of planning like we have cloud code cursor. Those are really great examples but I'm not seeing it make its way into every agent that I come across. Missing tools big one. Sometimes you just don't have the tool sets that you need. And then missing kind of tool guidance on like which of the tools we should be picking and then context engineering continues to be a big struggle for folks.

If I were to distill this out, I think it's like these three core issues. So adaptability and selfarning. So no system instructions learned from the environment touched on determinism versus non-determinism balance. So having the planning or no planning versus doing like a very static planning. You want to kind of have some flexibility there. And then context engineering I think is a term that just kind of emerged in the last like you know six to eight months but it's something that's really really important that we're finding you know missing tools tool guidance just not having context or confirming your data and not giving the LM enough context. So these are kind of the core issues to still.

But I think there's one other pretty important thing. And that is kind of this distribution of who's responsible for what. So there's these technical users, your AI engineers, your data scientists, developers, and they're really responsible for the code automation pipelines actually, you know, managing the performance and costs. But then we have our domain experts, subject matter experts, AI product managers. These are the ones that actually knew what the user experience would be. they probably are super familiar with the principles that we're actually building to our AI applications. They're tracking our evals and they're really trying to ensure that the product success. So there's this split between responsibilities but everybody is contributing but then there's this difference in terms of like maybe technical abilities. And so with prompt learning it's going to be a combination of all these things. So everybody's going to really need to be involved and we can talk about that a little bit more.

So what even is prompting? I'm going to first kind of go through some of the approaches that we kind of borrowed when we came up with prompt learning. So this is something that Arise has been really really dedicated to doing some research. And so one of the first things we borrow from which is reinforcement learning. How many folks here are familiar with how reinforcement learning works?

All right, cool. So if I were to give like a really like silly kind of analogy, we have a reinforcement model. Pretend it's like a a student brain that we're trying to kind of, you know, boost up. And so they're going to take an action which might be something like you're just going to take a test an exam and there's going to be a score. A teacher is going to come through and actually you know score the exam here that's going to produce this kind of like scaler reward and you know pretend the student has an algorithm in their brain that can just kind of take those scores and update the weights in their brain and kind of like the learning behavior there and then we kind of reprocess. So you know in this kind of reinforcement one we're updating weights based off of some scalers. But it's really actually difficult to update the weights directly, especially in like the LLM world. So, reinforcement learning isn't going to quite work that well when we're we're doing things like prompting.

So, then there's metaprompting, which is very close to what we do with prompt learning, but still not quite right. So, here with metal prompting, we're asking LLM to improve the prompt. So, again, we use that kind of like student example. We have an agent which is our student. And it's going to produce some kind of output like that's a user asking a question getting an output. That's our test in this example. And then we're going to score. Eval is pretty much what you can think of there. Where it's going to output a score and from there we have like the metapromp thing. So now the teacher is kind of like the metapar prompt. It's going to take the result from our scorer and update the prompts based off of that. But it's still not quite what we want to do.

And that's where we kind of introduce this idea of prompt learning. So prompt learning is going to take the the exam going to produce an output. We're going to have our enlumm evals on there. But there's also this really important piece which is the English feedback. So which answers were wrong? Why were the answers wrong? Where the student needs to actually study? Really pinpointing those issues. And then we still aren't using metapro. We still are asking an LLM to improve the prompt. It's just the information that we are giving that LLM is quite different. And so we're going to update the prompt there with all of this kind of feedback. So from our evals from a subject matter expert going in and labeling and use that to kind of boost our prompt with better instructions and sometimes exams.

So this is kind of like the traditional prompt optimization where it's like we have we're kind of treating it like an ML where we have our data and we have the prompt. We're saying optimize this prompt and maximize our like prediction impulse. But that doesn't quite work for Allens were missing a lot of context. So what we really found is that the human instructions of why it failed. So imagine you have your application data, your traces, a data set, whatever it is. Your subject matter expert goes in and they're not only annotating correct or incorrect. They're saying this is why this is wrong. It failed to adhere to this key instruction. It didn't adhere to the context. It's missing out whatever it is. And then you also have your ego explanations from Ellen as a judge, which is same kind of principle where instead of just the label, it provides the reasoning behind the label. And then we're pointing it at the exact instructions to change. We're changing the system prompt to help it improve so that we then get, you know, prediction labels, but we also get those evals and explanations of it. So, we're just kind of optimizing more than just our outlet here.

And I think a really key learning that we've had is the explanations in human instructions or through your own as a judge. That text is really really valuable. I think that's what we see not being utilized in a lot of other broad optimization approaches. They're either kind of optimizing for a score or they're just paying attention to the output. But you can think of it this way. It's like these elements are operating in the text domain. So we have all this rich text that tells us exactly what it needs to do to improve. why wouldn't we use that to actually improve our so that's kind of the basics of prompt learning but everybody always comes up to me and like sounds great s but does it actually work it does and we have some examples of when we do this so we did a little bit of a case study I think coding agents everybody is pretty much using them at this point there's a quite a few that have been really really successful I think cloud code is a great example cursor but there's also client which is more of a an open version of this and so we decided to take a look and compare to see if we could you know do anything to improve. So these are kind of the the baseline of where we started here.

You can see the difference between the different models. U obviously using two and throttle kind of the state-of-the-art there but we also had this opportunity where CL was using you know 45 and it was working decently well at 30% versus 40. And then there was kind of the conversation around. So this is where we started and we took a pass optimizing the system prompt here. So you can see this is what the old one was looking like. It has like no rules section. So it was just very like you are a cloud agent. You're built on this model. You're you're here to do coding. But there was no rules and so we took a pass at updating the system. So there were all of these different uh rules associated. So when dealing with errors or exceptions, handle them in a specific way. make sure that the changes align with, you know, the systems design. Any changes to be accompanied by appropriate test. So really just kind of building in like the rules that like a good engineer would have which was completely missing before.

And so we found that plan performs better with updated system problem. Pretty kind of simple. It's kind of the whole concept here. It's like you can see these different problems and we're seeing you know things that were incorrect now being correctly done just by simply adding more instructions. So it really demonstrates pretty well here how those system prompts can improve and we benchmarked again with a s bench light to get another just like kind of coding uh benchmark for these coding agents and we were able to improve by 15% just through the addition of rules. So I think that that's pretty powerful. So no fine-tuning, no tool changes, no architecture changes. I think those are the big things folks like reach for when they're trying to improve their agents. But sometimes it's just about your system prompt and just adding rules. I think we've really seen that and that's why we're really passionate about prompt learning and prompt optimization in general is it feels like the lowest lift way to get massive improvement gains in your agent. 4.1 achieved performance near 4.5 which is pretty much considered right now state-of-the-art when it comes to coding questions and it's twothirds of the cost which is always really beneficial.

So these are some of kind of the tables here. will definitely distribute this so you can kind of take a closer look. But I think the main point I want y'all to come away with is the fact that like, you know, 15% is pretty, you know, powerful improvement in our performance. Now, a question we get all the time is we're taking these examples of perform learning. So, how this is really important is we're going to take a data set. A lot of time that data set is going to be a set of examples that didn't perform well. either a human went through and labeled them and found that they you know were incorrect or you have your emails that are labeling them incorrect and so you've gathered all these examples and that's what we're going to use to optimize our prompt. So I get a question all the time like well aren't we going to overfitit based off of these bad examples but there's this rule of generalization where mending properly enforces high level reusable coding coding standards rather than repo specific fixes and we are doing this train test split to ensure that the rules are generalized beyond just like local quirks and whatever our training data set is.

But if you kind of think of this as like you hire an engineer, right, to to be an engineer at your company, you do kind of want them to overfit to the database that they're working on. So, we kind of feel that overfitting is maybe a better term for it is expertise. We are again not kind of training in the traditional world. We are trying to build expertise and as we'll talk about this is not something we feel that you do once. You're actually going to kind of continuously be running this. So, more problems are going to come up. we're going to kind of optimize our prompt for what the application is seeing now. And then we'll kind of So, we don't actually think it's a flaw. We feel like it's expertise instead. We can kind of adapt as needed and kind of mirroring what humans would do if they were taking on a task themselves.

This is just another set of benchmarking again kind of proving here that this diverse evaluation suite that focuses on the task for those difficult or tasks that are difficult for relish language models and we're seeing again success with our improvements. Now Ga just kind of came out recently and I think that's something everybody's really excited about. I think the previous DSPI optimizers were a little bit more focused on optimizing a metric and as we talked about like we really want to be using the text modality that these applications are working in that have a lot of the the reasons or how we need to improve and so we definitely wanted to do some benchmarking here. So how many people are familiar with Gered about it?

All right, cool. Well, I'll just give like sort of high level. I just kind of noted that the main difference between their other like new pro optimizers is that they are actually using this positive reflection and evaluation while they are are doing the optimization. So it's this evolutionary optimization where there's this parentto-based candidate selection and probabilistic merging of prompts. What this really does under the hood is we take candidate cross we evaluate them. Then there's this reflection LM that's reviewing the evaluations and then kind of making some mutations some changes and kind of repeating until it feels like it has the right set of prompts. So I think something that is important to notice about GABA is it doesn't really choose kind of just one. It does try to keep the top candidates and then you know do the merging from there.

But we benchmarked it and proper learning actually does do a little bit of a better job. And I think something that's really key is it does it in a lower number of loops. And I think something that we'll we'll talk about in just a second here is that it does actually matter what your emails look like and how reliable those are. I think that's something we really feel strongly about at Arise is you definitely want to be optimizing your agent prompts, but I think a lot of people forget about the fact that you should also be optimizing your email prompts because if you're using emails as a signal, you can't really rely on them if you don't feel confident in them. So, it's just as important to invest there, making sure you're kind of applying the same principles that you are to your agent prompt as your email prompts so you have a really reliable signal that you can trust and then feed that into your prompt optimization.

But, um in both of these graphs, the pink line is prompt learning. We did also benchmark it against me pro their older optimization technique that I was mentioning kind of functions off like optimizing around score and eval make the difference. So it kind of I I highlighted on this slide here like the with eval engineering we were able to do this. So we did have to make sure that the eval part of prompt learning were really high quality because again it's this only works if the eval itself is working. So, yep, emails make all the difference. Kind of spend some time optimizing a prompt here. Again, it's all about making sure you have proper instruction. The same kind of rules apply.

So, I want to kind of walk through. I know there's a lot of content. I think it's really important to have context. But before we jump into any of the workshops, any questions I could answer about what I discussed so far?

I have a question comment. So I I think you know coding is the greatest example in terms of having the structure and evals. One thing I'm sort of curious about is if you have other examples sort of general prompts forational interactions with systems that are not as easily quantifiable. I'm just curious about any experience you guys have there.

Yeah. Is that for like eval general?

Well I think it's just clear how you would set up what the eval would look like and I'm just wondering how you would do that for other types of so the question is like is there any kind of instruction for how you should set up your evals? coding seems like a very straightforward example. You kind of want to make sure the code's correct, right? But where some of these other agent tasks it's a little bit harder. I think the advice that I usually give folks is we do have a set of like out of box. You can always start with things like QA correctness or focus on the task. But what I always suggest is like getting all the stakeholders kind of in the room. So getting those you know subject matter experts and security you know leadership and really defining what success would look like and then start kind of converting that to different evaluations. So I think an example is Sterling and Alex. I have some task level evaluation. So like I really care did it find the right data that it should have. Should it did it create a filter using semantic search or structured like making the right tool call? And then I care did it call things in the right order? Was the plan correct? So kind of thinking about like what each step was and then like even security will be like well we care how often people are trying to jailbreak Alex. So, it's just taking each of those success criteria, converting it to eval. And we do have different tools that can help you, but that's usually the framework I give folks is like start with just success and then worry about converting into an email after.

Yeah. Just to add to that, maybe like more of like a subjective use case is like for example like Booking.com is one of our clients and so when they do like what is a good posting for a property like what is a good picture? Defining that is really hard, right? Like to you, you might think something is a very attractive posting for like a hotel or something, right? But to someone else, it might look really different. And sometimes, as kind of Sil was alluding to, it's sufficient to just gate it as a good, bad, and then kind of iterate from there. So like, is this a good picture or bad picture? Let decide and then gate from there into specific background like, oh, this was dimly lit, the layout of the room was different, etc., etc.

Yeah. Yeah. That's that you're actually building on the question I was going to ask which is that they end up with that binary outcome which doesn't necessarily give you a gradient to advance upon are you then effectively using those questions like digitally lit not to like get like a more continuous space is that exactly right and then from there as you get more signal you can refine your evaluator further and further and then use those states and you can actually put a lot of that in your prompting itself right so yeah I have two questions and I'm not sure if I should ask both of them or maybe your workshop will answer it. One is about rules and the rule section or like operating procedures. I'm curious how you uh do you just continuously refine that in the English language and maybe reduce the friction of any contradictory rules. That's the first question. And then the other was I would love to see the slide on eval. if you could just say a little bit more on how you approach that because my issue in doing this work is um whether or not to have like an a simulator of the product and then the simulator is evaluating or to do what I'd like to do which is like an end toend evaluation that I build but I would love to see you talk about that if you could.

Yeah, absolutely. So from the first one about like how the instructions it's definitely something I think that like you iterate over time on them. So a lot of times I think we take our best bet like we write them by hand, right? And I think what we're trying to do with proper optimization is like leverage the data to dynamically change them. And is I think great at like removing redundant instructions, things like that. But the goal is is we want to move away from static instructions. We feel very confidently that like that is not going to really scale. It's not going to lead to like sustainable performance. So the idea exactly with pump learning is something that you can kind of run over time. We see this even like a long running task eventually where you're building up examples of incorrect things maybe having a human annotate them and then the task is kind of always running producing optimized prompts that you can then pull in production and it it kind of is like a cycle that repeats over time.

Sorry just to intervene. So, are you saying that when you're doing this over a long period of time and then you have examples, you're just running the shots back into your rules section?

Kind of. It's going to pass it like when we get to the optimization actual like loop we're going to build, you'll kind of see it as like you are feeding the data in that's going to build a new set of instructions that you would then, you know, push to production to use.

Okay. Uh I think your second question was around evals and like how to where to start, how to like write them and like how to optimize those. Is that right?

Yes. Yeah. So, it's a very similar approach. I think it's like the data that you're reviewing is almost a little bit different. So, I should have pulled up the the loops. I don't know if you can find it. Let me just try something really quick to kind of show this. There we go. So, this is kind of like how we we see it is you have two co-evolving loops. I've been talking about the one on the left, the blue one a lot about we're improving agent, we're collecting failures, kind of setting that to do kind of fine-tuning or prompt learning, but you basically want to do the same thing with your evals where we're collecting the data set of failures, but instead of thinking about the failures being the output of your agent, we're actually talking about the eval output. So having somebody go through and you know evaluate the evaluators or using things like log props as confidence scores or jury as a judge to determine where things are not confident. We're kind of doing the same thing. So figuring out where your eval is low confidence and then you're collecting that annotating maybe having somebody go through and say okay this is where the eval went wrong. And so it's the same pretty much process of optimizing your eval prompt. It's just you know I think folks think they can just grab something off the shelf or write something once and then they can just forget about it. But this loop, I've said it a few times, but the the left loop only works as well as your eval.

Sorry, I think my question is actually way more static and basic. It's like do you are you talking about this orange circle as like are you building a system or simulator for the eval or are you just talking about like system prompt, user prompt, eval?

Yeah, I think it's more right now what we're talking about is just like kind of the different prompts. You could definitely do simulation, but I think that's a whole different workshop.

Thank you. Any more maybe questions before we get to the bridge club? Any switch back? All right. So, here is going to be a QR code for our prompt learning repo. So, I'll give everyone a few minutes to get such with that. Get it on your laptops. I know it's a little bit clunky to add this QR and like airdrop it. was not sure a better way. I can just show you also here if you want to find it. It is going to be in our Rise AI repo here and under prompt learning and you just want to kind of clone that. We are going to kind of be running it locally here.

You go back to the page with the URL. Yes. Sorry about that. Oh no, the page with the URL. Oh, we'll give folks just a few minutes to get What do you What's your process when you're building a new agent or work for anything that could be evaluated? Do you guys start by just like, oh, try something prototype and then see where it's bad and then do eval?

Yeah, I think there's different perspectives on this. Our perspective is EOS should never block you. Like you need to get started and you need to just build something really scrappy. We don't think like you should, you know, waste time doing eval. I think it's helpful to pull something out of the box sometimes in those situations just because it's hard to comb through your data. like that's something we've experienced with Alex of like when you're getting started just running a test manually reviewing like it it's kind of painful. So I think that having eval is helpful but shouldn't be a blocker. Pull something off the shelf maybe start with that then as you're iterating you're understanding where your issues are then you're starting to refine your evals as you're refining your agent.

Yeah. One last question. Yeah. So it makes sense to like optimize the system like sub aents or commands or how are you thinking about this like multi- aent?

Yeah. So the question is is like are you just doing one single prompt or how do you think about this in a multi- aent? I think we're kind of thinking that this right now is kind of independent tasks that can optimize your prompts kind of independently and then running tests to get into like the agent simulation of running them all together. But right now, our approach is a little bit isolated, but I definitely see a future where we're going to kind of meet the the standard of like sub agents and everything else that's going on right now.

No, I think that's pretty accurate. And also like I mean even in a single agent use case versus like a multi- aent use case like ultimately like each of those agents may be specialized. They may have their own prompts that they need to learn from. So I think doing this in isolation still has benefits for the multi- aent system as a whole that can pass on over time in scenarios like hand off etc and making something like really really specialized. So I guess like what we're talking about with like the overfitting as well which is again like question we get all the time but really you want to be over fit on your code base as an engineer. You don't want to be so generalized that you're no longer good at picking up specific works in your code base.

Yeah. All right. Everybody kind of getting to read the mode. Okay. Anybody need any help? All right. So, we are going to be using OpenAI for this. So, I think the next thing that I'll have everyone do is probably spend some time just grabbing your API key. We'll get to it and then I'll just kind of start walking through our notebook here. So, we are going to be doing a JSON webpage prompt example. So, you're going to find that under notebooks here. And so we'll give everybody a second to pull it out. There's going to be just some slight adjustments we're going to add to this example just to make it run a little faster and work a little better. The first is what this is even doing this is going to be a very simple example for just a JSON web page prompts. If anybody has like a prompt or use case that they want to kind of like code along, Van and I are absolutely help like glad to help kind of adapt what you're working on to the use case here. It's something very simple just to kind of demonstrate the the principles and we are going to be using we can definitely experiment. If you want to swap out any other providers that you want to use, we can also definitely help you do that. But the the goal of this is essentially going to be to iterate through different versions of a prompt using a data set. And we will optimize. So the first thing is obviously we need to do some installs. I am just going to have you all update it. It says like greater than 2.00. But we're going to actually just use I think 22 today. And then the next thing is just to make this run a little faster. So we're going to run things in async which is missing. So you can go ahead and add these lines in the cell as well.

All right, everyone kind of follow along and I never know want to move too fast. Seems to head nuts. Cool. Let's talk about configuration. So I kind of talked about it a little bit when I was going through the slides. So we are going to be doing some looping. So the general idea is is we start out with the data set with some feedback in it and we'll we'll look through the data set once we get it. But you're going to want to have either human evaluation. So like annotations, either free text, labels, or you're going to want to have some evaluation data. But the feedback is really important. That's what makes this kind of work. We're going to then, you know, pass that to Allen to do the optimization and then it's going to basically have eval. So as it's optimizing, it's using that kind of data set to then run and assess whether or not it should, you know, kind of keep optimizing. And then it also provides you data that you can kind of like use to gauge which of the prompts that it outputs in you know a production setting. So we're going to do some configuration. So I've kind of wrote out here kind of what each of these means. So we have the number of samples. So this controls how many rows of the sample data set. You can, you know, set to zero to use all data or you can, you know, use a positive number to limit for, you know, faster experimentation. So I think that sometimes folks use different approaches here. Sometimes you want to just move really quick so you set a low sample. Sometimes you want to be a little bit more representative so you up it. I have it here set as 100. Feel free to adjust. And then the next thing is train split. So I think folks are probably pretty familiar with the concept here of like a train test split, but it's just how much of the data do we want to use into our training? Again, that's what we're using to actually optimize. Then how much of it do we want to use when we're testing when we're running the eval on the new prompt? And there's number of rules. Basically the specific number of rules to use for evaluation. This just determines which prompts to use. And so this is like as we're running these loops, we're outputting, you know, a bunch of different prompts. So this is just saying how many we should use for evaluation. And then key one here, number of optimization loops. So this sets how many optimization iterations to run per experiment. And each loop basically generates those outputs, evaluates them and refineses the prompt. And so these just control the experiment scope the data splitting just went through the whole prompt learning loop and and how much data we want to use. So you can kind of just run these as you are or if you want to adjust them feel free. And then the next step pretty simple. We're just going to grab that open AI key if you haven't already set that up. So, get passage is going to like pop up. I'll show you here quick. It's going to pop up there. You can just paste in your API key there before we start looking at the the data a little bit. Just if anybody runs into any issues, you just give this away.

All right. I think this particular we get through this I'm doing good but if you have a free one you want to give me that I wish all right let's talk about the data so we provided data with you with queries you can see here that we're doing the 8020 split based off of kind of configuration we set above I'm just going to pull this train set here and let's just Yeah, I run because in the minus 50 Oh, yep. You're right. That's a mistake on my part. Yeah, it is the 50. Let's take a look at what this data set looks like. No. Uh just so folks can kind of understand. So kind of starting here with some just basic input and output. transcept we don't have any of the the feedback in these rows that I printed out here but you can imagine you can have different correctness labels here explanations any real validation data can be whatever it is that you'd like it to be. Some folks use multiple eval feedback sometimes it's a combination but you really want to have you know the input and output that will use that way.

Should my output of train set be the same as you?

Not necessarily. Depends on I didn't know if head was sort or not. It all depends on kind of what the the same but we could look at like you know if I did this this should be the same for you maybe just to make sure. Yeah. That's what you're saying. Okay. Yeah. Quick question. Um, is it possible for the input to be like a chat history and not just

Great question. So, I think it depends on like what it is you're trying to do. If you're doing just like a simple kind of system of the input, you kind of want it to be one to one. You don't want to give it a ton of like conversation data that's not relevant to the prompt that you're optimizing. we we generally just use like the single input but I think that there are applications that you could do like conversation level inputs.

Yeah. Because because quite often the failure is somewhere middleation right. So so if you put just the original task in uh then the probability of you hitting you know a failure in the middle of

Build a Prompt Learning Loop - SallyAnn DeLucia & Fuad Ali, Arize

How Prompt Learning Solves the Agent Reliability Gap

The Textual Advantage

Rules Over Fine-Tuning

Co-Evolving Loops

Actionable Takeaways

Others You May Like

Dario Amodei and Dwarkesh Patel – Exponential Scaling vs. Real World Friction

The Deflationary Singularity: Why Everything is Going to ZERO w/ Salim Ismail

What If Intelligence Didn't Evolve? It "Was There" From the Start! - Blaise Agüera y Arcas

Build a Prompt Learning Loop - SallyAnn DeLucia & Fuad Ali, Arize

How Prompt Learning Solves the Agent Reliability Gap

The Textual Advantage

Rules Over Fine-Tuning

Co-Evolving Loops

Actionable Takeaways

Join 10,000+ smart readers on our AI newsletter and stay ahead of the curve

Others You May Like

Dario Amodei and Dwarkesh Patel – Exponential Scaling vs. Real World Friction

The Deflationary Singularity: Why Everything is Going to ZERO w/ Salim Ismail

What If Intelligence Didn't Evolve? It "Was There" From the Start! - Blaise Agüera y Arcas