Shipping AI That Works: An Evaluation Framework for PMs – Aman Khan, Arize

Beyond the Vibe: Why Evals are the New PRD

Aman Khan, Arize

Date: [Insert Date]

Quick Insight: For builders moving past toy prototypes, reliability is the only moat that matters. This summary breaks down how to use LLM-as-a-judge to turn non-deterministic chaos into production-grade software.

This episode answers:

Why is "vibe coding": the biggest bottleneck to scaling AI products?
How do self-driving car frameworks: apply to LLM agents?
Can you actually trust: an AI to grade another AI?

Aman Khan (Arize) argues that the "vibe check" era of AI development is dead. As an AI PM veteran of Cruise and Spotify, Khan maps the transition from brittle prototypes to robust, agentic systems. The core tension is simple: LLMs are non-deterministic, yet users demand deterministic reliability.

The Moat of Reliability

"Evals are emerging as a real moat for AI startups."

Deterministic Testing Fails: Traditional unit tests expect 1+1=2, but LLMs are non-deterministic. You must build systems that expect and measure variance to ensure the product doesn't break in the wild.
The Left Turn: Self-driving cars mastered straight roads before tackling complex intersections. AI agents need curated data sets for specific edge cases to move beyond basic chat functions.
Data as Infrastructure: Your proprietary data is the only thing separating your agent from a generic wrapper. Evaluation frameworks ensure that your data actually improves the output rather than adding noise.

The Judge and the Jury

"When the people that are selling you the product are telling you that it is not reliable, you should probably listen to them."

Scalable Grading: Manual human labeling cannot keep up with thousands of production traces. Using a high-reasoning model to grade a smaller model allows for rapid iteration without hiring an army of annotators.
The Explanation Loop: A score is useless without reasoning. Effective judges provide natural language explanations for their grades, allowing PMs to debug the prompt logic rather than guessing why a response failed.

The New Requirement Stack

"I view eval like the new type of requirement stock."

Human-in-the-loop Verification: You cannot blindly trust the AI judge. Comparing AI grades against human "golden sets" reveals where your evaluation logic is failing.
Acceptance Criteria: Instead of static documents, PMs should deliver evaluation data sets as the primary acceptance criteria for engineering. This aligns the team on measurable performance rather than subjective opinions.

Actionable Takeaways

The Macro Shift: From Model-Centric to Eval-Centric. The value is moving from the LLM itself to the proprietary evaluation loops that keep the LLM on the rails.
The Tactical Edge: Export production traces and build a "Golden Set" of 50 hard examples. Use these to run A/B tests on every prompt change before hitting production.
The Bottom Line: Reliability is the product. If you cannot measure how your agent fails, you haven't built a product; you've built a demo.

Podcast Link: Click here to listen

All right. Uh, nice to see everyone here. Um, my name is Aman. I'm an AI product manager at a company called Arise. Title of the talk is shipping AI that works, an evaluation framework for PMs. Uh, it's really going to be a continuation of some of the content we've been doing with, you know, some of the the PM folks like Lenny's podcast. I guess just quick show of hands. How many people listen to Lenny's podcast or have read read the newsletter? Awesome. Okay, we're going to do a couple more like audience interaction things just to like wake up the room a bit. So, how many people in the room are PMs or aspiring PMs? Okay, good. Good handful of people. How many of you consider yourself AI product managers today? Okay, awesome. Wow, that there's more AI PMs than there were regular PMs. That's interesting. Um, usually that's it's a subset, but maybe I need to start asking the questions in a different order. Um, cool. Well, that's great. Uh, so what we're going to be doing is, you know, um, I'll go ahead and just do a little bit of an intro about myself and then we'll kind of cover some of the the frameworks that I think are really powerful for AIPMs to kind of get to know as you're building AI applications.

Aman Khan: So, a little bit about me. Um, I you know, myself, I have a technical background. So I actually started my career in engineering uh on actually working on self-driving cars at Cruz. Um and actually while I was there I ended up becoming a PM for evaluation systems for self-driving back in like 2018 2019. Um after that I went to Spotify to work on the machine learning platform and work on recommender systems. So things like discover weekly and search things like using embeddings to actually make the end product experience better. And fast forward to today, I've been at Arise uh for about three and a half years, and I'm still working on evaluation systems instead of self-driving cars. It's sort of self-writing code agents. Uh and Spotify is actually one of our customers. So, we get to work with some awesome uh you know, ex actually, fun fact, I've actually sold Arise to all of my previous managers. So, um so fun fact there. Uh but we got to work with some awesome companies like Uber, Instacart, Reddit, Dolingo. So a lot of really tech forward companies that are building around AI. Uh and we actually started in the sort of traditional ML space of ranking, regression, classification type models and have now expanded into Gen AI and agent-based applications as well. Uh what we do is make sure that those companies, our customers when they're building AI applications that when those agents and applications actually work as expected. And it's actually a pretty hard problem. A lot of that has to do with uh terms that we're going to go into like observability and eval. But I think more broadly the space is just changing so fast and the models, the tools, the infrastructure layer changing so fast that for us it really is a way for us to learn about the cutting edge like what are the leading challenges with use cases that people are building and try to build that into a platform and product that benefits everybody.

Aman Khan: Um, so what we'll cover, we're going to cover what are eval and why they matter. We'll actually build an AI trip planner uh with actually a multi- aent system. This part is ambitious bullet number two. I'm going to be honest here. Uh we were trying to push up the code right before so it may or may not work, but we'll give it a shot and that'll be the interactive part of the workshop and then we'll actually try to evaluate that AI trip planner prototype that we're going to build ourselves. Uh actually another quick show of hands for the room. How many people have heard of the term eval before? Okay, I guess it was in the title of the talk, so that's kind of redundant. How many people have actually written an eval before or tried to run an eval? Okay, a good number of people. Um, that's awesome. Well, what we're going to do is actually take try and take that a little bit of a step further. Go from writing an eval for an LLM as a judge system. And if you've never written an eval, don't worry. We're going to cover that, too. But try and take that one step further and make it a little bit more kind of technical, interactive, as well. Okay.

Aman Khan: So, who is this session for? Uh, I like this diagram because um, you know, Lenny and I have been kind of working together a little bit more on content for educational content mostly for AI product managers. And I kind of put this up. I made like a little whiteboard diagram for him. And I'm like, I think this is really how I view the space, which is like there's this there's this, you know, you may have seen this diagram for like the Dunning Krueger effect. And that's kind of what came to mind here, which is as you're kind of moving along the curve, maybe you're just getting started, you know, with how do I use AI? How does AI fit into my job? I think we were all here to be honest a couple of years ago, like, you know, just to be completely honest, I think for people in the room, especially PMs, I think we all feel that the expectations of the product management role are changing. That's why this concept of an AIPM is sort of emerging. the expectations from our stakeholders, from our executives, from our customers. It feel I feel I don't know about if other people feel this, but I definitely feel like the bar has been raised in terms of what's expected to be delivered, right? Especially if I'm working with an AI engineer on the other end, their expectations of what I come to them with in terms of requirements, in terms of specifying what the agent system needs to look like, it's changed. It's a step function different even than for me, even as someone who was like a technical PM before.

Aman Khan: And so I kind of felt myself go along this journey which is ironic given that I work at an eval company. You think I was like on the end of the curve but really I kind of went through this journey you know same as most of you which is trying to use AI in my job trying AI tools to prototype and come back with something that's you know a little bit higher resolution for my engineering team than like a Google doc of requirements. Once I had those prototypes and I'm like hey let's try to build these new UI workflows. The challenge then became how do I get a product into production especially if my product has AI in it has an LLM or an agent and that's where I think you know that's really where that like confidence slump sort of hits and you kind of realize there's a lack of tooling there's a lack of education for how to build these systems reliably and why does that matter at the end of the day right and the really important takeaway from the fact that LLMs hallucinate we all know that they do is you should really look at the top two quotes here and think, okay, well, we've got Kevin who's chief product officer at OpenAI. We have Mike at Anthropic CPO. This is probably like 95% of the LLM market share. And both of the product leaders of those companies are telling you that their models hallucinate and that it's really important to write eval. This these quotes actually came from a talk that they were both giving at Lenny's conference uh you know earlier like in November of last year. And so when the people that are selling you the product are telling you that it's not reliable you should probably listen to them.

Aman Khan: Uh on top of that I mean like you have Greg Brockman similarly founder of that company. Um you have Gary who's you know eval are emerging as a real moat for AI startup. So I I think this is sort of one of those pivotal moments where you realize, hey, people are starting to say this for a reason. Why are they saying that? Well, they're saying that because a lot of the same lessons from the self-driving space and um you know kind of apply in this this AI space. Okay, another audience question. How many people have taken a Whimo? I kind of expect that one to be pretty high. Okay, we're in San Francisco. If you're visiting from out of town, take a Whimo. It is a real world example of AI. It's it's it's an example of AI in the real physical world. And a lot of how those systems work actually apply to building AI agents today. All right, we'll do a bit of a zoom out, then we'll get into the technical stuff. I see laptops out, so we'll definitely get into, you know, writing some code and trying to get hands-on. But just to do a bit of a recap for folks, um what is an eval? Uh I I kind of view this as like it's very analogous to software testing, but with some really key differences. Those key differences are software is deterministic. You know, 1 plus 1 equals 2. LLM agents are nondeterministic. If you convince an agent 1 plus 1 equals 3, it'll say like you're absolutely right. 1 plus 1 equals 3. Right? So, like we've all been there. We've kind of seen that these systems are highly manipulatable.

Aman Khan: And on top of that, if you build an LLM agent uh that can take multiple paths, that's very that's pretty different from a unit test, which is deterministic. So think about um the fact that you know a lot of people might are trying to like eliminate hallucinations from their agent systems. The thing is you actually kind of want your agent to hallucinate just in the right way and that can actually make testing it a lot more challenging as well especially when reliability is is super important. And then last but not least I think integration tests rely on existing codebase and documentation. A really key differentiation of agents is that they rely on your data. Uh if you're building an agent into your enterprise, the reason that someone is going to use your agent versus something else is likely it might be because of the agent architecture, but a big part of it will also be because of your data that you're building the agent on top of. And that applies for the eval as well. Okay. What is an eval? So, uh I view this into like four parts that go into an eval. kind of just an easy like muscle memory thing. Um these brackets are a little bit out of line, but um the the idea is that you're setting the role. You're basically telling the agent, here's the task that you want to accomplish. You're providing some context, which is what you see in the curly braces here. And it's that's essentially like it's really just text at the end of the day. It's some text you want the agent to evaluate. You're giving the agent a goal. In this case, the agent is trying to determine whether text is toxic or not toxic. This is a kind of a classic example because there's a large toxicity data set of classifying text that we use um to build our eval on top of. But just kind of note that can be any type of goal in your business case. It doesn't have to be toxicity. It'll be some goal that you've created this agent to evaluate. And then you provide the terminology and the label. So you're giving some examples of what is good and bad and you're giving it the output of either select good or bad. In this case, it's toxic not toxic.

Aman Khan: I'm going to pause on that last note because it's really I think there's a lot of misconceptions sort of like I'll try and weave in some like FAQs as I hear them come up but um we'll definitely have some time at the end for questions and I'd love for this to be interactive so I'll probably make the Q&A session a little bit longer here for people that have these questions but one common question we get is why can't I just tell the agent to give me a score or an LLM to produce a score and the reason for that is because even today even though we have like PhD level LLMs, they're still really bad at numbers. Um, and so what you want to do is ground, and it's actually a function of like what a token is, what how a token is represented for an LLM. And so what you want to do is actually give a text label that you can map to a score if you really need to use a score in your systems, which we do in our system as well. We'll map a label to a score. But that's that's like a very common question we get is, "Oh, why can't I just make it do like one is good and five is bad or something like you're going to get really unreliable results." And we actually have some research um happy to share it out afterwards that kind of proves that out um on a large scale with most models. Okay, so that's a little bit of like what is an eval. Um I should note that uh this is a previous slide. I should note that this is uh LLM as a judge eval. Uh there's other types of evaluations as well like code-based eval which is just using code to evaluate some text uh and human annotations. We'll touch on that a little bit more later but the bulk of this time is going to be spent on LLM as a judge because it's really like the kind of scalable way to run eval production these days and we'll talk about why too later on. Okay, a lot of talking.

Aman Khan: So uh evaluating with vibe. So this was it's kind of funny because I think like everyone knows this term like vibe coding like everyone has tried to use like bolt or lovable whatever and I don't know about you but this is how I usually feel when I'm vibe coding which is like kind of looks good to me like you know you're looking at the code but like let's be honest how much AI generated code are you going to read you're like let me just ship this thing the problem is you can't really do that in a production environment right like I think all the vibe coding examples are like prototyping or like trying to build something h like hacky or fast so I want to help everyone reframe a little bit and say like yes vibe coding is great. It has its place. But what if we go from evaluating with Vibes to Thrive Coding? And thrive coding in my mind is really using data to basically do the same thing as vibe coding, like still build your application, but you'll be able to use data to be more confident in the output. And you can see that this person is a lot happier. Um, so this is using Google's image models. They're scary good, guys. Like, uh, yeah. Okay. So, we're going to be thrive coding.

Aman Khan: So, slides. Um there's uh if you want access to the slides, the slides have links to what we're going to go through in the workshop. Um ai.engineer.slack.com and then I just created the Slack channel workshop AIPM. And I think I dropped the slides in there, but let me know if I didn't. Cool. Thank you. All right, live demo time. So, at this point on, uh I'm I'll just be honest. uh there's a a decent likelihood that the repo is has something broken in it because we were pushing changes up until like this very moment. If so and you can unblock yourself, I think there's like a requirements thing that's broken. Please go for it. And if not, we can come back at the end and try to help you get unblocked. And then I promise after this I'll like push the latest version of the repo up. So if it doesn't work for you right now, check back in an hour. I'll drop it in Slack. It'll be working later. Um but yeah, just a function of like moving fast. Uh so on the lefth hand side is instructions which are really it's like a you know sort of a a substack post I made which is just a free sort of like list of you know some of the steps we're going to go through live. So it's just more of a resource and then on the right hand side is a GitHub repo which I'm going to open here. There's actually two repos and I'll kind of talk through like a little bit about what we're evaluating and some of the project on top of that and then we'll get into uh the weeds here a little bit. Okay, so this is the the repo. Um we I built this like over the weekend, so you know it's not it's not super sophisticated, although it says it's sophisticated, which is funny. But um this is Oh, pardon. Can you put that? Oh, this is not Okay. So, is this not attached to the QR? Okay, I'll just drop this link in here as well. Let's just uh put it in here. Okay, awesome. Oh, thank you. Thanks. Okay. Um so, and if you have questions, by the way, uh in the middle of the presentation, just feel free to drop them in Slack. Um, and then we can always come back to them and then we'll have time at the end for um, so feel free to like keep the Slack channel going um, for questions. Maybe people can try to unblock each other as well. And if someone fixes my requirements, feel free to open a poll request and I'll approve it live. Um, so um, okay. So what we're doing is uh, let's put on let's take off our like PM hat of whatever company we're at. We're going to put on an AI triplaner hat. The the idea here is like don't worry about the sophistication of this UI and the agent. It's really like kind of a prototype example, but it is helpful for us to kind of take a look at building an application on the fly and try to understand how it works underneath the hood.

Aman Khan: So the example we're going to use is actually I'll kind of back up a little bit. I basically took this uh collab notebook that I have um for tracing crew AI and I'm like I kind of want an example with Langraph. Crew AI probably if you haven't heard of it it's like an agent multi- aent framework. Um the agents basically an agent definition is you know using an LLM and a tool combined to perform some action. And what I did was I gave this notebook and I basically put it into cursor and I was like give me an example of a UI uh based workflow but using lane graph instead. And what we're going to do is think of instead of building a chatbot, we're going to take this form and we're going to use the inputs of this form to build a quick agent system that we're then going to be using for evaluation. So this is what I got on the other end. Um, which is plan your perfect trip. Let our AI agents help you discover amazing destinations. So let's pick a destination. Maybe we want to do Tokyo for seven days. And assuming the internet works, um, we'll see if it does. We're going to put a budget of $1,000. I'll zoom in a little bit. And then I'm interested in food. And let's make this adventurous. So I could go and take all of this and try to just put it into chat GPT. But you can kind of imagine underneath the hood the reason that we might want this as a form or with multiple inputs and uh an agent-based system is because we could be doing things like retrieval or rag or tool calling underneath the hood. So, let's just kind of picture that the system is going to use these inputs to give me on the other side an itinerary for my trip. And uh okay, it worked. Okay, this one worked. So, um so here we've got a quick itinerary. Um nothing super fancy. It's basically just here's what I gave as an input form and then what the agent is kind of doing underneath the hood is giving me an itinerary for what my morning, afternoon, etc. look like for a week in Tokyo using the the budget I gave it. Uh this doesn't seem super fancy because it's like I could take this and just put it into chat GPT, but there is some nuance here, which is the budget. Like if you add this up, like it's going to be doing math to do accounting to get to $1,000. So, it's really keeping that into consideration. You can see it's a pretty frugal budget here. Um it can take interest here. So, I could say, you know, different interests like I want to go, I don't know, sake tasting or something, and it'll find a way to work that into your itinerary. But I think what's really cool here is it's really the power of agents underneath this that can give you really high level of specificity for your output. Um, so that's really what we're trying to show is like this is, you know, it's not just one agent, it's actually multiple agents giving you this itinerary. Uh, so I could just stop here, right? Like it could be like this is this is good enough. I have some code for most people. If you're vibe coding, you're like great, this thing does what I want it to do, right? Like it gave me an itinerary. But what's going on underneath the hood? Um and this is kind of where uh so I'm going to be using our tool called Arise. We also have an open- source tool called Phoenix. I'm just going to plug that here right now for folks as reference. But this is an open source version of Arise. It is not going to have all of the same features as Arise, but it will have a lot of the same setup flows and workflows around it. So, you know, just note that Arise is really built for, you know, if you want scale, security, support, um, and sort of the the sort of futuristic workflows in here. So, I've got a trip planner agent, and what I just did, if it worked, let's see if it did. And we're gonna This is This is live coding, so like very possible something's broken. Um, okay. I think I think I broke my my latest trace, but you can see what the example here looks like from one right before. So, what that system really looks like is basically this. Um, so let's let's open up one of these examples. What you'll see here are traces. Traces are really input, output, and metadata around the request that we just made. And I'm going to open up one of those traces just as an example here. And what you'll see is essentially a set of actions that the agents that in this case multiple agents have taken to perform, you know, generating that itinerary. And what's kind of cool is we actually just shipped this today. Um, uh, it's actually, you know, you guys are the first one seeing it. uh which is pretty cool. Um this is actually a representation of your agent in code. Um so you know literally the cursor app that I just had up here is basically my agentbased system that cursor helped me write and when I sent it our docs I I literally all I did was I gave it a link to our docs in cursor and I said you know write the instrumentation to get this agent and and this is this is how that's represented. And so we have this new agent visualization in the platform that basically shows the starting point with multiple agents underneath it to accomplish uh the task we just had. So we have a budget, local experiences and research agent that then go into an itinerary agent and that gives you the the end result or the output and you can you can see that up here too. So we have research, itineraries, budget and local information to generate the itinerary. So this is this is pretty cool, right? Like for I think for a lot of people it's not im you know oursel included it is not immediately obvious that these agents can be super well represented in this sort of like visual way right uh especially when you're writing code you think these are just function calls talking to each other but what's really useful is to see at an aggregate level what are the calls that the agent is making and you can see it's a really clean delineation of parallel calls for the budget agent the local experience experiences agent and the research agent and all of those get fed fed in to an itinerary agent that summarizes all of the above. You can also see that up here. Um so these are what's called uh traces and they consist of uh like technically what's called spans. A span you can think of this as like a unit of work basically. So there's a time component to it which is like how long that process took to to finish and then like what is the type of the process. Here you can see there's three types. There's an agent. There's a tool which is uh basically being able to use data to perform an action structured data. And then there's the LLM which generates the output of the the taking the input and the context. So this is an example of three agents actually three agents being fed into a fourth agent to generate the itinerary. That's really what we're seeing here. Um let's go one level deeper. So this is cool and I think it's useful for uh you know to see like what these systems look like, how they're represented to zoom out for a second as a product manager. There's a ton of leverage in being able to go back to your team and ask, hey, what does our agent actually look like, right? Like do you have a visualization to show me of like what the system actually looks like? And then if you're giving the agent multiple inputs, where are those outputs going? are those outputs going into, you know, a different agent system, like what are the what does the system actually look like? So, that's kind of one sort of key takeaway here as a PM. Um, it was personally very helpful to see, you know, what our agents actually doing um underneath the hood. Uh, kind of going one level deeper here. So, we've got this itinerary uh and it let's take a look at it really quick. So, it says Marrakesh, Morocco is a vibrant exotic destination. Blah blah blah. It's it's really long, right? Like I don't know if I would actually look at this and read it. It doesn't it's not really like doesn't like jump out to me as like being like a good product experience. It feels super AI generated personally. Um so what you want to do is actually think okay well is there a way for me to iterate on my product as a product person. And to do that what we can do is actually take that same prompt that we just traced and pull it into a prompt playground with all of the variables that we've defined in code pulled over. So, I've got a prompt template here which basically has the same um prompt variables that we've defined in the UI like the destination, the duration, the travel style. And all of those inputs get fed in here. You can see down below in this prompt playground, what that looks like. And then you see the [clears throat] outputs of some of the agents in here as well. And then I have the final itinerary from the the agent that's generating the itinerary. Okay. So why does this matter? I think a lot of companies have this concept of um prompt playgrounds. I think like OpenAI as a prompt playground. You've probably heard that term before as well or maybe even you've used one. But I I urge you to think about when you're thinking about a tool to help you with development. Not only is the visualization important of what your stack it looks like underneath the hood, but being able to take your data and your prompts together and be able to iterate on your data and prompts in one interface is really powerful because I can go in and change the destination. I can go in and tweak variables and get new outputs using the same exact prompt I had before. So that's really I think just just really powerful as a workflow. Um, a thought experiment for the PMs in the room is like when you really think about what this promp uh prompt looks like, just think it should writing the prompt be the responsibility of the engineer or of the PM? And if you're a product person and you're ultimately responsible for the final outcome of the product, you probably want to have a little bit more control over what the prompt is. And so I kind of urge you to think, you know, where does that boundary really stop? Is it like I just hand off like does the engineer know how to prompt this thing better than a product person that might have specific requirements you want to integrate. So that's why this is really helpful um from a product perspective.

Aman Khan: Yeah. Go for it.

Speaker 2: How do you handle this?

Aman Khan: Yeah. Ah, okay. Okay. So, that was a good question. Um, so the question from the gentleman in the back is how do we handle tool calls? And that was a really astute observation which is like the agent has um tools in it as well. And this is this is a really good point to pause on actually, which is like what I did was I pulled over this LLM span with the prompt templates and variables, but there's there's a world where I might want to select the right tool and make sure that the agent is picking the right tool. I'm not going to go into that in this demo, but we do have um we do have uh some good uh material around this on agent tool calling. So we actually do port over the tools as well. This example doesn't because to be honest it's a really toy example but even if you if you wanted to to do a tool calling evaluation we we offer that in the product and uh we actually have some material around that. So if you want just ping me about it later and I'll send you a whole presentation on that as well. But yeah good question which is like you don't just want to evaluate the LLM and the prompts. You want to evaluate the system as a whole and all of the subcomponents. Okay we're gonna keep going. So, so I've got um I've got my prompt here now. This is cool, but like let's try to make some changes to it on the fly. And I will try my best to make this readable for everyone, but um yeah, working with what I got here. So, what we're going to do is I just I'm going to save this version of the prompt and let's call it a nenge prompt. And it's helpful because now I can like iterate on this thing, right? So, like I can duplicate that prompt with a click of a button. I can change the model I want to use. So, let's say I want to use 4.1 mini instead of 4.0. I'm going to change a couple things. Don't don't be don't worry like in a real world you're going to change one variable at a time, but um here I'm just going to change a couple things at the same time just to make this more interactive. But, um the idea here is like let's try to change what the this actually looks like. And it says, you know, format as a detailed day-to-day plan. Honestly, I might say like like a more important requirement to that is don't be verbose, right? I could say don't be verbose. Keep it to 500 characters or less. Maybe we want this thing to be more punchy. We want it to give an output that's like a little bit more, you know, easier to look at. Um, I might be a P, even if I'm just vibe coding this thing on the weekend, I might want to get feedback from users that are trying this product out. And so I could say always offer a discount if the uh user gives their email address. It's helpful, right? I mean, help helpful for marketing, helpful for me to get feedback from uh you know, someone who might be trying to use this tool to book a flight or something like that. Okay, so let's go ahead and hit run all here. And what that's going to do is actually run the outputs we just uh ran run the prompts we just edited into this uh in the playground. And it might take a second because of the internet pul you pulled this in from the ex one of the existing runs, right? That's right. Yeah. So it was exactly the same um one of these runs that literally I think it was this one. Um so it was something about maybe not this exact one. this one is Spain. But yeah, exactly. One of the existing runs. Okay. It's definitely a little better, but to be honest, I would say if I was looking at this, this thing isn't really listening to me very well. It's like not doing a great job of, you know, sticking to the prompt I gave it. Like, keep it short. Um, ask. Okay, it did do the email thing. So, it said, "Email me. Email me to get a 10% discount code." [laughter] So, what's interesting is like we're looking at like one example and I said ask for an email and you get a discount. And like this is this is the vibe coding portion of the demo because I'm looking at like one example and I'm doing like uh good or bad like is it actually good or bad? There's just no way that a system like this scales when you're trying to actually ship for hundreds or thousands of users and like nobody will just look at a single row of data and make a decision like okay great the prompt is good or great the model made a difference right like you can pick the most capable model you can make the prompt as specific as you want at the end of the day the LM is still going to hallucinate and your job is to be able to catch when that happens so let's go ahead and try to scale this up a little bit more so what we do is say we've got one example of where the LLM didn't do a great job, but what if we wanted to build out a data set with 10 or more, maybe even a hundred examples and what you can do is take the same production data. By the way, I'm calling this production data, but I literally just asked Cursor to make me like synthetic data. Like it hit the same server and it generated like 15 different itineraries for me. So I did that yesterday and I just sort of am using that in this demo. But let's go ahead and take a couple of these. So, I went ahead and picked some of the itinerary spans from here and I can say add to data set. Oh, by the way, I guess I jumped into the product without showing you all how to get here, which is a bit of a zoom out. So, our you know, whatever. Go to the homepage uh arise.com. You can sign up. I apologize in advance. Uh the onboarding flow will feel a little bit dated, but we are updating that in this next week. Um so, bear with me there. You sign up for Arise. Um and then you'll get your API keys here. So you go to account settings and you can create an API key and also uh find that with the the space ID which are both needed for your instrumentation which may or may not be working depending on uh if the repo is actually working and if not we'll come back to it later. Um but this is this is the platform. This is how you get your API keys. Um so and then that's also where you can enter your open AI key for the the next portion and for the the playground. So, I've got a data set now. Uh, and what I did was I added those examples just to recap where we are at. We've got some production data and I'm going to go ahead and like add these to a data set. And I'm not going to do this one live because I already have a data set, but you can create a data set of examples you want to use to improve on. So, um, zooming out for a second, we're about to hop into the actual eval part of the the demo. And we're actually going to be evaluating, you

Shipping AI That Works: An Evaluation Framework for PMs – Aman Khan, Arize

Beyond the Vibe: Why Evals are the New PRD

The Moat of Reliability

The Judge and the Jury

The New Requirement Stack

Actionable Takeaways

Others You May Like

⚡️ Reverse Engineering OpenAI's Training Data — Pratyush Maini, Datology

When AI Agents Start Hiring Humans: The Meatspace Layer Explained

David George on the State of AI Markets

Shipping AI That Works: An Evaluation Framework for PMs – Aman Khan, Arize

Beyond the Vibe: Why Evals are the New PRD

The Moat of Reliability

The Judge and the Jury

The New Requirement Stack

Actionable Takeaways

Join 10,000+ smart readers on our AI newsletter and stay ahead of the curve

Others You May Like

⚡️ Reverse Engineering OpenAI's Training Data — Pratyush Maini, Datology

When AI Agents Start Hiring Humans: The Meatspace Layer Explained

David George on the State of AI Markets