Latent Space
January 9, 2026

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith

The Gartner of Silicon: How Artificial Analysis Benchmarks the Intelligence Explosion by Latent Space

Author: Latent Space | Date: [Insert Date]

Quick Insight: This summary provides a roadmap for builders navigating the chaotic model market by identifying the true cost of intelligence. It explains why independent verification is the only way to separate lab hype from production reality.

  • 💡 Why is the cost per unit of intelligence crashing while total AI spend hits record highs?
  • 💡 How do "mystery shopper" policies prevent frontier labs from gaming their own benchmarks?
  • 💡 Can a model's "omniscience" be measured separately from its raw reasoning power?

Swix sits down with George Cameron and Micah Hill-Smith, the founders of Artificial Analysis. They are building the definitive data layer for the AI stack to help developers choose between serverless inference and managed deployments.

The Intelligence Smile

"You can get intelligence at the level of GPT-4 for over 100 times cheaper than GPT-4 was at launch."
  • Crashing Intelligence Costs: GPT-4 level performance is now a commodity available at a fraction of its original price. Developers can now build complex apps that were economically impossible eighteen months ago.
  • The Spend Paradox: Total compute spend is rising because agentic workflows consume exponentially more tokens than simple chat. Efficiency gains are being reinvested into more complex reasoning loops rather than saved.
  • Hardware Efficiency Gains: Blackwell and future chips offer massive throughput improvements for large sparse models. Lower costs per token will enable even larger models to remain commercially viable.

The Trust Deficit

"There’s no use doing what we do unless it’s independent AI benchmarking."
  • Mystery Shopper Audits: Artificial Analysis runs benchmarks through anonymous accounts to ensure labs aren't serving "golden" models to known testers. Independent verification prevents labs from shipping optimized versions that don't match public performance.
  • The Goodhart Effect: Labs now target specific benchmarks like competition math to inflate scores. Builders must look past headline numbers to find models that solve specific economic tasks.

Measuring the Unknown

"It’s strictly more helpful to say I don’t know instead of giving a wrong answer."
  • The Omniscience Index: New metrics penalize incorrect answers more than silence to discourage hallucinations. This forces a shift in model training toward honesty rather than confident guessing.
  • Agentic Evaluation Shift: Benchmarks are moving from single-turn Q&A to multi-turn tasks like GDP-val. Real-world utility is now measured by how many turns a model takes to solve a complex office task.

Actionable Takeaways

  • 🌐 The Macro Shift: The decoupling of parameter count from active compute via sparsity means intelligence is becoming a software optimization problem as much as a hardware one.
  • ⚡ The Tactical Edge: Audit your agentic workflows for turn efficiency rather than just cost per token.
  • 🎯 The Bottom Line: In a world of infinite tokens, the winner is the one who can verify the truth the fastest.

Podcast Link: Click here to listen

This is kind of a full circle moment for us in a way because the first time Artificial Analysis got mentioned on a podcast was you and Alysia on Lena Space. Amazing. Which was January 24. I don't even remember doing that, but yeah, it was very influential to me. Yeah, I'm looking at AI news for Jan 17 or Jan 16, 2024. I said this gem of a models and host comparison site was just launched. And then I put in a few screenshots and I said, it's an independent third party. It clearly outlines the quality versus throughput tradeoff and it breaks out by model and hosting provider. I did give you for missing fireworks and how do you have a model benchmarking thing without fireworks? But you had together, you had Perplexity, and I think we just started chatting there. Welcome George and Micah to Lyn Space. You I've been following your progress. Congrats on an amazing year. You guys have really come together to be the presumptive new gardener of AI. Okay. How do I pay you and let's get right into that. How do you make money?

Well, very happy to talk about that. So, it's been a big journey the last couple years. Artificial Analysis is going to be 2 years old in January 2026, which is pretty soon now. We first run the website for free obviously and give away a ton of data to help developers and companies navigate AI and make decisions about models, providers, technologies across the AI stack for building stuff. We're very committed to doing that intend to keep doing that. We have along the way built a business that is working out pretty sustainably. We've got just over 20 people now and two main customer groups. So we want to be who enterprise look to for data and insights on AI. So we want to help them with their decisions about models and technologies for building stuff. And then on the other side we do private benchmarking for companies throughout the AI stack who build AI stuff. So no one pays to be on the website. We've been very clear about that from the very start because there's no use doing what we do unless it's independent AI benchmarking.

Yeah. But turns out a bunch of our stuff can be pretty useful to companies building AI stuff. And is it like I'm a Fortune 500, I need advisor on objective analysis and I call you guys and you pull up a custom report for me. You come into my office and give me a workshop. What what what kind of engagement is that?

So we have a benchmark and insight subscription which looks like standardized reports that cover key topics or key challenges enterprises face when looking to understand AI and choose between all the technologies. And so for instance one of the report is a model deployment report. How to think about choosing between serverless inference, managed deployment solutions or leasing chips and running inference yourself is is is an example kind of decision that big enterprises face and it's hard to hard to reason through like this AI stuff is is really new to everybody and so we try and help with our reports and insight subscription companies navigate that. We also do custom private benchmarking and so that's very different from the public benchmarking that we publicize and there's no commercial model around that for private benchmarking we'll at times create benchmarks run benchmarks to specs that enterprises want and we'll also do that sometimes for AI companies who have built things and we help them understand what they've built with private benchmarking you know through the expertise mainly that we've developed through trying to support everybody publicly with our public benchmarks.

Yeah. Let's talk about tech stack behind that. But okay, I'm going to rewind all the way to when you guys started this project. You were all all the way in Sydney. Yeah. Well, Sydney, Australia for me. George was an SF, but he's Australian but moved here already. Yeah. And I remember I that Zoom call with you. What was the impetus for starting Artificial Analysis in the first place? You know, he started with public benchmarks. And so let's let's start there and we'll go to the private stuff.

Yeah. Why don't we even go back a little bit to like why we, you know, thought that it was that it was needed. The story kind of begins like in 2022, 2023, like both George and I have been into AI stuff for quite a while. In 2023 specifically, I was trying to build a legal AI research assistant. So, it actually worked pretty well for for its era, I would say. but was finding that the more you go into building something using LLMs, the more each bit of what you're doing ends up being a benchmarking problem. So had like this multi-stage algorithm thing trying to figure out what the minimum viable model for each bit was trying to optimize every bit of it. as you build that out, right? Like you're trying to think about accuracy, bunch of other metrics and performance and cost and mostly just no one was doing anything to independently evaluate all the models and certainly not to look at the trade-offs for speed and cost.

So we basically set out just to build a thing that developers could look at to see the trade-offs between all of those things measured independently across all the models and providers. Honestly, it was probably meant to be a side project when we first started doing it. Like we didn't like get together and say like, "Hey, like we're going to stop working on this stuff and like this is going to be our main thing." When I first called you, I think you hadn't decided on starting a company yet. That's actually true. I don't even think Paul's like like George quit his job. I hadn't quit working my legal AI thing. Like it it was genuinely a side project. Yeah, we built it because we needed it as people building in the space and thought, "Oh, other people might find it useful, too. So we'll buy a domain and link it to the versel deployment that that that we had and and and and tweet about it and but very quickly it started getting attention. Thank you Swix for I think doing an initial retweet and spotlighting it there this project that that we released and then very quickly though it was useful to others but very quickly it became more useful as the number of models released accelerated.

We had Mixture of Experts 7B and it was a key that's a fun one. Yeah. Like a open source model that really changed the landscape and opened up people's eyes to other serverless inference providers and thinking about speed, thinking about cost and so it became more useful quite quickly. Yeah. What I love talking to to people like you who sit across the ecosystem is well I have theories about what people want but you have data and that's obviously more more relevant. But I want to stay on the origin story a little bit more. When you started out I would say I think the the status quo at the time was every paper would come out and they would report their numbers versus competitor numbers and that's basically it. And I remember I did the leg work. I think I think everyone has some version of Excel sheet or Google sheet where you just like copy and paste the numbers from every paper and just post it up there and then sometimes they don't line up because they're independently run and so your numbers are going to look better than the or your reproductions of other people's numbers going to look worse cuz you don't hold their models correctly or whatever whatever the the excuse is. I think then Stanford Helm Percy Lang's project would also have some some of these numbers and I don't know if there's any other source that you can site the way that if I were to start artificial at the same time you guys started I would have used the Luther AI's eval framework harness. Yep. That was some cool stuff.

At the end of the day, right, running these evals, it's like, if it's a simple Q&A eval, all you're doing is asking a list of questions and checking if the answers are right, which shouldn't be that crazy, but it turns out there are an enormous number of things that you've got to control for. And I mean back when we started the website like one of the reasons why we were we realized that we had to run the evals ourselves and couldn't just take results from the labs was just that they would all prompt the models differently and when you're competing over a few points then you can put the answer into the model that in the extreme and like you get crazy cases like back when Google did Gemini 1.0 Ultra and needed a number that was better than GPT4 and like constructed I think never published like chain of thought examples 32 of them in every topic in MMLU to run it to get the score like there are so many things that you they never shipped ultra right that's on this one I mean it I'm sure it existed but yeah so we were pretty sure that we needed to run them ourselves and just run them in the same way across all the models and we were we were also dead certain from the start You couldn't look at those in isolation. You needed to look at them alongside the cost and performance stuff.

Yeah. Okay. A couple technical questions. I mean, so obviously I also thought about this and I didn't do it because cost. Did you did you not worry about cost? Were you funded already? Clearly not, but you know. No, we we well we definitely weren't at the start. So like I mean we're paying for it personally at the start. So well the numbers weren't nearly as bad a couple years ago. So we like certainly incurred some costs but we were probably in the order of like hundreds of dollars of spend across all the benchmarking that we were doing stuff. It was like kind of fine. These days that's gone up an enormous amount for a bunch of reasons that we can talk about. But yeah, it it wasn't that bad cuz if you can also remember that like the number of models we were dealing with was hardly any and the complexity of the stuff that we wanted to do to evaluate them was a lot less like we were just asking some Q&A type questions and then one specific thing was for a lot of evals initially we were just like sampling an answer directly without letting the models think. We weren't even doing chain of fault stuff initially and that was the most useful way to get some results initially.

Yeah. And so for if for people who haven't done this work, literally parsing the responses is a whole thing, right? Like because sometimes the models the models can answer any way they feel fit. And sometimes they actually do have the right answer but they just return the wrong format and they will get a zero for that unless you work it into your parser and that involves more work. And so there I mean but there's an open question whether you should give it points for not following your instructions on the format. So it depends what you're looking at, right? Because you can if you're trying to see whether or not it can solve a particular type of reasoning problem and you don't want to test it on its ability to do answer formatting at the same time, then you might want to use an LLM's answer extractor approach to make sure that you get the answer out no matter how it answered. But these days it's mostly less of a problem. Like if you instruct a model and give it examples of what the answers should look like, it can get the answers in your format and then you can do like a simple reax.

Yeah. And then there's other questions around I guess sometimes if you have a multiple choice question sometimes there's a bias towards the first answer. So you have to randomize the responses all these nuances you like once you dig into benchmarks you're like I don't know how anyone believes the numbers on all these things because it's it's so it's so dark magic. You've also got like the different degrees of variance and different benchmarks, right? So if if you if you run four question multi- choice on a modern reasoning model at the temperature suggested by the labs for their own models, the variance that you can see on a four question multi- choice eval is pretty enormous if you only do a single run of it and has a small number of questions especially. So like one of the things that we do is run an enormous number of all of our evals when we're developing new ones and doing upgrades to our intelligence index to bring in new things so that we can dial in the right number of repeats so that we can get to the 95% confidence intervals that we're comfortable with so that when we pull that together we can be confident in intelligence index to at least as tight as like a plus or minus one at a 95% confidence.

Yeah. Again, that just adds a straight multiple to the cost. And yeah, so that's one of many reasons that cost has gone up a lot more than linearly over the last couple years. We we report a cost to run the Artificial Analysis intelligence index on our website and currently that's assuming one repeat. Okay. in terms of how we report it because we want to reflect a bit about the waiting of the index but but our cost is actually a lot higher than what we report there because of the repeats. Yeah. Yeah. Yeah. And probably this is true but just checking they you don't have any special deals with the labs. They don't they don't discount it. You just pay out of pocket or out of your your sort of customer funds.

Oh there is there is a mix. So we so so the the issue is that sometimes they may give you a special endpoint which 100%. Yeah. Yeah. Yeah. Exactly. So we are laser focused like on everything we do on having the best independent metrics and making sure that no one can manipulate them in any way. There are quite a lot of processes we've developed over the last couple of years to make that true for like one you bring up like right here of the fact that if we're working with a lab, if they're giving us a private endpoint to evaluate a model that it is totally possible that what's sitting behind that black box is not the same as they serve on a public endpoint. We're very aware of that. We have what we call a mystery shopper policy and so and we're totally transparent with all the labs we work with about this that we will register accounts not on our own domain and run both intelligence evals and performance benchmarks without them being able to identify it. And no one's ever had a problem with that because like a thing that turns out to actually be quite a good factor in the industry is that they all want to believe that none of their competitors could manipulate what we're doing either.

That's true. I never thought about that. I've been in a database data industry prior and there's a lot of shenanigans around benchmarkings, right? So, I'm just kind of going through the mental laundry list. Did I miss anything else in that in this category of shenanigans? I mean, okay. the the the biggest one like that I'll bring up like is more of a conceptual one actually than like direct shenanigans. It's that the things that get measured become things that get targeted by what they're trying to build. Right. Exactly. So that doesn't mean anything that we should really call shenanigans. Like I'm not talking about trading on test set but if you know that you're going to be great at a particular thing. If you're a researcher, there are a whole bunch of things that you can do to try to get better at that thing that preferably are going to be helpful for a wide range of how actual users want to use the thing that you're building, but will not necessarily do that.

So, for instance, the models are exceptional now at answering competition maths problems. There is some relevance of that type of reasoning, that type of work to like how we might use modern coding agents and stuff but it's clearly not one for one. So the thing that we have to be aware of is that once an eval becomes the thing that everyone's looking at, the scores can get better on it without there being a reflection of overall generalized intelligence of these models getting better. That has been true for the last couple of years, it'll be true for the next couple of years. There's no silver bullet to defeat that other than building new stuff to stay relevant and measure the capabilities that matter most to real users. Yeah. And we we'll cover we'll cover some of the new stuff that you guys are building as well. Which is cool. Like you used to just run other people's evals, but now you're coming up with your own. And I think obviously that is a a necessary path once you're at the frontier. You've exhausted all the existing one-on-one ones. I think the next point in history that I have for you is AI Grant that you guys decided to to join and and move here. What's what was it like? I think you were in like batch two.

Batch four. Batch four. Okay. I mean it was great. Nat and Daniel are obviously great and it's a really cool group of companies that we were in AI grant alongside. It was really great to get Nat and Daniel on board. Obviously, they've done a whole lot of great work in the space with a lot of leading companies and were extremely aligned with the mission of what we were trying to do. Like we're not quite typical of like a lot of the other AI startups that they've invested in and they were very much here for the mission of what we want to do. Did they say any advice that really affected you in some way or like were one of the events very impactful?

That's an interesting question. I I mean I remember fondly a bunch of the speakers who came into fireside chats at AI Grant which is also like a crazy list. Yeah. Oh totally. Yeah. Yeah. Yeah. I I there was something about you know speaking to that and Daniel about the challenges of of working through a startup and just working through the questions that don't have like clear answers and how to to work through those kind of methodically and just like work through the hard decisions and they've been great mentors to to us as we built Artificial Analysis. Another benefit for us was that other companies in the batch and other companies in AI grant are pushing the capabilities of of what AI can do at this time. And so being in contact with them, making sure that Artificial Analysis is is useful to them has been fantastic for for supporting us in working out how should we build out Artificial Analysis to continue to being useful to those like you know building on AI.

I think to some extent I'm mixed opinion on that one because to some extent your target audience is not people in AI grants who are obviously at the frontier to some to some extent but then so a lot of what the AI grant companies are doing is taking capabilities coming out of the labs and trying to push the limits of what they can do across the entire stack for building great applications which actually makes some of them pretty archetypical power users of Artificial Analysis. some of the people with the strongest opinions about what we're doing well and what we're not doing well and what they want to see next from us. Cuz when you're building any kind of AI application now, chances are you're using a whole bunch of different models, you're maybe switching reasonably frequently for different models and different parts of your application to optimize what you're able to to do with them at an accuracy level and to get better speed and cost characteristics. So for many of them, no, they're like not commercial customers of ours. Like we don't charge for all that data on the website, but they are absolutely some of our power users.

So let's talk about just the the the evals as well, right? Like you start out from the the general like MMLU and and GPQA stuff. What's next? How do you how do you sort of build up to the overall index was in V1 and how did you evolve it?

Okay, so first just like background like we're talking about the Artificial Analysis intelligence index which is our synthesis metric that we pull together currently from 10 different eval data sets to give what we're pretty confident is the best single number to look at for how smart the models are. Obviously doesn't tell the whole story. That's why we published the whole website of all the charts to dive into every part of it and look at the trade-offs. But best single number. So right now it's got in it a bunch of Q&A type data sets that have been very important to the industry like a couple that you just mentioned. It's also got a couple of agentic data sets. It's got our own long context reasoning data set and some other use case focused stuff. As time goes on the things that we're most interested in that are going to be important to the capabilities that are becoming more important of AI, what developers are caring about are going to be first around agenda capabilities. So surprise surprise, we're all loving our coding agents and how the model is going to perform like that and then do similar things for different types of work are really important to us. The linking to use cases to economically valuable use cases are extremely important to us. And then we've got some of these things that the models still struggle with like working really well over long contexts that are not going to go away as specific capabilities and use cases that we need to keep evaluating.

Mhm. And but I guess one thing I was driving was like the V1 versus the V2 and how bad it was over time like how like how we've changed the index to where we are. Yeah, I think that reflects on well the change in the industry, right? So that's a nice way to tell that story. Well, V1 would be completely saturated right now by almost every model coming out because doing things like writing the Python functions in human eval is now pretty trivial. It's easy to forget actually I think how much progress has been made in the last two years. Like we we obviously play the game constantly of like the today's version versus last week's version and the week before and all of the small changes in the horse race between the current frontier and the who has the best like smaller than 10b model like right now this week right and that's very important to a lot of developers and people in especially in this particular city of San Francisco. But when you zoom out a couple of years ago, literally most of what we were doing to evaluate the models then would all be 100% solved by even pretty small models today. And that's been one of the key things, by the way, that's driven down the cost of intelligence at every tier of intelligence. We can talk about more in a bit. So V1, V2, V3, we made things harder. We covered a wider range of use cases and we tried to get closer to things developers care about as opposed to like just the Q&A type stuff that MMLU and GPQA represented.

Yeah. I don't know if you have anything to add there. Or we could just go right into showing people the benchmark and like clicking around and ask asking questions about it. Yeah, let's do it. Okay. This would be a pretty good way to chat about a few of the new things we've launched recently. Yeah. And I think a little bit about the direction that we want to take it and we want to push benchmarking. Currently the intelligence index and and eval focus a lot on kind of raw intelligence but we kind of want to diversify how we think about intelligence and we can we can talk about it but kind of new evals that we've kind of built and partnered on focus on topics like hallucination and we've got a lot of topics that I think are not covered by the current eval set that should be and and so we want to bring that forth but before we get into that

So, so for listeners just as a time stamp right now on number one is Gemini 3 Pro High, then followed by Claude Opus at 70. Just 5.1 high. You don't have 5.2 yet. And Kimmy K2 thinking. Wow. Still hanging in there. So those those are the top four. That will date this podcast quickly. Yeah. Yeah. I mean I I love it. I love it. This time next year and go how cute. Like totally a quick view of that is okay. There's a lot I love this chart. This is such a favorite, right? And almost every conferences and stuff like we always put this one up first to just talk about situating where we are in this moment in history. This I think is the the visual version of what I was saying before about the zooming out and remembering how much progress there's been. If we go back to just over a year ago, before 01, before Claude Sonet 3.5, we didn't have reasoning models or coding agents as a thing and the game was very, very different. If we go back even a little bit before then, we're in the era where when you look at this chart, like Open AI was untouchable for well over a year. And I mean you would remember that time period well of like there being very open questions about whether or not AI was going to be competitive like full stop whether or not open AI would just run away with it whether we would have the a few frontier labs and no one else would really be able to do anything other than consume their APIs. I am quite happy overall that the world that we have ended up in is one where multimodel absolutely and strictly more competitive every quarter over the last few years.

Yeah, this year has been insane. Yeah, you can see it. Uh, this chart with everything added is hard hard to read currently. There's so many dots on it, but I think it reflects a little bit, you know, what what we felt like how crazy it's been. Why 14 as the default? Is that a manual choice? Cuz you got service now and there that are, you know, less less traditional names. Yeah, it's models that we're kind of highlighting by default in our charts in our intelligence index. Okay. is where this you just have a manually cured list of stuff. Yeah, that's right. But something that I actually don't think every Artificial Analysis user knows is that you can customize our charts and and choose what what modelsated. And so if we you know take off a few names, it gets a little easier to Yeah. Yeah. A little easier to read. Yeah. But you can I love that you can see the 01 jump. Look at that. September 2024 and the Deep Seek jump which got close to OpenAI's leadership. They were so close I think.

Yeah, we we remember that moment around this time uh last year actually. Yeah. Yeah. Yeah. Well, couple of weeks. It was it was Boxing Day in New Zealand when when Deepseek V3 came out and I like we'd been tracking Deepseek and a bunch of the other global players that were less known over like the second half of 2024 and had runs on the earlier ones and stuff. I I very distinctly remember Boxing Day in New Zealand. I cuz I was with family for Christmas and stuff running and getting back result by result on Deep Seek V3. Um, so this was like the the first of their V3 architecture, the 671B MOE. Um, and we were very very impressed. Like that was the moment where we were sure that DeepS was no longer just one of many players, but had jumped up to be a thing. The world really noticed when they followed that up with the RL working on top of E3 and R1 succeeding like a few weeks later. But the groundwork that absolutely was laid with a like just extremely strong base model completely open weights that we had as the best open weights model on Boxing Day last year.

Yep. Boxing Day is the the day after Christmas for for those not I mean I'm from Singapore. A lot of us remember Boxing Day for for a different reason for the tsunami that happened. Of course. Yeah. So I was Yeah. Yeah. But that was a long time ago. So yeah. So this is the the rough pitch of AQI or is it AAQI or A II? I I so good good memory though. So once upon a time we did call it quality index and we would talk about quality performance and price but we changed it to intelligence. There's been a few naming changes. We added hardware benchmarking to the site and so have benchmarks at a at a kind of system level and so then we changed our throughput metric to we now call it output speed and then throughput makes sense at a system level. So got it got it. Took that name. Take me through more charts like what should what should people know? You know obviously the way you look at the site is probably different than how a beginner might look at it.

That that's fair. We can there's a lot of fun stuff to dive into. Maybe so we can hit past all the like we like we have lots and lots of evals and stuff. The interesting ones to talk about today that be great to bring up like a few of our recent things I think that probably not many people be familiar with yet. So first one of those is our omniscience index. So this one is a little bit different to most of the intelligence evals that we run. We built it specifically to look at the embedded knowledge in the models and to test hallucination by looking at when the model doesn't know the answer. So we're not able to get it correct, what's its probability of saying I don't know or giving an incorrect answer. So the metric that we use for omniscience goes from negative 100 to positive 100 because we're simply taking off a point if you give an incorrect answer to the question. We're pretty convinced that this is an example of where it makes most sense to do that because it's strictly more helpful to say I don't know instead of giving a wrong answer to factual knowledge question. And one of our goals is to shift the incentive that EVELs create for models and the labs creating them to get higher scores.

And almost every eval across all of AI up until this point, it's been graded by simple percentage correct as the main metric, the main thing that gets hyped. And so you should take a shot at everything. There's no incentive to say I don't know. So we did that for this one here. I think there's a general field of calibration as well like the confidence in your answer versus the rightness of the answer. Yeah, completely agree. Yeah. Yeah. On that and one reason that we didn't do that is be or put that into this index is that we think that the the way to do that is not to ask the models how confident they are. I don't know maybe it might be though you put it give it a JSON field say say confidence and maybe it spits out something. Yeah. You know, we have done a few eval podcast over the over the years and we did one with Clementine of Hugging Face who maintains the open source leaderboard and this was one of her top requests which is some kind of hallucination/ lack of confidence calibration thing and so hey this is one of them and I mean like anything that we do it's not a perfect metric or the whole story of everything that you think about as hallucination but yeah it's pretty useful and has some interesting results like one of the that we saw in the hallucination rate is that anthropics claude models at the the very left hand side here with the lowest hallucination rates out of the models that we've evaluated emissions on.

That is an interesting fact. I think it probably correlates with a lot of the previously not really measured vibes stuff that people like about some of the claude models. Is the data set public or what's is it is there a held out set? There's a hel set for this one. Um so it we we have published a public test set but we we've only published 10% of it. The reason is that for this one here specifically it would be very very easy to like have data contamination because it is just factual knowledge questions. We will update it over time to also prevent that but we've yeah kept most of it held out so that we can keep it reliable for a long time. It leads us to a bunch of really cool things including breakdown quite granularly by topic and so we've got some of that disclosed on the website publicly right now and there's lots more coming in terms of our ability to break out very specific topics. Yeah, I would be interested. Let's let's dwell a little bit on this hallucination one. I noticed that haiku hallucin hallucinates less than sauna hallucinates less than opus and would that be the other way around in a normal capability environment? I don't know what's what do you make of that?

One interesting aspect is that we've found that there's not really a not a strong correlation between intelligence and hallucination rate. That's to say that the smarter the the models are in a generalist sense isn't correlated with their ability to when they don't know something say that they don't know. It's interesting that Gemini 3 Pro preview was a big leap over here Gemini 2.5 flash and and and 2.5 Pro. But and if I add pro quickly here, I bet pro is really good. Actually, no. So I me I meant uh the GPT pros. Oh yeah, cuz GPT pros are rumored. We don't know for a fact that it's like eight runs and then with the LM judge on top. Yeah. So we saw a big jump in this is accuracy. So this is just percent that they get correct. And Gemini 3 Pro knew a lot more than the other models. And so big jump in accuracy, but relatively no change between the Google Gemini models between releases and the hallucination rate. Exactly. And so it's likely due to just kind of different post training recipe between the the claw models.

Yeah. Um that's driven this. Yeah. You can uh you can partially blame us and how we define intelligence having until now not defined hallucination as u negative in the way that we think about intelligence. And so that's what we're changing. I know many smart people who are confidently incorrect. Look look that that that is very human very true and there's times and a place for that. I think our view is that hallucination rate makes sense in this context where it's around knowledge but in many cases people want the models to hallucinate to have a go. Often that's the case in coding or when you're trying to generate newer ideas. One eval that we added to Artificial Analysis is is is critical point and it's really hard uh physics problems. Okay. And is it sort of like a human eval type or something different or like a frontier math type?

It's not dissimilar to frontier frontier math. So these are kind of research questions that kind of academics in the physics physics world would be able to answer but models really struggle to answer. So the top score here is 9%. And when the people that that created this like Minway and and actually Afia who was kind of behind Sweeten what organization is this or is this it's Princeton kind of range of academics from from different academic institutions really smart people they talked about how they turn the models up in terms of the temperature as high a temperature as they can when they're trying to explore kind of new ideas in physics as a as a thought partner just because they they want the models to hallucinate yeah sometimes maybe get something newact Um so not right in every situation but think it makes sense you know to test hallucination scenarios where it makes sense.

Well, so the obvious question is this is one of many that there is there every lab has a system card that shows some kind of hallucination number and you've chosen to not endorse that and you've made your own and I think that's a that's a choice. Totally in some sense the rest of Artificial Analysis is public benchmarks that other people can independently rerun. you provided us a service here. you have to fight the well who are we to to like do this and your answer is that we have a lot of customers and you know but like I guess how do you converge the industry on one number that actually everyone agrees is is the rate right cuz you have your numbers they have their numbers never the the two shall meet I mean I think I think for hallucination specifically there are a bunch of different things that you might care about reasonably and that you'd measure quite differently like we've called this a nissian hallucination rate not trying to declare the legacy. Humanity's last hallucination, you could uh you could have some interesting naming conventions and all this stuff. Um the biggest picture answer to that and something that I actually wanted to mention just as George was explaining critical point as well is so as we go forward we are building evals internally. We're partnering with academia and partnering with AI companies to build great evals. We have pretty strong views on in various ways for different parts of the AI stack where there are things that are not being measured well

Others You May Like