Latent Space
December 31, 2025

[State of Evals] LMArena's $100M Vision — Anastasios Angelopoulos, LMArena

The State of Evals: LMArena's $100M Vision

Author: Anastasios Angelopoulos, LMArena by Latent Space

Date: [Insert Date Here]


This summary is for builders and investors navigating the "vibe check" economy of AI model performance. It explains how Arena turned a Berkeley basement project into the industry's most trusted arbiter of truth.

This episode answers:

  • 💡 Why did Arena raise $100M: just to fund free inference?
  • 💡 How did a randomly named model: like "Nano Banana" move Google's market cap?
  • 💡 Can a public leaderboard maintain integrity: while transitioning into a commercial entity?

Anastasios Angelopoulos explains how Arena evolved from a Berkeley research project into the definitive "North Star" for AI evaluation. By capturing millions of organic human preferences, they created a benchmark that is impossible to "game" through traditional overfitting.

The Integrity Moat

"[The] public leaderboard that we show on LMArena I think of as a charity."
  • 💡 Zero Pay-to-Play: Models appear on the leaderboard regardless of whether providers pay or perform well. This ensures the platform remains a neutral arbiter rather than a marketing arm for big labs.
  • 💡 Organic Feedback Loops: Unlike static benchmarks, Arena relies on real users asking their own messy questions. This creates a living dataset that reflects actual utility rather than synthetic test scores.
  • 💡 Statistical Ground Truth: With tens of millions of monthly conversations, the Elo ratings represent a massive consensus. This scale makes the rankings a reliable proxy for product-market fit.

The Multimodal Frontier

"Multimodal models are going to become some of the most economically valuable aspects of AI."
  • 💡 Visual Reasoning Value: Image and video generation are moving beyond art into high-value enterprise sectors like marketing and design. This shift turns creative AI into a core productivity tool for content creators.
  • 💡 Expert Vertical Expansion: Arena is now tracking performance in specialized fields like medicine and law. This allows developers to target specific high-value niches rather than chasing general intelligence.

Scaling the Vibe Check

"Every user is earned every single day."
  • 💡 Infrastructure Overhaul: Moving from Gradio to React was necessary to support complex features like custom video notifications. This technical debt repayment allows for a more sticky consumer experience.
  • 💡 The Retention Engine: Persistent chat history was the primary driver for user sign-ins. Providing immediate utility through memory is the simplest way to compete with incumbents.

Actionable Takeaways

  • 🌐 The Macro Shift: The transition from static benchmarks to "Vibe-as-a-Service" means model labs must optimize for human delight rather than just loss curves.
  • The Tactical Edge: Use Arena’s open-source data releases to fine-tune models on real-world prompt distributions.
  • 🎯 The Bottom Line: In a world of synthetic data and benchmark saturation, human preference is the only remaining scarce resource for validating frontier capabilities.

Podcast Link: Click here to listen

All right, we're here with Anastasia from Arena. Congrats on all the success. You got the arena handle. Big branding moment. I think X is like being more commercial. So obviously you bought it, but like at least you have a place to go to where you can be like, "Hey, like we really like this, but I do think like dropping LM has changed the feel of it."

The reason we kept the LM at the beginning is because we started as LMs, right, out of the LMS sort of conglomerate at Berkeley. So we decided language models. So we wanted to maybe broaden a little bit. And we were the first Arena, so we feel like let's kind of try to own that.

Last time we had you guys on, you hadn't really spun out yet. I did a call with Allesia and I was like, these guys are going to start a company. I think you actually were already started at the time. An was your founding CEO, which people don't know. An is a very interesting character. We have a podcast schedule with him.

He does a lot more than normal VCs. He's been incredible to us. The way the company started was as an incubation by An. So, what he did is he kind of found us at Berkeley and picked us out of the basement and was like, "Hey, these guys seem like they're on to something." And started working with us really early. Gave us some grants. He was not the only one to do this. We also had a great grant from Sequoia, but An was in particular quite supportive of us and gave us some resources in order to continue building out Arena before we even were committed to starting a business and in that capacity he sort of formed an entity for us and he was like hey you guys can walk away at any time if you guys don't want to start a business. I mean, it was really incredible.

Very aggressive investment move by him, right? Because of course, any money that he spent at the end of the day, like we could walk away and leave him with nothing. But I think he knew wisely that the right thing for Arena was to start a company out of it, it was the only way that we could scale and that Wayan Yan and I would ultimately see that and be excited about doing it ourselves, which ended up being the case.

Was there a moment for you where you were debating it yourself? You had other opportunities. What was the deciding factor for you? It became clear that the only way to scale what we were building was to build a company out of it. That the world really needed something like Arena. Arena being really a place to sort of measure, understand and advance the frontier AI capabilities in on real world users, on real world usage based on organic feedback.

In order to achieve the scale and distribution necessary and the quality of course of the platform necessary to do this effectively we would need to start a company out of it. We considered other options. Are we going to keep doing this as an academic project? Are we going to do it as a nonprofit? But ultimately under those constructs we didn't feel like we'd have the resources necessary to accomplish our mission.

So you raised $100 million. That's a lot of resources. What's it for? First of all, we don't necessarily need to spend all that money, right? The purpose of money at a company is to give you cards to flip. It's to say, hey, you have enough resources necessary and so that if your first bet fails, you can make another bet and another bet. So, that's not to say that we're going to spend all of it. Of course, you want to spend things responsibly.

Having said that, the platform is actually quite expensive to run. We fund all of the inference on the platform. You pay market rates. They don't give you discount. No, we get discounts, but they are standard enterprise discounts the same that would be given to any other customer.

Have you disclosed any numbers? I see numbers of votes, but like what's that in like monthly tokens? I don't know about tokens. I'd have to back that out. We have more than 5 million now, 5, six million. We have probably 250 million conversations that happen over the course of the platform. We're on the order of mid tens of millions of conversations every month that are having on the platform. It's actually quite a large, it's one of the largest consumer platforms for LLM. Of course, nothing is really comparing to chat GPT.

The benefit of this is that it's actually quite a diverse population. So 25% of the people on our platform for example do software for a living still at this scale. Because we do all sorts of in we either survey them or we'll analyze the prompt distribution that's coming into the platform. We've done something called expert arena which is like trying to understand the distribution of experts that are coming to the platform. A lot of that can be like unauthenticated or whatever usage it is. But about half of our users now are logged in. And so we have some ability to understand them and we also have surveys that we run on the platform that tell us a bit more about who are the actual users. Of course, there's always like response bias in surveys, so you have to take it with a grain of salt.

You're not the only player. There are this artificial analysis started arena. AI it's like some crypto people that started this. I don't know if you have a conversation with them on like well hey this is our thing or like let's work together on something. No, you know, so I've talked to artificial I've actually talked to both groups both seem. Am I missing any major players? It's just those. I think those are like some of course large depending on how you define the term artificial analysis obviously has like huge market mind share around the analysis of different AI systems.

They told me they were just like we are going to be Garner of AI. That's kind of their goal and I think they're going after that consulting market and so on. Artificial from what I understand is like a team of consultants that are doing this. Seem like really really nice guys going after that particular market. It's different from our platform in the sense that their analysis is based on sort of like aggregating public benchmarks and turning those into analytics independently rerunning and using those in order to compile sort of reports and so on that educate the field on the performance of all these different models.

They have arenas, but the arenas are not based on organic usage. The thing that distinguishes our platform versus theirs is that the users are actually inputting their own use case. They're actually asking their own question. And that gives a level of realism that platform doesn't have. They specialize in a slightly different thing but I see those platforms kind of diverging in that sense.

Sometimes it's the only way to do this. So like for example for AA their video arena it's pre-generated videos. You can't enter in your own video. We are doing it organically. As a voter it does help in terms of like I don't have to wait. It does but also why would you go? Do you actually care about like other people's videos? Form your own intuition. Maybe you're like interested in comparing. I'm a shitty prompter. I'm a terrible prompter. I learn by example.

There are many prompts is much better than me. People have all sorts of cool ways of prompting all. It's educational to see. The only way to learn is by like, well, look at other people's prompts and see the results and go like, "Oh, I didn't know you could do that."

Let's go back to Arena. Oh, one thing I do want to say is the number one use of funds is getting off Gradio. Gradio, incredible platform. Gradio scaled us to a million mile. We're really really grateful to Graddio for taking this so far. Eventually, it became time for us to move off of that and go to a React. I'm sure Hugging Face would have loved you to stay on.

Was there a technical reason you just couldn't get the performance? It just became hard to develop and there were all these tools that we wanted in React and to do all the fancy things that you can do in React became kind of like one example. Let's say we wanted to create like our own custom like loading icons for video with notifications. How are we going to do that in in React? It's hard. You make a custom component. I'm sure the hagging faith guys are going to come in and say like hey you can do that in graduate which maybe you can but also fewer developers know how are we going to hire for that we have to reskill them people are less familiar with that stack. It's full reacts all this.

Other use of funds that might be interesting like basically that's basically deployed resources. It's primarily inference that's funds the free usage of the platform and then also hiring head count. We have an office. It's an SF. I'll tackle one of the major things this year, which I'm sure you're tired of thinking about, but I for people who are not in the loop, this is going to be news to them. The leaderboard delusion, the whole thing with coher. Let's summarize the what they said and then your response.

Leaderboard illusion is a paper that critiques Alam Marina and the main pretty brutally. Unscientifically. Coher wasn't doing that well on Coher was like 74. It's all good. It's actually not it's it's a respectable place that they had on the leaderboard. I don't even think it was really Coher people like the coherent model developers doing this. It was more their research side. But in any case, what is the what does the leaderboard illusion say? It says that Alam Marina was doing this undisclosed quote unquote private testing on our platform that model providers will send us pre-release models and we'll expose them and so on and so forth and that this creates so-called inequities in the leaderboard due to that pre-release testing for example they will they cited that meta at some point tested amount of models with us of course we can't disclose all of the details of how all that was done but that is that is the main claim of the paper.

Our response to that paper, it's online. You can find it on response to leaderboard illusion. And our response to that paper is essentially pointing out a series of factual mistakes in the paper that question the validity of the claims. So, you can go look at the first version of the paper on archive yourself and you'll see the claims. I think most scientists would view that as oh, they've corrected it. They've correct of course because we I mean, but they didn't correct everything. They just corrected some aspects that were just blatantly unscientific and false.

For example they said that we were that we only sampled like 9% open source models and like 60% like closed source models and this created an gap between open and closed source but re in reality we're actually really supportive of open source models and it was more like 60/40 and so that was one of the examples of a of an air an error in the claims. Another example is that they were claiming there was some sort of bias introduced by this pre-release testing and that it was undisclosed. In reality, as you probably know, we've been doing this pre-release testing for a long time. Our community loves it. They love basically getting the secret code names.

The secret code is like nano banana all that. So nano banana by the way started on you. Started on us, right? And people loved it. What like global sensation like non non-zero fraction of the global population using Lena about naming it banana or was that decision? It was it was their decision, I believe. But it was sort of this randomly generated thing and it just went. So apparently Nana, who's a PM is named after her because her nickname is Nana. Nana put banana on it. I didn't know the origin story.

To us it just looked like sort of a random thing. But it was clearly heads and shoulders above huge which like before that there was Reeve image remember? and also of BFL and all those I mean all those models are also great and I think those teams are also improving quite quickly but nano banana was a sensation I mean that moment alone changed Google's romat market share. Google stock billions of dollars are moving because of nanop and now there's like an openi code red and everything. The information reported this. I would say like the image generation I would say has been like this weird part of AI overall because it's not strictly AGI critical like it's not reasoning it's it's it's not like feeding more context into the model it is the model generating a visual representation so it's basically like I always think like well you know Gemini used to get a lot of complaints for generating like racist racist images or whatever that was a hilarious moment and and chi also had had it in the past.

I'm like, well, can we just get rid of this? Like, do we have to do image generation because let's just focus the the positive reputation of AI in general on language models and coding and, you know, the other stuff. I'm wrong. I'm such a huge Nam Banana Pro show. I totally agree. I was also kind of wrong about this. I didn't see the positive benefits but actually I think that these like multimmodal models are going to become some of the most economically valuable aspects of AI both in both in consumer and also in enterprise because one of the fastest growing segments market segments in AI adoption is marketing and in market marketing and design.

I'm a content creator. I'm sure you're using it all the time. Infinite supplies of diagrams and explainers and infographics. Soon we're not going to be even making them for our papers where they're just going to be our paper figures are going to be made by. I do think they actually one shot. So Deep Seek came out with V3.2 recently. I took their explanations which are very wordy. They like very concise papers. It's 23 pages long but it's very dense. And so I just took their explanations of the RL environment stuff and I fed it into Nano Banana Pro and an image that I help I used to understand a paper better. And the fact that I can just casually generate like a paper quality diagram that would usually take a PhD student like a month to in like in Photoshop or something to do is is incredible.

I want to ask about your principles running Arena. I think you you manage a giant community 5 million Mau. What have you decided are the core principles I guess before becoming a company and now that you're a company? I don't know if there's anything that's changed for you. I don't think anything has really changed. We want to provide the northstar of the industry and center the use cases of real users foreground those so that people know what to target. The goal is to create a benchmark that is constantly fresh, that does not suffer overfitting because of the fact that we constantly have new data points coming in that tracks the all the different new models all the different new use cases of AI and gives a whole world sort of ground truth for how real users are using these models and how how good they are on those use cases.

We continue to do quite a few open source day releases. We've probably released more data than basically anybody on the real world use cases of AI. Millions and millions of conversations, real world conversations from real users that the community is using to study and and improve on.

In terms of what you will build versus what will not build, I guess I'm not necessarily caught up on everything that you've launched. I know recently you've done the the dev or code arena. Code arena. That's the most arena. Expert arena. So, basically what is in the critical path for you? Let's say for next year and what what have you decided you'll never do?

Let me first talk about things that I'll never do the platform. Integrity comes first to the platform. The basically the public leaderboard that we show on Ellen Marina I think of as a charity. It's a loss leader for us. We don't really make money on the public leaderboard. You can't pay to get on the public leaderboard. It's not like a Gartner in that sense. It's not like any of these like you know pay systems. Never going to be like that. Models are going to be listed on the leaderboard whether or not the providers pay and whether or not they're getting a good score. They can't pay to take it off either. And so what that means that that's very important. And so what that means is that the leaderboard has a certain integrity that will never be compromised of course.

Not all preview models will make it onto the. Those preview models have never been released. Who cares about putting unreleased models on a on the leaderboard? The point is that for every released model, the score that you see on the leaderboard is statistically sound. It reflects the real world capabilities of the model. Why? Because millions of people from around the world have voted for it. And that's where that where that number comes from. All we do to compute that number is millions of people are voting. We take those votes, we turn them into a number. That's always going to remain, you know, a transparent and fair reflection of model performance.

Where are we going? Lots of different new categories. I don't know if you recently saw we expressed we exposed occupational and expert categories. So now single digit percentage of our user base. We're millions millions to tens of millions of users, right? So single digit percentages means a lot. Single-digit percentage of our user base are in medicine, in legal, in business, finance, accounting, creative, marketing, stuff like this. And we're able to show the performance of these models in all these different verticals because we have all these users in our in our user base. We're working more towards multimodal, video we're soon to launch on the site at some point, later this year or early next. So, lots of things in the pipeline.

Will you expose an API? We've thought about it. I think it's a possibility. What are the counter arguments like why not? Well, there's obviously a need for an API. The question is more of focus of our company just because we're a startup and so we really should be doing one thing well, arenas. So, I'm not I'm not sure that we how far we want to sort of spllay out and on what timeline we'd want to do that.

Any other sort of community management tips, you know, more broadly like every AI company like really wants to grow their community. You're obviously one of the strongest in the world. What's really really worked? First of all, I want to give a shout out to our community manager, Greg, who is doing an awesome job managing our community, whether that's on Discord or on Elmarina. He's really incredible. So, I would say hire Greg, but don't hire Greg. Find a Greg.

In general, the question of how do you get to so many users? That is a tough question and keep and retain them. That is a tough question because consumer is one of the hardest markets in the world. There's a lot of websites in the world that people can go to. Why should they go to yours? And the reality is if you want to create a really dominant product, you have to provide people value. And to be frank, I don't think we're all the way there yet. It's not like I have the solution and answer for how to build a a great consumer product. If I did, we wouldn't be at 10 tens of millions of users. We'd be at hundreds or, you know, we'd be at a billion users. We'd be like, is there a world you like are bigger than Chacht? I don't know. I don't know that we need to be. And I don't know that we ever will be because that's a that's an extraordinary generational product that they built, right? And it took a lot of time and to some extent it also involved luck. There's a lot of lightning in the bottle moments like Nanobino was for us where our user base just like goes up by a lot. But when those users come, they can just as easily leave. So the way I think about it is every user is earned. You have to earn them every single day. They can leave at any moment. They're fickle. And so all the time you have to be thinking about how do I provide this person value? Learning how are they using my website? What more could I give them? And how do I build in all the retention mechanisms so that they stay and then they're also bringing their friends.

Is there one that's working in terms of retention? Like you said, a lot of people are signing in now. Signin was a big driver of retention. What did you give them in order to encourage like history persistent history. That's it. That's enough. That's one thing that has had a big impact.

What do you want from people? What are you looking for help on like any call to action? We are always looking for people to come and join us. If you are one of the best people in the world in your area, whether that's consumer product, whether that is machine learning, whether that is, you know, B2B, go to market, marketing, all these things, we need you at Arena. We're building like a high performance team of real experts in everything that they do. And, you know, I'm always looking for excellent people to work with.

Do you need like what about partnerships, right? Like, let's say I'm at Cognition. I want to partner with El Marina or just Arena. What works for you? What existing partnerships do you already have that that's really fruitful? We partner with all the major model labs. That's straightforward like hey we have a new model here. Here you go. Exactly. So I think the the most straightforward thing would be for someone like cognition. It's like let's evaluate that agent. But we should be continuing to shape our well code arena is an agent evaluation. That's true. And more focused on like all these arenas tend to focus on the model rather than the harness. So, but that maybe should change. Maybe we should be evolving towards that direction. And I think the code arena is a good example of an arena that would support a full featured harness like a Devon.

In my view, I'm if if I'm talking to cognition, I'm saying, hey, let's get Devon on the arena and figure out how to, you know, loop together the harness so that we can I'm sure I'm sure that there's something that could be really valuable there, especially given Devon. Last week, people talking about Devon dead. Did you see that? People were saying Devon's gone. Devon's not gone. Devon's everywhere. People. So, can we highlight that for people and show them, hey, Devon is actually the best or one of the best in the world at doing what it does? Al Marina can actually do that. And our our our place is a central evaluation platform allows allows that to happen.

Thank you for owning the state of your house. Thanks so much for congrats on a wonderful year. Appreciate it. Congrats to you, too. Congrats on all the growing, you know, momentum in your podcast and in your career. It's really impressive to see.

Others You May Like