The AI world is obsessed with benchmark scores, but what if those numbers tell us nothing about how useful or safe a model is in the real world? Andrew Gordon and Nora Petrova from Prolific argue that current evaluation methods are fundamentally flawed, creating a "wild west" where models ace exams but fail humans.
Identify the "One Big Thing":
- The single most important argument is that current AI model evaluation, heavily reliant on technical benchmarks (like MMLU scores), is insufficient and misleading. It fails to capture human utility, safety, and nuanced user experience, creating a "wild west" where models are optimized for exams rather than real-world human interaction and values. A new, human-centric, methodologically rigorous evaluation approach is crucial for building truly useful and safe AI.
Extract Themes:
The Flaw of Technical Benchmarks:
- Current AI evaluation prioritizes technical scores (e.g., MMLU) over real-world human utility and experience.
- “A model that is incredibly good on humanity's last exam or MMLU might be an absolute nightmare to use day-to-day.”
- “If you just rely on those technical metrics, you miss half the point. These models are designed for humans to use at the end of the day.”
The Need for Human-Centric Evaluation:
- A more robust, demographically representative, and granular human preference evaluation is required to understand model performance beyond raw technical capability.
- “What you actually get is an actionable set of results that say, 'Okay, your model is struggling with trust,' or 'Your model is struggling with personality.' That's where you need to be focusing to really actually build a model that is good for real users in the real world.”
- “We can very confidently say that that model is preferred by as representative a set of the general public as we can possibly get.”
Safety, Bias, and the "Wild West" of AI Development:
- The lack of standardized, human-aligned safety metrics and the potential for biased evaluation methods create significant risks and hinder responsible AI development.
- “People are increasingly using these models for very sensitive topics and questions for mental health, for how they should navigate problems in their lives, and there is no oversight on that.”
- “Some companies are getting access to a lot more private testing in the background than others... which obviously undermines the integrity of the arena because the more comparisons you have for your model, the more access to prompts you have, the more data you have to refine a better model that's better at the arena.”
Synthesize Insights:
Theme 1: The Flaw of Technical Benchmarks
- Exam-Oriented AI: Current models are often optimized to ace academic benchmarks like MMLU (Massive Multitask Language Understanding), which are akin to a student studying for a specific test.
- Real-World Disconnect: High scores on these technical exams do not translate to practical utility or a good user experience in daily applications. Analogy: A Formula 1 car is engineered for peak track performance, but it's impractical for daily commuting.
- Fractured Field: The AI evaluation landscape is nascent and lacks standardization, leading to inconsistent reporting and difficulty comparing models across different labs.
- Missing the Human Element: Technical benchmarks exclude human feedback, failing to assess crucial factors like helpfulness, communication style, adaptability, or personality.
Theme 2: The Need for Human-Centric Evaluation
- Actionable Insights: Human preference leaderboards, like Prolific's "Humane," break down preferences into specific attributes (trust, personality, helpfulness), providing developers with targeted feedback.
- Representative Sampling: Unlike anonymous, self-selected participants in platforms like Chatbot Arena, Prolific stratifies its participant pool by demographics (age, location, values) to ensure a representative sample of the general public. Analogy: Instead of asking random people on the street, they're conducting a statistically sound poll with diverse demographics.
- Rigorous Methodology: Prolific uses a TrueSkill algorithm (developed for Xbox Live skill estimation) to efficiently compare models, focusing on information gain to reduce uncertainty with fewer battles.
- Structured Conversations: Participants engage in multi-step conversations with models, with built-in QA to ensure high-effort, relevant interactions, avoiding "topic wandering" or low-effort prompts.
Theme 3: Safety, Bias, and the "Wild West" of AI Development
- Sensitive Applications, No Oversight: People use LLMs for critical personal issues (mental health, life advice) without any regulatory or ethical oversight, creating a "wild west" scenario.
- Safety Blind Spot: There is no standardized leaderboard or metric for AI safety, despite its critical importance. Anthropic's work on Constitutional AI and mechanistic interpretability offers promising directions.
- Evaluation Bias: Open-source leaderboards like Chatbot Arena can be gamed; companies might submit numerous private test models to gather data and refine a single "winner," creating an unfair advantage and undermining integrity. Analogy: A company repeatedly sending its product to a public review site under different names to gather feedback and improve it before its official launch, skewing the overall perception.
- Psychopathy and Personality: Early findings suggest models perform poorly on personality and cultural alignment metrics, with a trend towards "people-pleasing" (psychopathy) behavior that users dislike. This points to potential issues in training data or fine-tuning.
Filter for Action:
- For Investors:
- Opportunity: Invest in companies building robust, human-centric AI evaluation platforms. The market for "AI quality assurance" is nascent but critical.
- Warning: Be wary of AI companies touting only technical benchmark scores. Dig deeper into their human preference and safety evaluation methodologies. High MMLU scores alone are not a proxy for market fit or responsible development.
- For Builders:
- Action: Prioritize human preference data and safety metrics alongside technical benchmarks. Integrate user experience feedback loops early and often.
- Action: Explore diverse sampling for user testing to avoid biased feedback. Consider structured conversation protocols for more meaningful user interactions.
- Warning: Do not optimize solely for technical benchmarks or public leaderboards that can be gamed. Focus on building models that are genuinely helpful, safe, and aligned with human values for real-world use cases.
- Opportunity: Develop tools and methodologies for mechanistic interpretability and constitutional AI to enhance model safety and alignment.
New Podcast Alert: Why High Benchmark Scores Don’t Mean Better AI
By Machine Learning Street Talk
The Illusion of Technical Prowess
A model that is incredibly good on humanity's last exam or MMLU might be an absolute nightmare to use day-to-day.
- Exam-Optimized AI: Many large language models (LLMs) are fine-tuned to perform well on academic benchmarks like MMLU. This is like a student studying for a specific test, not necessarily learning for life.
- Real-World Disconnect: A high MMLU score doesn't guarantee a good user experience. Think of a Formula 1 car: engineered for peak track performance, but impractical for daily commuting.
- Fractured Evaluation: The AI evaluation field is nascent, lacking standardized reporting. This makes comparing models across different labs difficult, as some emphasize certain benchmarks while others omit data entirely.
The Imperative of Human-Centric Feedback:
What you actually get is an actionable set of results that say, 'Okay, your model is struggling with trust,' or 'Your model is struggling with personality.' That's where you need to be focusing to really actually build a model that is good for real users in the real world.
- Actionable Insights: Prolific's "Humane" leaderboard moves beyond simple "which is better" comparisons. It breaks down preferences into specific attributes like helpfulness, communication, adaptability, and personality, providing targeted feedback for developers.
- Representative Sampling: Unlike anonymous public leaderboards, Humane stratifies its participant pool by demographics (age, location, values) based on census data. This ensures feedback reflects a representative cross-section of the general public, not just a biased subset.
- Rigorous Methodology: Humane employs the TrueSkill algorithm (originally for Xbox Live skill estimation) to efficiently compare models. It focuses on information gain, prioritizing battles that reduce uncertainty fastest, making the process computationally sound.
Safety, Bias, and the "Wild West":
People are increasingly using these models for very sensitive topics and questions for mental health, for how they should navigate problems in their lives, and there is no oversight on that.
- Sensitive Applications, No Oversight: Users turn to LLMs for critical personal advice, yet there's no regulatory or ethical oversight. This creates a "wild west" environment, with some companies taking safety more seriously than others.
- Evaluation Bias: Public leaderboards can be gamed. Companies might submit numerous private test models to gather data and refine a single "winner," creating an unfair advantage and undermining the integrity of the evaluation.
- Personality Deficiencies: Early data from Humane suggests models perform poorly on personality and cultural alignment metrics. There's a trend towards "people-pleasing" (psychopathy) behavior that users dislike, pointing to potential issues in training data or fine-tuning.
Key Takeaways:
- Strategic Shift: The market will increasingly demand AI models optimized for human utility and safety, not just technical benchmarks.
- Builder/Investor Note: Invest in or build platforms that provide robust, human-centric AI evaluation. For builders, prioritize user experience and safety metrics from diverse populations.
- The "So What?": Over the next 6-12 months, companies that integrate rigorous human preference and safety evaluations will differentiate themselves, building more trusted and effective AI products.
Podcast Link: Link

This episode challenges the prevailing wisdom that high technical benchmark scores equate to superior AI. It argues that current evaluation methods, often devoid of human input, fail to capture real-world utility, safety, and user experience, creating a misleading picture of model performance.
The Benchmark Illusion: Technical Prowess vs. Real-World Utility
- Andrew Gordon, a staff researcher at Prolific, and Nora Petrova, an AI researcher, contend that models excelling on technical benchmarks like MMLU (Massive Multitask Language Understanding) often prove impractical for daily use. This disconnect mirrors the absurdity of using a Formula 1 car for daily commuting.
- Most AI reporting relies on technical benchmarks, where models receive evaluations on specific themes or exams without human involvement.
- This approach misses critical user experience factors: helpfulness, communication style, adaptiveness, and model personality.
- Andrew Gordon argues that relying solely on technical metrics ignores the fundamental purpose of these models: human interaction.
- “A model that is incredibly good on humanity's last exam or MMLU might be an absolute nightmare to use day-to-day.”
The Human-Centric Imperative: Prolific's Humane Leaderboard
- Prolific introduces a human-centric evaluation framework, initially with a user experience leaderboard and now with "Humane," their main leaderboard. This system prioritizes actionable feedback on how humans perceive and interact with AI.
- Prolific's initial proof-of-concept involved 500 US participants evaluating single models on a Likert scale (e.g., 1-7 for helpfulness).
- The "Humane" leaderboard adopts a comparative battle approach, similar to Chatbot Arena, but with enhanced methodological rigor.
- This method yields actionable results, identifying specific model weaknesses like struggling with trust or personality.
- Nora Petrova emphasizes the need for fairer evaluation metrics, including diverse and stratified sampling based on user demographics and values.
Exposing Flaws in Existing Human Preference Leaderboards
- The current landscape of AI evaluation is nascent and fractured, lacking standardized reporting. Andrew Gordon highlights significant issues with existing human preference leaderboards, particularly Chatbot Arena.
- The field of AI evaluation is nascent, leading to heterogeneous reporting standards across labs; some models launch without any benchmarking data.
- Chatbot Arena's open-source nature allows some companies to gain disproportionate access to private testing and prompt data, undermining integrity. Meta, for instance, released 27 models before Llama 4's official launch, gaining extensive data to optimize for the arena.
- Chatbot Arena's anonymous, undifferentiated user base provides limited demographic data, making it difficult to understand who is providing feedback.
- Feedback specificity is low, offering only "Model A is better than Model B" without explaining why, hindering actionable development insights.
- “The more comparisons you have for your model, the more access to prompts you have, the more data you have to refine a better model that's better at the arena.”
Prolific's Rigorous Methodology: Building a Fairer Standard
- Stratified Sampling: Prolific samples participants based on census data (age, ethnicity, political alignment) from the US and UK, ensuring a representative cross-section of the general public.
- Granular Feedback: Instead of simple preference, "Humane" breaks down feedback into constituent parts: helpfulness, communication, adaptiveness, and personality, providing actionable insights for model improvement.
- Quality Assurance (QA): The system incorporates QA for multi-turn conversations, penalizing low-effort or topic-wandering interactions to ensure nuanced model engagement.
- TrueSkill Algorithm: Prolific employs Microsoft's TrueSkill framework (originally for Xbox Live skill estimation) to assess model performance. This Bayesian system accounts for randomness and changing skill levels, using information gain to efficiently select battle pairs and minimize uncertainty.
- Data-Driven Sampling: Battles are conducted only when data indicates high uncertainty between specific models, ensuring computational efficiency and targeted learning, unlike systems where popular models receive disproportionate attention.
- “We only ever do battles based on the need from the data. So the uncertainty is high for a specific model against another specific model. We conduct a battle for that to lower that uncertainty.”
Beyond Performance: Model Personality and Safety Oversight
- Initial findings from Prolific's testing reveal models perform significantly worse on subjective metrics like personality and cultural alignment. This highlights a critical gap in current AI development and evaluation, particularly concerning safety.
- Models tested performed worse on personality and background/culture metrics compared to helpfulness or adaptiveness, suggesting a disconnect between training data and human expectations.
- Nora Petrova notes an increase in "sycophancy" (people-pleasing behavior) in models, which users generally dislike. Prolific's data will correlate sycophancy with negative personality ratings.
- Andrew Gordon stresses the urgent need for safety leaderboards, arguing that model safety should be as critical as speed or intelligence, especially as users engage with AI on sensitive topics like mental health.
- Anthropic's work on Constitutional AI and mechanistic interpretability (peering into model "thoughts" to trace input-output pathways) represents a crucial step towards building safer, more reliable models.
- “There is no leaderboard for safety, right? We don't grade LLMs by how safe they are.”
Investor & Researcher Alpha
- Capital Reallocation: Investors should scrutinize AI companies' evaluation methodologies. Capital will increasingly flow to models demonstrating robust human alignment, safety, and nuanced user experience, not just peak MMLU scores.
- New Bottleneck: The true bottleneck for frontier AI development is no longer just compute, but the lack of standardized, human-centric evaluation frameworks that provide actionable insights for model improvement and ethical deployment.
- Research Shift: Purely technical benchmark optimization is becoming obsolete. Researchers must prioritize interdisciplinary approaches combining behavioral science, robust statistical methods (like TrueSkill), and ethical AI principles (e.g., Constitutional AI) to build truly useful and safe systems.
Strategic Conclusion
- The AI industry must pivot from a narrow focus on technical benchmarks to comprehensive, human-centric evaluation. This shift demands rigorous methodologies, diverse user feedback, and transparent safety metrics to ensure AI models are not just "smart," but genuinely beneficial and safe for humanity.
- The next step is establishing universal, human-aligned evaluation standards.