Evolving Large Language Model Evaluation: Practices and Insights from the Swallow Project

LLM evaluation is not a static scorecard; it's a dynamic act of creation that defines a model's value and steers its development. Dr. Okazaki from Tokyo Institute of Technology, leading the open-source Swallow LLM project, unpacks why evaluation frameworks must evolve as rapidly as the models themselves.

Identify the "One Big Thing":

The single most important argument is that LLM evaluation is not a static, simple task but a dynamic, complex, and crucial act of creation that directly shapes the development and capabilities of future models. As models evolve (especially with reasoning capabilities and instruction tuning), so too must the evaluation frameworks, or we risk mismeasuring and hindering progress.

Extract Themes:

1. Evaluation as Creation & Goal Setting:

“Evaluation is creation. There are two aspects to this: first, without evaluation, we don't know how much value something has. Second, value is born from how we evaluate.”
“When developing LLMs, if you decide on the evaluation method first, that becomes the goal, and development progresses towards it.”

2. The Evolving Complexity of LLM Evaluation:

“Even with the same benchmark, scores can vary significantly depending on the measurement method.”
“When DeepSeek-R1 came out, we evaluated a distilled Llama 3.18 Instruct model with Swallow Evaluation, and its performance was lower than the original model. This was strange.”

3. The "Swallow Evaluation Instruct" Framework: A New Paradigm:

“We developed Swallow Evaluation Instruct as an evaluation framework suitable for reasoning models and high-difficulty tasks.”
“It supports reasoning models by separating the reasoning process and the final answer from the LLM's output, allowing evaluation of only the final answer.”

Synthesize Insights:

Theme 1: Evaluation as Creation & Goal Setting

Evaluation Defines Value: Without robust evaluation, the true utility or "smartness" of an LLM remains unknown. It's not just about measuring; it's about defining what "good" means.
Goal-Oriented Development: The chosen evaluation method acts as a compass for LLM development. If you prioritize certain metrics (e.g., factual recall, coding ability), the model will optimize for those.
Beyond Simple Correctness: Evaluation extends beyond right/wrong answers to encompass language proficiency, common sense, and logical reasoning.
Real-world Analogy: Think of it like a chef developing a new recipe. If they only taste for saltiness, they might miss a lack of sweetness or umami. The evaluation (tasting criteria) dictates what aspects of the dish (LLM) they focus on improving.

Theme 2: The Evolving Complexity of LLM Evaluation

Context Matters: LLM performance is highly sensitive to inference settings (prompts, temperature parameters). A model might appear worse if evaluated with suboptimal settings.
Bias is Everywhere: Evaluation frameworks are susceptible to biases like choice order, output format preferences (e.g., longer answers preferred by LLM-as-a-judge), or even self-promotion bias in LLM-as-a-judge scenarios.
Outdated Methods Fail New Models: Older evaluation methods (e.g., not using chat templates, zero-shot without Chain-of-Thought) can underestimate the capabilities of newer, instruction-tuned, or reasoning-focused models.
Real-world Example: The DeepSeek-R1 case highlights this: a model designed for instruction following and reasoning performed worse on an older evaluation framework that didn't use chat templates or reasoning prompts, demonstrating the framework's inadequacy.

Theme 3: The "Swallow Evaluation Instruct" Framework: A New Paradigm

Reasoning-First Approach: This new framework is designed for instruction-tuned and reasoning models, allowing them to "think step-by-step" (Chain-of-Thought) before providing an answer.
Chat Completion & Free Generation: It moves beyond simple answer extraction to evaluate responses generated in a conversational format, including the reasoning process itself.
Pass@K for Potential: The inclusion of metrics like Pass@K (generating multiple answers and checking for at least one correct one) helps assess a model's potential for reinforcement learning from human feedback (RLHF) or other fine-tuning, rather than just its single-best output.
- Real-world Analogy: Pass@K is like a basketball coach seeing how many shots a player can make out of 100 attempts, even if only one shot is needed in a game. It measures underlying capability for improvement.
Leveraging Existing Tools: Built on Hugging Face's lighteval, it extends capabilities for Japanese tasks and supports various inference backends (local, API-based like OpenAI, DeepInfra).

Filter for Action:

For Investors:
- Warning: Be wary of LLM performance claims that don't specify evaluation methodology. A high score on an outdated benchmark might not reflect real-world utility.
- Opportunity: Invest in companies building robust, adaptive evaluation tools and frameworks. The "picks and shovels" for LLM development are becoming increasingly critical.
For Builders:
- Action: Adopt modern evaluation frameworks that account for instruction tuning, reasoning, and conversational interfaces. Don't just chase leaderboards; understand how those scores are achieved.
- Action: Contribute to open-source evaluation efforts. The quality of evaluation directly impacts the quality of models.
- Warning: Recognize that evaluation is an ongoing, iterative process. Your evaluation framework will need to evolve as your models do.
- Opportunity: Explore metrics like Pass@K for internal model development, especially when training models for complex tasks or RLHF. It provides a deeper signal than single-answer accuracy.

🙊 New Podcast Alert: Evolving Large Language Model Evaluation: Practices and Insights from the Swallow Project

By Weights & Biases

Podcast Link: https://www.youtube.com/watch?v=N9b6oAVuVZc

This episode dissects the critical, often overlooked, challenge of accurately evaluating Large Language Models (LLMs) as they rapidly evolve.

The Swallow Project: Building Japanese LLMs

Tokyo Institute of Technology's Okazaki introduces the Swallow Project, an initiative to develop powerful, open-source LLMs specifically enhanced for Japanese language and knowledge. The project aims to uncover optimal "recipes" for intelligent LLM construction.
Swallow models undergo continuous pre-training, not full-scratch development, allowing efficient exploration of performance improvements against top-tier models.
All Swallow models are open-sourced on Hugging Face, ensuring unrestricted academic and research use without data privacy concerns.
Recent models, like Llama 3.3 Swallow 70B V0.4, approach GPT-4 performance on average scores, significantly excelling in Japanese-specific tasks.
Current development focuses on improving coding capabilities, identified as a weaker area in previous iterations.

"We want to know the recipe for building smart LLMs." – Okazaki

Evaluation: Defining Value and Progress

Okazaki asserts that evaluation is not merely measurement but a creative act that defines an LLM's value and sets its developmental goals. Effective evaluation is crucial for verifying progress towards desired capabilities.
Evaluation establishes the target capabilities (e.g., language proficiency, common sense, logical reasoning) an LLM should acquire.
It necessitates selecting appropriate benchmark data and methods aligned with the LLM's developmental stage (pre-trained vs. instruction-tuned).
Developing a stable and reliable evaluation infrastructure is paramount, despite the perceived simplicity of checking answers.
Inference settings (prompts, temperature parameters) and various biases (selection order, format, response length, self-assessment by LLM-as-a-judge) significantly impact scores, complicating fair comparison.

"Evaluation is creation. It defines how much value an LLM holds and how that value emerges." – Okazaki

The Limitations of Early Evaluation Frameworks

The initial Swallow Evaluation framework, developed during the project's continuous pre-training phase, employed methods that became outdated as LLMs advanced, particularly with the rise of instruction-tuned models.
Early evaluation avoided chat templates, directly feeding prompts like "1+2+3=" to models not trained on conversational formats. This aimed to measure raw predictive power.
It relied heavily on few-shot inference (providing examples within the prompt) to guide models that struggled with direct instructions, potentially overestimating their inherent understanding.
Evaluation often focused solely on the final answer, penalizing models that provided reasoning or extraneous text, even if the core answer was correct.
Some multiple-choice tasks used likelihood-based scoring, measuring the probability of generating a specific choice, which limited applicability to models not designed for such output.

"We realized our previous evaluation methods were not accurately measuring LLM capabilities." – Okazaki

The Shift to Instruction-Tuned Evaluation

The emergence of models like DPCR1 (a distilled Llama 3.18 Instruct) and the increasing complexity of benchmarks exposed the shortcomings of older evaluation methods, prompting a complete overhaul.
DPCR1, when evaluated with the old Swallow framework, showed a performance decrease compared to its base model, indicating the evaluation itself was flawed.
The DPCR1 paper explicitly noted its sensitivity to prompts and performance degradation with few-shot inference, directly contradicting the old framework's approach.
The rapid improvement of LLMs, especially with Chain-of-Thought (CoT) prompting (a technique where the model generates intermediate reasoning steps before the final answer), demanded benchmarks capable of assessing advanced reasoning.
This necessitated a new framework, Swallow Evaluation Instruct, designed for instruction-tuned models and high-difficulty tasks.

Swallow Evaluation Instruct: A New Paradigm

The new framework, Swallow Evaluation Instruct, leverages Hugging Face's LightEval as its base, extending it for Japanese tasks and adopting modern evaluation practices.
It supports chat completion APIs and allows for flexible backend inference engines (local models, DeepInfra, OpenAI-compatible APIs).
The framework embraces "think-and-solve" evaluation, extracting answers from free-form generation that includes reasoning processes.
It adopts zero-shot inference and explicitly supports Chain-of-Thought prompting to maximize model performance.
Evaluation parameters like temperature and top-P are configurable at runtime, allowing for nuanced performance analysis.
New metrics like Pass@K (checking if at least one correct answer exists among K generated responses) assess the potential of reasoning models, crucial for reinforcement learning from human feedback (RLHF) development.
For mathematical tasks, it integrates libraries like MathVerify to handle diverse answer formats (e.g., fractions, decimals) beyond simple string matching.

Investor & Researcher Alpha

Capital Reallocation: Investors should scrutinize LLM benchmark claims, especially for Japanese-centric models. Outdated evaluation methods can significantly misrepresent true model capabilities. Prioritize projects demonstrating robust, modern evaluation frameworks like Swallow Evaluation Instruct.
Research Direction Shift: Research focused solely on optimizing LLMs for few-shot, answer-only evaluation is becoming obsolete. The industry now demands models capable of complex reasoning, instruction following, and free-form generation. Future research must align with evaluation methods that capture these advanced capabilities.
New Bottleneck: The "unsexy" work of developing and maintaining sophisticated evaluation frameworks is a critical bottleneck. As LLMs evolve, the ability to accurately measure progress and compare models becomes increasingly challenging and vital. Investment in robust, open-source evaluation infrastructure offers significant long-term value.

Strategic Conclusion

The rapid advancement of LLMs mandates a continuous, sophisticated evolution of evaluation benchmarks and methodologies. The Swallow Project's shift to "Swallow Evaluation Instruct" exemplifies the industry's necessary move towards assessing complex reasoning and instruction-following capabilities. The next step for the industry is to standardize dynamic, bias-aware evaluation frameworks that prevent benchmark gaming and accurately reflect true model intelligence.

Evolving Large Language Model Evaluation: Practices and Insights from the Swallow Project

Others You May Like

Why the US need Open Models | Nathan Lambert on what matters in the AI and science world

Why the US need Open Models | Nathan Lambert on what matters in the AI and science world

From $0 to $11B: The ElevenLabs Story

Evolving Large Language Model Evaluation: Practices and Insights from the Swallow Project

Join 10,000+ smart readers on our AI newsletter and stay ahead of the curve

Others You May Like

Why the US need Open Models | Nathan Lambert on what matters in the AI and science world

Why the US need Open Models | Nathan Lambert on what matters in the AI and science world

From $0 to $11B: The ElevenLabs Story