This episode dissects the critical, often overlooked, challenge of accurately evaluating Large Language Models (LLMs) as they rapidly evolve.
The Swallow Project: Building Japanese LLMs
- Tokyo Institute of Technology's Okazaki introduces the Swallow Project, an initiative to develop powerful, open-source LLMs specifically enhanced for Japanese language and knowledge. The project aims to uncover optimal "recipes" for intelligent LLM construction.
- Swallow models undergo continuous pre-training, not full-scratch development, allowing efficient exploration of performance improvements against top-tier models.
- All Swallow models are open-sourced on Hugging Face, ensuring unrestricted academic and research use without data privacy concerns.
- Recent models, like Llama 3.3 Swallow 70B V0.4, approach GPT-4 performance on average scores, significantly excelling in Japanese-specific tasks.
- Current development focuses on improving coding capabilities, identified as a weaker area in previous iterations.
"We want to know the recipe for building smart LLMs." – Okazaki
Evaluation: Defining Value and Progress
- Okazaki asserts that evaluation is not merely measurement but a creative act that defines an LLM's value and sets its developmental goals. Effective evaluation is crucial for verifying progress towards desired capabilities.
- Evaluation establishes the target capabilities (e.g., language proficiency, common sense, logical reasoning) an LLM should acquire.
- It necessitates selecting appropriate benchmark data and methods aligned with the LLM's developmental stage (pre-trained vs. instruction-tuned).
- Developing a stable and reliable evaluation infrastructure is paramount, despite the perceived simplicity of checking answers.
- Inference settings (prompts, temperature parameters) and various biases (selection order, format, response length, self-assessment by LLM-as-a-judge) significantly impact scores, complicating fair comparison.
"Evaluation is creation. It defines how much value an LLM holds and how that value emerges." – Okazaki
The Limitations of Early Evaluation Frameworks
- The initial Swallow Evaluation framework, developed during the project's continuous pre-training phase, employed methods that became outdated as LLMs advanced, particularly with the rise of instruction-tuned models.
- Early evaluation avoided chat templates, directly feeding prompts like "1+2+3=" to models not trained on conversational formats. This aimed to measure raw predictive power.
- It relied heavily on few-shot inference (providing examples within the prompt) to guide models that struggled with direct instructions, potentially overestimating their inherent understanding.
- Evaluation often focused solely on the final answer, penalizing models that provided reasoning or extraneous text, even if the core answer was correct.
- Some multiple-choice tasks used likelihood-based scoring, measuring the probability of generating a specific choice, which limited applicability to models not designed for such output.
"We realized our previous evaluation methods were not accurately measuring LLM capabilities." – Okazaki
The Shift to Instruction-Tuned Evaluation
- The emergence of models like DPCR1 (a distilled Llama 3.18 Instruct) and the increasing complexity of benchmarks exposed the shortcomings of older evaluation methods, prompting a complete overhaul.
- DPCR1, when evaluated with the old Swallow framework, showed a performance decrease compared to its base model, indicating the evaluation itself was flawed.
- The DPCR1 paper explicitly noted its sensitivity to prompts and performance degradation with few-shot inference, directly contradicting the old framework's approach.
- The rapid improvement of LLMs, especially with Chain-of-Thought (CoT) prompting (a technique where the model generates intermediate reasoning steps before the final answer), demanded benchmarks capable of assessing advanced reasoning.
- This necessitated a new framework, Swallow Evaluation Instruct, designed for instruction-tuned models and high-difficulty tasks.
Swallow Evaluation Instruct: A New Paradigm
- The new framework, Swallow Evaluation Instruct, leverages Hugging Face's LightEval as its base, extending it for Japanese tasks and adopting modern evaluation practices.
- It supports chat completion APIs and allows for flexible backend inference engines (local models, DeepInfra, OpenAI-compatible APIs).
- The framework embraces "think-and-solve" evaluation, extracting answers from free-form generation that includes reasoning processes.
- It adopts zero-shot inference and explicitly supports Chain-of-Thought prompting to maximize model performance.
- Evaluation parameters like temperature and top-P are configurable at runtime, allowing for nuanced performance analysis.
- New metrics like Pass@K (checking if at least one correct answer exists among K generated responses) assess the potential of reasoning models, crucial for reinforcement learning from human feedback (RLHF) development.
- For mathematical tasks, it integrates libraries like MathVerify to handle diverse answer formats (e.g., fractions, decimals) beyond simple string matching.
Investor & Researcher Alpha
- Capital Reallocation: Investors should scrutinize LLM benchmark claims, especially for Japanese-centric models. Outdated evaluation methods can significantly misrepresent true model capabilities. Prioritize projects demonstrating robust, modern evaluation frameworks like Swallow Evaluation Instruct.
- Research Direction Shift: Research focused solely on optimizing LLMs for few-shot, answer-only evaluation is becoming obsolete. The industry now demands models capable of complex reasoning, instruction following, and free-form generation. Future research must align with evaluation methods that capture these advanced capabilities.
- New Bottleneck: The "unsexy" work of developing and maintaining sophisticated evaluation frameworks is a critical bottleneck. As LLMs evolve, the ability to accurately measure progress and compare models becomes increasingly challenging and vital. Investment in robust, open-source evaluation infrastructure offers significant long-term value.
Strategic Conclusion
The rapid advancement of LLMs mandates a continuous, sophisticated evolution of evaluation benchmarks and methodologies. The Swallow Project's shift to "Swallow Evaluation Instruct" exemplifies the industry's necessary move towards assessing complex reasoning and instruction-following capabilities. The next step for the industry is to standardize dynamic, bias-aware evaluation frameworks that prevent benchmark gaming and accurately reflect true model intelligence.