This episode explores the rapidly evolving field of post-training distillation for large language models (LLMs), revealing how new techniques are dramatically improving efficiency and challenging traditional approaches.
The Genesis and Evolution of Distillation
- Fishy Aral, a Staff Research Scientist at GDM, explains the motivation behind his recent internal presentation on distillation. He notes that while the seminal 2015 paper by Hinton et al. introduced the concept, significant advancements have occurred, especially with the rise of LLMs.
- The discussion moves beyond the "purest" Hinton definition, acknowledging the expanded current usage of "distillation" to encompass various knowledge transfer methods from larger "teacher" models to smaller "student" models.
- Fishy emphasizes that distillation isn't just about cost savings; it's a crucial deployment strategy. As models like Llama 405B and GPT-4.5 become too expensive for widespread use, distillation enables practical application.
- Swyx highlights the marginal benefit versus cost argument: "It is really the marginal benefit versus cost argument, right? Like if the marginal benefit and capability is not worth the additional cost to you, then people don't want to use that bigger model."
Dark Knowledge and the Distillation Advantage
- The conversation touches on the phenomenon where distilled models often outperform models trained from scratch on the same data, a concept Hinton termed "Dark Knowledge."
- This suggests that larger models possess latent knowledge not fully captured in the training data, which distillation can transfer to smaller models.
- Alesio asks about potential changes to pre-training or preference tuning when the goal is distillation. Fishy clarifies that it depends on the desired outcome: distillation can be a final step or a precursor to further refinement like RLHF (Reinforcement Learning from Human Feedback).
Traditional Logit-Based Distillation
- Fishy explains the original Hinton-style distillation, which involves matching the "logits" (probability distributions over the vocabulary) of the teacher and student models.
- This method minimizes the KL divergence (Kullback-Leibler divergence), a measure of how one probability distribution differs from another, between the teacher and student distributions.
- This approach, used in pre-training for models like Jurassic-2 (JR-2), utilizes "soft labels" (probabilities for each token) instead of "hard labels" (one-hot encoded tokens).
Synthetic Data Distillation: A Practical Approach
- The discussion shifts to a widely used technique: synthetic data distillation. This involves generating outputs from the teacher model given a set of prompts and then using these input-output pairs to fine-tune the student model.
- This method, introduced in a 2016 paper, is surprisingly principled, minimizing the same KL divergence as logit-based distillation, but only requires API access to the teacher model.
- Fishy highlights the DeepSeek approach, which uses "best-of-n" sampling: generating multiple outputs, filtering for correctness (e.g., passing test cases), and fine-tuning on the selected outputs.
- A key advantage of this method is its tokenizer agnosticism, allowing distillation between models with different vocabularies.
Optimizing Synthetic Data: Top-K and Compute Considerations
- The conversation explores refinements to synthetic data distillation, such as using the "top-k" samples instead of just the top-1, similar to the MBR (Minimum Bayes Risk) technique in machine translation.
- A crucial insight emerges: generating more data from a smaller, cheaper model and filtering it can be more effective than generating less data from a larger, more expensive model.
- Fishy shares research showing that, in a compute-matched setting, generating data from a smaller Gemma 9B model and filtering it outperformed using data from a larger Gemma 27B model, even for improving the 27B model itself.
- He explains: "We found consistently that actually generating data from 9B in a compute math setting is always better, even better for distilling or actually improving the 27v model itself."
RL-Inspired Distillation: Addressing Train-Inference Mismatch
- The discussion introduces a novel approach: RL-inspired distillation, addressing the "exposure bias" or train-inference mismatch inherent in traditional distillation.
- This mismatch arises because, during training, the student model learns from fixed teacher-generated outputs, while at inference time, it generates tokens autoregressively.
- Drawing inspiration from the RL technique "dagger," this method samples outputs from the student model, obtains logits from the teacher for those student-generated sequences, and minimizes the loss.
- This creates a feedback loop where the student continuously asks the teacher for guidance, correcting its mistakes and learning from scenarios it might encounter during inference.
Principled Derivation and Practical Implementation
- Fishy reveals a surprising mathematical connection: by reversing the direction of the KL divergence, one can derive this on-policy distillation approach, demonstrating its principled nature.
- This method avoids full RL by not backpropagating through the sampling distribution, simplifying the training process.
- Empirical results on small-scale tasks show that on-policy distillation (sampling from the student) often outperforms both logit-based and synthetic data distillation.
- Fishy notes that this technique was used in Gemma 2's post-training.
Mode Covering vs. Mode Seeking: A Trade-off
- The conversation delves into the implications of different KL divergence directions, highlighting the concepts of "mode covering" and "mode seeking."
- Traditional distillation (mode covering) tends to spread the student's probability mass over regions where the teacher has low probability, while on-policy distillation (mode seeking) focuses on capturing a single mode of the teacher's distribution.
- This leads to a trade-off between diversity and performance: mode covering yields higher diversity but lower performance, while mode seeking prioritizes performance at the cost of diversity.
- A practical solution is to use a mixture of both KL directions, balancing diversity and performance.
Simplifying Implementation: Leveraging RL Frameworks
- Fishy provides a remarkably simple recipe for implementing this RL-inspired distillation:
- Take an existing RL framework (e.g., RLHF, RLAIF).
- Turn off the reward term.
- Replace the KL divergence anchor policy with the larger teacher policy.
- This highlights the close relationship between RL and this form of distillation, suggesting that existing RL infrastructure can be readily adapted.
- It's also possible to combine RL and distillation, leveraging both reward maximization and teacher guidance.
Speculative Decoding and AI-Powered Reviews
- The discussion touches on speculative decoding, a technique where a smaller model proposes tokens, and a larger model accepts or rejects them, accelerating inference.
- Distillation can improve speculative decoding by making the smaller model's distribution closer to the larger model's.
- Fishy reveals that this technique, combined with distillation, is used in Google's AI-powered reviews.
Frontier Distillation: Combining the Best of Both Worlds
- A recent paper introduces "frontier distillation," which combines on-policy sampling with speculative decoding.
- The student samples tokens, but the teacher intervenes if the student deviates too far, ensuring higher-quality feedback.
- While more computationally expensive due to the interleaved sampling, this approach offers improved efficiency.
Key Trade-offs: Online vs. Offline Distillation
- The episode concludes with a crucial comparison of online (RL-inspired) and offline (synthetic data) distillation.
- Online distillation doesn't require pre-collected data or annotations, but it's more compute-intensive due to the need for a live teacher during training.
- It addresses the train-test mismatch and is potentially more optimal for long-horizon tasks, drawing parallels to findings in the RL literature.
- Offline distillation is simpler and cheaper but may be suboptimal for complex tasks.
Reflective and Strategic Conclusion:
- Distillation is evolving beyond simple cost reduction, becoming a critical tool for deploying and even enhancing large language models. Crypto AI investors and researchers should prioritize understanding these new techniques, particularly RL-inspired methods, to capitalize on efficiency gains and emerging capabilities in areas like reasoning and agentic tasks.