The Magic of LLM Distillation — Rishi Agarwal, Google DeepMind

Rishi Agarwal, a staff research scientist at Google DeepMind, delves into the evolution and nuances of Large Language Model (LLM) distillation, exploring its potential beyond mere cost reduction and into performance enhancement. This conversation unpacks various distillation techniques, highlighting recent innovations and their practical implications for developers and researchers.

The Expanding Definition of Distillation

"The definition [of distillation] is very broad, it says we want to transfer knowledge from an expensive teacher model to a smaller model... but it doesn’t say how you transfer the knowledge."
"Think of [distillation] as… what do you want out of this? ...It’s just saying I can transfer capabilities from this teacher model to the student one."
Distillation isn’t just about shrinking models; it’s a core deployment strategy for making powerful LLMs practically usable. As models grow larger, distillation becomes essential for managing cost and enabling deployment on resource-constrained devices.
The goal of distillation has expanded to encompass performance improvement of student models, exceeding capabilities achievable through standard training.
Distillation serves as a fundamental tool, facilitating the transfer of capabilities between models of different sizes and architectures, regardless of tokenizer compatibility.

Distillation Techniques: From Classic to Cutting-Edge

"The simple thing which people do is very principled… it’s basically sampling something from the teacher and fine-tuning the student on this data."
"The main change is rather than sampling the output sequences from the teacher, we just sample that from the…student itself."
Traditional "sequence-level knowledge distillation" involves using a teacher model to generate synthetic data, then fine-tuning a smaller student model on that data. This technique is simple, versatile, and only requires API access to the teacher.
More advanced techniques like "logits matching" leverage the probability distributions from the teacher to provide richer training signals.
"On-policy distillation," inspired by reinforcement learning, samples training data from the student and uses the teacher to provide feedback, improving performance and addressing the train-test mismatch.

The Power of On-Policy Distillation and its Implications

"…there’s no train-test mismatch. The way we are generating the data is also the same thing done at test time."
"You can merge all the ideas…we’re sampling from the student and then we’re getting feedback from the teacher…[and using] inspiration from speculative decoding."
On-policy distillation mimics real-world usage by having the student generate text autoregressively during training. This aligns the training process more closely with inference and mitigates the "exposure bias" problem where the student struggles with scenarios not encountered during training.
This method can be efficiently implemented by adapting existing reinforcement learning frameworks, simply swapping out the reward term and adjusting the KL divergence target.
The combination of on-policy distillation and speculative decoding optimizes both the student model's performance and the teacher model's inference speed.

Strategic Considerations for Distillation

"You distill a model once, but you serve it billions of times."
"Always try the simple thing first, and then decide whether it’s worthwhile doing the more complicated thing."
While more advanced distillation techniques offer performance gains, the added complexity and compute costs should be weighed against the value of the task.
Synthetic data distillation is often sufficient for achieving significant performance improvements, and should be the starting point for most applications.
For high-value models deployed at scale, the investment in more sophisticated methods, such as online distillation, is often justified by the long-term benefits in performance and efficiency.

Key Takeaways:

Distillation is more than just model compression; it's a powerful technique for improving LLM performance and enabling practical deployment.
On-policy distillation offers significant advantages over traditional methods, especially for complex, long-horizon tasks.
Choose the right distillation strategy based on the specific needs of your application, balancing complexity, cost, and desired performance.

Actionable Insights:

Explore on-policy distillation if your model tackles complex or lengthy generation tasks.
Prioritize simple distillation methods initially, and escalate to more complex techniques only when necessary.
Don't underestimate the value of distillation; a small improvement during training can yield massive benefits during deployment.

For further insights and detailed discussions, watch the full podcast: Link

This episode explores the rapidly evolving field of post-training distillation for large language models (LLMs), revealing how new techniques are dramatically improving efficiency and challenging traditional approaches.

The Genesis and Evolution of Distillation

Fishy Aral, a Staff Research Scientist at GDM, explains the motivation behind his recent internal presentation on distillation. He notes that while the seminal 2015 paper by Hinton et al. introduced the concept, significant advancements have occurred, especially with the rise of LLMs.
The discussion moves beyond the "purest" Hinton definition, acknowledging the expanded current usage of "distillation" to encompass various knowledge transfer methods from larger "teacher" models to smaller "student" models.
Fishy emphasizes that distillation isn't just about cost savings; it's a crucial deployment strategy. As models like Llama 405B and GPT-4.5 become too expensive for widespread use, distillation enables practical application.
Swyx highlights the marginal benefit versus cost argument: "It is really the marginal benefit versus cost argument, right? Like if the marginal benefit and capability is not worth the additional cost to you, then people don't want to use that bigger model."

Dark Knowledge and the Distillation Advantage

The conversation touches on the phenomenon where distilled models often outperform models trained from scratch on the same data, a concept Hinton termed "Dark Knowledge."
This suggests that larger models possess latent knowledge not fully captured in the training data, which distillation can transfer to smaller models.
Alesio asks about potential changes to pre-training or preference tuning when the goal is distillation. Fishy clarifies that it depends on the desired outcome: distillation can be a final step or a precursor to further refinement like RLHF (Reinforcement Learning from Human Feedback).

Traditional Logit-Based Distillation

Fishy explains the original Hinton-style distillation, which involves matching the "logits" (probability distributions over the vocabulary) of the teacher and student models.
This method minimizes the KL divergence (Kullback-Leibler divergence), a measure of how one probability distribution differs from another, between the teacher and student distributions.
This approach, used in pre-training for models like Jurassic-2 (JR-2), utilizes "soft labels" (probabilities for each token) instead of "hard labels" (one-hot encoded tokens).

Synthetic Data Distillation: A Practical Approach

The discussion shifts to a widely used technique: synthetic data distillation. This involves generating outputs from the teacher model given a set of prompts and then using these input-output pairs to fine-tune the student model.
This method, introduced in a 2016 paper, is surprisingly principled, minimizing the same KL divergence as logit-based distillation, but only requires API access to the teacher model.
Fishy highlights the DeepSeek approach, which uses "best-of-n" sampling: generating multiple outputs, filtering for correctness (e.g., passing test cases), and fine-tuning on the selected outputs.
A key advantage of this method is its tokenizer agnosticism, allowing distillation between models with different vocabularies.

Optimizing Synthetic Data: Top-K and Compute Considerations

The conversation explores refinements to synthetic data distillation, such as using the "top-k" samples instead of just the top-1, similar to the MBR (Minimum Bayes Risk) technique in machine translation.
A crucial insight emerges: generating more data from a smaller, cheaper model and filtering it can be more effective than generating less data from a larger, more expensive model.
Fishy shares research showing that, in a compute-matched setting, generating data from a smaller Gemma 9B model and filtering it outperformed using data from a larger Gemma 27B model, even for improving the 27B model itself.
He explains: "We found consistently that actually generating data from 9B in a compute math setting is always better, even better for distilling or actually improving the 27v model itself."

RL-Inspired Distillation: Addressing Train-Inference Mismatch

The discussion introduces a novel approach: RL-inspired distillation, addressing the "exposure bias" or train-inference mismatch inherent in traditional distillation.
This mismatch arises because, during training, the student model learns from fixed teacher-generated outputs, while at inference time, it generates tokens autoregressively.
Drawing inspiration from the RL technique "dagger," this method samples outputs from the student model, obtains logits from the teacher for those student-generated sequences, and minimizes the loss.
This creates a feedback loop where the student continuously asks the teacher for guidance, correcting its mistakes and learning from scenarios it might encounter during inference.

Principled Derivation and Practical Implementation

Fishy reveals a surprising mathematical connection: by reversing the direction of the KL divergence, one can derive this on-policy distillation approach, demonstrating its principled nature.
This method avoids full RL by not backpropagating through the sampling distribution, simplifying the training process.
Empirical results on small-scale tasks show that on-policy distillation (sampling from the student) often outperforms both logit-based and synthetic data distillation.
Fishy notes that this technique was used in Gemma 2's post-training.

Mode Covering vs. Mode Seeking: A Trade-off

The conversation delves into the implications of different KL divergence directions, highlighting the concepts of "mode covering" and "mode seeking."
Traditional distillation (mode covering) tends to spread the student's probability mass over regions where the teacher has low probability, while on-policy distillation (mode seeking) focuses on capturing a single mode of the teacher's distribution.
This leads to a trade-off between diversity and performance: mode covering yields higher diversity but lower performance, while mode seeking prioritizes performance at the cost of diversity.
A practical solution is to use a mixture of both KL directions, balancing diversity and performance.

Simplifying Implementation: Leveraging RL Frameworks

Fishy provides a remarkably simple recipe for implementing this RL-inspired distillation:

Take an existing RL framework (e.g., RLHF, RLAIF).
Turn off the reward term.
Replace the KL divergence anchor policy with the larger teacher policy.

This highlights the close relationship between RL and this form of distillation, suggesting that existing RL infrastructure can be readily adapted.
It's also possible to combine RL and distillation, leveraging both reward maximization and teacher guidance.

Speculative Decoding and AI-Powered Reviews

The discussion touches on speculative decoding, a technique where a smaller model proposes tokens, and a larger model accepts or rejects them, accelerating inference.
Distillation can improve speculative decoding by making the smaller model's distribution closer to the larger model's.
Fishy reveals that this technique, combined with distillation, is used in Google's AI-powered reviews.

Frontier Distillation: Combining the Best of Both Worlds

A recent paper introduces "frontier distillation," which combines on-policy sampling with speculative decoding.
The student samples tokens, but the teacher intervenes if the student deviates too far, ensuring higher-quality feedback.
While more computationally expensive due to the interleaved sampling, this approach offers improved efficiency.

Key Trade-offs: Online vs. Offline Distillation

The episode concludes with a crucial comparison of online (RL-inspired) and offline (synthetic data) distillation.
Online distillation doesn't require pre-collected data or annotations, but it's more compute-intensive due to the need for a live teacher during training.
It addresses the train-test mismatch and is potentially more optimal for long-horizon tasks, drawing parallels to findings in the RL literature.
Offline distillation is simpler and cheaper but may be suboptimal for complex tasks.

Reflective and Strategic Conclusion:

Distillation is evolving beyond simple cost reduction, becoming a critical tool for deploying and even enhancing large language models. Crypto AI investors and researchers should prioritize understanding these new techniques, particularly RL-inspired methods, to capitalize on efficiency gains and emerging capabilities in areas like reasoning and agentic tasks.

The Magic of LLM Distillation — Rishi Agarwal, Google DeepMind

Others You May Like

How Blockchain Fixes AI's Biggest Problem with Anand Iyer

Aaron Levie on AI's Enterprise Adoption

The Weird ChatGPT Hack That Leaked Training Data

The Magic of LLM Distillation — Rishi Agarwal, Google DeepMind

Join 4,000+ smart readers to get access to all our research and tools for free.

Others You May Like

How Blockchain Fixes AI's Biggest Problem with Anand Iyer

Aaron Levie on AI's Enterprise Adoption

The Weird ChatGPT Hack That Leaked Training Data