Weights & Biases

January 8, 2026

Post-training best-in-class models in 2025

How Post-Training Turns Tiny Models Into Reasoning Giants by Weights & Biases

Author: Weights & Biases

Date: October 2023

Quick Insight: This summary is for developers looking to move beyond massive API costs by building specialized, efficient models. You will learn the specific recipes for turning raw base models into high-performance reasoning agents.

💡 Why does a model's loss curve sometimes go up during successful reinforcement learning?
💡 How can a 350M parameter model outperform giants in specialized math benchmarks?
💡 What is the "beta parameter" in DPO and why does it dictate model personality?

Base models are just expensive autocomplete engines that require a high-intensity boot camp to become useful. Maxime Labonne from Liquid AI explains how post-training converts raw statistical power into functional tools. The secret lies in moving past simple completion toward structured reasoning and human alignment.

Top 3 Ideas

🏗️ Data Quality Supremacy

"Data quality is really what's the most important during post training."

High Accuracy Samples: Models learn from factual correctness. Accurate data prevents the hallucination cycles that plague unrefined models.
Complexity Requirements: Easy samples teach the model nothing. Challenging data forces the architecture to build better internal logic.
Diversity Over Volume: General models need wide coverage while task-specific models require deep edge cases. Proper data selection reduces the total compute needed for high performance.

🏗️ Alignment via DPO

"DPO makes models that humans like because they sound more like them."

Contrastive Learning: DPO compares chosen versus rejected answers. This teaches the model subtle stylistic preferences that standard fine-tuning misses.
The Beta Parameter: This value controls how much the model deviates from its base. High beta keeps the model safe while low beta encourages creative exploration.

🏗️ Reasoning with GRPO

"We wanted the model to be good at math, but we also wanted it to compress the reasoning traces."

Reward Functions: Reinforcement learning uses signals instead of static answers. This allows you to penalize long-windedness or reward specific formatting.
Tiny Reasoning Giants: Liquid AI’s 350M model hits elite math scores. Small models are now viable for complex logic if the training recipe is right.

Actionable Takeaways

🌐 The Macro Shift: Intelligence is decoupling from scale. As reasoning becomes a commodity, the value moves from the size of the model to the proprietary nature of the training data.
⚡ The Tactical Edge: Use TRL or Unsloth for single-GPU fine-tuning. Prioritize cleaning your instruction sets over increasing your training iterations.
🎯 The Bottom Line: The future belongs to those who own their data pipelines. If you can distill elite reasoning into a 350M parameter model, you win on latency, cost, and privacy.

Podcast Link: Click here to listen

Hi everyone. Thank you. So in this presentation I'm going to talk about pre-raining models. At Liquid AI we create small language models. We've released, I counted this morning, 17 models since July and they're in between 350 million parameters and 8 billion parameters. We have text models, we have vision models, we have audio models, we have specific models including retrieval. If you're interested, please check our hugging face page. And besides that, I've also done some other work like books and some projects in the open source community.

Let's get started with the question what is post training? With post- training we start with a pre-trained model, a base model that has been pre-trained on trillions and trillions of tokens. So with LFM2 models for example it's been pre-trained on 10 trillion tokens some models and then this model is only able to do completion. So you ask it a question and instead of answering the question it will complete it which is not very helpful which is why we do pre-raining. Press training is here to turn this model that only can do completion into a model that is able to answer questions and follow instructions.

So the first step is called supervised fine tuning and this is where we train the models on at least 1 million samples with instructions and answers. Those are usually conversations. So the model learn to be able to answer this and follow the instructions. And then depending on what you want to do, you can transform it into a chat model that is able to be more optimized for humans or a reasoning model that is better at reasoning heavy task like math and code. So in this presentation we're going to get an overview of all these techniques.

When I talk about post training I also talk about fine-tuning. To me it's mostly a change in terms of general purpose versus task specific and also a change in terms of number of samples. If you do something that is very general purpose you will need a lot of samples. If you did some something that is more task specific for example a spell checker you don't need that many samples.

And when to use fine tuning there are different ways that you can use it. So you can change a model to change the tone and the format. You can add knowledge that is a bit superficial but you can still do it. You can't really add a new language during supervised fine tuning for example. But you can also use it because you have a bigger model that is really helpful but it's too costly or too slow and you can distill it into a smaller models that increases also the the quality of the output.

In terms of finetuning libraries I would recommend TRL which is managed by hiding face. It's a really really good one with a lot of up-to-date algorithms. For example, they just implemented on policy distillation from thinking machine labs. So this is really cool if you want to play with it. There's also axel tool that has reusable YAML configurations, a lot of utilities. I really like this one and unsloth that is super popular especially if you want to do single GPU fine tuning and also has a lot of utilities.

In terms of high requirements, you can use pretty much anything that you want. on Google Collab is nice if you want to start working with it but al otherwise you can also use cloud GPUs for example on runpod or like core weave or whatever you want or local GPUs if you have some let's talk about supervised fine tuning so in terms of data this is kind of the structure of the data you have an optional system prompt you have a user instruction so this is what the user types to the model and finally you have the expected answer from the model during training what we do is that we take the system and the instruction as context and we only calculate the loss on the output. So the expected output. Which is why it's really really important to have super good outputs. This is what the model learns during training.

And this leads us to the question what is a good data set. So a good data set can be described in different ways. What I like to do is I like to say that the samples must be very accurate meaning that they have to be factual. If you ask a question, it has to be the right answer, right? It has to be very very diverse, especially if you do general purpose finetuning, which means that you need to cover a wide range of topics. And finally, it has to be complex enough to challenge the model. If your samples are too easy, the model will just be too good to do it and it won't learn much from it.

Here's an example of a data generation pipeline for instruction following. So instruction following is when you ask a question to your model, but with some constraints. For example, write it in two paragraphs. Here an example where you have two constraints. One, the answer should be in English and secondly, it should be all lower case. So we start with a SID data with prompts and some constraints like the ones I described. We can query an LLM. It can be like any LLM, can be JGPT, can be whatever you want. Then we can run the test to see, okay, is it in English? Is it lowerase? And if it's okay, then you can keep the sample and the answer that was generated. And then you can do other stuff like decontamination to make sure that you're not training on the test set and this kind of stuff.

Here's an example of a data set I created in the open source. And you can see that here instruction following is actually only 4% of this data set. So you also need to have different categories if you do a general purpose post training. For example, math chat code. It really depends on what you want your model to be good at.

In terms of SFT techniques, SFT is very simple. It's like the pre-training objective. What you are doing really when you do supervised funing is that you try to predict the next token every time. And you can have like a different format because now we have conversations and you can also mask some part of the system prompt and the user prompt as we said before.

So with full fine tuning you just take your model train on the supervised funing data and that's good because it maximizes quality but it's also very heavy. So for post tuning it's okay but if you do fine-tuning on your own it might be just too much you won't have the hardware to do it which is why you can use parameter efficient techniques such as Laura Laura instead of retraining all the parameters you freeze them and you're going to add some little matrices at each module in the model and you only train these small matrices. So instead of training 100% of the parameters you only train the equivalent of like 0.1% of the parameters. So it's a lot faster. It requires less VRM. But it still requires quite a lot of VRM to be able to load the model in memory which is why we have Qura.

And in CQA instead of loading the full precision model, we only load a quantized version of this model. So 4bit precision. This is nice because now it requires less VRM. But the problem is that it also degrades quality. So if you can afford Laura, I really recommend using Laura instead.

In terms of training parameters, here are some very important ones. The most important one is the learning rate. The learning rate determines the strength of the parameter update. This is the one that you want to fine-tune the most in my personal experience. Then you have other ones like number of epochs, like number of passes over the data set. You have the batch size. So you want to have at least some samples in the the batch so it's not too noisy. There's the max length. for example at liquid because our model architecture is super efficient we only train on 32k context length we never do less than this and there's there's the optimizer and the attention mechanisms those ones I would say like don't try to fine-tune them like AdamW is very strong flash attention too is a very strong candidate so those at least you don't have to really change them then when your your run is being ongoing.

You can monitor the experiment and what you really want to track especially is the learning rate, the loss sorry the training loss. And here you see a bad example and a good example. In the bad example, you see that there's a loss spike which is really bad because if you try running the model after that, you will see that it's really not coherent at all. And here the problem I had was that my learning rate was too high. This is what created the loss spike. So I restarted it with a lower learning rate and now I have this smooth curve. That is a good indication that the training is successful.

Let's go to DPO and for DPO is a preference alignment algorithm that allows you to modify the behavior and the style of the model. And what you do during the preference alignment is that you give these chosen answers and these rejected answers. And you want the model to act more like the chosen ones and less like the rejected ones. So it's a contrastive training which is really good at getting some subtle elements that might be missing from the supervised funing round.

Here's a very popular data set I made on a hugging face and here you can see that it's a lot of different DPO data sets and I scored them. I only kept the highly score chosen answers, added some filtering and that makes a pretty good data set if you're interested in DPO. Here's a data generation example with something called ultra feedback. it comes from a paper and here for preference data you can query multiple LLMs you can then score every answer that you got and then you can pick the highest score as the chosen answer and the lower score as the rejected answer and here's a diagram where you can see the policy model is the model that is currently being trained it is compared with a reference model the reference model gives a baseline and what you're basically trying to achieve here is to increase the probability of the policy policy model outputting the preferred answers and decrease the probability of the policy model outputting the rejected ones. So it's a very simple training but it's very very effective.

In terms of training parameters we have the same ones but an additional one which is the better parameter and this one is about the importance of the reference model. So you can if it's very small you're going to say yeah the reference model doesn't really matter you can explore a lot and if it's really high you're going to say the opposite and say okay like please stay close to the reference model do not go crazy and so that can help you do some subtle DPO runs or on the contrary do something that is a lot more exploratory and you can see here if it's not too small that we have a lot of evaluations and actually what DPU makes is it makes really good chat models. It makes models that humans like because they sound more like them.

And on top you have the arena elo from LM Marina which is a popular measure of human preferences. And if you check the correlation between this metric and other metrics like MMLU, like math benchmarks, you're going to see that it's weakly correlated. Actually, those are two different things. is you can make a model that is good at math but humans do not like it or you can make a model that humans like but it's very broad math.

Let's now talk about reinforcement learning with GRPO. So it's a popular algorithm that was popularized by DeepSc for Deepsec R1. In terms of data format GRPO and like reinforcement learning algorithms in general I would divide it into two different components. You have the instruction data. So here you can see it it's with pretty much the same thing. The only difference is that we have this thing tokens here and the thing tokens really have the reasoning trace. So the model is trained on first outputting like the thing tokens and then outputting the final answer and that's really good because you can do a first round of SFT just on that and this will warm up the model to output this specific kind of structure and then you can do real reinforcement learning and here you can see that instead of having like the expected output like here I have a ground truth and this ground truth is just the real answer that you're expecting from the model. So here it shouldn't be a sign, it should be just a slash.

Here's an example to create the SFT data. So this is a simple data generation pipeline where you already have some prompts, you already have some real answers and you will you will just query deepseec R1. for example, it can be any reasoning model and then you're going to check if the final answer that is outputed by deepsear one corresponds to the one the ground truth from your seed data can filter out all the wrong answers and you can then do the duplication and checking that the format is well respected and that can create a really really good reasoning data set. So this is something that we've done we've done it with this model but we've done it with a more recent one called LFM2 350M math. This is a tiny tiny model 350 million parameters and is really good at math and this is the recipe that we used to do it.

So first we have this step with the instruction data set where we did SFT. This is really really big. This is over 4 million samples. So that's a lot of data. And here you can see we only used open source data to do it because there's a lot of good reasoning data sets in the open source community. And then we did reinforcement learning and here it's a collection of open source data sets. Again it's a lot smaller here and we did some tricks that you can see in our blog post to select the best data only.

With JPO you have something that is very specific. It's called a reward function. And this is the function that will provide the reward signal to your model. So it's going to be bad if the answer is not the one that you expect and it's going to be positive if it's the answer that you expect. Here's a very simple reward function that I created. And it's really about telling the model to only output answers with 50 characters. If it's lower than that, it's going to get a negative reward. If it's about 50 characters going to be positive.

Let me show you. Here you can see during training I track the length of the answers and you can see that it starts about like 90 85. It goes up but when it goes up it's going to get like this negative reward signal and then it goes down and kind of stabilizes around 50 characters. So this is like a very simple training objective. This can be used to do a summarization model for example. But this is very effective and this is done with a tiny tiny model. So you don't need like really really big models to do it. You can use a 350 billion parameter model to do that for example.

And something that is interesting is that generally the loss goes down but here's it goes up. This is something that is specific to Gapio because it use KL divergence. It measures like the difference between the baseline and the model that you're training. So if it's zero, it means that the model is not learning anything. So you want it to go up a bit. So you you know that you're training the model and it's not just being idle in terms of evaluation.

So here I have a table of evaluation from the model that we made. And you can see that we use some popular benchmarks for math. So this one GPQA diamond is a popular one for knowledge and science. We have one for math called math 500. We have AME24 and AME25 which are like math competitions. And during the training you can see that we started with the base model with not excellent performance overall on all of this. And then the distal model is the one that has been supervised fine-tuned and it goes up quite a lot. And here what we've done with the last round of reinforcement learning is that we had two objectives. We wanted the model to be good at math, but we also wanted it to compress the reasoning traces because if you have a tiny model, but it outputs 32,000 tokens, it's going to be very long during inference. And we didn't want that. We wanted a model that is quite concise in its reasoning. It was really successful because the model now can get like pretty much the same performance, but only use 4,000 tokens instead.

So this is a way that you can use reinforcement learning to force some constraints on how the model operates.

And that's it. As a conclusion, I would say that we saw all the main steps of post training. So you can start with the data set. I think this is where you should spend the most time. Data quality is really what's the most important during post training and any time that you have should be spent first checking the quality of the data. It can be like very manual or it can be automated with a judgem for example. Then there's the fine-tuning algorithm that we saw. There's one that I haven't mentioned. It's called model merging. It's not really training but it's a very nice way of using different checkpoints and averaged averaging their weights together to build a stronger model. And then you have evaluations to track down the quality of the model that you've built. And this will probably give you some feedback signal to say hey maybe my model is not that good in math. So I can just add more math data or increase the quality of the math data and then do another round of fine tuning and finally evaluate it again. So post training is really a cycle. You will never zero shot a perfect model. It's always going to be this cycle of iteration and improvement and this is really core to it. Thank you for your attention.

Post-training best-in-class models in 2025

How Post-Training Turns Tiny Models Into Reasoning Giants by Weights & Biases

Top 3 Ideas

🏗️ Data Quality Supremacy

🏗️ Alignment via DPO

🏗️ Reasoning with GRPO

Actionable Takeaways

Others You May Like

Dario Amodei and Dwarkesh Patel – Exponential Scaling vs. Real World Friction

The Deflationary Singularity: Why Everything is Going to ZERO w/ Salim Ismail

What If Intelligence Didn't Evolve? It "Was There" From the Start! - Blaise Agüera y Arcas

Post-training best-in-class models in 2025

How Post-Training Turns Tiny Models Into Reasoning Giants by Weights & Biases

Top 3 Ideas

🏗️ Data Quality Supremacy

🏗️ Alignment via DPO

🏗️ Reasoning with GRPO

Actionable Takeaways

Join 10,000+ smart readers on our AI newsletter and stay ahead of the curve

Others You May Like

Dario Amodei and Dwarkesh Patel – Exponential Scaling vs. Real World Friction

The Deflationary Singularity: Why Everything is Going to ZERO w/ Salim Ismail

What If Intelligence Didn't Evolve? It "Was There" From the Start! - Blaise Agüera y Arcas