
Author: Pratyush Maini | Date: October 2023
Quick Insight: Frontier AI models are not just fine-tuned for reasoning; they're pre-trained with self-correction data. This shift means specialized data strategies are now essential for building competitive, smaller models.
The AI world often treats data as a commodity, but Pratyush Maini, a founding team member at Datology and a CMU PhD candidate, argues it's the secret sauce. His investigative work, sparked by a curious "seahorse emoji" bug in frontier models, reveals a fundamental shift in how leading AI labs are building intelligence.
"Somehow data is still one of the more undervalued research topics."
"The self-reflection data has now actually become pretty much core to the training of all frontier models."
"The old idea of you have a foundation model and you can just fine-tune it and get the desired capability is kind of like we are past that stage now."
Podcast Link: Click here to listen

Hi, welcome to the lane space lightning pod. We have Pratush here from Dtology, one of the founding team members. Listeners to the pod will remember our episode with Ariotos. I think it was one of the top episodes of 2025. I think not enough people are looking at the data and we want to encourage engineers and researchers to think more deeply about data and I think Dtology is one of the best teams that is basically all data nerds.
Patric, your bio has like 70% California, 20% Pittsburgh, 10% Delhi. You're in South Bay right now at the office, right?
That's correct. I'm at the office.
How else do you introduce yourself? Any other things that people should know about you before we get started?
No, that was pretty good. Thank you so much for having me. I am basically finishing up my PhD at CMU and have been a part of the founding team at Dtology for more than 2 years now. It's been a while that we've been building this company.
I am very excited about datacentric aspects to AI and more of how we can build responsible and very high performance systems by focusing on the data layer. Somehow data is still one of the more undervalued research topics and that's been the focus of my PhD and now also a lot of the time at Dtology. I think you're very well pointed that we're all data nerds. We love looking at data in general.
I think one of the coolest parts about looking at data is we can share strange artifacts about it. In fact, we even made a channel called "data is weird" on our Slack channel where we keep posting screenshots and snippets of all of these random artifacts that you see within data sets and they'll keep surprising you. The internet is so weird in so many ways that you would never expect that the depths of the internet has this and we're actually feeding these kinds of data points to our model. So I think there's a shared passion about really looking into data and an investigative approach around the team in general.
Okay. Two follow-up questions and then we can go into the paper. One, what's your thesis going to be about? What's the overall theme of your works?
This is called responsible and efficient use of webscale data for pre-training which is basically when you're training on this messy web data and you shove all of that into the model. How can we train efficiently which is only use things that are useful but then also the responsible angle to it which is how can we make sure we are attributing data to giving the right credit to artists who put data online or how can we make sure the data is safe for models model behaviors.
So I do a lot of work on the evaluation aspects of this as well but then also on the efficiency and performance aspects.
Okay. And then just in your "data is weird" channel, any fun anecdotes over the last recent anecdotes just pull something out. No, we're going to go into the GPT training data stuff, but I'm just kind of curious what gets posted there.
So some of the cool things is all from researcher curiosity. I think one of the recent things that we were exploring was that how do various models like for instance you take the **coen** model and then ask it to do completions on certain questions. So we would just take like we took some example questions from **J**, which is like an important olympiad like an exam in India and we would even ask see the models with the first two words of a question like "the light bulb" and it would literally complete the actual questions and so then we started comparing this across like family of models and we consistently saw how many of these models today are being massively fine-tuned on problems from overfit massively overfit like even like imagine if I ask you "the light bulb" what comes next in your mind it won't be a question from an Olympiad or a practice exam right and these give the right question they also give the options after that.
And so I think that is one of the fascinating findings from "data is weird" from the last month but nobody is benchmarking on **J**, that's not a **J** bench now and I think it's also been reported by various frontier model labs but like **J** is like one example I'm sure this happens like across different exams but it's very clear how the last stage of training for many of these models does have a massive amount of example or exam benchmark because the model will not behaviorally complete exam questions with options if they have not really seen it at the end of training for multiple epochs.
And you're saying there's perfect memorization and I'm wondering if you have ablated this across different model sizes like does 8B do the same as 30B probably not right and I was curious about the memorization power.
That's a great question. I think it's a mix of two things one is the recency of seeing that data and the second is also the size of this model. I don't we don't see this phenomena in like the earlier versions of like **coen** 1.5 or even **coen** 2 but start seeing it in **coen** 3 so there's something about a more like a significantly higher weight on benchmark or like **J** or like any types of evaluation questions during the final stages of training in these decent models and then definitely see we see the effect of like bigger models having memorization to a significantly higher level though I would say that with the coming of the **M0** models we're starting to see this phenomena even like models which have a less number of active parameters so a phenomena that we would observe at a 72b model in the past is probably now visible at a 20b active **M0** also.
So in some way the active parameters might be lesser in the **M0** life cycle but still be able to regurgitate exact memorization.
Yeah, it's interesting when memorization is done in the router at the **M0** actually ends up becoming like an index where you look up like oh here's the index of the thing that you need and then it's routed to that expert that is going to have memorized the answer.
Exactly. That's a way to cheat. Okay. So let's tell us about the seahorse emoji and GPT.
So around October last year, I saw on Twitter a few people had posted this very fascinating result where they would ask a frontier model such as GPT like is there a seahorse emoji and that's the only question they would ask and what you would see is that the model keeps going on and on and this was a phenomena that really fascinated me like imagine asking GPD a question and this answer does not stop. Which was really surprising to me because all of these models are like these are like frontier models. They would not really have this problem of continually going back and forth on a certain answer.
If you see over here, it says yes in the beginning, then it starts to think more about it and then it says no and then goes yes and no and yes and no. So there's like a huge amount of internal turmoil in the model and a phenomena where it just keeps saying this and does not really stop. And so then I tried this across models and I saw consistently across Grock and GPT and Gemini that you were seeing this phenomena where models will like self-correct themselves quite a bit and these were non-thinking models like the standard the auto or the smallest fast model and they would still do this and I was quite perplexed that how is it possible that these models which are so strong they're able to solve these IMO problems and they're kind of just getting into this selfnarrative existential dilemma or something when you ask them about like is there a seahorse emoji and this was also a time when I was thinking a lot about reasoning data and like should models learn more selfcorrection behavior during pre-training or not and my initial thought was that this behavior that we are seeing of models saying yes or no is very reminiscent of all of these back correction and self-reflection behavior which is becoming very predominant right in modern AI models and there are a lot of papers on like if you the models learn how to correct their wrong traces or like in their data or in their post training data then the mod has become much stronger at at reasoning and so I was thinking this is very reminiscent of that and somehow feels like there is like an emergence of that behavior even when I did not ask for reasoning.
So I thought like it will be very interesting to see like what really causes this behavior because we don't really understand that and maybe there has some tie-in with the prevalence of reasoning data or self-correcting monologues within the mid-raining or pre-training phase of non thinking models and so my first thought was is this something that happened always what or is it just like a GPT5 phenomena and because people have been noting this in October and potentially people asked this in GP in in like a year ago also like would the models create this kind of a self-correcting monologue and the interesting thing was and I've put the number of output tokens on the y-axis on this graph and the release date of the model on the x-axis and so I ran the openi API across like models released from 2023 to 2025 and you would see like all the models until 2024 December had very tur and short responses to the question is there a seahorse emoji they would either say that they exist sometimes they would say they do not exist but they would never go into this selfcorrection loop where they like say one answer in the beginning and then a different answer at the end.
And so something very interesting was happening because we know that in December 2024 Four is when the 01 model came which is the first time where a model that could think and selfcorrect itself was available to the public and then 4 months after that we noticed that the GPT 4.1 series is the first model where suddenly this the response length of all the models had started to increase and then the chat GPT4 model update that was in the month of May again had like this recursive behavior and this became significantly more prevalent by the time GPT5.
And just to insert your question here, this is specifically like the cutoff date that they have declared themselves, right? Like I don't know if this is is this is what are you ordering by cut off date or the release date?
This is ordered by the release date and so it's unclear what the cutoff date would be but very likely that it has seen likeam samples very close to that date. So for instance, Chad GPT 40 is an interesting case. its original release was somewhere in December like 2024 but then they had an update in May and only the May update shows this behavior but the previous update does not show it and so they kind of potentially post trained the model on these kinds of reasoning traces and it certainly started showing the self-reflective behavior.
I think should be important to point out that this seahorse emoji is an interesting question and like why it really brings out this behavior is because the existence of the seahorse emoji is kind of called like a mandela effect where a lot of people on Reddit are saying that hey I have seen the seahorse emoji it looked blue in color and had a horse pointing to the left or something like that and so there is like enough doubt or like enough conversations on the internet that some people say yes and some people do say no and so it's kind of acting like a trigger to elicit that thinking response which is great for us because we do want these edge cases that can help us do this discovery or this investigative analysis.
So by now what we have figured out is that around the GPT 4.1 and then GPT 5 series there is something happening which leads the models to have a lot of self-reflection and now the question is like what happened over here like which led to models starting to do this and for me this was interesting because I wanted to know if the frontier model labs also care about putting reasoning data in pre-training or in mid-training phases like you might have seen a lot of papers coming out these days which say that you should do reasoning pre-training and like RL pre-training and all of those papers which are saying that put the thinking tokens in the beginning and as academic researchers or like people who are not training frontier models it's always a curiosity question are these things really valuable at the frontier level or are these questions only interesting at the 1B scale when you can move your data distribution and so that's why I was excited about understanding like if there is relevant data in the mid-training phase for non-thinking models and something interesting that I was able to validate this was with the Almo series.
So the Almo models are great for doing investigative science because they release everything they release the data the models and they also have a very fantastic Almo trace. Now with Almo trace you can trace like what data in the pre-training of Almo was closest to its response right now.
Yeah, this is I was thinking of this when you're talking about your PhD thesis was like directly the tool to use.
Exactly. Do you use their approach?
Yeah. I think like I honestly al trace or the original paper which produced the infinagram like I think it came out last year at Colum 24. That was one of my favorite papers that year. I think they did a fantastic job in like being able to like actually trace data back. It's like a big engineering problem to be able to like look up so fast and they did like a fantastic job. and I've used it a lot for multiple projects in my PhD.
So coming back to this phenomena and I was like curious like okay now we know that this suddenly happens or like this infection point happens in GPD 4.1 series can we trace it back to an open series of models and see if this is like the same phenomena happens there and so you will see for the mode 2 model which was released a year ago like says there exist an emoji it's the wrong answer but does not really try to selfcorrect itself right on the other hand in the 3.1 32b series it has the same selfcorrection behavior like it's a long response I only have the last part of it but it first says yes then says but wait there is no one and then so we see the same monologue within the Almo 3.1 model and if you look at the Almo trace there is no particular example that it is really referencing to over here that leads to this phenomena so it's more like a capability rather than a regurgitation of the text.
So that's great to know it's not like someone has poisoned the data that was another thought in my mind has ply poisoned the data on the internet and and the models are like regurgitating something from there but that's not happening. So that's good that's a good sanity check that what we're seeing is a capability and not like a regurgitation from some website or like Reddit post which might have said that yes and no. And then the more interesting thing is now we can actually trace the exact difference in the data sets for all mode 2 and mode 3 and the main difference for the instruct model over here. So Almo 3 has multiple variants some are the thinking variants some are the instruct variants.
So with the instruct variant they do not have any thinking data in the post- training phase but they do mention that we are going to put like there's a line that they write there's some intentional addition of thinking traces in the mid-raining phase of Almo data and so that's the information available from their report and we can also see the actual data sets they use. This observation really well corroborates with the whole finding about the open AAI models where they potentially also did the same thing where even if in the non-thinking models the instruct models where you do not put thinking traces in the post-training phase you do put thinking traces in the mid-training phase and that leads to models becoming better and so this was like quite interesting for me because what they suggests about the GPT training data is that the self-reflection data has now actually become pretty much core to the training of all frontier models because we're seeing that happen in non-instruct models across the board.
Which is very interesting to me because for the longest time people would always believe that the idea of foundation models is that you have trained this foundation and now you just post train it which meant that there's a general purpose model you don't need to really put everything that you care about in the general purpose model and then you can post train it to get the capability that you want. But this really shows us that even this foundation mod foundation needs to have the ingredients that of the capability that you desire at the end. And so putting thinking traces in the foundation is actually useful for the post- training or fine-tuning of the thinking models. And so that's why they can only have one single backbone and you would actually prefer to have the foundation also represent the capabilities that you desire. And so self-reflection is a capability that we desire and it's become core to the foundation and not left as a cosmetic after afterthought of just post training.
I think also was interesting to me that the GPD 4.1 series came about 4 months after the O1 data was available. So it's kind of interesting like how fast do Frontier Labs move. O1 reasoning traces would have been available to the pre-training team in December or the mid-training team I should say in December and then from December until u what's like 4 months from there end of April the mid-training team would have potentially used the traces which they obviously were dis developing and there was a lot of talk about the Omni models but not directly right like it's just went into the the the web and the the pre-changing back.
No I would be very surprised if it went into the web I I think this was very intentional because that's exactly how the Almost series happened where they have an intentional addition of the thinking data. So they have added a data set which has this self-reflection behavior where you will self-correct yourself by first gening a wrong sequence and then the right one.
Yeah. I mean this is why I wanted to feature this piece because I think like the the you know like how people are training reasoning into their models and when they're choosing to include it and all this it's very actually very important and not well known. And so this is one of the first like actual like somewhat investigative piece I've seen about it.
So yeah and you you might have seen all these papers which show that RL is only like kind of enhancing or amplifying a capability that the model already has and you actually need the core to have that capability. So I think this reinforces all of those things that the core actually contain the capability and then you can actually do better RL, you can do better post training all of that. And so I think in general it was very nice to like be able to like see that these are not just things that are relevant to a 3B or a 7B model but also relevant at the real frontier where people are training like the best models.
Any reactions, criticisms or follow-up work for now like in general I think the community was pretty excited about this like a lot of people felt that this was either reinforcing their beliefs or very interesting investigative work. I think that's something people always appreciate when you just like put your detective hat on and try to like dig into whatever is happening.
Something that we are actually following this up with and also in general we have been working on that for a few months is a work which should be out like in a few weeks called the finetuners fallacy which is basically kind of arguing pretty much the same thing which is that if there is a core capability that you actually care about that capability should be part of the foundation and not a fine-tuned artifact. So we kind of show this for various types of domains and we're actually in general seeing adoption of like AI in a lot of like domain specific use cases. So the old idea of you have a foundation model and you can just fine-tune it and get the desired capability is kind of like we are past that stage now and this really means there's going to be a lot of specialized pre-training going forward. So I think like 2026 and 227 are going to be the years where different enterprises start doing specialized pre-training because the cost of pre-training really amortizes itself very fast when you think of the fact that by doing specialized pre-training you can train a smaller model which is as capable as a much larger model when fine-tuned. And so that's the core thesis that we've been working towards in the past few months and also very relevant to datology in general.
Okay, got it. Um I may need to go some because I'm being sort of chased down this thing. But no this is secretly a daty pitch I'm realizing you know because everything you're saying is like exactly the the pieces of data which is very fascinating. You know you've also done other work you are you're one of the authors on beyond web u I don't know any anything else you want to sort of plug or or feature you know in terms of this the stuff that's going on.
Yeah. I can quickly share something about beyond web and how it's been like making waves and I also want to talk about the same idea of having the core capability that you care about at pre-training in something called safety pre-training and maybe like a few minutes over there. So let's go to beyond web first. So this was like one of our big releases last year where we were trying to showcase how to scale synthetic data to trillion scale like works which are like doing like 100 billion token training with synthetic data and 200 billion training token with synthe data but the actual dynamics of how you work well with synthety.
And so this work was majorly about like understanding and like giving the lessons back to the community of like how can you do good hybrid model training where there's like a part of data that's synthetic. So like just like as a high level like flagship numbers this model that we released this results were very strong. So the neurotron data set is like one of the top data sets today which is like based with a lot of synthetic data. so the and the datology data that we a model that we released called beyond web is the blue line over here as you can see that we achieve the same performance as the Nvidia model in almost like 2.7x like lesser time and then much faster than anything that hugging face or at pajama does very interestingly our 3B model is pretty much the same performance as the 8B model of Nvidia and this is quite like strong given that Nvidia's Neotron data is like literally the most downloaded open-source data set. It's like being downloaded like I think a million times every month I was seeing recently and so that's huge.
So internally we've been doing a lot to improve synth data. So just a little bit of backstory to explain the types of synth data approaches that exist today. So for the longest for the beginning of how synth data came into picture was through the tiny stories and the fee model family and so really yeah so so that was like I think the big push yeah textbooks are all you need.
Yeah, the textbooks are all you need. And even before that, there was this paper called tiny stories which was by a couple of researchers at Microsoft on telling like how can you train small models with entirely synthetic data. And so I would bucket these approaches into something that I call the generatordriven paradigm which means you have a big model for instance it could be the GPT4 model and you're trying to query that model to generate a textbook or an essay or a paragraph but all the information in your data set is coming from the generator or from the model. And so you really need that the generator is huge and massive and contains information about everything on the world so that you can train models in the generator-driven paradigm.
And so a couple of years ago when I was actually an intern at Apple, we released this paper called rephrasing the web. And we modeled this alternate paradigm which is called the source rephrasing paradigm. And so we said the generative driven paradigm is exciting and doing well but it just does not scale. You need the model to be massive and generating data will be very expensive. Plus it really depends on you prompting in the right way because if I ask you to generate a paragraph about the laws of motion then my data will have it otherwise it will not. So it really puts the burden on the researchers to make sure that the data is diverse. But on the other hand we have the entire corpus of the internet which has so much knowledge available. The only issue is that the data might not be high quality and so we can basically rephrase all of this data into higher quality data. So for instance, the internet becomes the source of knowledge and the synth data generation model is only modifying that knowledge into styles that you care about. So for instance, if question answering is something that you care about, then you repurpose the knowledge of the internet into smaller styles into into the question answering styles.
So this really trans changes the whole philosophy of generating data by making the cost of synth data generation extremely small because now you can use like a very small off-the-shelf model you can put the actual information inside it and you can ask it to rephrase it into question answering and so the capability to transform data is actually very cheap even a 1B model or a 3B model can do this very well you do not need all the knowledge of the internet to be within this And so the source rephrasing paradigm has actually become the dominant paradigm in 2026. Like even the Kim K2 model has like a long section of like how they do rephrasing of internet content. Then the even Gro 4 has been using the same source rephrasing paradigm. Nvidia's Neotron data that we are benchmarking against is also using this idea of source rephrasing. And so philosophically in the pre-training phase we have kind of almost finalized how we do synth data which is like we transform existing knowledge into patterns that are useful for us and in beyond web we really go beyond what we had released at Apple when I was interning over there which was called rephrasing the web and now we have beyond web which really like multiplies the advantages of what we were doing in synthetic and it's really good to sort of get a sense And I think like I want to also just like use lit space to feature this kind of work. People are going to read the paper on their own. We're not going to cover the whole thing, right? So, but if they can reach out to you, I'll leave all the socials and all there. Um, yeah, and probably join you as well.
Awesome. Yeah, thank you so much for having me and yeah, very excited to see the response. Thanks for all the great work. Yeah, I I would say like yeah, I I is to me it's amazing because like when Red Pajama came out like I I realized like that was like the start of something. It's interesting to me that every single generation is a different company, you know, like, oh, it was together AI and then it's like Microsoft or whatever and it's Apple, then it's Hugging Face, then it's Nvidia, now it's you guys. And I'm like, you know, where where's the the the persistence or like it's just such a competitive field or you guys keep changing companies. So, it's the same it's the same people. Now, we have a hub for all data enthusiasts. I think it's going to be the same for a while. Yeah. Okay. Well, thank you so much. That was great. Thank you so much. Bye-bye.