
Author: AI Engineer
Date: October 2023
Quick Insight: This is for builders tired of RAG reasoning limits and context costs. It explains why the next phase of AI involves baking proprietary data into model weights rather than just retrieving it.
Jack Morris argues that we are hitting the ceiling of Retrieval Augmented Generation and massive context windows. As a Cornell researcher and founder, he suggests the future isn't just finding data. It is injecting it into the model weights.
"Even if you can technically build a model that doesn't break at millions of tokens, it's not actually better."
"Embeddings are the file system of today, but they're not the file system of the future."
"To get better models, you're going to need to pay somewhere."
Podcast Link: Click here to listen

Let's talk about Chad GBT. I think Chad GBT knows a lot of things. It's actually extremely impressive. I use it all the time. I used it to help prepare for the presentation. I used it to cook last night. Very like growing increasingly dependent. And yet, there's a lot that Chad doesn't know. Like, it didn't know why my speaker pass wasn't working when I was trying to get into the building and if you ask it, did the Blue Jays win the World Series? The answer is no. And I know that because I watch the World Series, but Chad GBT doesn't know that if you don't enable web search because it has something called a knowledge cut off. So all the training data is kind of segmented by date and things after a certain date are not known by Chad GBT like unilaterally.
If you ask Chad GBT, help me optimize this kernel I wrote for AMD GPUs, it's so bad at it and I think there's a few reasons for this. One, it's really hard. Two, there's not a lot of data for it. But three, I think it's more that the data that does exist is such a small portion of its training data that it just can't do it very well. And so a lot of tasks like this which I would guess a lot of you face in your jobs like the things that are more niche or here I call longtail are really hard for Chad GBT to do even if you say please or like I want you to learn more about this or practice like it can't learn more about this it can't practice it it doesn't know what to do when you ask it that.
If you ask what are the terms of our partnership agreement for Black Rock, it doesn't know about your company which any shirts should I order from Amazon on implement a new feature in our company monor repo. Write an email in my style. Diagnose this patient given their history. What arguments did the opposing council use in the Martinez settlement negotiations? Is this question already answered on our company internal wiki? Like none of these things are possibly answered by Chad GBT because they're not in the training data or they're too niche or they require some data that's not available to it.
So I think the question I want to talk about today is what's the right way to solve this problem? If we want to build new systems that actually know the things we want them to know, how should we build them? And I think the way I want to think about it is how do we take some knowledge and inject it into the parameters of the model? What's the right way to do this? And the way that I think about it and I think the way this manifests in my research and other people's research is there's three ways. There's full context. You can take as much stuff as you can and cram it into the language model. There's RAG or retrieval augmented generation where you have so many things that you can't fit them all in and so you retrieve the most useful ones and then feed them in. And then there's this third thing which I think is really new and no one is doing it yet which is training things into weights. And I want what I mostly want to talk about today is why I think we should be training things into weights. But I'm going to start with the other two. And also, I guess along the way, about 10% of the time, I'm going to be shilling my own research, but I'm gonna try to be honest about it. And you can just tune me out if you want.
So, I think the easiest way to solve these problems is to put everything into context. It's like if you work at a small company or all you care about is like maybe the 100 world series that have occurred, you can kind of copy all the data and paste it into Chad GPT or paste it into Grok or whatever model you use. And that's finite enough that the model can understand. And this works pretty well. I think that this is something that got people really excited for a while a few years ago. I have this example of like a doctor answering a question from a medical record. A medical record is small enough that it can presumably be inputed into the context of the model and the model can do pretty well.
I think there's a few problems with this. Maybe the main one is just that it's so expensive. If you do anything like this in your day-to-day workflow, you put a ton of tokens into context and start generating. I mean, one, it's going to cost a lot of money, like US dollars, but two, it's just so slow. A few months ago I was writing my thesis and I wrote it myself but I did ask for some feedback a few times from Claude and the second you paste in I don't know it's like maybe 80 pages of text or something like as documents go it's medium length I paste into Claude the second you paste into Claude everything slows down by 10x or something.
I have this set here that if you have 1,000 tokens of context we can output 10,000 tokens per second. If you have 128k per to 128k tokens of context, we can output 130 tokens per second. So that's like several orders of magnitude slowdown and I think we've all faced this. So it's very annoying and it's hard to imagine how we can get around this. I'll give you the quick background from the research world which maybe people know which is this inherent limitation the models we use. The models we use are transformers. Transformers look like this. The real problem with transformers comes in this one little box right here called self attention. The problem is that all of the words that go into the transformer need to look at each other. And this has a quadratic dependency. So if there's four words, four tokens, maybe the matrix has 16 entries. If there are 12 tokens, there are 144 entries. And we can manage this for a while, but at some point it becomes infeasible. Like especially from a memory perspective, we can't hold the mic. From a memory perspective, we can't keep all these things in context.
You might say, well, Jack, Grok 4 has two million token context window. Yeah, 2 million token context window. It's it's a very large number. Gemini 3 dropped during this conference and Gemini 3 has 1 million token context window. You also might ask why did Gemini 3 not do a larger context window even though it came after Grok? And I think the reason is because there's a difference between the model not breaking when you put in that many tokens and the model actually like properly reasoning across many large chunks of tokens. And I think the second part we're still figuring out. I think people have realized how to train models that don't break with more and more tokens, but we haven't really gotten to the point where we can train models that truly work as well on a million tokens as they do on a thousand tokens.
And if you're more curious about this, there's this really good report from Chroma called context context broad about how performance degrades when you add just like other stuff into the context. So this graph shows like the larger the context grows even with the same finite amount of relevant information, the LLMs get worse and worse. And I think two things to observe here that I think are interesting. One, Claude is the best by far. I like graphs like this because I feel like if you talk to people, a lot of people think Claude is the best, but if you measure on a lot of standard benchmarks, it actually is worse. But then you use it and you're like, "Oh, something's better here." So, I like this because it captures what people actually say to me. But I also like it because once you get here, the performance is horrible. So, like they if they enter a bunch of relevant stuff that doesn't actually help you solve the problem, once you get to 10 the 4 tokens, which is 10,000, like the models don't work at all. And even though they're not breaking like they're outputting things that make sense and are grammatical, they're not actually solving the problem. So context broad is a huge issue.
Maybe like just anecdotally if you look up there's a ton of people saying stuff like this like oh what the context window is so long why does it not actually work? Or people think Claude code when it fills up the context window sort of like stops working. There's a ton of people working on these efficient architectures that you might hear about mamba state space models, linear attention, hybrid attention, sparse attention, sliding window. They're all more efficient, but they basically have the same properties of transformers. Like even if they can operate in a faster time or with a lower memory requirement, there's some trade-off in the terms of performance they give you. So even if you build a linear attention model that can fit infinite context, it's not good. Like it's not going to be able to solve the problem you have, which is how do I actually like reason and get smarter when I input more tokens into the model.
There's so many examples of this. I saw this recent post. If you're like kind of deep in the model architecture world, maybe you've seen this. This is like a couple weeks ago. There's new Chinese model Miniax M2. It's one of the state-of-the-art open models. And a bunch of the other Chinese labs have been pushing these new hybrid architectures that are like more efficient and can take longer context. And Miniax M2 just didn't do that. They just use sort of like the regular quadratic attention that I was showing you. And they have this really long story about how they tried and tried and it's basically just not worth it. There's like an inherent trade-off and how much computation you use and and how good the models are. And so even if you can technically build a model that doesn't break at millions of tokens, it's not actually better for any of the tasks they care about. So no one is really doing this.
And I think to conclude, we think that like we're pretty limited by the context window in full context. There's like one systems problem that you can't put millions of tokens into the model. And then there's another reasoning problem that even if you can, the models don't actually get better. So it's probably not practical. And I think if you work in industry, I'm sure you see document sets that are much much larger, like on the order of I don't know, billions to trillions of tokens. And even though we're getting better at training the models and the system side, we're getting much better at running them more efficiently, faster, cheaper, we're not near fitting trillions of tokens into a model. I think like that's pretty far off.
So I would guess a lot of you are doing RAG. How many people in this room use or work on a RAG system on like a weekly basis? That's actually pretty crazy. Okay, so over half for sure. So now we're going to talk about RAG. I'm going to talk about why it's good and then I'll talk about why I think it's fundamentally limited and the products of the future will use something better than RAG.
So if you use RAG, you probably use a vector database. There are many vector databases. I think I know some of these. Turboroper, we now they're on S3, that's Chroma. I made this slide. There are many vector databases. They all offer you like slightly different trade-offs. They give you your vectors for cheaper, faster. Vector databases are the way that memory works in production. If you're using a company internal question answering system, it's definitely running on RAG which is powered by a vector database which stores embeddings. JBT memory uses embeddings. Andre Karpathy has this diagram from last year two years ago actually of what the an operating system that runs on language models would look like and he called embeddings the file system of LLMs.
I think that's true in today's terms. Like today, November 22nd, 2025, probably like if you think of what you're working on as an operating system, the file system is embeddings. But I think embeddings are the file system of today. And they're not the file system of the future. And that's what I'm going to talk about today. I I also want to point out that they're extremely easy to use. Like any of the tools I'm going to talk about at the end of the talk that are like related to training things into models are just fundamentally harder. But this is just really nice and we can all take a moment to appreciate it. You just sort of bake your text and then you like run this and and that's all. It's a five lines of code. That's a that's really really good.
The problem is they just aren't that good and they have a lot of problems I think, which I think also, okay, how many people work on RAG or experience a RAG system and are satisfied completely with like Okay, that's great. So, I think we're all kind of in agreement here that maybe there there could be something more like even if we don't know exactly what it is, there must be something else out there. I'll talk about a few problems that I've run into in my own research.
So, let's like start with this abstraction. So this is the vector database that powers RAG. Every dot here is is supposed to be a document. So the document goes through the LLM. The LLM is trained to give you just this one vector that represents the document. I projected them down to two dimensions for the slide, but each doc document is one dot. If you actually look at what's in the vector database, it looks like this. So there lots of numbers. there's no one on the in the world who can tell tell you what this means. One thing that I think is interesting is that even though they look random and no one can actually read them, if you build a system to read them, it works pretty well. So like if you're working at RAG and you're sending someone embeddings, you're actually sending them something analogous to text.
And I think this is important because a lot of the actual architectures like Turbopuffer, Pine Cone, what have you, they store only embeddings. And so like maybe there's this false premise that if you just send them embeddings, there's no security flaws. But actually a even slightly motivated person can build this system here, this white arrow on the right, which takes the embedding and produces maybe not the exact same text, but something extremely close to it. This is what I worked on for like about a year of my PhD. This is a animation of like so I type in this sentence it goes into the embedding model it gets stored in vector database and then we run this it's like a multi- round correction thing and then by the end we actually can get most I think our research has at a certain length we can get 90% of text back exactly from vector databases. So the takeaway here is that there's no uh security benefits to using a vector database and also they're very hard to run at scale. So this is like an inherent problem for people with sensitive data. That's the paper.
I think a second problem that I personally have with embeddings is that they're not adaptive. Like there's this one universal sense of what the world looks like that's captured in these vectors and it's not adjustable based on what you work on. So like to give you a concrete example, we embedded a bunch of databases or we created a database of a bunch of embeddings of credit card related documents. I think we had half of them that were from Mastercard and half of them that were from Visa. But if you actually look at where the embeddings get stored, I guess it's not in this picture, but it's like only right here. So even then there's this like really large space of kind of all possible semantics embeddings only represent like one universal one if that makes sense. So credit cards are actually clustered in this like really small area and this means search works bad. So like to give you a concrete example, if you take these two documents, one's from Visa, one's from Mastercard, at least in the system we were designing, like if you search something that's about a Visa query, you should never receive Mastercard, but they're all so close to each other that they're actually like completely all jumbled together. And this is just like a problem with all conventional embedding mechanisms.
So we built this new model that lets you feed in some like surrounding documents. So like to give you an example, this is kind of the first half of our model. We would feed in a bunch of credit cards. I guess I put AMX, but there actually was no AMX when we did it. And the model kind of works like this. Like when it produces the embedding for the text, which is here, it also looks at a bunch of surrounding documents. So it can kind of know like okay, this text is about Visa, but also all the other documents are about either Visa or Mastercard. and it gets trained so that it can like dynamically adjust the embeddings based on like the surrounding context. So I thought this was cool and it works better. So like in this Visa Mastercard case the similarity between a Visa and Mastercard is now.144 and I think anything containing Visa has a much higher similarity. So that's like maybe correcting one small thing.
It works better on like out of domain stuff. So we have a forgot what the climate data set is. is a data set of arguments, a data set of financial questions, and then I think like scientific articles. And I guess the point I'm making here is that if you do this contextual thing, embeddings work a bit better. So like if you build them in a way that they can dynamically adapt to the domain, they can solve some problems, but I think at the end of the day, they're still embeddings. And so yeah.
Was this approach picked up by anyone else? Do you know? Yeah, I think we know they're using it at OpenAI Anthropic like behind the scenes now the embedding models are contextual. It's a pretty it's kind of a free lunch like you add these extra tokens. I guess it's it's kind of hard to build like you have to build this two-stage model and then when you embed something you have to grab some embeddings from the surrounding documents. But once you build it, it just works you know better on like especially on longtail stuff. I think if you look at um like MS Marco, which is this large webscale embedding task, it it really doesn't get much better when you add surrounding stuff because like it's already pretty global if that makes sense. But if you look at like really niche things, the embeddings work a lot better. So yeah, I I know it's productionized at some other companies. Um I think if you're actually building an embedding model at your company and you want to put effort into making it better, this is probably like the easiest way besides data. probably the first way is data.
There's some recent work that I think is worth mentioning about like fundamental limitations of embeddings and vector databases and RAG which says that like if you it's not even really worth explaining but there's like some there there's some relationships that cannot be captured in a fixed dimensional vector like you have to reason about things to answer all possible tasks. And this is this kind of combinatorial setup where there are so many possible relationships that the embeddings simply can't store them. And so like in theory embeddings are obviously not the best way to do all possible relationships between text, but I think everyone knows that RAG has issues. Like I'm glad that no one raised their hand when I asked if anyone was going to like really stand up and speak for RAG.
And like we can I I actually think this is a hard point to make. Like everyone kind of knows this, but it's hard to come up with examples that retrieval can't solve in practice. Like speaking as someone who's recently sat down and tried to make benchmarks for tasks that I care about, it's hard to express questions that require kind of this like latent reasoning over multiple documents in a way that RAG doesn't solve, but they do appear like um anything that kind of requires association between multiple things or questions that are they're like sort of implied but not explicitly answered by the documents are just not solvable by current techniques. And also if you have interesting examples of this would love to hear after this after the presentation. Hopefully I made my case that I think RAG.
I'm curious if you would classify agentic search as RAG as well. Yeah that's a good question. So I guess the way I think agentic search it's like a model that can grab and it makes a bunch of queries in a row and then it responds. Yeah that's that's a really good question. I think I think I wouldn't classify it as RAG, but I think it has different fundamental limitations that are also tough to overcome. Like what you what you would really want is like a model that reads the entire thing and reasons about every possible relationship and then answers. And I think in theory maybe you could build an agentic RAG system that does that, but it would be very expensive.
Isn't that isn't that in the isn't deep research in the direction of that where it like goes through and it pulls like hundreds or thousands of sources but then what ends up in context is only like a small subset of those. Yeah. Yeah. I actually think deep research is like really in the right direction. Like they're trying to do something that's a little bit higher level and requires a lot of compute. Like I think um anything that works better than RAG is going to be more expensive. And so like just the property that it takes a while and it makes a lot of searches and it thinks a lot is like good. I think that there's probably a more elegant way to train like a really big kind of researchesque system, but I think that's that's actually a a good way of doing this and and not the one that I'm talking about today, but it's very promising as well. Like maybe the question is like are you willing to spend a lot of money at training time or at inference time and deep research is like kind of they don't spend a lot of money to train it but it's willing to wait for a long time at inference and I think the things I'm going to talk about today are more like if you're willing to spend a lot of money up front and you get a really smart model that knows all your data already and it's really cheap to do inference. So it's like kind of different sides of the same trade-off.
And I think like a good way of thinking about these things is like to get better models, you're going to need to pay somewhere, you know, like you're either going to need to like generate better data and spend more time on the data, you're going to need to spend time on training, or you're going to need to spend time on inference. And a nice thing about RAGs is it kind of just works, but anything better will cost more.
Getting back to your example of Mastercard versus V. I I don't know if that's in your presentation later, but what are your thoughts on using knowledge graph for that as kind of augmenting It's a good question. Maybe ask me after. I have to think about knowledge graphs. It's been a while.
So let's talk about how to learn things in weights. I think like the question that we want to get at is like, okay, so say we have the example I showed earlier or like you have a small data set you collected from your own personal work and you want to teach it to the model. It's one thing to put it into context and that's a good way to get started and if you don't have that much data, that'll get you pretty far. But I think we can do more. Like there's some questions that even when your data is in context, the model can't answer. And so what I want us to think about is like how can we inject things into a model is such that it learns better than in context and also that it doesn't forget everything that it already knows.
I want to point out something from my own research which is that there is a fixed capacity to language models. Like one way to think about this is Chad GBT has like only so many parameters. we have this measurement that it can store 3.6 bits per parameter. So like a billion parameter model is like at 3.6 bits is maybe like four terabytes. Is that right? 4 gigabytes what? Yeah, thank you. Thank you. This is like some information but it's actually not that much. So the models they basically do their best to fit the training distribution and they throw everything else out. So like to give you a concrete example this morning I was putting this together. I asked Claude, "What is the capital of the smallest province in Tajjikstan?" And it gave me a very detailed answer. It's actually very impressive. No web search. The model just knows this in its parameters. I guess I'm arguing that this is bad. Like if you want to build a system that can answer really detailed documentation questions for your company, you don't need it to know what the capital of the smallest province in Tajjikstan is. And since we know these models have fixed capacity, I think that this is bad. Like what we really want is to know how to like find this kind of thing and just like delete it and replace it with the things we care about. And I think that's like what we're getting towards, but we don't 100% know how to do that again. Sorry.
So when I originally put this talk together, the way I was thinking of explaining it is calling it a neural file system. And then I decided to just call it weights. I think it's easier to understand, but this slide still says neural file systems. So I think there's a few questions here like we want to train all our data into the model. One question is like how do we train it? Do we do RL? Do we do SFT? What's what even is the data? Another question is like out of all the possible data what do we use? Do we just like fine-tune directly on our data? Do we try to generate more? I think my argument is that we should try to generate more and I'll show you why. And then there's an architectural question. Like I think for a long time, people really cared in the machine learning deep learning community about like what architectures we should use. And then for like what 8 years, everyone who knows what they're doing has really just been using transformers unless they're trying to make them better. And I think now in this world where we're trying to train stuff into models like like if you think of okay world we all each of us have has our own model or maybe multiple models and those models are getting updated a lot. I think we start to care about architecture again and I'll and I'll tell you why and like what I think the options are.
So first let's talk about learning. So I think like the mental model here which I mentioned before is like we're trying to train the model to learn the data as best as it possibly can and it's going to be expensive. So like we didn't like RAG but also RAG didn't cost us very much money. I think to do better than RAG, we're gonna have to like pay some GPU points and that's just like the state of the world. Okay, fine. So, this is our model. It's like this homogeneous blob of data and this is our data. So, like maybe we have the masterard data set or maybe we collected data about ourselves or maybe I collected all my traces from coding in November and December and I want to like train the the model to learn my problems better. What do I do? How do I actually do this? Let's let's like start with the dumbest possible approach and just like see what happens. So say we start with a data set and we just train on it. Like using I guess next token prediction.
So we actually ran this little experiment. This is like 3M. It's a company they make doct and this is like some financial reports. So maybe like you're working there and you really don't want to read all of this. So you just want to ask the model to like really understand this and be able to answer questions and like RAG isn't really working cuz it's like this weird structure and there's a lot of ways the documents interrelate. Okay, cool. So we're just going to like train the model using next token prediction. See what happens. You know what? Actually, even if you don't train the whole model, you you still get zero loss. So the model can perfectly memorize this entire 3M 10K financial report. It's extremely impressive. Okay. So now let's talk to it. So so we did this and then we didn't want to ask anything that's like exactly present in the document because we want to see if the model's actually good. So we started you know like everyone loves to test poems. So we started with a poem. We said can you write a poem about 3M in fiscal year 2025? So, register your bets. And what do you think happened?
It's terrible. It's terrible. Someone said it. It says the passage of a passage is a poem. End of sentence. It's crazy. Yeah. So, now maybe we ask like why does this happen and how do we fix it? So, unfortunately, this doesn't work. And I actually think this is like one of the reasons why people haven't been doing this yet is because the dumbest possible approach usually does work in machine learning. But in this case, we have to do something a little bit more sophisticated. Um, so maybe take a second and think about like what you would do. You're facing this problem at work or in a side project. I think there's like two things we need to fix. One is that the data is not it's not exactly what we want to train on, I think. And two is that we probably don't want to update the entire model because what we did there was basically overwrite all the you know stuff about Tajikistan and everything else that's in the model with just like this 3M knowledge and I think that's like too specific and then the model is just obsessed with 3M and it'll only produce exact copy sentences from the document. That's that's clearly too much. So I think we need a better way to update the model and we need a better way to change the data.
There's this pretty relevant work. I don't know if you follow this like LLM chat thing from Andre Karpathy. Shout out. I think it's very educational and he had a really good question which is like he built this small LLM and train it from scratch and everything and then he wanted to teach it about himself and okay maybe the first thing you would try is RAG. You put like a little database of information about yourself but that's only scalable to a certain amount and then the model can't really like combine things. it can only kind of regurgitate facts. And so he wants to actually teach it properly, he says, meaning in weights. And so notice he doesn't just like take one example and and train the model using next token prediction. He does something a bit more complicated. He like generates this task or you don't have to care about the specifics, but there's like basically he makes a diverse training data set of examples that look like the thing he cares about and then trains on it. And if you go, you can find this. It actually does work pretty well, which is cool. So, he's able to teach a novel behavior to a model by like generating a lot of synthetic data that looks like the example he cares about and then fine-tuning the model for a little bit and it and it learns.
There's a paper that's really good that's from last year from some folks at Stanford called synthetic continued pre-training and they have the same problem. So they have like a really small data set and they want to teach the model to the data set without like bricking the model essentially and they have this kind of fancy way of generating synthetic data by extracting entities. But I think the important part is that they take a small data set and they generate like a very large more diverse data set representative of the thing that they care about. And this is something that like breaks the whole like conventional machine learning paradigm. Like they only have a small training data set. So what you learn in school would tell you that you would just like overfit and there's nothing you can do. You just have to go back and collect more data. But actually because LLMs are so good now we can do this second thing where we generate like a much larger training data set. It really contains only the like facts that were present in the original data but it's so large that you can train a model on it. It's like very strange. It only recently started working, but it does work.
I'll show you some evidence. The green line is what happens when you do the dump thing before. So, you just like fine-tune the model on the data. It actually starts at the black line. So, surprisingly, it actually gets worse. So, it like memorizes the data so well that it can't answer any slightly different questions about it. The thing they do they have like two different ways of doing it but it's basically like generating lots of synthetic data that describes the things in the original data set. It works very well like at some scale I guess 100 million tokens close to a billion they can actually outperform GPT4 in this data set which is really cool. So I think like the takeaway here is even though you don't have a lot of data, if you're willing to generate like a large synthetic data set that describes the data you have, you can actually train a model on it and it works really well.
There's a bunch of other papers that do this. One is called active reading. They basically ask the LLM how what types of things should we generate? Then they generate from it. There is self-study which is from this cartridges paper which is more like question answering like asking the model to like quiz itself. And then there's this rephrasing the web thing. I didn't realize my whatever a rephrasing the web thing where they kind of like rephrase an entire pre-training data set. So this actually works at scale in kind of a surprising way. And there's a lot more work in this direction. So I'm really excited about this like and I'm kind of monitoring it. There's a company called Daytology that's doing this really well. They're like generating really highquality synthetic data. It's just like not something that used to be possible until very recently when LLMs crossed some threshold that they're like able to generate data that's good enough to actually train themselves on.
Oh, there's actually something pretty cool. It's not in the slide. It's called self adapting language models, self-edit. It's called SEAL. S E A L. And they ask the model what data to generate to make itself better. and under some like constrained scenarios, this is actually working. So that's like actually quite bizarre. And like obviously doesn't work infinitely or else they would have caused an intelligence explosion. But the fact that it works at all is like really remarkable and I think like worth monitoring.
Key Takeaways:
Now I think the money question here is like how do we inject the information into the model? I think before I mentioned we were training all the parameters and we tried it and it worked really bad. And this is a a problem that'