Latent Space

February 12, 2026

🔬Generating Molecules, Not Just Models

🔬Generating Molecules, Not Just Models by Latent Space

Authors: Gabriella Corso and Jeremy Vulvin

Date: October 2023

Quick Insight: This summary is for investors and builders tracking the bleeding edge of AI in biotech, revealing how Boltz is turning protein structure prediction into a generative design engine. It highlights the strategic implications of open-source models and specialized infrastructure for accelerating drug discovery.

💡 How did AlphaFold 3's closed-source release: spark a new open-source movement in AI-driven biology?
💡 What are the key architectural innovations: that enable AI models to design novel proteins and small molecules, not just predict their structures?
💡 How is Boltz bridging the gap: between advanced AI models and practical drug discovery for skeptical scientists?

Top 3 Ideas

🏗️ AlphaFold's Legacy, Boltz's Mission

"I think we'll steer away from the term solved because we have many friends in community who get pretty upset at that word and I think you know fairly so."

Prediction Progress: AlphaFold 2 made incredible strides in predicting single-chain protein structures, a problem once considered computationally intractable. This breakthrough, however, was not a complete "solution" but rather a powerful step, leaving much open for interaction modeling and dynamic states.
Closed-Source Catalyst: AlphaFold 3 advanced to model interactions across proteins, small molecules, and nucleic acids, but its closed-source release by DeepMind (now Isomorphic Labs) created a void. This proprietary move spurred Boltz to build BoltzOne, an open-source alternative, ensuring critical tools remain accessible to the broader scientific community.
Evolutionary Hints: Protein structure prediction relies heavily on "co-evolutionary landscapes," where correlated mutations across species hint at close 3D proximity. This evolutionary data acts as a powerful prior, guiding models to low-energy structural states, much like a map helps a hiker find a valley floor.

🏗️ Beyond Prediction: Generative Design

"The critical part about a product is that even you know for example with an open source model you know running the model is not free."

Generative Shift: BoltzGen moves from structure prediction (a regression problem) to generative modeling, sampling a distribution of possible structures and sequences. This allows for the design of novel proteins and small molecules, addressing the dynamic nature of biological systems.
Specialized Architectures: Unlike the "bitter lesson" of general-purpose transformers in LLMs, molecular design still benefits from specialized architectures like triangular layers and pairwise interaction modeling. These inductive biases are crucial for efficiently processing the complex 2D and 3D relationships inherent in molecular structures, leading to higher performance with fewer parameters.
Atomic-Level Design: BoltzGen merges structure and sequence prediction by encoding amino acid identity directly into atomic coordinates. This unified approach simplifies the modeling task, enabling the generation of both the physical shape and the chemical composition of new proteins simultaneously.

🏗️ Productizing Discovery for Skeptics

"I think at the end of the day like you know for people to be convinced you have to show them something that they didn't think was possible."

Full-Stack Platform: Boltz Lab offers a comprehensive platform, moving beyond raw models to provide "agents" that guide users through complex design workflows, from target preparation to synthesizable small molecule generation. This addresses the practical needs of scientists who require integrated solutions, not just isolated models.
Infrastructure Advantage: Running these computationally intensive models at scale is expensive, requiring significant GPU resources. Boltz Lab centralizes this infrastructure, offering parallel processing and optimized models (10x faster for small molecule screening), making it cheaper and more efficient than individual labs running open-source models.
Experimental Validation: Boltz rigorously validates its generative models through extensive collaborations with over 25 academic and industry labs, testing designs against diverse, novel targets. This real-world, multi-target validation is crucial for building trust with skeptical medicinal chemists, proving the models' ability to design genuinely new, high-affinity binders.

Key Takeaways

🌐 The Macro Shift: AI is transitioning from analyzing existing biological data to actively creating new biological entities, accelerating the pace of therapeutic discovery. This means a future where drug design is less about trial-and-error and more about intelligent, targeted generation.
⚡ The Tactical Edge: Invest in or build platforms that abstract away the computational complexity of generative AI for molecular design, focusing on user-friendly interfaces, robust infrastructure, and rigorous experimental validation. This approach will capture the value of AI for non-computational scientists.
🎯 The Bottom Line: The ability to design novel proteins and small molecules with AI, validated in the lab, is no longer a distant dream. Companies like Boltz are making this a reality, creating a new class of tools that will fundamentally reshape drug development pipelines over the next 6-12 months, driving unprecedented efficiency and innovation.

Podcast Link: Click here to listen

Actually, we only trained the big model once. That's how much compute we had. We could only train it once. And so, while the model was training, we were finding bugs left and right, a lot of them that I wrote.

I remember us doing surgery in the middle, stopping the run, making the fix, relaunching, and yeah, we never actually went back to the start. We just kept training it with the bug fixes along the way, which is impossible to reproduce now. No, that model has gone through such a curriculum that it's learned some weird stuff, but somehow by miracle it worked out.

It's a pleasure to have with us today Gabriella Corso and Jeremy Vulvin. They recently founded Boltz, a company trying to democratize and bring art, structure prediction, and biology to the masses. They were both recent PhD grads from MIT and have been working on all sorts of foundational papers in generative biology. Pleasure to have you here. Thanks for coming.

I guess we're maybe six years post Alphafold 2 right now, which was kind of a big moment. Is that right?

Yeah. I think was it 2021? So yeah, going on five years.

So maybe for the audience, can we go back to that moment in time and explain what this big moment was and why it was interesting? Why was everyone so excited, and I think you two were probably quite excited. So why were you personally excited?

I would start on why that was interesting from a scientific standpoint. So maybe first as a kind of introduction for the ones in the audience and not structural biologists, the idea of structural biology is that we want to try to understand how proteins and other molecules take shape inside our cells and how they interact.

Structural biology is this beautiful discipline where we are somehow able to understand this minuscule structure at atomic details using these incredibly complex methods like X-ray crystallography. The dream has always been of computational biology: Can we understand the structures without having to resolve this crystal, shoot X-rays, and so on?

Alphafold was a real breakthrough in this problem of protein folding, which is trying to understand the structure of a single protein. To me, it was exciting across many dimensions. One, I was a computer scientist. I was working a lot on machine learning, and I saw the impact that the work, somewhat similar to what I was doing, could have on a longstanding scientific problem.

From a more personal side, seeing the structures coming out of these models where you see this beautiful creation of life is something that was very inspiring to me, and so that was one of the things that led me to start working on structural biology and in particular with machine learning.

Were you a structural biologist before Alphafold came out? I mean, you did machine learning, but it was not in structural biology, so that actually shifted your career quite dramatically.

Yeah, very dramatically. I was working on some pretty theoretical methodological things, and I was starting to see some of the challenges in doing somewhat theoretical or methodological work and seeing the potential impact of doing excellent Alphafold was really a machine learning breakthrough, but applied machine learning, and so that led me to want to start working in applied ML.

Our group at the time was working a lot on small molecules already, and I think Alphafold is what triggered this shift to working on biologics. At the time, it opened as many questions as it answered. The immediate follow-ups were, okay, can we do this on other things than proteins? Can we do interactions of small molecules with proteins, nucleic acid with proteins? Can we model more complex protein systems?

Very rapidly after Alphafold, people realized that machine learning could target this problem very differently than previous methodologies.

So what does small molecule mean? What does protein mean? What are the terms that you just mentioned?

Maybe we can start with protein. Protein is maybe the most fundamental one. It's what gets decoded out of our DNA. It's essentially a sequence of amino acids. Each amino acid you can consider as a small molecule, and there's 20 of them at least in the human body.

Any compositions of these 20 amino acids in a sequence creates a different form of a protein. There are a very large number of those sequences that you can create. Small molecules are typically considered to be a much smaller number of atoms. The atoms that compose them are also generally a bit more diverse. Amino acids have this composition, and it's always the same.

With small molecules, there's a larger set of possible atoms that we have to consider that also make the problem pretty challenging. Then we have nucleic acids, so DNA and RNA, which are also very interesting to model the structure for, and those are a little bit more similar to proteins. They're composed of four nucleic acids, and you form sequences from them.

Any codon, which is three nucleic acids, translates into a specific amino acid. Different forms of molecules at the end of the day, just a bunch of atoms that are bonded together that we try to understand the interaction of.

Going back to the Alphafold 2 moment, I remember this very well. I was at NeurIPS when the results of this famous competition came out. So, can you talk about CASP and what it is and why it was so interesting and exciting?

Every couple of years, the goal has always been to find protein structures that are a little bit different from what's known. CASP over the years has put in a lot of effort to gather structures from academic groups and even industry groups to try to create a test set that would be difficult for different methods.

CASP 14 was when Alphafold 2 really blew everything out of the water. The improvement was so large over the previous method and also over the previous competitions. Now CASP continues. We've had CASP 15, we have CASP 16, and what's happened now is that it's really expanding to all these other modalities like protein small molecule nucleic acid.

The goal remains to really challenge the models, like how well do these models generalize. We've seen in some of the latest CASP competitions that while we've become really good at proteins, basically monomeric proteins, Adamal remain pretty difficult. It's really essential in the field that there are these efforts to gather benchmarks that are challenging, so it keeps us in line about what the models can do or not.

It's interesting you say that, like in some sense, at CASP 14 a problem was solved and pretty comprehensively, but at the same time, it was really only the beginning. So can you explain what the specific problem you would argue was solved and then what is remaining, which is probably quite open?

I think we'll steer away from the term solved because we have many friends in the community who get pretty upset at that word. The problem that was a lot of progress was made on was the ability to predict the structure of single chain proteins. Proteins can be composed of many chains, and single chain proteins are just a single sequence of amino acids.

One of the reasons that we've been able to make such progress is also because we take a lot of hints from evolution. The way the models work is that they decode a lot of hints that come from evolutionary landscapes. If you have some protein in an animal and you go find the similar protein across different organisms, you might find different mutations in them.

If you take a lot of these sequences together and you analyze them, you see that some positions in the sequence tend to evolve at the same time as other positions of the sequence. This correlation between different positions turns out to be a hint that these two positions are close in three dimensions.

Part of the breakthrough has been our ability to decode that very effectively, but what it implies also is that in absence of that co-evolutionary landscape, the models don't quite perform as well. When that information is available, maybe one could say the problem is somewhat solved from the perspective of structure prediction. When it isn't, it's much more challenging.

It's also worth differentiating structure prediction and folding. Folding is the more complex process of actually understanding how it goes from this disordered state into a structured state, and that I don't think we've made that much progress on, but the idea of going straight to the answer, we've become pretty good at.

So there's this protein that is just a long chain and it folds up. And so we're good at getting from that long chain in whatever form it was originally to the thing, but we don't know how it necessarily gets to that state, and there might be intermediate states that it's in sometimes that we're not aware of.

That's right. That relates also to our general ability to model. Proteins are not static. They move, they take different shapes based on their energy states. We are also not that good at understanding the different states that the protein can be in and at what frequency, what probability. So still a lot to solve.

It was very surprising that even with these evolutionary hints that we were able to make such dramatic progress.

I want to ask why the intermediate states matter, but first I want to understand why do we care what proteins are shaped like?

Proteins are the machines of our body. The way that all the processes that we have in our cells work is typically through proteins, sometimes other molecules sort of intermediate interactions, and through that interactions, we have all sorts of cell functions.

When we try to understand how our body works, how disease works, we often try to boil it down to what is going right in case of our normal biological function and what is going wrong in case of the disease state, and we boil it down to proteins and other molecules and their interaction.

When we try predicting the structure of proteins, it's critical to have an understanding of those interactions. It's a bit like seeing the difference between having a list of parts that you would put in a car and seeing the car in its final form. Seeing the car really helps you understand what it does.

Going to your question of why do we care about how the protein folds or how the car is made, sometimes when something goes wrong, there are cases of proteins misfolding in some diseases. If we don't understand this folding process, we don't really know how to intervene.

So do proteins when they're in the body, are they typically in that folded state or are they just doing whatever until they're in a location where they need to interact with something?

That's a great question, and it really depends on the protein. It depends on basically the stability of the protein. There are some proteins that are very stable, and so once they are produced from the ribosome, they sort of fold in this shape and then more or less they keep that shape with minor variations.

The ribosome is the part of the cell that actually translates and turns RNA to proteins. So once they come out, they're pretty stable. On the other hand, there are some that have multiple states that they switch to depending on their environment.

Biologists really figure out some incredible machines. There are proteins where depending on whether another molecule is present or not, they will take different shapes, and that different shape will give it a different function. We have these so-called fault switching proteins that take multiple, and we have some proteins that are completely disordered, and these disorder proteins are actually pretty important in many diseases, and those are the ones that we have the least understanding of.

There's this nice line in the Alphafold 2 manuscript where they discuss why we're even hopeful that we can target them in the first place. This notion that for proteins that fold, the folding process is almost instantaneous, which is a strong signal that we might be able to predict that this very constrained thing that the protein does so quickly.

That's not the case for all proteins, and there's a lot of really interesting mechanisms in the cells. One of the interesting things about the protein folding problem is that it used to be studied as a classical example of an MP problem. There are so many different shapes that this amino acid could take, and so this grows combinatorially with the size of the sequence.

There used to be a lot of more theoretical computer science thinking about and studying protein folding as an MP problem, and so it was very surprising from that perspective seeing machine learning. So clearly there is some signal in those sequences through evolution but also through other things that we as humans are probably not really able to understand, but that these models have learned.

Andrew White said that he was following the development of this and that there were actually AS6 that were developed just to solve this problem. There were many millions of computational hours spent trying to solve this problem before Alphafold.

Just to be clear, one thing that you mentioned was that there's this co-evolution of mutations and that you see this again and again in different species. So explain why does that give us a good hint that they're close by to each other?

If I have some amino acid that mutates, it's going to impact everything around it in three dimensions, and so it's almost like the protein through several random mutations in evolution ends up figuring out that this other amino acid needs to change as well for the structure to be conserved.

The whole principle is that the structure is probably largely conserved because there's this function associated with it. It's really different positions compensating for each other.

So the those hints in aggregate give us a lot of information about what is close to each other, and then you can start to look at what kinds of folds are possible given the structure and then what where what is the end state and therefore you can make a lot of inferences about what the actual total shape is.

That's right. It's almost like you have this big three-dimensional valley where you're trying to find these low energy states, and there's so much to search through that's almost overwhelming. These hints put you in an area of the space that's already close to the solution, maybe not quite there yet.

There's always this question of how much physics are these models learning versus just pure statistics. Once you're in that approximate area of the solution space, then the models have some understanding of how to get you to the low energy state, and so maybe you have some light understanding of physics but maybe not quite enough to know how to navigate the whole space well.

So we need to give it these hints to get it into the right valley and then it finds the minimum or something.

One interesting explanation about how Alphafold works that I think is quite insightful, of course doesn't cover the entirety of what it does, that I'm going to borrow from Sergey Chico at MIT. The interesting thing about Alphafold is it's got this very peculiar architecture that we have since used, and this architecture operates on this pair-wise context between amino acids.

The idea is that probably the MSA gives you this first hint about what potential amino acids are close to each other. MSA is multiple sequence alignment. This evolutionary information, from this evolutionary information about potential contacts, it's almost as if the model is running some kind of da algorithm where it's decoding, okay, these have to be closed, okay, then if these are closed and this is connected to this, then this has to be somewhat close.

You decode this that becomes basically a pair-wise distance matrix, and then from this rough pair-wise distance matrix, you decode the actual potential structure.

So there's kind of two different things going on in the coarse grain and then the fine grain optimizations.

You mentioned Alpha Fold 3. So maybe good time to move on to that. Alpha Fold 2 came out and it was fairly groundbreaking for this field. Everyone got very excited. A few years later, Alpha Fold 3 came out and maybe for some more history, what was the difference between Alpha what were the advancements in Alpha Fold 3 and then maybe we'll after that we'll talk a bit about how it connects to bolts but anyway yeah so after Alpha Fold 2 came out.

Jeremy and I got into the field and with many others, the clear problem that was obvious after that was, okay, now we can do individual chains, can we do interactions, interaction different proteins, proteins with small molecules, proteins with other other molecules. So quick, why are interactions important?

Interactions are important because that's the way that these machines that these proteins have a function. The function comes by the way that they interact with other proteins and other molecules. The individual machines are often not made of a single chain, but they're made of multiple chains, and then these multiple chains interact with other molecules to give the function to those.

When we try to intervene of these interactions, think about a disease, think about a bio sensor, or many other ways we are trying to design a molecules or proteins that interact in a particular way with what we would call a target protein or target. This problem after Alphafold 2 became clear, the big one of the biggest problems in the field to solve.

Many groups, including ours and others, started making some kind of contributions to this problem of trying to model these interactions, and Alphafold 3 was a significant advancement on the problem of modeling interactions. One of the interesting things that they were able to do, while some of the rest of the field really tried to model different interactions separately, how protein interacts with small molecules, how protein interacts other proteins, how RNA or DNA have their structure, they put everything together and trained a very large model with a lot of advances, including changing some of the key architectural choices, and managed to get a single model that was able to set a new state-of-the-art performance across all of these different modalities, whether that was protein small molecules is critical to developing new drugs, protein protein understanding interactions of proteins with RNA and DNA.

Just to satisfy the AI engineers and in the audience, what were some of the key architectural and data changes that made that possible?

One critical one that was not necessarily just unique to Alphafold 3, but there were a few other teams including ours in the field that proposed this was moving from modeling structure prediction as a regression problem, where there is a single answer and you're trying to shoot for that answer to a generative modeling problem where you have a posterior distribution of possible structures and you're trying to sample this distribution.

This achieves two things. One is starts to allow us to try to model more dynamic systems. Some of these structures can actually take multiple structures, and so you can now model that through modeling the entire distribution. From more core modeling questions, when you move from a regression problem to a generative modeling problem, you are really tackling the way that you think about uncertainty in the model in a different way.

If you think about, I'm undecided between different answers, what's going to happen in a regression model is that I'm going to try to make an average of those different answers that I had in mind. When you have a generative model, what you're going to do is sample all these different answers and then maybe use a separate models to analyze those different answers and pick out the best.

That was one of the critical improvements. The other improvement is that they significantly simplified to some extent the architecture, especially of the final model that takes those pair wise representations and turns them into an actual structure, and that now looks a lot more like a more traditional transformer than a very specialized equivariant architecture that it was in Alphafold 4.

This is a bitter lesson a little bit.

There is some aspect of a bitter lesson, but the interesting thing is that it's very far from being a simple transformer. This field is one of the very few fields in applied machine learning where we still have architecture that are very specialized. There are many people that have tried to replace these architectures with simple transformers, and there's a lot of debate in the field, but most of the consensus is that the performance that we get from the specialized architecture is faster superior than what we get through a single transformer.

Can you talk a bit about that specialized architecture? I assume you're referring to triangle layers probably as the core idea or there's something maybe it's probably quite fundamental about the fact that we sort of model this in like a second order. So like instead of just the sequence we model every single pair and then to update every pair then we need to have these like sort of triangular type operations.

What's interesting about it is a couple of things. One, it relates a little bit to what the input is. We talked about these multiple sequence alignments before and this notion that we need to look at pairs of residues to try to understand maybe this initial distance matrix that Gabri was talking about, and that's something that is very natural to model in 2D.

There's something about the output as well where supervising over these pairs is quite powerful. It's this idea of telling the model, hey, these two things are close to one another, these two things are not. Doing that in 3D is maybe a bit more challenging. When I say 3D, I mean like so it's like 1D where we model the coordinates in three dimensions, but doing that in one dimension I think is probably more challenging for the model.

It's really survived the test of time. This thing came out in 2021 and it's largely the same. There's been this change to the structure module that's been largely simplified, but where a lot of the magic happens, it's still in the same place with these large pair-wise interaction modeling that's maybe the most differentiated portion.

The other part in Alphafold 3 is this moving away from modeling just at the amino acid level to actually having the model alternate between atomic resolution modeling and then more like it's called token level which is like at the amino acid level that's also something that was introduced that I think was particularly helpful in modeling these other modalities like small molecule. This idea of coarse grain finer grain is actually quite popular in other areas as well so that's maybe not too surprising.

The fact that for some reason the models that have so much more inductive bias when you go into this 2D representation I think is very interesting.

So you mentioned coarse and fine grain and that brings to mind the sort of ribbony diagrams of proteins that I've that everyone has probably seen. Can you actually pull up a molecule and talk about what the different components of that protein are we're looking at like with the spiral and the arrows and all those what components and those like what what level of granularity are we looking at h like how do we think about that how does a model think about that.

There's a little image of our own bull platform. I have a protein here, and you actually sort of see both the coarse grain and the finer grain here. So, we have the ribbon like structure here that is representing these different amino acids in the protein. When we zoom in over this interaction with the small molecule, then you see at the atomic level how these things interact with one another. There's even the actual bond interactions here that are shown.

We go from this very abstract representation of these things, the sequence, the graph of the molecule, and the goal is every single atom should have a coordinate and ends up looking like something like this. It's actually pretty pretty elegant. This is something that's nice really nice that this field has done is it's made really beautiful visualizations of stuff which is really nice to look at.

So there's like ribbons there's like the coily ribbons there's arrows there's like some sort of like not coily ribbons like what do those mean how does someone think about those.

We can zoom into a few different areas of the protein. This one's actually a good example because there's a few different secondary structures here. Here you have what we call an alpha helix. There essentially like sort of three categories. There's the alpha helyses. This is where it takes a little bit of this ribbon shape. There's here what we call a better sheet, which actually, as the name says, sort of like ribbon going like this forming forming a bit of a of a sheet.

Then you have these more loopy regions which look like more unstructured and those are the parts of the protein that are most flexible. They are super important. Maybe one of the most canonical drug modalities are antibodies and antibodies have six of these loops that are largely flexible but when they interact come into this fixed structure when interacting with the target.

So harder to model and really critical to interactions. Those are largely the three big families.

As a structural biologist or just a biologist, when you look at that, so you're basically looking, okay, here's the the sheet part. Here's the and then you're you're kind of saying, okay, that so that'll be bendy and then I have like these coils those like what what do those mean to you when you look at them?

I should say I am not a structural biologist by any any way, shape or form. There's certain types of interactions that are more canonically associated with these different types of structures. A more well-versed structural biologist could give you a more thorough answer than that.

We've seen some of the early successes of protein design being able to design a binder to any any target. A lot of the early success was these very alpha helix centric type peptides which I think are almost like bricks and the models had a pretty good understanding of those interactions and so there was good success with that and then took a little bit of time to go from that to more exotic binders and things like that.

There's certainly a lot of important interaction behaviors associated with these structures.

Another interesting thing that I think on the staying on the modeling machine learning side which I think is somewhat counterintuitive seeing some of the other fields and applications is that scaling hasn't really worked the same in this field. Models like Alphafold 2 and Alphafold 3 are still very large models, but in terms of parameters they're actually not very big. They are definitely below a billion parameters.

If you hear these days in LLM space, a model with less than a billion parameters, you would think can do anything. When you look at the computational cost of running these models, they are actually a lot more expensive than it is to run a language models because we go from instead of having quadratic operations, we now a cubic operation.

It's interesting how right now in the field and this is maybe related to having less data or needing more inactive biases, we have this ratio of amount of computation to parameters that is much much higher than in other in other places.

If I recall Alphafold 2 was like what 70 million parameters something like that. It's something like that. It's quite small around 100 or so.

These decisions of triangle layers and these for Alphafold 2 this interesting equavarian architecture really were priors that it baked in a lot of the physics of the system and also co-evolution data is I think people have argued that is kind of like almost like a database lookup of some sorts. It also sort of so that provides in some sense more parameters as well.

It's more definitely the amount of pure compute flops is very high and it's almost more reasoning based maybe than more just information extraction. One of the things that the part of the reason the LLMs are so large isn't just because of their reasoning capability, but also because of the sheer quantity of information that they store. Here there's a little bit less of that, and it's more about decoding this input rather than maybe memorizing as much of it.

So is there like a loop in the architecture that allows it to compute more for per parameter? Like how does that work?

Part of it is just exclusively this fact that instead of having operations that operate on the single chain they operate on the pair wise and so you instead of having quadratic number of interactions you have a cubic number of interactions and so that on its own leads you to have smaller representation sizes but more representation that leads to more flops but fewer parameters.

There is also this idea of somewhat similar to reasoning where you recycle this operation from Alphafold 2 but also Alphafold 3. They have this interesting framework where you start we as we were discussing the input to the model is sort of this initial understanding of the interactions either from the evolution of the multiple sequence but also potentially from what we call templates that are basically database lookup of similar structures.

How the model works is that it decodes these and tries to understand a good potential rough structure of the pair wise interaction and then what you can do is basically do this recycling where you feed this understanding back to the input of the model and then try to decode it again and people do this three or four times and in some cases I've even tried to do it tens of times and so you can see it as a very very early version of reasoning or trying to get.

So you know Alphafold 2 really cool, Alphafold 3 really cool. But Alphafold 3 came with a catch and I think this catch was important for the development of you know bolts and so on. So yeah the catch was that it was an amazing paper nature paper but unfortunately they decided not to release the model.

Alphafold 2 was open source and since then was used the reported numbers is more than a million scientists. Alphafold 3 for commercial reasons that mine has since spin-off as a morphic lab that is now trying to become sort of like a new pharmaceutical company and decided to keep this model internal and only use internally.

Now we were in the field and building on top of models like Alphafold and so now we no longer add the base starting point to build on top, but even more importantly everyone in both academic research and in industry no longer had access to these incredible models that was really useful to try to understand biologies but also try to develop new therapeutics.

We decided to take the matter in our own hands and decided to try to obtain a model that was of similar accuracy. Largely also using a lot of the information that was in the Alphafold 3 manuscript, we went ahead and built boltzswan, which was the first fully open source model to approach the level of accuracy of Alphafold 3.

Along the way we realized that it was probably too ambitions to see this as an academic project and there are a lot of things that were missing and so we decided to also start a public benefit company to push this mission of democratizing access to these models that we started with bolts one.

I remember this it was actually shocking how fast you got bolts one out like it was just like two or three months right.

I think we started in late May and it came in November if I remember correctly. So slightly longer but yeah.

It was relatively quick. We were working on some of the some similar ideas at the time. This idea of having a diffusion model on top of this more this pair wise strong was something that we were exploring independently. When the paper came out it was really clear especially for example on the data pipelines there was so much that we were not really doing and so there was a lot to catch up on.

We were already in a place where we had some experience working with the data and working with these type of models and I think that put us already in a good place to produce it quickly and I would even say I think we could have done it quicker. The problem was for a while we didn't really have the compute and so we couldn't really train the model and actually we only trained the big model once.

That's how much compute we had. We could only train it once and so like while the model was training we were finding bugs left and right. A lot of them that I wrote and I remember us doing surgery in the middle, stopping the run, making the fix like relaunching and we never actually went back to the start. We just kept training it with the bug fixes along the way, which was impossible to reproduce now.

No, that model has gone through such a curriculum that it's learned some weird stuff, but somehow by miracle it worked out.

The other funny thing is that the way that we were training most of that model was through a cluster from the department of energy, but that's sort of like a share cluster that many groups use. We were basically training the model for 2 days and then it would go back into the queue and stay a week in the queue. It was pretty painful.

Towards the end I caught up with Deon the CEO of Genesis and I was telling him a bit about the project and telling him about this frustration with the computer. He offered to help and we got the help from Genesis to finish up the model otherwise it probably would have taken a couple of extra weeks of weeks.

Bolt one. How did that compare to Alpha Fold 3? And then and then there's some progression from there.

I would say the bolts one but also these other set of models that came around the same time were approaching were a big leap from the previous open source models and really approaching the level of Alphafold 3. Even to this day there are some specific instances where Alphafold 3 works better. One common example is antibody antigen prediction where Alphafold 3 still seems to have an edge in many situations.

Obviously these are somewhat different models. They are you run them you obtain different results. It's not always the case that one model is better than the other but in aggregate especially at the time so Alphafold 3 is still having a bit of an edge.

We should talk about this more when we talk about volt but like how do you know one is one model is better than the other like so you I make a prediction you make a prediction like how do you know.

The great thing about structure prediction and once we're going to go into the design space of designing new small molecule new proteins this becomes a lot more complex. A great thing about structure prediction is that a bit like CASP was doing basically the way that you can evaluate them is that you train the model on a structure that was released across the field up until a certain time.

One of the things that we didn't talk about that was really critical in all this development is the PDB which is the protein data bank is this common resources basically common database where every biologist publishes their structures and so we can train on all the structures that were put in the PTB until a certain date and then we basically look for recent structures.

Which structures look pretty different from anything that was published before? Because we really want to try to understand generalization and on this new structure we evaluate all these different models.

So you just know when Alphafold 3 was trained, you know when you're you intentionally train to the same date or something like that.

Exactly. This is the way that you can somewhat easily compare these models. Obviously that assumes that the training set you've always been very passionate about validation. I remember diff doc and then there was diff do l and dogen you really thought you've thought very carefully about this in the past.

I mean actually I think dogen is a really funny story that I think I don't know if you want to talk about that it's an interesting one. One of the amazing things about putting things open source is that we get a ton of feedback from the field and sometimes we get great feedback of people really liking the model. Honestly, most of the times and to be honest that's also maybe the most useful feedback is people sharing about where it doesn't work.

It's critical and this is also something across other fields of machine learning it's always critical to set to do progress in machine learning set clear benchmarks and as you start doing progress of certain benchmarks then you need to improve the benchmarks and make them harder and harder and this is the progression of how the field operates.

🔬Generating Molecules, Not Just Models

🔬Generating Molecules, Not Just Models by Latent Space

Top 3 Ideas

🏗️ AlphaFold's Legacy, Boltz's Mission

🏗️ Beyond Prediction: Generative Design

🏗️ Productizing Discovery for Skeptics

Key Takeaways

Others You May Like

Dario Amodei and Dwarkesh Patel – Exponential Scaling vs. Real World Friction

The Deflationary Singularity: Why Everything is Going to ZERO w/ Salim Ismail

What If Intelligence Didn't Evolve? It "Was There" From the Start! - Blaise Agüera y Arcas

🔬Generating Molecules, Not Just Models

🔬Generating Molecules, Not Just Models by Latent Space

Top 3 Ideas

🏗️ AlphaFold's Legacy, Boltz's Mission

🏗️ Beyond Prediction: Generative Design

🏗️ Productizing Discovery for Skeptics

Key Takeaways

Join 10,000+ smart readers on our AI newsletter and stay ahead of the curve

Others You May Like

Dario Amodei and Dwarkesh Patel – Exponential Scaling vs. Real World Friction

The Deflationary Singularity: Why Everything is Going to ZERO w/ Salim Ismail

What If Intelligence Didn't Evolve? It "Was There" From the Start! - Blaise Agüera y Arcas