Latent Space

December 31, 2025

[State of MechInterp] SAEs in Production, Circuit Tracing, AI4Science, "Pragmatic" Interp — Goodfire

Cracking the Black Box: How Mechanistic Interpretability Moves from Lab to Production by Latent Space

Author: Latent Space | Date: N/A

This is for builders tired of treating LLMs as unpredictable magic. It explains how Goodfire uses mechanistic interpretability to turn black boxes into steerable, high-stakes tools for science and enterprise.

💡 Can we scrub PII from model outputs more cheaply than using GPT-4 as a judge?
💡 How do we disentangle a model's reasoning ability from its rote memorization?
💡 What is Pasteur’s Quadrant and why does it matter for AI safety?

Goodfire founders Mark and Jack explain why the science of deep learning is moving beyond black boxes. They argue that understanding internal mechanisms is the only way to build trustworthy systems for high-stakes industries.

Top 3 Ideas

🏗️ Production Grade Interp

“It’s the equivalent of using GPT-5 as a judge, but it’s 500 times cheaper.”

Sidecar Monitoring: Goodfire uses sidecar models to detect specific feature firings like PII. This allows for high-recall safety checks without the latency or cost of massive LLM judges.
Latent Capacity: Models contain hidden abilities that text prompting alone cannot access. Accessing the internal mental map allows for direct manipulation of concepts like image composition.
Creative Control: Interpretability provides power user tools for creative domains. Builders can select unsupervised concepts and paint them directly into a model's mental map.

🏗️ Shattering the Representation

“It basically shatters the representation into many pieces.”

Sparse Autoencoders: SAEs decompose dense, uninterpretable webs into primitive concepts. This makes it possible to see exactly why a model mentions a specific location or object.
Attribution Graphs: New methods scale these techniques across every layer of a model. Tracing these graphs provides a roadmap of how a model arrives at an answer.

🏗️ The Pragmatic Future

“Interpretability should be useful and we’re getting it used right now.”

Outcome-Driven Science: The field is moving from pure reverse-engineering to pragmatic interpretability. Focusing on steering outcomes ensures research remains relevant to real-world deployment.
Pasteur’s Quadrant: Goodfire operates at the intersection of basic research and applied invention. This dual focus prevents stagnation and accelerates the discovery of novel biomarkers in life sciences.

Actionable Takeaways

🌐 The Macro Shift: The transition from Black Box to Glass Box AI. Trust is the next moat, and interpretability is the tool to build it.
⚡ The Tactical Edge: Use feature probing for high-stakes monitoring. It is more effective and cheaper than using LLMs as judges for tasks like PII scrubbing.
🎯 The Bottom Line: Understanding model internals is no longer just a safety research project. It is a production requirement for any builder deploying AI in regulated or high-stakes environments over the next 12 months.

Podcast Link: Click here to listen

Okay, we are here live at Europe's with two good folks. We want to cover the state of MechInterp basically and you guys are very passionate. Mark, I had you on for AIE before. Welcome back.

Thank you.

Jack, you're new but also part of Goodfire. How would you describe what do you guys do and your path into Mecher? Maybe Jack, you want to go first?

So we do interpretability research primarily. We're a company focused on making models interpretable and robust and safe. And I guess I would describe the type of work we do as the science of deep learning which is trying to make models not just black boxes but things that we can actually trust and deploy in high stakes industries. I guess my own path into interpretability was I was a PhD student from 2020 to 2025. I graduated in May. And I started working on I was working on language models. I was working on grounding in language models which is basically the idea that you need more than text data to represent meaning in the world.

It was basically right away as I started at some point, GPD3 had come out and I had a moment of like, oh, this is actually really good at understanding the world and it was kind of a slow transition, but somewhere along the way in grad school, I switched to fully doing interpretability.

Cool. As Jack mentioned, Goodfire is an AI research company focused on building a platform for interpreting models of all kinds across lots of different modalities and domains. My path looks very different than Jack. So prior to Goodfire, I was at Palunteer as an engineer on our healthcare team and then joined Goodfire back in March. And I guess one cool thing about the state of interpretability as a field and also how Goodfire is set up is there's a lot of foundational research still to be done that folks like Jack are working on and our team.

But at GoodFire, I'm more focused on our applied real world use cases for interp. building out a platform that can help in use cases like scientific discovery or for inference time monitoring of models for deployment in enterprises and things like that. So I think there's a lot of exciting just totally new theoretical research to be done. But especially over the past year I think we're starting to see interpretability have practical use cases be actually deployed in especially situations where models are being used for high stakes industries and problems which is super exciting.

I think a lot of people are ignorant of the fact that we're actually at this point where people could actually apply these things for real life use cases. I saw the platform directly. I you guys had like a launch party in your office for the diffusion thing. Maybe you want to recap what that was so that people can go play around with it again.

Yeah. So, this is kind of a research preview that we put out. It's at paint.goodfire.ai. It's live. That was a use case of interpretability for sort of like the creative domain. I think just giving one hint at like interpretability gives you a set of I think of it almost as like power user tools for accessing models and doing things with them that you might not have realized you could. So that example is you take stable diffusion XL turbo. So a model that typically you use text input to prompt the model and you get an image out.

But using interpretability techniques you can sort of like plug directly into the mind of the model and you get a 2D canvas where you can basically like paint directly into its mental map of the image. And so we used unsupervised techniques to basically figure out all the concepts that the model has internally. So animals, backgrounds, scenes and stuff like that. And so you can select any of those concepts, which again we're found totally unsupervised, and you can just like paint a lion over here and then drag it and move it over here. It's just a totally new way. Like this is a model that only takes text prompts as input, but when you plug into the brain, you can do these cool things.

And then highlighting other work because I think people that one went relatively viral. Jack, I don't know if you want to take a turn at like other highlights of your year in terms of stuff that you shared. You start with the memorization.

Okay. Sure. So there's a lot of work and like going back years on like so models memorize a lot of their training data. That's a privacy concern, but it's also I think just scientifically under interesting to understand and there's sort of like an unclear picture of like how should we think about how it's represented like within a model when it's is is it like a a computer with like like a file system that we can kind of tap into where maybe files are redundantly stored like you know memorized training sequence are kind of spread throughout or should we think about like some other way and we kind of proposed a a kind of like different lens on it. I think a exciting question around the work was whether we can like disentangle like core cognitive capacities in a language model from from knowledge right from knowledge I was thinking like GPQ QA0 and you know if 100 exactly right so this is a paper we put out recently in October and we showed there's this like nice spectrum of like capabilities in so there's another I think another new lens on on memorization which is that in language models it's not so black and white like what is memorized and not so like memorizing that the the capital of France is Paris is you know memorized in some sense but it's it's a lot different than just like memorizing like single page like license agreement document that shows up a thousand times in the pre-trending data or something like that you can actually see like a the way that we like disentangle memorization you can kind of see this like gradient of memorization in between both mechanistically and and behaviorally with like logical reasoning tasks being quite distinct from wrote memorization and then like factual recall it's kind of somewhere in the middle.

What's your follow on work after you've done this?

So I think like understanding like the reasoning capabilities is very interesting. I think like understanding how postraining affects those capabilities in general is super interesting and I think we're going to keep pushing on that.

Can you induce forgetting so activity?

We show and so yeah so there's there's a lot of work on like unlearning I would I would describe it more as not unlearning but maybe suppression I think there's like really like I guess guarantees that you've fully removed information from a model is I don't think it's been convincingly showed anywhere yet and I I think that it's I think maybe some of the point that we want to make in our work is that when we look at memorization this way it makes a a point about how hard or you know possibly like intractable that is.

I don't know if this is a related problem or it's too different but instead of unlearning but updating so I moved the capital of France to Marseilles but there's so much training data saying the capital of France is Paris I need like a date I need to be able to tell the model hey like this is now out of date but like short of tagging a date on everything I don't know when that makes sense you know that's like a weird.

Yeah yeah a really big like paper from a couple years ago on this exact thing on fact editing was the the Rome paper. It's rank one model edit and nice acronym.

Yeah. Here we go. Uh and they look at exactly like exactly what you're describing. It's a really nice nice paper. I don't think it's it's being used like in deployment anywhere. But I think that's like and it's it's really an interpretability paper. And I think that's somewhere where like you know from a basic science perspective like interpretability has like so much to to offer is understanding how we can do something like that which is currently just very difficult.

So call you over to market a little bit for industry stuff. What can people do at the end of 2025 that they could not do at the start of 2025?

So I'll preface with saying we we still have a long way to go but but we are excited about seeing like the seeds of you know things actually being deployed in practice. I I would say that we're just continuing to make progress on understanding that models have a lot of latent capacity that you can't get at by treating them entirely as as black boxes. So, we're seeing interpretability show up in like model cards now for eval people are running for various like red teaming exercises and stuff.

Yeah, I know. There's some stuff in Gemini 3 Claude I think Claude 4 and it's definitely Victor.

Yeah, for sure. I mean, yeah, the interp team there is phenomenal and I think works across some of their other teams, but and then from Goodfire's perspective, you know, we so like one of our partners, Racketin, is deploying an interpretability based tool in production with one of their language agents. This is a really cool use case where if you what they needed to do was take chats between their customers and their agent, find instances where personally identifiable information is mentioned. So names, emails, phone numbers, things like that, and scrub it out. And interpretability turned out to be the the best way to do this at scale and in a cost-effective way. It was both more effective and cheaper than alternative techniques.

And you do that by having like a PII feature.

Yeah, exactly. So what you do is you the customer is talking with an agent, but then separately you're putting that transcript through what we call like a sidecar model. And you could, what's really interesting is if you ask that model, try to like use it as an LLM as a judge, it's not very good. But if you probe its mind and you sort of detect when the features related to personally identifiable information are firing, that gets you the highest recall of anything. It's it's the equivalent of using GPT5 as a judge, but it's, you know, like 500 times cheaper. So So they're they're deploying that.

And then I you know am personally very excited about the use cases for scientific discovery. So some of our partners in the life sciences and in materials there there are these narrowly superhuman models for things like genomics, for medical imaging, for proteomics, for material science and those are especially uninterpretable because they are working in domains that yeah we can't well they're they're huge and like we can't native I don't speak genome you know like it's literally base pairs in base pairs out but they're super human at tasks that are very interesting. So, we have some some, you know, exciting early results in in finding novel biomarkers of disease with some of our partners that we're excited to share soon. And yeah, I mean, I think AI for science is, you know, becoming a very hot topic and I think for good reason.

Yeah, we're literally starting AI for science pond like we're spinning out a separate inspace AI for science bot.

I love to hear that. Featuring other work done notable work this year. I feel like I have to mention in topics circuit tracing paper. I don't know if there's much discussion internally for you guys. Let's just let you riff like what do you think?

Yeah, Jack I know has done a lot of circuit tracing. You want to Yeah, when that so when that came out we we did a so that was back in like March. If you could let's say people know about the SAE work. What is the difference? Because I I struggle summarizing it. My my summary you can correct me. It's like take an individ like have a full access to a model. take an individual layer and train a replacement circuit that simplifies what it does.

Yeah. So, yeah, that's right. And a good way to maybe put it is that you take a representation in the middle which is this like weird dense uninterpretable like web of concepts and you know features in a representation and the the essay like decomposes it into like primitives like concepts firing for like mentions of coffee shops or mentions of like New York City or something like that and those are those are much more interpretable and that it basically decompos it shatters the representation to like many many pieces.

The big change with the the cross layer transcoder so the circuit tracing work was a to to really scale up the models to incorporate like every layer so like cross layer so the the model is called a cross layer transcoder so it's incorporating like it's tying features across different layers and then the the tracing part is a method for creating like an it's called an attribution graph through those features which are interpretable to like describe how the model is like producing one output like through every layer and through like a bunch of token positions. I can talk a bit more about it. So, when that came out in like March or April, we so still pretty early for us. There were like eight to 10 of us at the time, I think. So, we we tried to we like went had to replicate some of their findings. We're really curious about, you know, what would look like training one, like what it's like to use it and basic scientific questions as well, which is like could it rediscover some like rich representations from like previously understood circuits. So we we put out a post I don't remember when exactly that went up but over I think over the summer like May or June on our replication effort on that and like how that works.

Is there is there an obvious next step like what is what is this all leading up to? Because to me it's like basically just always scaling it up, always making it more unsupervised. I don't know if there other trends that you can see like yeah these are the core principles that we're just exploring.

That's like a good point to bring up interpretability as as an alignment science versus as like a science for understanding models like more broadly which is like if you you're one in the same kind of I think it depends on what your motivations are. So if you are focused on reading a model's mind and understanding like what's going on internally to make sure it's not having bad thoughts or like it's not misaligned. It makes a bad thought. Yeah. uh it makes a lot of sense to to like do exactly what they're doing. So I think and that I think they're just not they're I don't know exactly what they're up to but applying like these techniques to like read out what's going on inside these models mind. So it's like very nice detection like framework but if we want like really robust control like really robust control of models like what they learn during training I think like that's where there's so much more work to be done. I think you know many different teams are all working on on these problems but in terms of next steps that's yeah like some other directions to go.

No, I think to to Jack's point I think the the circuit tracing work is super cool and I think it's it's sort of one arrow in a in a larger quiver of techniques that are useful and you know what technique is useful depends on the task at hand that you're interested in. So to Jack's point, like the techniques that are useful for alignment science and evaluations, you know, that's one use case, but there's a whole world of other things that you might wish to apply other techniques for. And maybe maybe you don't need circuits for things that can be uncovered using probes or using other techniques to understand how you know post-training changed a model. also this like model diffing use case and maybe the things that you want to use as like inference time sort of guarantees for for models look a little bit different. So I think it's super cool but yeah to your point kind of like there's there's a variety of a variety of techniques depending on the the final use case that you're interested in.

Last thing that a lot of people hear in New York I've had like two conversations about it already is what's happening with Neil Nanda. I don't like celebrity culture, but it's hard to avoid Neil's impact. And he basically said he's pivoting his team at Deep Mind to no longer focus on whatever and now it's like pragmatic interpretability. What's that conversation about? Like what what do insiders think?

I read Neil's post and I think like I also share the the hope that of of a pragmatic future for for interpretability for sure. I think a lot of the.

Why didn't we think of that earlier? we should medic like oh god I well I also think I also think a lot of the the response to that I'm not sure that folks actually kind of read it all the way all the way through I think there's a lot of reading of that for some reason folks thinking oh like interpretability is dead or something if anything that says interpretability is alive and well like there are use cases for not black te blackbox techniques that that can be brought to bear in real world use cases so I you know I I think we're Um we we probably agree with Neil on that. And then um you know I think also good for like different companies to have different agendas that they're pursuing. I think it's dangerous you know a field will stagnate if everyone sort of is just converging on the exact same same approaches of things.

Not sure if you have other to the point about like is interpretability like dead or something I think is like a a like gross like misattribution of like what that post is about. to melt your wine. Nobody's I'm not saying it. Nobody nobody I talked to was saying it. It was much more like the existing approaches for him were not scaling to some extent and he was the way I put it is kind of like it's like like well okay let's forget about complete understanding and let's manage by outcome and like let's measure ourselves by our ability to steer outcomes and forget if we know precisely what's going on.

Yeah. I totally So in terms of like should we just be focusing on like reverse engineering models like from the ground up? So that's maybe where the big change is. I think like we we probably agree that like interpretability should be useful and we're trying to get it like like used right now and we are getting it used right now. I think like maybe where like where we kind of align on this is like we're really focused on like use case inspired like basic research. So yes, like we're very pragmatic, but there's also like a like a lot of like very deep foundational science to be done on understanding models, which is yes, driven by outcomes, but we also want to like develop like the understanding of of what goes on in models because if we're not doing that, then every other uh like lab producing models is just a blackbox AI lab. And you know, that's not how things should be done.

Or Jack snuck in a little reference there that I think was was good. So there's this concept that we like to talk about internally called pastor's quadrant. So Louie pastor like is this like a statistical thing with like the the same distribution but four different that's an okay yeah no this is more like a conceptual kind of framework but the idea is that you can have so the two axes of the quadrant are I've heard it described as like discovery and invention or like pure research versus applied research like something in in that kind of domain but the idea is that you can have just pure basic research and so the classic example there is like Neil's bore like understanding the atom, understanding the electron just for sort of like theoretical physics interest and then you have the uh corner that is just like purely applied research purely you have an end in sight which is like Edisonian research. So Thomas Edison says I want to make a light bulb I don't really like I'll learn the chemistry that I need to but it's just this goal is is the goal and then there's Louis Pastor who like spans both of these. And so the idea is that it's not just this linear thing between basic research and applied research. You can kind of have a combination of those two. And so past your, you know, sort of pioneer of like germ theory, but also engineered some of the first vaccines. And so this idea that you should sort of be bouncing back and forth between kind of open-ended foundational research, but then also having a goal in sight and be able to really foot back and forth between the two. I think there's a lot of cases where you see this being a really productive way to make progress in a field.

So, there's a company like hero mascot. I think it's Tom McGrath, one of the co-fire co-founders of Goodfire, who he started the interp team in Deep Mind back in the day, and he he has a lot of these like great references from other domains of science that are just I think really good kind of grounding points for for how we operate.

Amazing. Let's get a quick call to action in. You guys are hiring. What are you hiring for? What's what's hard to find?

Definitely actively hiring. I hope the AI engineer community should definitely go to goodfire.ai, check out our careers page, actively hiring for researchers and engineers and and you hear nuts. Yeah. Exactly. Exactly. There's a combination. So so yeah, we're you know we're doing foundational interpretability research, but we're building out a platform to apply this in real world use cases. You know, customers across life sciences, materials, government, financial services. So if you're someone with an MLE background, no interp experience required whatsoever. You can see we have different backgrounds. I'm kind of coming from industry from engineering. Jack's coming from a PhD. Other folks at Goodfire have backgrounds in like quant trading firms or frontier labs all sorts of places. But if you like training big models, building agents, engineering systems, we're hiring across a variety of roles and really looking to fill some of those engineering gaps.

Exit. I think that's it.

Yeah, that's what it Thanks for having us, son.

Thanks for coming.

[State of MechInterp] SAEs in Production, Circuit Tracing, AI4Science, "Pragmatic" Interp — Goodfire

Cracking the Black Box: How Mechanistic Interpretability Moves from Lab to Production by Latent Space

Top 3 Ideas

🏗️ Production Grade Interp

🏗️ Shattering the Representation

🏗️ The Pragmatic Future

Actionable Takeaways

Others You May Like

Inside the economics of OpenAI (exclusive research)

Inside the economics of OpenAI (exclusive research)

Inside the economics of OpenAI (exclusive research)

[State of MechInterp] SAEs in Production, Circuit Tracing, AI4Science, "Pragmatic" Interp — Goodfire

Cracking the Black Box: How Mechanistic Interpretability Moves from Lab to Production by Latent Space

Top 3 Ideas

🏗️ Production Grade Interp

🏗️ Shattering the Representation

🏗️ The Pragmatic Future

Actionable Takeaways

Join 10,000+ smart readers on our AI newsletter and stay ahead of the curve

Others You May Like

Inside the economics of OpenAI (exclusive research)

Inside the economics of OpenAI (exclusive research)

Inside the economics of OpenAI (exclusive research)