Latent Space

February 5, 2026

Goodfire AI’s Bet: Interpretability as the Next Frontier of Model Design — Myra Deng & Mark Bissell

Goodfire AI’s Bet: Interpretability as the Next Frontier of Model Design

Myra Deng & Mark Bissell by Latent Space

Goodfire AI pioneers interpretability methods to move beyond black-box AI, enabling precise control and intentional design. This shift promises safer, more powerful AI, especially in high-stakes applications, by allowing humans to understand and guide model behavior.

💡 How can interpretability techniques like steering move beyond stylistic tweaks to enable surgical, high-stakes model customization?
💡 What are the current limitations of unsupervised interpretability methods (like SAEs) in real-world production scenarios, and what comes next?
💡 How can interpretability bridge the gap between human understanding and superhuman AI capabilities, particularly in scientific discovery?

Top 3 Ideas

🏗️ Defining the Problem

*If you ask 50 people who quote unquote work in interp like what is interpretability, you'll probably get 50 different answers.*

Core Challenge: Interpretability lacks a clear definition. Goodfire defines it as methods to understand, learn from, and design AI.
Beyond Poking: Current AI development often "pokes" at models post-training. Goodfire aims for intentional design, integrating interpretability from the ground up.
Science of Deep Learning: Goodfire views interpretability as foundational to the "science of deep learning," demystifying internal representations across the AI lifecycle.

🏗️ Surgical Control

*Nobody knows what's going on, right? Like subliminal learning is just an insane concept when you think about it.*

Precision Edits: Interpretability enables "surgical edits" to model behavior, removing unwanted traits like bias or hallucinations, or instilling desired traits.
Gen Z Kimmy: A demo showed real-time steering on a trillion-parameter model (Kimmy K2), transforming its output to Gen Z slang, highlighting dynamic, inference-time control.
Beyond Style: While stylistic steering is a compelling demo, the real goal is applying this control to high-stakes issues like hallucination mitigation, where black-box methods struggle.

🏗️ Intentional Design

*The current life cycle of training and deploying and evaluations is is to us like deeply broken and and has opportunities to to improve.*

RL's Limits: Current reinforcement learning (RL) is likened to teaching a child with only "cookies or a slap," lacking intentional feedback.
Human-AI Interface: Interpretability provides a bidirectional communication, allowing humans to control models and enabling superhuman AI to teach new scientific knowledge.
Scientific Discovery: Goodfire applies interpretability to scientific models to find novel biomarkers for diseases like Alzheimer's, showing AI can make discoveries beyond human intuition.

Actionable Takeaways

🌐 The Macro Shift: The era of opaque, black-box AI is ending; the future demands intentionally designed models with human understanding and control. This shift is driven by reliability in high-stakes applications and extracting novel insights.
⚡ The Tactical Edge: Investigate interpretability tools (like Goodfire's platform) to gain granular control over model behavior, moving beyond basic fine-tuning for critical applications.
🎯 The Bottom Line: Interpretability is not a niche; it's the missing piece for scaling AI safely into mission-critical domains. Mastering model understanding and intentional design will yield unprecedented capabilities and competitive advantage.

Podcast Link: Click here to listen

If you ask 50 people who quote unquote work in interp like what is interpretability, you'll probably get 50 different answers to some extent.

Also, we're where Goodfire sits in the space. We're an AI research company above all else and interpretability is a set of methods that we think are really useful and worth specializing in in order to accomplish the goals we want to accomplish.

But I think we also sort of see some of the goals as even more broader as almost like the science of deep learning and just taking a not blackbox approach to sort of internal representations and then bringing interpretability to training which I don't think has been done all that much before.

So welcome to the lane space. We're back in the studio with our special mech co-host Vivu. Welcome Mochi. Moi special co-host and Mochi the mechanistic interpretability doggo. We have with us Mark and Myra from Goodfire. Welcome.

Thanks for having us on. Maybe we can sort of introduce Goodfire and then and then introduce you guys. How do you introduce Goodfire today?

Yeah. It's a great question. So GoodFire we like to say is an AI research lab that focuses on using interpretability to understand, learn from, and design AI models. And we really believe that interpretability will unlock the new generation, next frontier of safe and powerful AI models. That's our description right now. And I'm excited to dive more into the work we're doing to make that happen.

Yeah. And there's always like the official description. Is there like an unofficial one that sort of resonates more with a different audience?

Well, being an AI research lab that's focused on interpretability, there's obviously a lot of people have a lot that they think about when they think of interpretability. And I think we have a pretty broad definition of what that means and the types of places that can be applied and in particular applying it in production scenarios in highstakes industries and really taking it sort of from the research world into the real world which you know it's a it's a new field so that hasn't been done all that much and we're excited about actually seeing that sort of put into put into practice.

Yeah, I would say it's it wasn't too long ago that topic was like still putting out like toy models or supervisition and that kind of stuff. And I wouldn't have pegged it to be this far along.

When you and I talked at New Reps, you were talking a little bit about your production use cases and your customers. And then not to bury the lead, today we're also announcing the fund raise your series B $150 million at a 1.25b valuation. Congrats. You're a unicorn.

Thank you. Yeah. No, things move fast. We were talking to you in December and already some big updates since then.

Let's dive I guess into a bit of your backgrounds as well. Mark, you you were at Palanteer working on health stuff which is really interesting because GoodFire has some interesting like health use cases. I don't know how related they are in practice.

Yeah. Not super related but I don't know it was helpful context to know what it's like just to work with health systems and generally in that domain.

Yeah. And Mara, you were at Two Sigma, which actually I was also at, Two Sigma. Really back in the day. Wow, nice. Did we overlap at all? No.

This is this is when I was the briefly a software engineer before I became a sort of developer relations person and now you head of product. What are your sort of respective roles just to introduce people to like what all gets done in Goodfire?

Yeah, prior to Goodfire, I was at Palunteer for about 3 years as a as a forward deployed engineer, now a hot term. wasn't always that way and as a technical lead on the healthcare team and at Goodfire I'm a member of the technical staff and honestly that I think is about as specific as I could describe myself cuz I've worked on a range of things and you know it's it's a fun time to be at a team that's still reasonably small.

I think when I joined one of the first like 10 employees now we're above 40 but still it looks like there's always a mix of research and engineering and product and all of the above that needs to get done and I think everyone across the team is you know pretty pretty switch hitter in the roles they do.

So I think you've seen some of the stuff that that I worked on related to image models which was sort of like a research demo. More recently, I've been working on our scientific discovery team with some of our life sciences partners, but then also building out our our core platform from more of like a flexing some of the kind of MLE and and developer skills as well. Very generalist.

Um, and you you also had like a very like a founding engineer type role. Yeah. Yeah. So, I I also started as I still am member of technical staff. Did a wide range of things from the very beginning including like finding our office space and all of these like nittygritty. people visited when you had that open house thing. It was really nice. Thank you. Thank you. Yeah. Plug to come visit our office.

Like it was it was like 200 people like it has room for 200 people but there you guys were like 10. Yeah. For a while it was very empty. But yeah like like Mark I I spend a lot of my time as as head of product. I think product is a bit of a weird role these days, but a lot of it is thinking about um how do we take our frontier research and really apply it to the most important real world problems and how does that then translate into a platform that's repeatable or a product and working across you know the engineering and research teams to make that happen.

And also communicating to the world like what is interpretability, what is it used for, what is it good for, um why is it so important? All of these things are part of my day-to-day as well. I love like what is things because that's a very crisp like starting point for people like coming to a field.

Leo I'll do a fun thing VU why you want to try tackling what is interpretability and then they can correct us. Okay great. Um so I think like one just to kick off it's a very interesting role to be hit a product right because you guys at least as a lab you're more of an applied interp lab right which is pretty different than just normal interp like a lot of background research but you guys actually ship an API to try these things you have ember you have products around it which not many do okay what is interp so basically you're trying to have an understanding of what's going on in model like in the model in the internals so different approaches to do that you can do probing saes transcoders, all this stuff.

But basically, you have an you have a hypothesis. You have something that you want to learn about what's happening in a model internals and then you're trying to solve that. From there, you can do stuff like you can, you know, you can do activation mapping. You can try to do steering. There's a lot of stuff that you can do. But the key question is, you know, from input to output, we want to have a better understanding of what's happening and, you know, how can we how can we adjust what's happening on the model internals?

How'd I do? That was really good. I think that was great. I think it's also a it's kind of a minefield of a if you ask 50 people who quote unquote work in interp like what is interpretability, you'll probably get 50 different answers.

And to some extent also like where where GoodFire sits in the space. I think that we're an AI research company above all else and interpretability is a is a set of methods that we think are really useful and worth kind of specializing in in order to accomplish the goals we want to accomplish.

But I think we also sort of see some of the goals as even more broader as almost like the science of deep learning and just taking a not blackbox approach to kind of any part of the like AI development life cycle whether that means using inter for like data curation while you're training your model or for understanding what happened during post- training or for the you know understanding activations and sort of internal representations what is in there semantically and then a lot of sort of exciting updates that were, you know, are sort of also part of the the fund raise around bringing interpretability to training, which I don't think has been done all that much before.

A lot of the stuff is sort of post talk poking at models as opposed to actually using this to intentionally design them. Is this post- training or pre-training or is is that not focused on post- training? But there's no reason the techniques wouldn't also work in pre-training.

It seems like it would be more applicable post training because basically I I'm thinking like rollouts or like um you know having different variations of a model that you can tweak with your steering.

Yeah. And I think in a lot of the news that you've seen in on like Twitter or whatever, you've seen a lot of unintended side effects come out of post-training processes. You know, overly sycopantic models or models that exhibit strange reward hacking behavior. I think these are like extreme examples. There's also, you know, very mundane more mundane like enterprise use cases where, you know, they try to customize or post-train a model to do something and it learns some noise or it doesn't appropriately learn the target task.

And a big question that we've always had is like how do you use your understanding of what the model knows and what it's doing to actually guide the learning process more effectively?

Yeah. I mean, you know, just to anchor this for people, uh, one of the biggest controversies of last year was 40 glazegate. I've never heard I didn't know that was what it was called. No, the other one they called it that on the blog post and I was like why the opening I call it like officially used that term and I'm like that's funny but like yeah I I guess is is the pitch that if they had worked to a good fire they wouldn't have avoided it like you know I think so. Yeah.

I think that's certainly one of the use cases I think and another reason why post training is a place where this makes a lot of sense is a lot of what we're talking about is surgical edits. you know, you want to be able to have expert feedback very surgically change how your model is doing. Whether that is, you know, removing a certain behavior that it has.

So, you know, one of the things that we've been looking at or is is another like common area where you would want to make a somewhat surgical edit is some of the models that have say political bias. Like you look at Quen or um R1 and they have sort of like this CCP bias in them. And you know, is there a CCP vector? Well, there's there are certainly internal Yeah. parts of the representation space where you can sort of see where that lives.

Yeah. Um and you want to kind of, you know, extract that piece out. Well, I always say, you know, whenever you find a vector, a fun exercise is just like make it very negative to see what what the opposite of CCP is. The Super America bald eagles flying everywhere.

But yeah, so in general like lots of post- training tasks where you'd want to be able to to do that whether it's unlearning a certain behavior or you know some of the other kind of cases where this comes up is are you familiar with like the the groing behavior? I mean I know the machine learning term of groing. Yeah. sort of this like double descent idea of of having a model that is able to learn a generalizing a generalizing solution as opposed to even if memorization of some task would suffice, you want it to learn the more general way of doing a thing.

And so, you know, another way that you can think about having surgical access to a model's internals would be learn from this data but learn in the right way. um if there are many possible you know ways to to do that can meant trips solve the double descent problem depends I guess on how you okay so I I I view that double ascent as a problem because then you're like well if the loss curves level out then you're done but maybe you're not done right right but like if you actually can interpret what is uh generalizing or what is what is still changing even though the loss is not changing then maybe you you can actually not view it as a double the problem and actually you're just sort of uh translating the space in which you view loss and you like then you have a smooth curve.

Yeah, I I think that's certainly like the domain of of problems that we're that we're looking to get at. Yeah. to me like double descent is like the biggest thing to like ML research where like if you believe in scaling then you don't you need to know where to scale and but if you believe in double descent then you don't you don't believe in anything where like anything levels off like yeah I mean also tangentially there's like okay when you talk about the China vector right there's the subliminal learning work it was from the anthropic fellows program where basically you can have hidden biases in a model and as you distill down or you know as you train on distilled data, those biases always show up even if like you explicitly try to not train on them.

So, you know, it's just like another use case of okay, if we can interpret what's happening in post- training, you know, can we clear some of this? Can we even determine what's there? Because, yeah, it's just like some worrying research that's out there that shows, you know, we really don't know what's going on.

That is, yeah, I think that's the biggest sentiment that we're sort of hoping to tackle. Nobody knows what's going on, right? Like subliminal learning is just an insane concept when you think about it, right? Train a model on not even the lojits literal literally the output text of a bunch of random numbers and now your model loves owls. And you see behaviors like that that are just they defy they defy intuition and and there are mathematical explanations that you can get into, but I mean early days objectively there are a sequence of numbers that are more allike than others.

There there should be according to according to certain models, right? It's interesting. I think it only applies to models that were initialized from the same starting usually. Yes. But I mean I think that's a that's a cheat code because there's not enough compute. But like it if you believe in like platonic representation like probably will transfer across different models.

Oh, you think so? I think of it more as a statistical artifact of models initialized from the same seed sort of um there's something that is like path dependent from that seed that might cause certain overlaps in the latent space and then sort of doing this distillation. Yeah. Like pushes it towards having certain other tendencies. Got it.

I think research a bunch of these open-ended questions, right? like you can't train in new stuff during the RL phase, right? RL only reorganizes weights and you can only do stuff that's somewhat there in your base model. You're not learning new stuff. You're just reordering chains and stuff.

But okay, my broader question is when you guys work at an interp lab, how do you decide what to work on and what's kind of the thought process? Right? Cuz we can ramble for hours. Okay, I want to know this, I want to know that, but like how do you concretely like you know what's the workflow? Okay, there's like approaches towards solving a problem, right? I can try prompting. I can look at chain of thought. I can train probes, SAEs. But how do you determine, you know, like okay, is this going anywhere? Like do we have set stuff? Just, you know, if you can talk about that.

It's a really good question. I feel like we've always um at the very beginning of the company thought about like let's go and try to learn what isn't working in machine learning today. whether that's talking to customers or talking to researchers at other labs trying to understand um both where the frontier is going and where things are are really not falling apart today and then developing a perspective on how we can push the frontier using interpretability methods.

And so you know even our chief scientist Tom spends a lot of time talking to customers and trying to understand what real world problems are and then taking that back and trying to apply the current state-of-the-art in inter to those problems and then seeing where they fall down basically and then uh using that those failures or those shortcomings to understand what hills to climb when it comes to interpretability uh research.

So like on the fundamental side for instance when we have done some work applying SAE and and probes we've encountered you know some shortcomings in in Saes that we found a little bit surprising and so have gone back to the drawing board and and done work on um better foundational interpreter models and a lot of our team's research is focused on what is the next evolution beyond SAES for instance and then when it comes to like control and design of models you know we tried steering with our first API and realized that it still fell short of blackbox techniques like prompting or fine-tuning and so went back to the drawing board and were like how do we make that not the case and how do we improve it beyond that and one of our researchers ECDE who just joined is actually ECDE and Attakus are like steering experts and and have spent a lot of time trying to figure out like what is the research that enables us to actually do this in a much more powerful robust way.

So yeah, the answer is like look at real world problems, try to translate that into a research agenda and then like hill climb on on both of those at the same time. Yeah, Mark has the steering CLI demo queued up which we're going to go into a sec, but I always want to double click on when you drop hints like we found some problems with SAEs. Okay, what are they? You know, and uh let's let's then we then we can go into the demo.

Yeah, I mean I I'm curious if you have more thoughts here as well because you've done it in in the healthcare domain. Um, but I think like for instance when we do things like trying to detect behaviors within models that are harmful or like behaviors that a user might not want to have in their model. So hallucinations for instance, harmful intent, PII, all of these things. We first tried using SAPE probes for a lot of these tasks. So taking the feature activation space from SAEs and then training classifiers on top of that and then seeing how well we can detect the properties that we might want to detect in model behavior.

And we've seen in many cases that probes just trained on raw activations seem to perform better than SAPE probes, which is a bit surprising if you think that SAEs are actually also capturing the concepts that you would want to capture cleanly and more surgically. And so that is an interesting observation. I don't think that is like I'm not down on SAEs at all. I think there are many many things they're useful for. But we have definitely run into cases where I think the concept space described by SAEs is not as clean and accurate as we would expect it to be for actual like real world downstream performance um metrics.

Fair enough. Yeah. Yeah. It's the blessing and the curse of unsupervised methods where you got to peek into the AI's mind, but sometimes you wish that you saw other things when you when you looked inside there. Although in the in the PII instance, I think weren't SA an SAPE based approach actually did prove to be the most generalizable?

Well, in in the case that we published with Rocketin and I think a lot of the reasons it worked well was because we had a noisier data set. And so actually the blessing of unsupervised learning is that we actually got uh to get more meaningful generalizable signal from SAEs when the data was noisy. But in other cases where we've had like good data sets, it hasn't been the case.

And just because you named Rakutin and I don't know if we'll get it another chance like uh what what is the overall like what is Rakutin's uh usage uh or production usage?

Yeah. So they are using us to essentially guard rail and inference time monitor their language model usage and their agent usage to detect things like PII so that they don't route private user information to downstream model providers. And so that's you know going through all of their user queries uh every day. And that's something that we deployed with them a few months ago. And now we are actually exploring um very early partnerships not just with Rocketin but with other people around how we can help with potentially training and customization use cases as well.

Yeah. And for those who don't know like it's Racketin is like I think number one or number two e-commerce store in Japan. Yes. Yeah. Yeah. And I think that use case actually highlights a lot of like what it looks like to deploy things in practice that you don't always think about when you're doing sort of research tasks.

So when you think about some of the stuff that came up there that's more complex than your idealized version of a problem, they were encountering things like synthetic toreal transfer of methods. So they couldn't train probes, classifiers, things like that on actual customer data of PII. So what they had to do is use synthetic data sets and then hope that that transfers out of domain to real data sets and so we could evaluate performance on the real data sets but not not train on customer PII. So that right off the bat is like a big challenge.

You have multi-ingual requirements. So this needed to work for both English and Japanese text. Japanese text has all sorts of quirks including tokenization behaviors that caused lots of bugs that caused us to be pulling our hair out. And then also a lot of tasks you'll see you might make simplifying assumptions if you're sort of treating it as like the easiest version of the problem to just sort of get like general results where maybe you say you're classifying a sentence to say does this contain PII but the need that racketin had was token level classification so that you could precisely scrub out the PII.

So as we learned more about the problem and you're sort of speaking about what that looks like in practice. Yeah. a lot of assumptions end up breaking and that was just one instance where you a problem that seems simple right off the bat ends up being more complex as you keep diving into it. Excellent.

One of the things that's also interesting with interp is a lot of these methods are very efficient right so where you're just looking at a model's internals itself compared to a separate like guardrail LM as a judge a separate model one you have to host it two there's like a whole latency so if you use like a big model you have a second call some of the work around like self detection of hallucination it's also deployed for efficiency right so thinking of someone like rakuten doing it in production live you know that's just another thing people should consider Yeah. And something like a probe is super lightweight. Adds no extra latency really. Excellent.

Uh you have the steering demos uh lined up. So we were just kind of see what you got. I I don't I don't actually know if this is like the latest latest or like alpha thing. No, this is a pretty hacky demo from uh from a presentation that someone else on the team recently gave. So this will give a sense for for steering in action. Honestly, I think the biggest thing that this highlights is that as we've been growing as a company and taking on kind of more and more ambitious versions of interpretability related problems, a lot of that comes to scaling up in various different forms.

And so here you're going to see steering on a one trillion parameter model. This is Kimmy K2. Uh, and so it's sort of fun that in addition to the research challenges, there are engineering challenges that we're now tackling because for any of this to be sort of useful in production, you need to be thinking about what it looks like when you're using these methods on frontier models um as opposed to sort of like toy kind of model organisms.

So yeah, this was thrown together hastily, pretty fragile behind the scenes, but uh I think it's quite a fun demo. So screen sharing is is on. So, I've got two terminal sessions pulled up here. On the left is a forked version that we have of the Kimmy uh CLI that we've got running to point at our custom hosted Kimmy model. And then on the right is a uh setup that will allow us to steer on certain concepts. So, I should be able to chat with Kimmy over here. Tell it hello. Is this running locally?

So the CLI is running locally, but the Kimmy server is running back to the office. Well, hopefully should be. Um, that's too much to run on that Mac. Yeah, I think it's uh it takes a full like H100 node. I think it's like you can run it on 8GPUs. H100. So So yeah, Kimmyy's running. We can ask it to prompt. It's got a forked version of our uh of the SGline codebase that we've been working on. So I'm going to tell it, hey, this SGLine codebase is slow. I think there's a bug. Can you try to figure it out? It's a big code base, so it'll it'll spend some time doing this.

And then on the right here, I'm going to initialize in real time some steering. Let's see here. Continue searching for any bugs. Feature ID 43205 layers 20 30 40. So, let me uh this is basically a feature that we found that inside Kimmy seems to cause it to speak in Gen Z slang. And so on the left, it's still sort of thinking normally. It might take, I don't know, 15 seconds for this to kick in, but then we're going to start hopefully seeing it.

Dude, this code base is massive for real. So, we're going to start seeing Kimmy transition as the steering kicks in from normal Kimmy to Gen Z Kimmy and both in its chain of thought and its actual outputs. And interestingly, you can see, you know, it's still able to call tools uh and stuff. It's um it's purely sort of its its demeanor. And there are other features that we found for interesting things like concision. So, that's more of a practical one. You can make it more concise. Um the types of programs uh programming languages it uses.

But yeah, as we're seeing it come in, pretty good output. Scheduler code is actually wildl um something about how Agents for interp is different than like coding Agents. I don't know. While this is spewing up, how how do we find feature 43205?

Yeah. So, in this case, um we our platform that we've been building out for a long time now supports all the sort of classic out of the box interp techniques that you might want to have like SAPE training, probing, things of that kind. I'd say the techniques for like vanilla ses are pretty wellestablished now where you take your model that you're interpreting run a whole bunch of data through it gather activations and then yeah pretty straightforward pipeline to train an SAPE there are a lot of different varieties there's top kes batch top kes um normal relu seaes and then once you have your sparse features to your point assigning labels to them to actually understand that this is the Gen Z feature. That's actually where a lot of the kind of magic happens.

And the most basic standard technique is look at all of your input data set examples that cause this feature to fire most highly and then you can usually pick out a pattern. So for this feature, if I've run a diverse enough data set through my model, feature 43.205 probably tends to fire on all the tokens that sound like Gen Z slang. And um so you know you could have a human go through all 43,000 concepts and look at the pattern but to automate that you just kind of hand those examples off to a frontier LLM and ask it to identify that pattern.

And I've got to ask the basic question you know can we get examples where it hallucinates pass it through see what feature activates for hallucinations? Can I just you know turn hallucination down?

Oh wow. you you really uh solved it. You really predicted some a project we're already working on right now, which is um detecting hallucinations using interpretability techniques. And this is interesting because hallucinations is something that's very hard to detect and it's like a kind of a hairy problem and something that blackbox methods really struggle with. Um whereas like Gen Z, you could always train a classifi a simple classifier to to detect that um hallucinations is harder. But we've seen that models internally have some awareness of of like uncertainty or some sort of like user pleasing behavior that leads to hallucinatory behavior.

Um, and so yeah, we have a project that's trying to detect that accurately and then also working on mitigating the hallucinatory behavior in the model itself as well. Yeah. And I would say most people are still at the level of like, oh, I'll just turn temperature to zero and that turns off hallucination. And I'm like, well, that's a fundamental misunderstanding of how this works.

Yeah. Although, so part of what I like about that question is you there are SAPE based approaches that might like help you get at that. But often times the beauty of SAEs and like we said the curse is that they're unsupervised. So when you have a behavior that you deliberately would like to remove and that's more of like a supervised task, often it is better to use something like probes and and specifically target the thing that you're interested in reducing as opposed to sort of like hoping that when you fragment the latent space, one of the vectors that pops out will be the thing you're interested in.

And as much as we're training an autoenccoder to be sparse, we're not like for sure certain that, you know, we will get something that just correlates to hallucination, right? you'll you'll probably split that up into 20 other things and who knows what they'll be of course right yeah so there's known sort of problems with like feature splitting um and feature absorption and then there's the offt target effects right ideally you would want to be very precise where if you reduce the hallucination feature suddenly maybe your model can't write creatively anymore and maybe you don't like that but you want to still stop it from hallucinating facts and figures good uh so people has a paper to recommend there that we'll put in the show notes but yeah I mean I guess just because your demo is done. Uh any any other things that you want to highlight or any other interesting features you want to show?

I don't think so. Yeah, like I said, this is a pretty small snippet. I think the main sort of point here that I think is exciting is that there's not a whole lot of inter being applied to models quite at at this scale. You know, Anthropic certainly has some some research and yeah, other other teams as well. But it's it's nice to see these techniques, you know, being put into practice. I think not that long ago the idea of real time steering of a trillion parameter model would have sounded yeah the fact that it's real time like you you started the thing and then you edited uh the steering vector I think it's it's an interesting one uh TBD what the actual like production use case would be on that like the real-time editing um it's like that's the fun part of the demo right you can kind of see how this could be served behind an API right like you're you only have so many knobs and you can just tweak it a bit more and I don't know how it plays in people haven't done that much with like how does this work with or without prompting right how does this work with fine-tuning like there's the whole hype of continual learning right so there's just so much to see like is this another parameter like is it like parameter we just kind of leave it as a default we don't use it so I don't know maybe someone here wants to put out a guide on like how to use this with prompting when to do what oh well I have a paper recommendation that I think you would love from uh ectep on our team who um is an amazing research just can't say enough amazing things about act.

He actually has a paper that as well as some you know others from from the team and elsewhere that go into the essentially equivalents of activation steering and in context learning and how those are from a he he thinks of everything in a cognitive neuroscience basian framework. But basically how you can precisely show how prompting in context learning and steering exhibit similar behaviors and even like get quantitative about the like magnitude of steering you would need to do to induce a certain amount of behavior similar to certain prompting uh even for things like jailbreaks and stuff. It's a really cool paper.

Are you saying steering is less powerful than prompting? more like you can almost write a formula that tells you how to convert between the two of them and so be formally equivalent actually in the in the limit right so like one case study of this is for jailbreaks there I don't know have you seen the stuff where you can do like manyot jailbreaking you like flood the context with examples of the behavior when put out that paper a lot of people were like yeah we've been doing this guys like what happens Yeah. Um, what's in this in context learning and activation steering equivalence paper is you can like predict the number of examples that you will need to to put in there in order to jailbreak the model. That's cool. By doing steering experiments and using this sort of like um equivalence mapping. That's cool. That's really cool. That's very neat.

Yeah. I was going to say like uh you know I can like back rationalize that this makes sense because you know what context is is basically just you know it updates the KV cache kind of and like and and and then every next token inference is still like you know that the the sheer sum of everything all the weights plus all the context up to date and you could I guess theoretically steer that with uh you probably replace that with your um steering. The only problem is steering typically is on one layer, maybe three layers like like you did. So it's like not exactly equivalent. Right. Right. There's sort of you need to uh get precise about Yeah. like how you sort of define steering and like what how how you're modeling the setup.

But yeah, I've got the paper pulled up here. Belief dynamics reveal the dual nature. Yeah. The title is belief dynamics reveal the dual nature of in context learning and activation steering. So Eric Bigalow Dana Warcraft on the um who are uh doing fellowships at Goodfire ECE's the the final author there. I think actually to your your question of like what is the production use case of steering I think maybe if you just think like one level beyond steering as as it is today like imagine if you could adapt

Goodfire AI’s Bet: Interpretability as the Next Frontier of Model Design — Myra Deng & Mark Bissell

Goodfire AI’s Bet: Interpretability as the Next Frontier of Model Design

Top 3 Ideas

🏗️ Defining the Problem

🏗️ Surgical Control

🏗️ Intentional Design

Actionable Takeaways

Others You May Like

Mustafa Suleyman — Nature, humans, tools… and now a fourth class

Introducing 4D Creation Open Beta and the Future of Gaming with Roblox CEO Dave Baszucki

She Raised $64M to Build an AI Math Prodigy | Carina Hong, CEO of Axiom

Goodfire AI’s Bet: Interpretability as the Next Frontier of Model Design — Myra Deng & Mark Bissell

Goodfire AI’s Bet: Interpretability as the Next Frontier of Model Design

Top 3 Ideas

🏗️ Defining the Problem

🏗️ Surgical Control

🏗️ Intentional Design

Actionable Takeaways

Join 10,000+ smart readers on our AI newsletter and stay ahead of the curve

Others You May Like

Mustafa Suleyman — Nature, humans, tools… and now a fourth class

Introducing 4D Creation Open Beta and the Future of Gaming with Roblox CEO Dave Baszucki

She Raised $64M to Build an AI Math Prodigy | Carina Hong, CEO of Axiom