Latent Space
December 31, 2025

[NeurIPS Best Paper] 1000 Layer Networks for Self-Supervised RL — Kevin Wang et al, Princeton

1000 Layer Networks for Self-Supervised RL

Kevin Wang et al, Princeton By Latent Space

Quick Insight: This summary explains how Princeton researchers broke the "shallow RL" ceiling by replacing reward-based learning with self-supervised representation. It provides a blueprint for applying LLM-style scaling to robotics and complex decision-making.

  • 💡 Why traditional RL fails: Why does traditional RL fail when you simply add more layers?
  • 💡 State classification advantage: How does moving from "reward maximization" to "state classification" enable massive depth?
  • 💡 Performance with 1000 layers: Can you achieve state-of-the-art performance with 1000 layers on a single H100?

Top 3 Ideas

🏗️ The Depth Deficit

"Reinforcement learning was like this one anomaly where we continue to use these really shallow networks."

  • The Depth Deficit: Traditional RL relies on noisy reward signals that break down in deep architectures. By moving to self-supervised objectives, researchers can finally use the same deep-learning playbooks that powered the LLM boom.
  • Vertical Efficiency: Scaling depth is linear in parameter growth, while scaling width is quadratic. Depth is the more efficient path to intelligence for resource-constrained builders.
  • The Architectural Recipe: Simply adding layers isn't enough; you need residual connections and layer normalization to prevent performance loss. This combination creates a "critical depth" where performance suddenly explodes.

🏗️ Death of the Reward

"Our code doesn't have a line of code saying maximize rewards here."

  • Classification Over Regression: The team replaced TD-error regression with a binary classification task. This change turns a "noisy" RL problem into a "stable" deep learning problem.
  • Implicit World Models: By learning to distinguish future states, the agent builds a map of the environment without needing explicit frame prediction. It is a leaner way to give robots intuition about their surroundings.

🏗️ The Robotics Frontier

"We can actually train robotic agents... with absolutely no human supervision."

  • Data Velocity Gains: Using GPU-accelerated environments like Jax, the team collected hundreds of millions of transitions in hours. This high-speed data pipeline is the fuel that 1000-layer networks need to saturate their learning capacity.
  • The Batch Size Access: Deep networks have the "headroom" to benefit from massive batch sizes that would overwhelm shallow models. Scaling depth actually makes other scaling dimensions more effective.

Actionable Takeaways

  • 🌐 The Macro Shift: The Great Convergence. The wall between RL and self-supervised learning is crumbling, leading to a unified "representation-first" approach to AI.
  • ⚡ The Tactical Edge: Swap your reward-heavy objectives for contrastive representation learning to access deeper, more stable architectures.
  • 🎯 The Bottom Line: If you aren't planning for RL models with 100x the current depth, you're building for the past. The next year of robotics will be defined by this vertical scaling.

Podcast Link: Click here to listen

So, welcome to L Space. We are basically trying to provide the best optimal sort of podcast experience of neurops for people who are not here. And congrats on your paper. How does it feel?

Yeah, it was very exciting. We had a poster yesterday and then today we'll have an oral talk. We just got mobbed. There were a lot of people. It was like three hours straight of waves of people. I've never received the best paper. Do you just find out on the website? I just woke up one day and checked my email and they just told me, "Oh, you've been awarded best paper." Maybe you know from the reviews as well, right?

Yeah, we knew from the reviews that we did well. But there's a difference between doing well in the reviews and getting best paper. So, that part we didn't actually know.

Okay. So, I skipped a little bit. Maybe we can go one by one and introduce who you are and what you did on the team.

I'm Kevin. I was an undergrad from Princeton. I just graduated and I led the project, started the project, and was very happy to collaborate with Isan and Nicole and Ben also.

And were you in the same research group? What's your social context?

So, we're all from Princeton. This project actually started from an IW seminar. Like independent work research seminar that Ben was teaching and this was one of my first experiences in ML research. It was really valuable to get that experience and then Ean was also in that seminar and working on adjacent things. And so we collaborated a lot during that seminar and then the project turned out to have some pretty cool results and then later on also the whole working on sort of similar things also joined on the project and became a good collaboration.

I don't know if any of you guys want to chime in on other elements of coming into deciding on this problem.

Broadly my lab works on deep reinforcement learning but historically deep meant like two or three or four layers not 10,000. When Kevin and Sean mentioned they wanted to try really deep networks I was kind of skeptical it was going to work. I've tried this before it doesn't work other people have tried this before and it doesn't going to work. So I was very very skeptical starting out. I don't know if I made this at the time, but that was my prior going in.

But do you view your job as like screening or like, hey guys, this probably isn't going to work. You should try a different idea, you know, like or should you be encouraging even if it's dumb?

It's selecting bets. And this was a bet I was willing to make.

What made you willing to make a bet?

It seemed relatively low cost in that we Mihal in particular had spent the past year developing infrastructure that made it a lot easier to run some of these experiments and the precedent was deeper numbers could do a whole lot better like that's what the deep learning revolution has been over the last decade. Why do we stop making them deeper? Reinforcement learning was like this one anomaly where we continue to use these really shallow networks and that's particularly true in the settings that we were looking at where you're starting from scratch you're starting from often.

Any other perspectives you guys want to chime in with?

I guess maybe I should just go over an overview of our project.

Yes. Okay. Sorry. Yes. So, the way that I kind of view our project is that if you look at the landscape of deep learning, you know, you have NLP like language, vision, and then RL. And as Ben kind of alluded to, in language in vision, we've sort of converged to these paradigms of scaling to massive networks, right? Like hundreds of billions of parameters, trillions of parameters. And there's been a lot gained in deep learning from that. But then it seems like in the third sort of branch of deep learning in deep RL that has not yet been the case. I was very surprised when I was looking at the networks. Why were you just using a simple two-layer MLP for these frontier state-of-the-art RL algorithms.

I was very curious, can we design RL algorithms? Can we sort of put together a recipe for RL that can allow it to scale in potentially analogous ways that language envision might scale and so what we did is that we know that traditional RL like let's say like value based RL doesn't really scale. This is pretty clear from the literature so we tried a different approach to RL called self-supervised RL where instead of learning like a value function we're learning representations of states actions and future states such that the representations along the same trajectory are pushed together the representations longterm strategies are pushed apart and this is just a different approach to RL that's allows us to learn in a self-supervised manner so there's we can solve task reach goals without any human crafted reward signal and so we know that self-supervised learning is scalable in these different areas in if deep learning so can self-supervised RL scale in similar ways.

When we first tried it it actually didn't work like we made the networks deeper the performance like totally degraded but then I separately was like there's also some other work like in our literature like we tried like residual connections and then there's other a few other architectural components that we had to put into the recipe and then all of a sudden one day I ran this experiment and there was this one environment in which there was like going from like doubling the depth didn't really do anything but like doubling the depth again with these different components suddenly skyrocketed performance in this one environment.

Getting this to work was very non-trivial in the sense that usually when we need to think about doing hyperparameter optimization we try changing A see if it makes it better try changing B see whether it makes it better and if we just made the depth bigger makes it worse we just add residual connections didn't make it better and it was really this combination of factors that Kevin and Ean figured out that really made this work and as a precursor to that we also tried scaling along different dimensions so scaling the batch size scaling the the width of the network so the hidden layers.

Pretty much kind of similar to just scaling depth naively. And then once we started introducing residual connections, layer these specific architectural choices, that's when we saw these significant jumps in performance like these critical depths at which performance multiplies by a pretty huge factor and that's where we really noticed like unlocking some significant performance gains as opposed to scaling just along with which did yield some performance improvements.

When you look at the number of parameters that your network has as you grow width, it's roughly a quadratic as opposed to something like growing depth. So it's more in some senses more parameter efficient, also more sample efficient from the experiments that we conducted.

Nice. In some ways you're sort of replicating stuff that is seen in the wild but on on a very small model that you can study. Is that would you would you say that's

We saw these huge performance improvements in language models, image generation models by making them larger, making them deeper, which seems very intuitive. And so that's why our work we we draw from like foundational research, right? Like residual networks which employ residual connections to avoid vanishing gradients. And that's something that we show in some of our ablations in our paper like further down probably in appendices. We did experiments without these residual connections. And so sort of borrowing these concepts that have existed in other fields and applying them to this setting with RL and showing that it works.

Before Ben has to go, I'll leave the last word to him. What additional work does this inspire that you want to push on next?

I think there's one thing I'd clarify about the paper and then I'll directly answer the question. I think the thing I might clarify about the paper is I think a lot of people reading the title are like wow big networks they're great. I'll take big networks and you solved it now. We can just go. We say big network add them to PO, add them to SACE, add them to your favorite reinforcement learning algorithm. But I think that's actually not the main conclusion. I think the main conclusion is that using big networks not only requires these architectural tricks, but also as Kevin mentioned before, it requires using a different objective. This objective doesn't actually use rewards in it.

There's another word in the title, reinforcement learning, that also might be a little bit of a misnomer because we aren't directly trying to maximize rewards. Our code doesn't have a line of code saying maximize rewards here. And so is at the end of the day this a reinforcement learning method? I don't know. It looks much more similar to the self-supervised methods in other areas of machine learning. And so I think that the method the work really stands in some sort of interesting intersection of reinforcement learning and self-supervised learning research. And we had this little figure on the bottom left of the poster which was the screenshot of a slide from Yan Lakun talking about how to build intelligent systems and whether that's going to be done by unsupervised learning or supervised learning and reinforcement learning. And I think what her paper really suggests is that the boundary between these things is really blurry and maybe the keys to building intelligent systems are going to be lever insights from all of them.

Yeah, the layer kick. Well thank you for your time. I know you have to go soon for Jon.

Thank you so much for coming. I think that that insight of like blurring things is interesting. I don't know if you were talking about so like the abstraction layer of representation learning. I don't know if if that triggers anything in terms of like the mix between self-supervised and reinforcement learning. Is that something fundamental that you've discovered or that we that people don't understand when they when they read the paper?

I think the best way that I would explain it is that we know that standard RL is not super scalable. And so like why can this different approach or different objective RL be scalable? I think it's because we're fundamentally shifting the burden of learning from something like like Q-learning or like regressing to like TD errors which we know is quite sperious and noisy and biased to fundamentally like a classification problem. We're trying to classify whether future state is along the same trajectory or along a different trajectory. And we do this with representation learning and we know that classification cross entry loss and representation learning is scalable in the deep learning literature if we think about language and like some of the objectives there so in some sense we're kind of blurring the the lines we're doing reinforcement learning it's still an actor critic reinforcement learning algorithm it's like a goal condition reinforcement algorithm but the objective the burden of like learning of that of of solving that RL task shifts to something that's more similar to what's objective that you might see in language in vision that we know have scaled so much and so I think yeah I think that's like one of the fundamental insights that we've seen is that it seems like by approaching RL in this different approach we were able to like get so much more out of we were able to scale our networks like significantly beyond what is like standard used in RA.

Can I jump in I will just give a bit of more of context about the architecture because we use another objective the influency so the contrast plastic plus. However, the architecture is quite similar to the previous works of previous papers like bra or or simba simba and simba fu 2 simba one simba 2. So we we also tweaked a bit of this architecture. However, it's not that we like invented the wheel for the first time. It's the merging between the architecture and the objective that makes the scale really like go up and and performance follow the the scale. I think that's something that we should probably mine deeper.

Do you think I guess like what domains what industry like if you you've applied it on on multiple different types of networks or or data sets is there a particular affinity that you think like has is like sort of low hanging fruit?

Actually if you look at a lot of our tasks they're particularly sort of like robotics tasks. So this is personally I'd be very curious about how a work like this could impact like the robotics field. Like my understanding of robotics is that a lot of robotics right now there's kind of multi a few different approaches. Like one approach is we want to train robots using imitation learning. So we try to collect like an insane amount of data. But we have a ton of cub human cuban civ supervision and we try to scale up this data and we're like learning with imitation learning like but on the other hand potential perhaps there's another approach which is like for example like goal condition reinforcement learning where we can actually train robotic agents and trillion RL agents to solve meaningful tasks with absolutely no human supervision no demonstration it's much more scalable so this could serve as an alternate approach and perhaps instead of like scaling data like scaling manual like human supervision which which is you know not super scalable If there are ways to sort of make goal condition reinforcement learning scalable and like we can just scale the architecture or we can scale because you're focused on your objectives.

Right. What's with certain different objectives? I think that could be very exciting and see to see how that can affect a field like robotics for example.

Double click on just one one thing on the efficiency which you you was talking about. I would expect very deep the deeper it is this should be quadratically worse. I I'm not familiar with like the the pre-existing literature. I'm just like sort of working out intuitions. But basically what are the trade-offs that you've found that I think you might want to warn people about because because you you were the guy who mentioned efficiency.

Yeah. So I was referring to one of the figures on our poster also in our paper where we compare the number of parameters that models have as we scale along the axis of depth and as we scale along the axis of width from our baseline architecture, the most baseline one would be like a width of 256 like the hidden layers have 256 neurons and then the depth is four four layers or hidden layers. And so the point I was making there is that when you scale along depth you're the number of parameters that your model has is going to grow roughly linearly. Whereas with width you're making your network outputs wider and then the input to the next network is also growing as well. And so the the number of parameters your network's then going to have it grows approximately quadratically. And so one of the experiments we did was sort of examining as we grow the number of parameters in our model by scaling along these two different choices which one for the same like approximate number of parameters yields a better performance. And the depth curve kind of goes like this. It jumps up pretty fast. That's like present throughout our paper. For with it grows a little bit more slowly. And so that the kind of takeaway from that is that if you are a bit more resource constrainted scaling along that might be better because there's fewer parameters with a smaller model to a smaller number tool learnable parameters with is expensive which is expensive. Exactly. And in general of course like more parameters is also going to be more expensive. So that's just like another consideration to think about when using these networks and I suppose.

Any other sort of rules of thumbs like that that I can extract that? This is just the most basic one that I could think of.

I guess like to your original question of like the trade-offs like one of the trade-offs, one of the limitations that we say is like obviously if you make the networks bigger the it it will takes longer to run, right? So if you like double the depth at some level of depth you you it might take like twice as much to like take make a forward pass through the network, right? However, this is not so like within our paper like for most environments we are able to like saturate like get to like almost perfect performance within just you know we don't even need to get to like a thousand layers like maybe just 64 layers for example is sufficient and in this regime like like the latency of the network is not necessarily actually even the not necessarily like a significant bottleneck like you can imagine there's a lot of tasks in which especially in RL that like collecting data might be the bottleneck right and making four passes through our network may not be the bottleneck. And so in our environment, we in our research, we specifically used the Jax GCRL environment, which is a Jax based GPU accelerate environment. So we can collect like thousands of like environment trajectories like in parallel at the same time. So that we're able to like make oh this is built in, right? This is built in so that we can collect you know like like a thousand trajectories at the same time along all these environments and so makes that make sure that like we have enough data to like saturate the learning from wow that's like work data columns okay.

I don't know if you want to explore expound upon that on the drug maybe and you know most people are familiar with Pyth less familiar with Jax with J I think Jax is getting the the traction especially in RL field because the in for online reinforcement learning getting as much data as you can is is the most important. There's got to be a pie to equivalence. But anyway, how are any tips for other people also exploring this kind of roll out?

I think I can also recommend like for go conditioned RL, I I'm recommending JRL, but there are also like multi- aent J implementation and others. So going back to our paper, if you look at the plots, we only see this like huge performance increase when we cross like 50 millions of transitions gap. So I think the data is crucial like here.

I guess even to build on that like I I like drawing analogies to like successes in other areas of deep learning like for example in large language models the reason why we're able to scale to such large networks is that we found a paradigm in which we can leverage the entire internet scale of data to learn right and so data in RL traditionally has been hard to come by but now with these like GPU accelerate environments we can collect hundreds of millions of time steps of data within just a few hours and so I think that this serves as like a really good test bed for us to be able to also find ways to scale up like network capacity and get similar kind of gains.

I think that has to go. Are you saying that you have a difference you would do pre-training differently in LLMs? Like what's the what's the difference objective now?

Very simply the the paradigm that you're referencing is next word or next token, right? It's very robust.

How do you change that?

Oh, I'm not saying I want to leverage insights from that to apply to morale. I I feel like you should go the other way.

You think you should go other way? Maybe. I mean, that would be a very interesting research direction, too. But actually, yeah, even on that point, like one of the things I was thinking about is that the way that our our objective works is in some it's not exactly next word prediction, but it's kind of like next state prediction, right? You imagine you're at some current state and you're at some current action and we want to predict whether or not this future state this this certain state is a future state along the same trajectory or a different trajectory and so in some sense we are actually doing some sort of like implicit world model implicit like you know like in I don't know if that's a bad word these or like in language you you do a cross entry loss to classify the next token right and here we're just doing a binary classification of like whether or not some next state is some yeah yeah it's a classification yeah and so I do see that there are some like sort of parallels here that perhaps we should dig into deeper and see like what is the core to of what enables deep learning to scale and then how can we like leverage that how how we can distill those like insights and then apply those across like all different fields whether it's language or reinforcement learning.

Did you did you get my my meaning about the world model stuff?

Actually and I I heard I think I might have heard professor Eisenbach yesterday talking about this at a poster and he's explaining to a couple people that because this is like doing representation learning and trying to learn these meaningful representations for a given state and action but for a given goal in some sense you can think of it almost like learning a model of the environment learning a model of the world but without having to do any sort of like next frame prediction or stuff like that that's a little bit more highdimensional and complex.

I will think like the the angle that I'm trying to think about and push is instead of learn the next world they're basically like generate a number of candidates possible worlds and classify them to your point which is exactly how I do things. Let's say I'm playing poker and I'm trying to classify what hands you have. Well, there's a range of hands based on what you're you're doing. And the more information I get, the more I resolve to, oh, I know exactly what hand you have based on what you're showing, you know, or you're buffing. But that's that's a different thing. But you know what I mean? Like that I I feel like that is the ultimate sort of end goal of representation which is a world. But I don't know if that is too vague compared to the more concrete types of world models that let's say the video gen people are doing.

I guess one one other thing like I'm also exploring I you mentioned like the deep models being slower or more expensive. That that is a trend in the inference world of making models shallower, right? And I wonder if this like short catchphrase I was thinking about like deep teacher shallow student would be a good deployment paradigm. Like you push the frontier capabilities with the with with F and then you distill distill it back. I actually this is like a good point like if you go out to our website like this is one of the future directions that we list at the very bottom. We we we we would love to see if we could get similar performance like we pushed the you know like we do achieve state-of-the-art performance on u gold condition RL in Java CRL by a significant amount and so it was very exciting to see the like the the sort of frontier of the ability to train RL agents sort of pushed and if we can do that in a way that also sort of is just as efficient as a standard you know networks that would be very cool so you know like is able because training doesn't have to be the the same thing that you deployed inference, right? You know what I mean? Like So, yeah. So, if there's ways to like distill down to a smaller model or prune the model and maybe not still retain performance, that's a very interesting research direction that we're choos what what else is your personal passions?

So, currently I'm pursuing direction of stitching in reinforcement learning. So we are trying to generalize reinforcement learning from shorter subbehaviors so that they are stitched merged during the test time and yeah I think this is one of my last papers that I will tackle during the PhD. Personally I would I'm very curious of like can we like what's the like real like can we push I'm I'm I'm curious about like advancing the frontier as much as possible. So if you actually look at our paper we focus on scaling depth but we notice that we see that scaling width actually also improves performance and we also find that actually by scaling depth we actually unlock the ability to scale along batch size as well.

So this is one of Uh so so okay so I guess colinear like yeah right so like okay I guess for context like in traditional RL like value based RL scaling batch size is not super effective but there's we also can see there's also other work in other areas of deep learning that show that scaling batch size is only most effective when there's like a large enough network capacity to take advantage of the scaled batch size and we actually find that you know perhaps so one hypothesis might be like perhaps the reason why scale batches isn't that effective in traditional RLS because like we've been using these tiny networks that haven't be able to capture And one of our experiments is that like because we are enabled successful training of deep network we actually were able to this is a great test bed for you know like testing this hypothesis and we find that indeed as we scale the network capacity we also unlock this different dimension of scaling by our site and so all that to say is that I'm very curious for someone like with enough compute to like take some of these environments scale up batch scale up depth to the maximum capability also scale along width also scale along batch size And let let's like basically like in the same way that in language we're scaling on so many different aes can we unlock different dimensions of scaling as well and what capabilities and how far can we push the frontier of training these RO agents from doing that before we pass Sean when you say enough compute what kind of compute budget did you have how does it how I just want to see what you guys got good question so we we wanted to make sure that this is we we wanted to make it such that like uh you know it's quite accessible so actually the nice thing is that all of our experiments even the thousand networks can be run on one single 80 gigabyte H100 GPU.

So that's dollars. Right. Right. Right. So everything can be run on one GPU. But in theory if we had, you know, like a distributed training setup and like can just like blast compute through this and really wanted to push the frontier, it'd be very interesting to see how things go.

Cool. And I've actively been trying to learn as much as I can about vision language action models, role models at Europe's going to a lot of machine language action models. Vision language. Vision language. Yeah. Um and yeah, curious about applications of representative for these. Yeah, exactly. For robotics. Actively trying to explore more in that area. So just reading a lot of literature, talking to as many as I can. Yeah, we just released our episode with General Intuition. Okay. Um where if you know a bit about their history, they started as a gaming clipping company and they basically have a vision language action model which I I saw I saw a preview. It was very impressive. I'm not sure exactly how transferable it is to embodied use cases, but it doesn't have to like screen is fine, you know, like I I don't know if you have any takes on.

It's an exciting research direction. Definitely. I think the the the concept of actions as as something that you are outputting is actually not that popular in industry, right? O only because text has completely dominated the last three years and tool calling and which is a just another form of structured text. And and I I feel like the action research is is kind of like I don't know how I don't know what needs to happen in order to unlock the next phase in in that. I don't know if you anything interesting out here shout it out.

There's a lot of cool work on like leveraging pre-trained VLMs and you freeze it and then you apply and then you on top of that like some sort of experts to output actions. Also like systems for doing like hierarchical planning maybe outputting some higher level plan that and this is like a larger network that takes a long time to a little longer to do inference and so it outputs its plans with less frequency like some sort of chunk and then from there there's like some sort of second system that operates a bit more fast. I think there's quite a bit of interesting research in that direction. So that's sort of what I'm looking forward to.

Cool. Final question. Hardest question you were asked at the postal session or just favorite encounter anyone famous that you met.

I actually haven't gotten a chance to go to the conference that much. I'm actually working full-time now. So yeah. Uh so so far I I actually literally just got my badge like a few moments before session. So I guess I wouldn't be the best to answer that question. No, no, no. Like you like people ask you stuff, right? Oh. Oh, that might I might post because people asking you or meeting you and like you know just just give a vibe of like what people are saying and yeah I think people were very I think it it's sort of like a very eye opening I think that the general question is that people thought it was a very eye opening paper because like the objective is quite simple. It's quite elegant and for us to be able to like you know like I I don't want to say like overturn but like sort of challenge the conventional wisdom that like RL is not super scalable and push it to such limits like a thousand layers deep and see continue improve performance. I think the general impression that I've gotten is that, you know, this this this could be like a really cool like if we can sort of build along this direction and that like we can really scale along all these different dimensions and push the frontier of the ability for RL. I'm very curious to see how that goes.

All right. Well, thank you so much for dropping by. Congrats on the paper again. And, uh, good luck in your future work. Thank you. Thanks for having us. Yeah. Thanks.

Others You May Like