RoboPapers

February 4, 2026

Ep#61: 1x World Model

How World Models Unlock Zero-Shot Robotics and Scalable Evaluation by RoboPapers

Author: Daniel Ho

Date: October 2023

This summary is for investors and builders tracking the bleeding edge of AI and robotics. It unpacks how 1x is using world models to enable humanoid robots to learn complex tasks from human video, accelerating deployment and self-improvement.

💡 How do world models enable robots to perform tasks they have never been explicitly trained on?
💡 What is the multi-stage training process that allows 1x's Neo robot to generalize from internet video to complex real-world actions?
💡 How can world models act as "learned simulators" to evaluate robot policies and accelerate development without constant real-world testing?

Daniel Ho, Head of Eval at 1x, joins RoboPapers to discuss the company's groundbreaking work with world models. 1x is building humanoid robots that learn from the vast ocean of human video data, moving beyond traditional, task-specific robot training to a future of generalized, adaptable, and self-improving machines.

Top 3 Ideas

🏗️ Zero-Shot Action

"The high level takeaway is that it really allows you to zero shot to new tasks because of the generalizability of video pre-training."

Video First: World models predict future video frames from an initial image and text prompt. This differs from traditional robot policies (like VALA) that directly regress action sequences. The implication is a robot that "dreams" its actions, visualizing the outcome before moving.
Human Data: Training on internet-scale human video allows robots to understand a broad range of tasks. This means a humanoid robot, with its human-like form, can extrapolate from human actions to perform novel tasks it has never seen robot data for.
Multi-Stage Learning: 1x uses a three-stage training process: web-scale video pre-training, mid-training on egocentric human data, and fine-tuning on limited robot data. This layered approach broadens the robot's understanding, preventing overfitting to specific robot data and boosting generalization.

🏗️ Imagination to Reality

"If we can predict video correctly then we can execute it and that kind of cause and effect can become clear."

Dream and Do: The world model generates a video of the desired task, then a separate Inverse Dynamics Model (IDM) infers the precise actions to achieve that video. This allows for a direct translation from imagined video to real-world robot movement.
Granular Control: Detailed text prompts, like "right hand picks up the bag of chips, left hand remains still on sliding door," guide the world model's video generation. This level of instruction enables complex, multi-step tasks, moving beyond simple commands to nuanced behaviors.

🏗️ Scalable Evaluation

"The real world budget is the most limited and it's going to be the lowest throughput and so we need some higher throughput signals to be able to create the champion models that we go and evaluate every day."

Learned Simulator: Action-conditioned world models predict future video from an initial image and a sequence of actions. This transforms the world model into a trusted simulator, allowing for offline evaluation of robot policies with high fidelity.
Targeted Testing: World models enable developers to test specific failure modes by initializing simulations from real-world failure logs. This allows for rapid iteration and targeted fixes for edge cases, a significant upgrade from broad, expensive real-world testing.

Actionable Takeaways

🌐 The Macro Shift: Robotics is moving from bespoke, data-hungry behavior cloning to generalized, human-informed learning via world models. This shift, mirroring the success of LLMs, means robots can use the vast, unstructured data of human experience to acquire new skills.
⚡ The Tactical Edge: Invest in platforms and data pipelines that facilitate multi-modal, multi-stage training for humanoid robots. Prioritize systems that can generate synthetic data and use world models for high-throughput, targeted policy evaluation.
🎯 The Bottom Line: World models are the engine for scalable robot intelligence. They promise a future where robots learn faster, generalize wider, and self-improve through iterative simulation, making widespread humanoid deployment a near-term reality.

Podcast Link: Click here to listen

Hello everyone. Welcome to another episode of Robo Papers. Super happy today to finally get Daniel Ho on the port. Daniel is the head of Eval at 1x. He's got lots of things to say about World Models today. So thank you so much for making the time, Daniel.

Maybe as a start if you can share a little bit about yourself before we dive in.

Yeah, absolutely. I've been at 1X for around 2 years now, doing a lot of work on policies, role models, evaluations. On the role model side, we've shared two releases so far on how to use world models as policies. So they could take actions in the real world using our robot Neo, as well as on the eval side, how we can evaluate policies using the world model as kind of like a learned simulator.

Prior to 1x, I also worked at Google. I worked on Whimo, I worked on some research there and yeah, I did my undergrad at Berkeley.

Very cool. So yeah, and I think you guys just had a big drop a couple of days ago where you basically learn directly from role models to do something right with the new. Previously a couple months back you guys have another blog post about the eval. So maybe it'd be great if you can talk us through the journey and what you can share about those.

Yeah, I think it probably makes sense to just dive right into the latest results and just see what the current state is. I think it's very visual. I think when it comes to thinking about what role models can do.

So if we just were to go through maybe just as a first topic, like what is different between a world model policy and another kind of robot policy like a VA policy.

I think if you were to think about it the main difference is that a world model is trained on video pre-training as kind of the main objective function that conveys forward dynamics to predict the actions that the robot should take in order to solve a task.

Whereas with the VA, rather than learning directly on video or predicting what happens next in a video, you would instead regress what happens in the action sequence. So you would have a sequence of data from video frames and robot actions and then you would predict given that I see this current observation, this current robot image, I should take the trajectory of actions. So it's kind of a more direct regression problem that VALA learns.

I think in this blog post we kind of share that there's a lot of advantages to learning in the world model way. The high level takeaway is that it really allows you to zero shot to new tasks because of the generalizability of video pre-training and of the ability to do this kind of video next token next frame prediction problem that meshes very well with just reasoning through human video that has already been done a lot in the past.

We can then transfer all of that knowledge into this policy which now if you ask it to do a new task that even though we've never seen robot data of that task, we're able to understand that task because our robot has two hands, the robot kind of looks like a human, and we can extrapolate a little bit from what we've seen before.

I think if we were to go on to next steps and say what kind of works today, what kind of tasks work, if we were to just look at some of the actual tasks that a robot policy can solve. There's some direct grasping tasks where on the left side here it's the role model generation, on the right side here it's the actual roll out of that generation in the real world, which means that from the starting frame which is whatever the robot sees, we'll first imagine what happens next in the scene if I were to grab the bag of chips from that I would get a sequence of actions that correspond to those video frames and then I'll be able to execute that's what you see.

I think to us it's actually very exciting to see how similar sometimes these side-by-side comparisons can get, meaning that there's something here in terms of if we can predict video correctly then we can execute it and that kind of cause and effect can become clear. So the left side here is the generated video, the right side is the real roll out and this is all zero shot just to be clear.

Very cool.

Yeah, you can imagine that some of these tasks might be a bit less zero shot, like grabbing a salt shaker or grabbing bag of chips, but all of these new behaviors here are completely novel. There's nothing even remotely close to these kind of behaviors in the data set. We have never done wiping. We've never done this kind of badging into the office. These are all just random things which we've tried to do.

If you look at our data set it's primarily a pick and place data set. So really the question is given a focus on pick and place which is where you would start with from robot data, can we really analyze how well we can zero shot. So we're using the robot data primarily as a shim in order to teach and post-train a model on robot morphology, robotic kinematics, like what are the constraints of the range of motion I have for example how far can I reach my hand out before it becomes unrealistic.

Once I understand that it doesn't really matter is what we're trying to convey like what the type of robot data you train on is all that matters is that you have some data that corresponds to robot doing things that may or may not intersect with the actual actions you want to solve.

So sorry Danny. So just to be clear right on the left. So for the one model when it's basically dreaming up this video it's at prom is presented with I imagine just one image and then with some text where the text will say okay pull up the chair is that what's happening? And then it will drain the video along with some action labels that go with those video.

Exactly. So you can imagine that the conditioning into the policy is if you hover over some of these bars on our plot you can see that for steaming the shirt the problem we have is right hand is holding the steamer, right hand brings the steamer up to the shirt, right hand steams the shirt in up and down motion and if we were to find that equivalent steaming behavior it might be something like this that gets output through the video generation and then the separate IDM process is what we use today in order to infer the actions that take you from one frame to the other or the transitions between the frames.

Okay. So literally you in this it's like this left right scenario you show the first image and then that's what the WM I mean what model imagine and then on the real on the right is basically you actually just apply the action label sorry the action instructions that's gen suggested by the war model and then you apply and then this is what is actually being shown and it's so close. Okay, got it.

Exactly. So the action sequence or sorry the prompt sequence as you might have noticed is can be pretty long, right? Like grab the bag of chips is like right hand picks up the bag of chips, left hand remains still on sliding door. It could be really long. Like right hand reaches out and grabs the bottom of the metal handle on the right. Right hand slides the handle across to the left, closing this sliding glass door.

Some of these prompts are pretty specific because we won't exactly convey today what exactly should happen in a scene. In the future, I think there's a lot of work to do to kind of just work with both upsampled prompts like this where we've really conveyed granular instruction as well as just simple prompts like close the sliding door.

I think there's a kind of in the future probably going to be some kind of interpretability layer between a VLM and some kind of maybe planner like this role model where the VLM needs to output text into some manner that is interpretable and understandable by the role model to say oh I understand this kind of a language.

Maybe could you talk a little bit about I think in your blog post you mentioned you guys start with basically internet scale video and then you meet train with some ecoentric and then finally some robot maybe could you talk us through that and I think somewhere in there you have some captioning that you do in addition to.

Yeah, so the process that I think we're sharing is that from the webscale pre-training we first mid-train on egocentric data like say like 900 hours for example and then we fine-tune on a smaller set of robot data. So say for example 70 hours and again this robot data may not be the type of kind of action or task that corresponds to the ones we try out in the blog post but the mid training may include just very general home style tasks that may better correspond to some of these other chores.

We didn't explicitly go out and collect egocentric data of every single task that we've done. This is more or less just like unbiased egocentric data. And then if you look at I think the reason why we do this, we show with some ablations the ability of the model to improve from being mid-trained on egocentric data.

So here we show between doing no egocentric mid training and doing egocentric mid training we're able to measure some delta and performance. First we can measure this on video quality measured by human labelers who can label does this video look realistic or not, noticing problems with physical plausibility or objects disappearing or just random artifacting like if it looks bad then reject it if it looks good accept it.

We see that for example on generalization tasks we see improvement when you add egocentric training and interestingly on in distribution task you actually lose a bit of performance on the quality. We think that kind of makes sense because you can imagine that egocentric really is in order to diversify the task list. So you're actually trying to broaden the distribution. Maybe you dilute a little bit of the specific in distribution overfitting that you can have in order to generalize.

In the real world we also further run actual ablation studies where we count the number of successes across each of these different ablation. For egocentric for example, you see that across most of the tasks, in this case, I think all of the tasks there's pretty significant improvement. There's enough for it to be static for some of these tasks.

For example, for scrubbing a dish, we found that egocentric was actually pretty important. So if you train on scrubbing the dish with egocentric, you could really easily overfit to Neoata without including that egocentric. So for example this is the same prompt which is very granular and saying like scrub the dish and doing specifically that with the right hand grab this point scrub the dish but the model may kind of overfit now to the robot data if all you see is pick and place for example but by including the mid-ring stage you are able to broaden the distribution like we hoped and you can maybe see that by the world model generations post mid-training stage are able to now not overfit to grasping but actually solve the task and that was really good to see.

Pretty much all the failures on scrub the dish for the prior like no ego baselines come from just the model not understanding the task correctly. So that was a level of control that we got from egocentric and I think if you look at the other case you mentioned the caption of sampling that was also true too. So you could also look at the ablation results you see like oh caption of sampling seems like it generally helps.

What that means is if you think about language models or VLMs now it's pretty common to train on not just simple commands like open a door but to be really granular like second by second to give a playbyplay of what's happening in the scene and that's able to also exert a more granular level control to the model.

Did you need to do anything like so for your egocentric data, do you need to do anything like put a version of Neo's cameras on on the glasses or does does anything work or how do you think about that kind of stuff?

Yeah, that's a good question. For this experiment, we did not do any of that. So this is completely like I said before unbiased like not we didn't do this egocentric data in our office. This was kind of like more like egocentric data that you would kind of acquire from third party providers or you'd acquire from the web rather than being like oh we're going to do egocentric data scrubbing in order to target egocentric scrubbing like so you didn't even need to specially collect it yourself. It's just this is not collected by us. This is just collected by third parties where we didn't really provide an informed opinion about what they should be doing. So you can imagine it's like a lot of his general home tasks or factory tasks or other places where people have collected egocentric data before.

I'm actually curious like whether you would know this like I imagine if you like the the very base which is trained on soal internet like video right wouldn't there or so be some egocentric video in the mix what do you think this step of doing me training like egocentric like just to highlight I imagine there has to be some egocentric view also already.

That's a good question. I think there definitely is a lot of that in the data mix. I think the problem is that it doesn't make up a big percentage of it. Like if you take a unbiased set of data from the internet probably not a lot is egocentric and less of that is egocentric of doing useful tasks. So this is a bridge right a bridge between the pre-training and the post-training where we know in post training time it's going to be fully first-person robot views fully very specific home chores.

Can we get to a middle point where we can train on data which is still of humans which is still very in distribution to the pre-training. I'm seeing egocentric data is like you said before not strange or high perplexity to the model but it is able to understand that oh it seems like we want to go down this egocentric way I should focus on understanding these types of motions so that when we get to the robot style egocentric it doesn't come as like a shock and overfit the model to that change. We think right so we see from the patients but we're kind of also just speculating here to be clear.

Got it. Do you mind going back to the success rate chart? We have the last one the scrub this one.

Yeah. So I'm curious what you think about the absolute success rate especially for a scrub dish. So clearly without ecoentric it's basically useless. Once you're eagle then it but it's also 20% you know only.

Yeah. Do you think it's just a matter of just really having a lot more egocentric so hopefully you have more scrup you know in this 900 hours or do you think it's still on the robot data side because you were saying the 70 hours is basically place maybe you have to expand that as well like what's what's the what's going to move the needle for this.

I think the kind of takeaway I have here is that the work here is still early and there's so many ways to do improvement from this like given we didn't try very hard to fit to the scrub dish task you could like you said be more targeted and collect that it wasn't the intention of the work to show results like that because the at least the study here is more like at a high level should we be thinking about world model backbones as a more scalable class of architecture than VA's so if we were to have gone and done that that would be doing the VA style of thinking rather than the world model style of thinking and of course we could fit results like that and make them you very close and collect you know if you put a data engine to it right hundreds or thousands of hours of of that task.

But it doesn't really convey the same kind of result where we're trying to see maybe the generalization performance because we're imagining that even if we have a good sense of the task we want to solve in homes the diversity in general is just going to be so different the task of scrubbing the dish is going to be different because everybody has different dishes and different sponges and different homes so it's a proxy to the general you know transfer problem across tasks.

If you were to look at the actual task itself, the scrub dish task is pretty hard, which is why I think the success rate scored was pretty low. It's both grabbing the sponge and scrubbing at the same time. So we could find I have to grab the sponge and then I have to scrub it and I have to make contact with the dish and that's actually kind of a pretty contactrich task, right? And and you just have pick and place robot data, too. So this is pretty hard transfer problem, right? Is that fair?

Okay. Yeah. But I think also to your point, another way that we're excited about hill climbing this performance that is enabled by having like a world model style backbone that wasn't possible before is the ability to do maybe like this post training process that follows LLMs. For example, if we rather than collect teleop data like you would do in VA of of this task, I could instead just roll this out, right? I get 20% success.

So, you know, I don't succeed all the time, but I'm able to safely roll out these videos and, you know, try a bunch, get a bunch of failures, get a bunch of successes, and I could, you know, train on the autonomy data. The video sequences of those successes and failures could be used during world model training, too. And then that would help me align the video model directly to these tasks and he'll climb that 20% higher and higher as well as allow me to do things like training value functions or like we've seen in some other works take the autonomy data and label it with actual more information.

I think in your blog you also Oh, sorry. Sorry, Chris. You [laughter] I was I was gonna ask how much of a difference you thought the pre-training data mix made? I guess because like I could imagine a world where eventually you get like like just I don't know pabytes of egocentric data and you do you think you'd still need all that egocentric data or is there still just stuff that's hard to g capture from that or like what does the mix look like? Do you think you need a lot do you have lots of kitchen tasks in it?

I don't I think being like humanoid pled the answer is that we really want a recipe that learns very well from all human data not only ego but exo right so the bet here is also that 900 hours of ego data is also in the grand scheme of things not a lot it's also not like enough to generalize for the most part so we're really not only learning from the ego here but we're learning a lot from exocentric third person view data of people doing things too and that's really possible because the morphology of a robot is very similar to the morphology of a human.

I'd imagine transfer like this would become trickier at least it would be less data efficient right if we're doing nonhuman or form factor robots and there you see that to your point you may need to do a lot more style or other kinds of collection where you design specific kind of hardware in order to better fit the morphology of a custom robot because my robot has wrist cams and a six off hands that operate in this way with these kind of grippers so I want to make sure that I'm collecting so-called egocentric data in that specific way but it's more like a more manual bridge than just directly learning from the webscale [clears throat] video that already exists of human.

Yeah, actually I just want to point out like you guys were talking about just you know obviously this video is on loop like the one on the wall model right I just want to point out two very interesting I just saw get Daniel and Chris thought on it. The first thing is that you notice when you drop it it was on the first slot and then magically it'll be on the second slot. It started distracting me partway through too. Sorry. [laughter] Yeah.

So it's it's doing some but then at the same time when you try to put the second you know bread on it it kind of know that there's something it didn't drop straight away. So it it's as if it understood memory or like it has some spatial intelligence that it knows there something in there it won't drop all the way. Anyway, I just want to point out like it's at the same time it's it's silly but at the same time it's also impressive.

Yeah. But also it got it right. Right. Like the motions the robot did were correct because it actually works in the real side. Right. So [cough] even though the the bagel teleported a little bit. Yeah. Yeah. It takes like a little give. I think the bagel like if you wiggle it just falls in depending on whether it falls into one or the other. There's like a lot of um you know high uncertainty, right? Like it's like a lot of the minor contact details would change where it falls in.

I think that's something which world models today don't have that precise level of control where you can say exactly like I want to move you to like you know a couple millimeters forward so that you get into this one slot rather than the other and that's I think something that we would appreciate training a lot more on robot data for especially if the robot data includes things like tactile or higher you know fidelity sensors that you could hill climb this more granular level of manipulation.

So do you think that do you think that this problem so like the really granular differences in in robot video generation is something that will come out with scale basically that like think it's not a huge issue and that with enough data it'll just be solved.

Yeah, I think that's like the simple way to look at it, right? Of course, under the hood, a lot of people had to do a lot of very hard things to make that work. Like, you know, build really good robots with a lot of good sensors and, you know, build all the infra to make that happen. But, um, those are kind of like known problems in a way, right? Things that we could hill climb even though they are hard.

The idea would be that with this kind of policy, you can ask the robot to do things. It may fail, right? it may succeed, but that's going to especially combined with the robot with high dimensional sensing information, you know, be able to hill climb the policy more and more. If you really scale this up, right, like if you deploy thousands of robots, then really quickly you'll collect way more data than we've ever trained on.

That's the ability to go from a world where post training is like the cherry on top to a world where it's a really a focus of this kind of maybe more multimodal style post training where rather than just adapted to robot morphology you're really adapting it to the full sensing suite of the robot that you have. Yeah that makes sense.

So on the sensing suite note I think one thing that I'm curious about so you have so there's this there's this old kind of style of world models which is like state plus action to the to and predicting a next state that's not what you guys followed right you followed this more this kind of like there's this sort of newer wave of these which is which use like an inverse dynamics model to estimate the actions and so how do you how do you think about that do what why why make that change or what do you think is the what are the trade-offs.

Yeah, we could talk about the state action and next state one I think that was what we used for evals I think that's very clear and necessary for like making a like a true kind of model based kind of simulator. But in this case if the idea is that I want to be able to specify very openended actions doing so in text space is more natural. This is more like the chat GBT style of doing robotics right like we also built this very fun like you know chat interface where we have everybody at the company try and just you know type to the model and have it do things and people have tried really random things. We actually just show some of these videos from the random things people have tried.

It's like interface like you know JBT for a robot. You connect to a robot you see what the robot sees you know pick up pick up this thing or you know open the door and then you see the video it's like oh good video do it or bad video don't do it and then it's like a very very natural thing where people actually have a lot of interpretability into the process. I think if you talk about state action models then you know if you have a you can't really have this kind of like chatbot style like how am I going to specify this trajectory in 3D space maybe on this one like could you comment a little bit about how do you think the IDM part actually performed like do you think it's more or less able to be faithfully you know draw out the actual action that it seems to be suggested by the world model.

Yeah, here we deliberately broke the IDM out separately. The IDM is actually the much more stable thing to train. You can imagine that the reason is because you could train the IDM on every single robot log you've collected no matter if it's or autonomous no matter if it's good or bad. The IDM basically says that from any starting frame if you take any action you get to any kind of next frame or like another way say it's given any two frames with any quality of action in between you can just draw the bridge between those two frames.

So, you know, it's a really robust way of training that's really well specified and really is able to handle everything. It's like converges very well. It's a classical supervised learning problem that has a very stable loss and very dense you could train on. You could train on every pair of images across a video and you could take any pair wise combination. You could also slide a window across like any kind of sequence length and yeah there's no limit to it unlike more narrow ways of training like teleyop where I must focus on this one specific action and it has to be done exactly in this way to be correct and consistent.

The IDM is more like we can trust the IDM it's really about just generating the video. That's kind of what we found. In the future we could also merge the two. In a machine learning sense there's that seems very plausible.

How about in practice like I think in your blog post here you mentioned something about like an actual inference like how long does it take to actually run the imagination and then you know could you talk about that and and as things scales out I mean obviously this is your kind of first like attempt like how to do it in practice.

Yeah, and that's a big reason to merge the IDM and the world model that can make it faster right so then you don't need to run two models you could just run one model you could [clears throat] model partially and do things like you know just care about the action part of it. The current model today takes you know between 10 to 20 seconds to run and to end. I think if you use the best hardware I think it's around 11 seconds to to inference. So no it predicts like 5 seconds of video. So it's like running synchronously right. It's not running in async mode. It's not like a closed loop roll out where you you know take actions continuously with the robot.

I think if you could get the latency down that becomes more and more feasible and it could be more and more natural. But yeah I think that's something where following the trends of like optimization work in diffusion models like something that we're you know really want to show as a next step.

How how about also like maybe even allowing the world model to dream up like maybe more variations like do you think that's helpful actually?

Yeah, that's a good question and that's actually sometimes in direct [clears throat] conflict to the latency rate that you just mentioned. So we showed some work that you can generate multiple versions and pick the best one and humans are very good at picking the best video from among a series of videos that look realistic. we actually have a very good like critic of videos right in our brains. So if you allow humans the ability to look at eight videos rather than one video and just pick the best then I think we saw a pretty good boost on most tasks I would say because you could immediately rule out some of the bad generations, right?

Maybe today the model still needs to be kind of preference tuned or RHF right it doesn't always do exactly as you say maybe in the future this will go away and there are two different ways to solve the problem one is you weed out the bad ones by generating multiple times the other way is you train a preference function too and you could be like oh you know given pairwise preferences across generations I can just like post- train the model on only predicting the ones that humans prefer And that's also something I think that'll be really interesting next work.

Do you think this is the kind of thing that like also that you could uh just scale up like if you generated 64 you're getting better like is this like a test time compute for robots or I don't know.

Yeah, I think for some tasks generations are pretty good all the time, right? Especially like grasping, right? Let's say we train on graph. Yeah, but if it's a really hard task, you know, you kind of have to be more selective. So I think that's some interesting conclusions you can see where like oh some tasks do require a lot more thinking in order to get it right which you could solve with test compute you could also solve with our LHF at the planning stage right especially if you combine it with a value function then you could really like automate this process and you know spin up multiple servers and do things really in parallel and it could probably be a combination of all of these right so if you really try like no task maybe it takes the model a long time because bit slower and thinking through a lot of it like in the LLM sense but maybe it's able to do it right and that will be a new kind of avenue that you can explore with these kind of roto policies that wasn't possible.

Do you think it will take any like any clever methods or any changes to to make sure that you get enough of a diversity of like options for that kind of thing to work like cuz like I could imagine it collapsing pretty easily to just a couple.

Yeah. Yeah. with the fusion and the way we've done you know our training we saw like a good amount of diversity already. The collapsing mostly came from indistribution tasks, right? Where you could argue it's already good, you could argue it's overfit, right? Therefore, like you know, it's always doing the same thing. Whereas with rarer tasks, actually you see that the model is just naturally more uncertain because it hasn't really seen this before in your robot data set. It's like, oh, there's many ways to do it. Usually it doesn't collapse, but I guess that I think is future work to measure exactly how the trends line up there.

Daniel, how do you think about like cuz right now you you let it like basically trim up 5 seconds of like video and then do 5 seconds of action or something like that, right? I mean in the real world there will be tasks where maybe within the first 5 seconds like two seconds in something change. Yeah. You know, so do you think it's it's this is just a matter of engineering you layer on the system to make it more dynamic?

Yeah, I think that's a good question. There's many ways to solve this and still like open research. I would say that in synchronous mode you're always going to have this problem where you can't really handle dynamic tasks or changing environments. If you run it really asynchronously right where you're constantly kind of replanning and being aware of change then you know you could solve it. There's also maybe ways to do system one system two right that we've also seen where you can maybe have something more reactive.

Yeah, this maybe this is a good segue to the other earlier paper where you do eval. Yeah, absolutely. Yeah. How do you know whether this world model version is better than any other one? Yeah. Or this how do you even know a world model is better than a VA?

Yeah. Yeah, right. You could compare anything. So I think like maybe as a bridge like you know towards the point you mentioned earlier like you know a different way to formulate the world model problem. you know, in the in the current work we're talking about it was, you know, going from image and text to, you know, future video, but now we're going from image and action to future video. And the goal is to see how far you can do with action conditioned world models.

With action condition world models, you really want the world model to follow the actual command. So if I command the hand to move a certain way, I really really need the model to predict video that shows that if we could do that correctly, then this turns the kind of world model into a learned simulator where I can trust that oh it's going to faithfully execute this action sequence and give me the future state of what would happen next like a normal simulator and therefore you can do cool things with it like do evaluations in it.

I think people have also shown other demos like I think Tesla showed a demo of like you know driving in this kind of world model too where you could also maybe hypothetically put a you know VR headset on and you know do some actual telly operation yourself in a world model. Those are all kind of possible if you can you know faithfully follow the action sequences. So what that kind of means is like from the initial frame if you command you know different actions that you can follow the kind of a action sequence.

Here it's like you know we command it with four radically different action sequences and the robot does four radically different things. It walks it wipes you know it does many kind of things. And if you're able to you know train it like that then yeah that the kind of evaluation work and and future work kind of follows the architecture here is so same as before let's say we have a pre-trained base model that understands video well there's an extra step now you need to add to kind of turn the conditioning signal from text to actions right so actions is very robot specific and not something that you would normally see in kind of a base model and normally right like if you see even egocentric data like it's just the video itself it doesn't really come with a direct action and the action here is really the transition that you could take to go from one state to the next state.

So how do the hands move over time? What is the kind of desired command to move the scene around? And it kind of mostly only applies to egocentric data, right? if you have, you know, third person view data, there's less of a direct like kind of, you know, singular action that causes and affects the kind of scene to change. So it's kind of like a a new concept that we're injecting. And so we can, you know, kind of condition the model on actions with action encoder and passing the action latence into the model.

If we train it right, then we can do that kind of a conditioning. And for evaluation specifically, we also need to add this extra value piece on top. So not only do we need to predict the future video sequence, but we also need to predict some value or have a separate kind of value function that tells us given that I'm you know predicting this state or I'm in this state, you know, how well have I completed the task, you know, how much reward am I expected to get in the future starting from this state or some discounted version of

Ep#61: 1x World Model

How World Models Unlock Zero-Shot Robotics and Scalable Evaluation by RoboPapers

Top 3 Ideas

🏗️ Zero-Shot Action

🏗️ Imagination to Reality

🏗️ Scalable Evaluation

Actionable Takeaways

Others You May Like

Ozempic Won't Solve America's Obesity Problem

The Economic Singularity Will Make Today’s Economy Unrecognizable w/ Dr. Alexander Wissner-Gross

⚡️ Context graphs: AI’s trillion-dollar opportunity — Jaya Gupta, Ashu Garg, Foundation Capital

Ep#61: 1x World Model

How World Models Unlock Zero-Shot Robotics and Scalable Evaluation by RoboPapers

Top 3 Ideas

🏗️ Zero-Shot Action

🏗️ Imagination to Reality

🏗️ Scalable Evaluation

Actionable Takeaways

Join 10,000+ smart readers on our AI newsletter and stay ahead of the curve

Others You May Like

Ozempic Won't Solve America's Obesity Problem

The Economic Singularity Will Make Today’s Economy Unrecognizable w/ Dr. Alexander Wissner-Gross

⚡️ Context graphs: AI’s trillion-dollar opportunity — Jaya Gupta, Ashu Garg, Foundation Capital