RoboPapers
February 4, 2026

Ep#61: 1x World Model

Robots Dream in Video: How 1x World Models Are Unlocking Zero-Shot Action by RoboPapers

Author: Unknown

Date: Unknown

Quick Insight: For investors and builders tracking the bleeding edge of AI, 1x is showing how world models, trained on internet-scale video, are enabling humanoid robots to perform complex, novel tasks without specific prior training. This approach promises to accelerate robot deployment by making them adaptable, not just task-specific.

  • 💡 How do world models enable robots to perform tasks they have never seen before?
  • 💡 What is the secret sauce in 1x's data training pipeline for robust robot generalization?
  • 💡 Can world models replace traditional simulators for evaluating robot performance?
Daniel Ho, Head of Eval at 1x, pulls back the curtain on their latest work, demonstrating how world models are transforming robot capabilities. The core tension: traditional robot policies (like VALA) require direct action regression, limiting their adaptability. 1x’s world model approach, however, uses video pre-training to predict future states, allowing robots to "imagine" tasks and act with unprecedented generalization.

The Zero-Shot Leap

"The high level takeaway is that it really allows you to zero shot to new tasks."
  • Video Pre-training: World models learn forward dynamics by predicting what happens next in video. This means they can understand a task by "imagining" its completion.
  • Humanoid Advantage: 1x's Neo robot, with its human-like form, can extrapolate from vast human video data. This allows it to perform tasks like "wiping" or "badging into an office" without specific robot training.
  • Generalization Power: Unlike behavior cloning, which overfits to specific robot data, world models broaden the task distribution. This means a robot trained primarily on "pick and place" can still tackle novel, contact-rich tasks like "scrubbing a dish."

Data's Layered Wisdom

"If we can predict video correctly then we can execute it."
  • Tiered Training: 1x trains models in stages: first on web-scale video, then mid-training on 900 hours of unbiased egocentric data, and finally fine-tuning on 70 hours of robot data. This layered approach builds a robust understanding of the world.
  • Egocentric Bridge: Unbiased egocentric data, collected from third parties, acts as a crucial bridge. It teaches the model first-person human motions, preventing overfitting to robot-specific data and improving generalization to diverse home tasks.
  • Granular Prompts: Detailed text prompts, like "right hand picks up the bag of chips, left hand remains still on sliding door," guide the world model's video generation. This allows for precise control, with future work aiming for simpler, high-level commands.

World Models as Simulators

"The world model of robotics is the same as a simulator of self-driving cars."
  • Learned Simulation: Action-conditioned world models predict future video from an image and an action sequence. This turns the model into a "learned simulator" that can faithfully execute and predict outcomes.
  • Targeted Evaluation: This learned simulator allows for offline evaluation of robot policies, even for specific failure modes. Instead of costly real-world tests, you can replay a thousand real-world failures to fix a specific door handle grasp.
  • Budget Optimization: Real-world robot evaluation is expensive and slow. World models offer a high-throughput proxy, enabling teams to rank hundreds of models and select only the best few for physical testing, optimizing development cycles.

Actionable Takeaways

  • 🌐 The Macro Shift: The scaling laws seen in large language and video models are now extending to physical robotics. Internet-scale human video data, combined with humanoid morphology, is creating a new paradigm for robot generalization.
  • The Tactical Edge: Invest in or build systems that prioritize multi-stage data pipelines, especially those incorporating diverse egocentric data. This approach is proving key to unlocking zero-shot capabilities in physical AI.
  • 🎯 The Bottom Line: World models are not just a research curiosity; they are a practical tool for accelerating robot deployment. Their ability to generalize and act as learned simulators will redefine how robots are trained, tested, and ultimately integrated into our daily lives over the next 6-12 months.

Podcast Link: Click here to listen

Hello everyone. Welcome to another episode of Robo Papers. Super happy today to finally get Daniel Ho on the port. Daniel is the head of Eval at 1x. He's got lots of things to say about World Models today. Thank you so much for making the time, Daniel.

Maybe as a start, if you can share a little bit about yourself before we dive in.

Yeah, absolutely. I've been at 1X for around 2 years now, doing a lot of work on policies, role models, evaluations. On the role model side, we've shared two releases so far on how to use world models as policies. So they could take actions in the real world using our robot Neo, as well as on the eval side, how we can evaluate policies using the world model as kind of like a learned simulator.

Prior to 1x, I also worked at Google. I worked on Whimo, I worked on some research there, and I did my undergrad at Berkeley.

Very cool. So you guys just had a big drop a couple of days ago where you basically learn directly from role models to do something right with the new. Previously, a couple months back, you guys have another blog post about the eval. So maybe it'd be great if you can talk us through the journey and what you can share about those.

Yeah, I think it probably makes sense to just dive right into the latest results and just see what the current state is. I think it's very visual.

When it comes to thinking about what role models can do, if we were to go through what is different between a world model policy and another kind of robot policy like a VA policy, the main difference is that a world model is trained on video pre-training as the main objective function that conveys forward dynamics to predict the actions that the robot should take in order to solve a task.

Whereas with the VA, rather than learning directly on video or predicting what happens next in a video, you would instead regress what happens in the action sequence. So you would have a sequence of data from video frames and robot actions, and then you would predict, given that I see this current observation, this current robot image, I should take the trajectory of actions. So it's kind of a more direct regression problem that VALA learns.

In this blog post, we share that there are a lot of advantages to learning in the world model way. The high-level takeaway is that it really allows you to zero shot to new tasks because of the generalizability of video pre-training and the ability to do this kind of video next token next frame prediction problem that meshes very well with just reasoning through human video that has already been done a lot in the past.

We can then transfer all of that knowledge into this policy, which now, if you ask it to do a new task that even though we've never seen robot data of that task, we're able to understand that task because our robot has two hands, the robot kind of looks like a human, and we can extrapolate a little bit from what we've seen before.

If we were to go on to next steps and say what kind of works today, what kind of tasks work, if we were to look at some of the actual tasks that a robot policy can solve, there's some direct grasping tasks where on the left side here, it's the role model generation, on the right side here, it's the actual roll out of that generation in the real world, which means that from the starting frame, which is whatever the robot sees, we'll first imagine what happens next in the scene if I were to grab the bag of chips. From that, I would get a sequence of actions that correspond to those video frames, and then I'll be able to execute that's what you see.

It's very exciting to see how similar sometimes these side-by-side comparisons can get, meaning that there's something here in terms of if we can predict video correctly, then we can execute it, and that cause and effect can become clear. So the left side here is the generated video, the right side is the real roll out, and this is all zero shot just to be clear.

Very cool.

You can imagine that some of these tasks might be a bit less zero shot, like grabbing a salt shaker or grabbing a bag of chips, but all of these new behaviors here are completely novel. There's nothing even remotely close to these kind of behaviors in the data set. We have never done wiping. We've never done this kind of badging into the office. These are all just random things which we've tried to do.

If you look at our data set, it's primarily a pick and place data set. So really the question is, given a focus on pick and place, which is where you would start with from robot data, can we really analyze how well we can zero shot?

We're using the robot data primarily as a shim in order to teach and post-train a model on robot morphology, robotic kinematics, like what are the constraints of the range of motion I have, for example, how far can I reach my hand out before it becomes unrealistic. Once I understand that, it doesn't really matter is what we're trying to convey, like what the type of robot data you train on is. All that matters is that you have some data that corresponds to a robot doing things that may or may not intersect with the actual actions you want to solve.

So just to be clear, on the left, for the one model, when it's dreaming up this video, it's presented with one image and then with some text where the text will say, okay, pull up the chair. Is that what's happening? And then it will drain the video along with some action labels that go with those video.

Exactly. So you can imagine that the conditioning into the policy is, if you hover over some of these bars on our plot, you can see that for steaming the shirt, the problem we have is the right hand is holding the steamer, the right hand brings the steamer up to the shirt, the right hand steams the shirt in up and down motion. If we were to find that equivalent steaming behavior, it might be something like this that gets output through the video generation, and then the separate IDM process is what we use today in order to infer the actions that take you from one frame to the other or the transitions between the frames.

So literally, in this left right scenario, you show the first image, and then that's what the WM, the world model, imagines. On the real on the right is basically you actually just apply the action label, the action instructions that's gen suggested by the war model, and then you apply, and then this is what is actually being shown, and it's so close. Okay, got it.

Exactly. The action sequence or the prompt sequence can be pretty long. Grab the bag of chips is like the right hand picks up the bag of chips, the left hand remains still on the sliding door. It could be really long. The right hand reaches out and grabs the bottom of the metal handle on the right. The right hand slides the handle across to the left, closing the sliding glass door.

Some of these prompts are pretty specific because we won't exactly convey today what exactly should happen in a scene. In the future, there's a lot of work to do to work with both upsampled prompts like this where we've really conveyed granular instruction, as well as just simple prompts like close the sliding door.

In the future, it's probably going to be some kind of interpretability layer between a VLM and some kind of planner like this role model where the VLM needs to output text into some manner that is interpretable and understandable by the role model to say, oh, I understand this kind of language.

Could you talk a little bit about, I think in your blog post, you mentioned you guys start with basically internet scale video, and then you meet train with some egocentric, and then finally some robot. Maybe could you talk us through that, and I think somewhere in there you have some captioning that you do in addition to.

The process that we're sharing is that from the webscale pre-training, we first mid-train on egocentric data, like 900 hours, for example, and then we fine-tune on a smaller set of robot data. So, for example, 70 hours. This robot data may not be the type of action or task that corresponds to the ones we try out in the blog post, but the mid-training may include very general home-style tasks that may better correspond to some of these other chores.

We didn't explicitly go out and collect egocentric data of every single task that we've done. This is more or less just unbiased egocentric data. The reason why we do this, we show with some ablations the ability of the model to improve from being mid-trained on egocentric data. So here, we show between doing no egocentric mid-training and doing egocentric mid-training, we're able to measure some delta and performance.

First, we can measure this on video quality measured by human labelers who can label, does this video look realistic or not, noticing problems with physical plausibility or objects disappearing or just random artifacting. If it looks bad, then reject it. If it looks good, accept it. We see that on generalization tasks, we see improvement when you add egocentric training, and interestingly, on in-distribution tasks, you actually lose a bit of performance on the quality.

We think that makes sense because you can imagine that egocentric really is in order to diversify the task list. So you're actually trying to broaden the distribution. Maybe you dilute a little bit of the specific in-distribution overfitting that you can have in order to generalize. In the real world, we also further run actual ablation studies where we count the number of successes across each of these different ablation.

For egocentric, for example, you see that across most of the tasks, in this case, all of the tasks, there's pretty significant improvement. There's enough for it to be static for some of these tasks. For scrubbing a dish, we found that egocentric was pretty important. So if you train on scrubbing the dish with egocentric, you could really easily overfit to Neoata without including that egocentric.

For example, this is the same prompt, which is very granular and saying, scrub the dish and doing specifically that with the right hand, grab this point, scrub the dish, but the model may overfit now to the robot data if all you see is pick and place, for example, but by including the mid-ring stage, you are able to broaden the distribution like we hoped, and you can maybe see that by the world model generations post mid-training stage are able to now not overfit to grasping but actually solve the task, and that was really good to see.

Pretty much all the failures on scrub the dish for the prior like no ego baselines come from just the model not understanding the task correctly. So that was a level of control that we got from egocentric. If you look at the other case you mentioned, the caption of sampling, that was also true too. You could also look at the ablation results. You see caption of sampling seems like it generally helps.

If you think about language models or VLMs, now it's pretty common to train on not just simple commands like open a door, but to be really granular, like second by second, to give a playbyplay of what's happening in the scene, and that's able to also exert a more granular level control to the model.

Did you need to do anything like, so for your egocentric data, do you need to do anything like put a version of Neo's cameras on the glasses, or does anything work, or how do you think about that kind of stuff?

That's a good question. For this experiment, we did not do any of that. This is completely unbiased. We didn't do this egocentric data in our office. This was more like egocentric data that you would acquire from third-party providers or you'd acquire from the web rather than being like, oh, we're going to do egocentric data scrubbing in order to target egocentric scrubbing. So you didn't even need to specially collect it yourself. This is not collected by us. This is just collected by third parties where we didn't really provide an informed opinion about what they should be doing.

You can imagine it's a lot of general home tasks or factory tasks or other places where people have collected egocentric data before.

I'm curious whether you would know this. I imagine if you like the very base, which is trained on soal internet like video, wouldn't there also be some egocentric video in the mix? What do you think this step of doing me training like egocentric, like just to highlight, I imagine there has to be some egocentric view also already.

That's a good question. There definitely is a lot of that in the data mix. The problem is that it doesn't make up a big percentage of it. If you take an unbiased set of data from the internet, probably not a lot is egocentric, and less of that is egocentric of doing useful tasks.

This is a bridge between the pre-training and the post-training where we know in post-training time it's going to be fully first-person robot views, fully very specific home chores. So can we get to a middle point where we can train on data which is still of humans, which is still very in distribution to the pre-training. I'm seeing egocentric data is not strange or high perplexity to the model, but it is able to understand that it seems like we want to go down this egocentric way. I should focus on understanding these types of motions so that when we get to the robot style egocentric, it doesn't come as a shock and overfit the model to that change. We think, so we see from the patients, but we're also just speculating here to be clear.

Got it. Do you mind going back to the success rate chart? We have the last one, the scrub this one.

I'm curious what you think about the absolute success rate, especially for a scrub dish. Clearly, without ecoentric, it's basically useless. Once you're eagle, then it but it's also 20% only.

The takeaway I have here is that the work here is still early, and there's so many ways to do improvement from this. Given we didn't try very hard to fit to the scrub dish task, you could be more targeted and collect that. It wasn't the intention of the work to show results like that because the study here is more like at a high level, should we be thinking about world model backbones as a more scalable class of architecture than VA's?

If we were to have gone and done that, that would be doing the VA style of thinking rather than the world model style of thinking. Of course, we could fit results like that and make them very close and collect, if you put a data engine to it, hundreds or thousands of hours of that task, but it doesn't really convey the same kind of result where we're trying to see maybe the generalization performance because we're imagining that even if we have a good sense of the task we want to solve in homes, the diversity in general is just going to be so different. The task of scrubbing the dish is going to be different because everybody has different dishes and different sponges and different homes. So it's a proxy to the general transfer problem across tasks.

If you look at the actual task itself, the scrub dish task is pretty hard, which is why I think the success rate scored was pretty low. It's both grabbing the sponge and scrubbing at the same time. We could find this. I have to grab the sponge, and then I have to scrub it, and I have to make contact with the dish, and that's actually kind of a pretty contact-rich task. You just have pick and place robot data, too. So this is pretty hard transfer problem, right?

But also to your point, another way that we're excited about hill climbing this performance that is enabled by having a world model style backbone that wasn't possible before is the ability to do maybe this post-training process that follows LLMs. For example, if we rather than collect teleop data like you would do in VA of this task, I could instead just roll this out. I get 20% success. So I don't succeed all the time, but I'm able to safely roll out these videos and try a bunch, get a bunch of failures, get a bunch of successes, and I could train on the autonomy data.

The video sequences of those successes and failures could be used during world model training, too. And then that would help me align the video model directly to these tasks and he'll climb that 20% higher and higher, as well as allow me to do things like training value functions or take the autonomy data and label it with more information.

I think in your blog you also.

I was going to ask how much of a difference you thought the pre-training data mix made? I guess because I could imagine a world where eventually you get pabytes of egocentric data, and do you think you'd still need all that egocentric data, or is there still just stuff that's hard to capture from that, or what does the mix look like? Do you think you need a lot do you have lots of kitchen tasks in it?

Being humanoid pled, the answer is that we really want a recipe that learns very well from all human data, not only ego but exo. So the bet here is also that 900 hours of ego data is also in the grand scheme of things not a lot. It's also not enough to generalize for the most part. So we're really not only learning from the ego here, but we're learning a lot from exocentric third-person view data of people doing things, too, and that's really possible because the morphology of a robot is very similar to the morphology of a human.

I'd imagine transfer like this would become trickier, at least it would be less data efficient if we're doing nonhuman or form factor robots, and there you see that you may need to do a lot more style or other kinds of collection where you design specific kind of hardware in order to better fit the morphology of a custom robot because my robot has wrist cams and a six off hands that operate in this way with these kind of grippers, so I want to make sure that I'm collecting so-called egocentric data in that specific way, but it's more like a more manual bridge than just directly learning from the webscale video that already exists of human 3.

I just want to point out you guys were talking about this video is on loop like the one on the wall model. I just want to point out two very interesting I just saw get Daniel and Chris thought on it. The first thing is that you notice when you drop it it was on the first slot and then magically it'll be on the second slot.

It started distracting me partway through too.

Yeah. So it's doing some, but then at the same time when you try to put the second bread on it, it kind of know that there's something it didn't drop straight away. So it's as if it understood memory or it has some spatial intelligence that it knows there's something in there it won't drop all the way. Anyway, I just want to point out it's at the same time it's silly, but at the same time it's also impressive.

But also it got it right. Right. Like the motions the robot did were correct because it actually works in the real side.

Even though the bagel teleported a little bit.

It takes like a little give. I think the bagel, if you wiggle it, just falls in depending on whether it falls into one or the other. There's a lot of high uncertainty. A lot of the minor contact details would change where it falls in.

That's something which world models today don't have that precise level of control where you can say exactly like I want to move you to a couple millimeters forward so that you get into this one slot rather than the other, and that's I think something that we would appreciate training a lot more on robot data for, especially if the robot data includes things like tactile or higher fidelity sensors that you could hill climb this more granular level of manipulation.

Do you think that this problem, the really granular differences in robot video generation, is something that will come out with scale basically that it's not a huge issue and that with enough data it'll just be solved.

That's the simple way to look at it. Of course, under the hood, a lot of people had to do a lot of very hard things to make that work. Build really good robots with a lot of good sensors and build all the infra to make that happen. But those are known problems in a way. Things that we could hill climb even though they are hard.

The idea would be that with this kind of policy, you can ask the robot to do things. It may fail, it may succeed, but that's going to especially combined with the robot with high dimensional sensing information, be able to hill climb the policy more and more. If you really scale this up, if you deploy thousands of robots, then really quickly you'll collect way more data than we've ever trained on.

That's the ability to go from a world where post-training is the cherry on top to a world where it's a really a focus of this kind of maybe more multimodal style post-training where rather than just adapted to robot morphology, you're really adapting it to the full sensing suite of the robot that you have.

On the sensing suite note, I think one thing that I'm curious about, so you have, so there's this old kind of style of world models, which is like state plus action to the and predicting a next state that's not what you guys followed. You followed this more this kind of like there's this sort of newer wave of these, which is which use like an inverse dynamics model to estimate the actions and so how do you how do you think about that do what why why make that change or what do you think is the what are the trade-offs.

We could talk about the state action and next state one. I think that was what we used for evals. I think that's very clear and necessary for making a true kind of model-based kind of simulator. But in this case, if the idea is that I want to be able to specify very open-ended actions, doing so in text space is more natural. This is more like the chat GBT style of doing robotics. We also built this very fun chat interface where we have everybody at the company try and just type to the model and have it do things, and people have tried really random things. We actually just show some of these videos from the random things people have tried.

It's like interface like JBT for a robot. You connect to a robot, you see what the robot sees, pick up this thing or open the door, and then you see the video. It's like, oh, good video do it or bad video don't do it, and then it's a very natural thing where people actually have a lot of interpretability into the process. If you talk about state action models, then you can't really have this kind of chatbot style. How am I going to specify this trajectory in 3D space maybe on this one like could you comment a little bit about how do you think the IDM part actually performed like do you think it's more or less able to be faithfully you know draw out the actual action that it seems to be suggested by the world model.

Here we deliberately broke the IDM out separately. The IDM is actually the much more stable thing to train. You can imagine that the reason is because you could train the IDM on every single robot log you've collected no matter if it's or autonomous no matter if it's good or bad. The IDM basically says that from any starting frame if you take any action you get to any kind of next frame or like another way say it's given any two frames with any quality of action in between you can just draw the bridge between those two frames.

It's a really robust way of training that's really well specified and really is able to handle everything. It's like converges very well. It's a classical supervised learning problem that has a very stable loss and very dense you could train on. You could train on every pair of images across a video and you could take any pairwise combination. You could also slide a window across any kind of sequence length and there's no limit to it unlike more narrow ways of training like teleyop where I must focus on this one specific action and it has to be done exactly in this way to be correct and consistent.

The IDM is more like we can trust the IDM it's really about just generating the video. That's kind of what we found. In the future we could also merge the two. In a machine learning sense there's that seems very plausible.

How about in practice like I think in your blog post here you mentioned something about like an actual inference like how long does it take to actually run the imagination and then you know could you talk about that and and as things scales out I mean obviously this is your kind of first like attempt like how to do it in practice.

That's a big reason to merge the IDM and the world model that can make it faster. So then you don't need to run two models you could just run one model you could model partially and do things like just care about the action part of it. The current model today takes between 10 to 20 seconds to run and to end. If you use the best hardware I think it's around 11 seconds to inference. So no it predicts like 5 seconds of video. So it's running synchronously. It's not running in async mode. It's not like a closed loop roll out where you take actions continuously with the robot.

If you could get the latency down that becomes more and more feasible and it could be more and more natural. That's something where following the trends of optimization work in diffusion models like something that we're really want to show as a next step.

How about also like maybe even allowing the world model to dream up like maybe more variations like do you think that's helpful actually?

That's a good question and that's sometimes in direct conflict to the latency rate that you just mentioned. We showed some work that you can generate multiple versions and pick the best one and humans are very good at picking the best video from among a series of videos that look realistic. We actually have a very good critic of videos in our brains. So if you allow humans the ability to look at eight videos rather than one video and just pick the best then I think we saw a pretty good boost on most tasks I would say because you could immediately rule out some of the bad generations. Maybe today the model still needs to be kind of preference tuned or RHF it doesn't always do exactly as you say.

Maybe in the future this will go away and there are two different ways to solve the problem one is you weed out the bad ones by generating multiple times the other way is you train a preference function too and you could be like oh given pairwise preferences across generations I can just post-train the model on only predicting the ones that humans prefer and that's also something I think that'll be really interesting next work.

Do you think this is the kind of thing that also that you could just scale up like if you generated 64 you're getting better like is this like a test time compute for robots or I don't know.

For some tasks generations are pretty good all the time, especially like grasping. Let's say we train on graph. If it's a really hard task, you kind of have to be more selective. So I think that's some interesting conclusions you can see where some tasks do require a lot more thinking in order to get it right which you could solve with test compute you could also solve with our LHF at the planning stage especially if you combine it with a value function then you could really automate this process and you spin up multiple servers and do things really in parallel and it could probably be a combination of all of these.

If you really try like no task maybe it takes the model a long time because bit slower and thinking through a lot of it like in the LLM sense but maybe it's able to do it right and that will be a new kind of avenue that you can explore with these kind of roto policies that wasn't possible.

Do you think it will take any clever methods or any changes to to make sure that you get enough of a diversity of like options for that kind of thing to work like cuz like I could imagine it collapsing pretty easily to just a couple.

With the fusion and the way we've done our training we saw a good amount of diversity already. The collapsing mostly came from indistribution tasks where you could argue it's already good, you could argue it's overfit. Therefore, it's always doing the same thing. Whereas with rarer tasks, actually you see that the model is just naturally more uncertain because it hasn't really seen this before in your robot data set. It's like, oh, there's many ways to do it. Usually it doesn't collapse, but I guess that I think is future work to measure exactly how the trends line up there.

Daniel, how do you think about like cuz right now you you let it basically trim up 5 seconds of like video and then do 5 seconds of action or something like that. I mean in the real world there will be tasks where maybe within the first 5 seconds like two seconds in something change. You know, so do you think it's it's this is just a matter of engineering you layer on the system to make it more dynamic?

That's a good question. There's many ways to solve this and still open research. I would say that in synchronous mode you're always going to have this problem where you can't really handle dynamic tasks or changing environments. If you run it really asynchronously where you're constantly replanning and being aware of change then you could solve it. There's also maybe ways to do system one system two that we've also seen where you can maybe have something more reactive.

This maybe this is a good segue to the other earlier paper where you do eval. Absolutely. How do you know whether this world model version is better than any other one? Or this how do you even know a world model is better than a VA?

You could compare anything. So I think like maybe as a bridge like towards the point you mentioned earlier a different way to formulate the world model problem. In the current work we're talking about it was going from image and text to future video, but now we're going from image and action to future video. The goal is to see how far you can do with action conditioned world models.

With action condition world models, you really want the world model to follow the actual command. So if I command the hand to move a certain way, I really really need the model to predict video that shows that if we could do that correctly, then this turns the kind of world model into a learned simulator where I can trust that oh it's going to faithfully execute this action sequence and give me the future state of what would happen next like a normal simulator and therefore you can do cool things with it like do evaluations in it.

People have also shown other demos like I think Tesla showed a demo of driving in this kind of world model too where you could also maybe hypothetically put a VR headset on and do some actual telly operation yourself in a world model. Those are all kind of possible if you can faithfully follow the action sequences. So what that kind of means is like from the initial frame if you command different actions that you can follow the action sequence.

Here it's like we command it with four radically different action sequences and the robot does four radically different things. It walks it wipes you know it does many kind of things. If you're able to train it like that then the evaluation work and future work kind of follows the architecture here is so same as before let's say we have a pre-trained base model that understands video well there's an extra step now you need to add to turn the conditioning signal from text to actions right so actions is very robot specific and not something that you would normally see in kind of a base model and normally right like if you see even egocentric data like it's just the video itself it doesn't really come with a direct action and the action here is really the transition that you could take to go from one state to the next state.

How do the hands move over time? What is the desired command to move the scene around? It mostly only applies to egocentric data, right? If you have third person view data, there's less of a direct singular action that causes and affects the scene to change. So it's kind of like a new concept that we're injecting. We can condition the model on actions with action encoder and passing the action latence into the model. If we train it right, then we can do that kind of a conditioning.

For evaluation specifically, we also need to add this extra value piece on top. Not only do we need to predict the future video sequence, but we also need to predict some value or have a separate kind of value function that tells us given that I'm predicting this state or I'm in this state, how well have I completed the task, how much reward am I expected to get in the future starting from this state or some discounted version of that end.

If you can predict a value then you putting everything together you could take multiple different actions from the same state let's say each action was predicted by a different network and then by passing it through the kind of world model you're able to see different kind of values right maybe with VA number one you got a value of 0.3 with VA number two you got a value of 0.5 and so you could say that oh VA number two performed better that action sequence seemed like it was a better. At least the world model judged it better.

From that point now you could really do things like come up with complete analysis of like oh across a whole aggregate of a data set how well does VA number one perform compared to VA number two. If you look at some of these videos, you can see that we have action conditioned sequences where by replaying the actions of say a log, you're able to achieve very similar kind of no generations of the robot state between the original and the world model generation. The way the hands move look almost the same between the two, meaning that we're able to correctly get controllability of actions to be a property of this.

If you were to look at the next steps it would be how well do you are you able to evaluate how well are you able to simulate given the data you train this role model on and the way that we measure that is on how accurate our alignment is. If you have a real robot log where you've done a given task, I want to grab an object, there's going to be a success or failure associated with that. I could have done that correctly or I could have failed on that task. If you were to replay the sequence in the world model, you could also predict equivalent success or failure using your value function.

If you have a score of 50/50, then you're pretty much guessing. It could be successful, it could be failure. I don't really know. That's the baseline. If you get a score of 100, then it means that oh, I can fully understand the world where I know and able to fully faithfully simulate whether your action sequence led to a successful trajectory or not and it will match your real log. The reality is it will be somewhere in between most of

Others You May Like