
Author: RoboPapers | Date: Unknown
Quick Insight: Robotics needs scalable, reliable evaluation. PolaRiS offers a novel real-to-sim approach, using generative visuals and minimal sim co-training, to create benchmarks that accurately predict real-world robot performance. This tool accelerates policy development by making robust, diverse evaluations accessible and reproducible.
Real-world robot evaluations are slow, costly, and hard to reproduce. This creates a disconnect: policies trained on real data often fail in simulations. Arhan (U Washington) and Carl (Physical Intelligence) introduce PolaRiS, a system designed to bridge this sim-to-real gap, making robot policy evaluation scalable and reliable.
Podcast Link: Click here to listen

Hey everyone, welcome to another episode of Robbo Papers. I'm Chris and I'm here with Arhan and Carl. Would you guys like to introduce yourselves and get started?
I guess I can start. I'm Arhan. I'm a first year PhD student at the University of Washington. I'm advised by Abhishek Gupta and I'm going to talk about the first project I did in my PhD with Carl and Abishek and many other collaborators on using sim for robot evaluation.
Yeah, my name is Carl. I'm a member of technical staff at physical intelligence and I like to work on large scale robot learning problems and had the pleasure of helping Arhan a little bit with his project here.
I guess I can give a high-level overview to start of what it is. So I briefly touched on using sim for evaluation, but I think more importantly what we wanted to do was a scalable way to do sim evaluation.
There's been previous works like Simpler and Robot Arena and a few others, but I think a few key things that were missing previously were they're either too hard to construct without handcrafting your assets which a lot of robot learning people don't necessarily have access to.
Some of them don't really support wrist cameras such as Simpler, and like this Robot Arena work has this limitation where if you're doing green screening on your external camera views, you can't really have a camera that moves around in your environment.
One thing that we're doing differently, the approach we took, was using a real-to-sim approach to generate environments to make it cheap to get eval environments. Along with, if you have some sort of explicit representation like a gausian slot or something, you can actually move around in your environment and get novel views that are not necessarily just green screens of your training data set.
I think it may be helpful to actually zoom out a little bit first and try to explain why should we even do this kind of evaluation for robotics at all?
Sure.
Do you want me to do like a one minute summary of that?
Yeah, sure. Go for it.
So basically, I think if you thought of it from the start, you would say, okay, as a roboticist, we should always do real world evals, right? Because real world evals are what we care about. This is kind of the metric that we're optimizing for is to have robots actually work in the real world.
But then when you do that in practice, and I'm sure many of the people who are listening and who are roboticists know the pain of that, it's not actually very scalable because running real robot evals is really painful and it takes a lot of time and it's very hard to reproduce.
Even historically as a community, we have used a lot of sim evals, right? We have often built benchmarks. I think Jeff also has some great benchmarks that he built in simulation. The benefit is clear, right? Like it's very fast, it's very scalable, it's perfectly reproducible. So everybody loves SIM evals.
But then there's kind of this disconnect, right? Like we care about real world performance, but then we build SIM evals and we all test on SIM evals. This work here is trying to close this gap a little bit and try to build evals that are SIM evals, but they're actually not really trying to optimize for the best performance in SIM, but what instead they're trying to optimize for is to be indicative of real world performance.
The policies that we will evaluate in this paper today are actually trained on real data most of the time. They can run on real robots and in fact a lot of the experiments that Arhan did for this paper are real robot experiments.
The final product or the outcome of this paper is a sim eval tool where you run a simulated evaluation like any other sim eval. But then what we show for example in this little graph on the right here is that the performance of these policies in the simulation are actually indicative of what they will do in real.
Basically, a policy that does better in our sim benchmark will do better in real, and so this is where the value is. Basically, we can build a simulated tool that allows you to test performance that then hopefully is similarly reflected in the real world.
Okay. And so I think now is maybe a good time to then zoom in and say, okay, what was there before because we're not the first people to work on sim eval. And then how does our technical solution expand the capabilities of what was there before?
So I'll just hand it back to Arhan to stand up.
Yeah, just to add a bit on what Carl mentioned, it would be great if Carl could give a bit of context to some of our non-robotics audience, you know, why is evaluation so much harder for robotics as compared to LLMs, VMs, computer vision, NLP?
Yeah I think that would be that would be really great for the audience.
I think one thing that makes robotics particularly hard compared to all these other fields that you mentioned is the issue of compounding error. In our case, most of these other offline metrics, like validation MSSE, I should say compounding error and multimodality.
I think all these other metrics that they use like validation MSSE and they're mostly like onestep metrics where if you're predicting the next, if you're predicting an answer to what's in this image or if you're predicting housing prices or something, these are like onestep problems.
In robotics, whatever you predict affects what you see next and so you kind of have this issue of compounding error where you will start to drift away from the more errors in prediction you'll get larger drift and so your evaluation metric might not truly represent what your policy can do.
This is the issue between doing open loop versus closed loop evals and why open loop is hard. And so that's why we're trying to build, use simulators to actually be able to roll out your policy effectively.
It's actually quite an interesting point because I don't actually think that LLMs will get around this problem for much longer. In some sense, LLM can choose to start with problems that are singlestep problems where you just take an input and produce an output and that's it.
Over time, the problems that the LLM want to tackle become more sequential decision-making problems, right? They may want to interact with users or they may want to do very complicated things in code bases where maybe they need to run code and wait for a result and do it again and so on.
The sequential nature will come into LLM too and then they will have a lot of these same problems where they maybe need to simulate users now because that's their environment and then again they will have similar problems where evaluation becomes hard.
So far they could choose to not do sequential decision-making and so they had easier evals. For robotics, you cannot make that choice because there's no use for robotics without sequential decision-making and so we have to tackle this problem from the start.
Sounds great. Please continue.
Cool. I guess next I can get into what we actually did, what our method was. So I talked a little bit about we were doing this sort of 3D reconstruction to make it cheap to create new evaluation environments.
Specifically what we were doing is using 2D gausian splatting for extracting a mesh and then you have gausian splatting for your visuals. We have this scene composition thing that I can cover a little bit later maybe, where you can basically pull in different assets that you scan.
For environments we're using gausian splatting, but for objects we're doing more generative models. In the paper we use Trellis, but recently more things have come out like Sand 3D for example which just you give single view or multiview inputs and then you can get a generative model to produce the object.
The benefit of this, why we use this for objects versus actually scanning it normally, is because your objects are going to be partially observable. If it's sitting on a table, you won't really be able to capture what the bottom of the surface looks like. So that's why generative models are helpful here because it can do the completion what you can't see.
Anyways, I guess you put this together in the scene builder and then you can kind of get a scene with a task that you want that you care about that is paired with the real world. In practice what we do, we tried to get this to work zero shot in the beginning and I can actually show the plot right after this, but zero shot if you transfer your real policies that are only trained on real world data if you transfer them to this simulation that's like with using gausian splatting background visuals and either ray traced or splatted objects as well we actually see your policy will do something reasonable.
If you don't do systematic analysis, it'll look like the policy is transferring to sim, this is probably a good evaluation, but if you do the paired world paired real world correlations between them, the zero shot transfer is not super highly correlated and what we found is if you throw in some code, you throw in a short code training data set that's not task specific or environment specific to what you're testing on, just like random simulation data into your code training mix and then evaluate your policy, the correlations look much much better.
The intuition behind this is just you're doing some sort of alignment to get around the distribution shift your policy sees from going to from real to sim. On the right you have this example of what some of our correlations look like after doing some short sim code training.
I want to stress the data that we pick for co-training is completely irrelevant to what we test on. There's no overlap in our evaluation environments where we It's not even like the same robot or anything like that.
Oh, okay. It is the same robot. I will say that. It'd be interesting to see how it, I guess it was not really in our academic capacity to test it on multi-embodiments if you just throw in random sim data. I actually I don't know how that would change things, but we did the we use the same droid setup, but all our objects and environments completely different.
To be fair, the policies the policies we're evaluating here are basically single robot policies, right? They were at some point pre-trained on multi-root data sets, but then for all practical purposes, they're really just heavily fine-tuned on the single robot embodiment of a Franka robot. It probably makes most sense to put some data for that same robot in there.
Can I ask how you're thinking about physics in all this?
Well, let me start.
Because if you're doing gosh and splatting, obviously there's no you don't have any idea of what the frictions are or the masses of the different objects and presumably maybe your code is giving you some of that for the robot, but could you tell me how you think about this?
To clarify is a question on how we're specifying like Yeah. like do you randomize it or what do you how do you how how do you like or is this not an important problem or like what is how do you think about it that's a very open-ended I guess.
I think so most of the things that we evaluated were like rigid body tasks where it's like the objects that we place in front of them like most of them are the robot should be able to pick them up and I think there could be stronger CIS ID done we didn't do extensive CIS ID basically because what we were testing testing. What we tested was we did test at some point is like dynamics between SIM and real different.
We basically did this hardware in the loop test as we called it but basically what it was was we roll out the policy in the real world but the inputs to the policy is not real world inputs it's just like mirrored SIM inputs. Sorry wait other way around. It's you roll out the policy in SIM but the inputs to the policy are real world inputs.
Basically you isolate the fact that you get real world visuals but you isolate the sim dynamics and we that was basically a test to see like our dynamics and issue whether it's like for object masses masses and friction or like robot dynamics too controllers and stuff but that would that would mostly only work on tele contact right or so for the the robot stuff it would mostly work on contact I guess most of the objects that we're doing like if it's able to pick it up, it will pick it up.
For that reason, we didn't focus too hard on like object mass and friction randomization. When you get into more like dextrous tasks and maybe like deformables, this will definitely become a much bigger issue. In our case, we basically just set them to like kind of like default values approximated by like this the size of the object and like the density or whatever. It was like auto computally.
Like I think these are kind of observations that are somewhat in line with what we had seen in previous work as well. So in the Simpler paper which is kind of a previous version of a similar at least problem setting if very different approach there we explicitly try to see how sensitive the whole system is to different physics parameters and it wasn't particularly sensitive.
Arhan's experiment here is actually a very nice way to delineate these two components of a simulated evaluation system which are the dynamics and the visuals and I think it was a quite nice demonstration that actually the dynamics at least in these times tasks as Arhan said are not kind of the key driver of correlation issues but it's really mostly about the visuals which is actually what in the first place motivated us to use the gausian spliding approach because it's one way to get very high quality visuals with relatively low effort on the user side.
Basically making a very very nice looking gausian plat is much much easier than making a very very nice looking handcrafted simulation environment. The ability to get these nice visuals quickly was kind of one of the key motivations to use this technique. It also did turn out that the visuals are really what is driving good correlation.
I have a I have a question just regarding one of the points that was made which is the leveraging sim data into the code training. So I guess you did like code training with simroid data that's collected on different tasks and that was like about 10% of the data set.
No, it's Oh. Oh, I see. It's maybe 10% in the code training mixture, but it is a very very tiny data set compared to the actual droid training data set.
So, that's the question. Why would the correlation get better if the amount of sim data is so little?
So, so I think there's an important kind of insight here, which is that the sim data is not there to teach the robot something it doesn't already know. It's there to teach the robot how to ignore certain differences that are spirious correlations.
In your sim data, you have certain artifacts that are induced by say the rendering of a gausian splat which looks a little fuzzy for example when you get close like wrist cameras get really close and you don't have that on your real data but all the robot has seen is real data so far right so this is kind of confusing and these models tend to not generalize like humans they tend to have particular kind of properties where if you have a small distribution shift like this it can really throw off a model.
Clearly it's nothing about the task right like the tasks we have here are very similar to what we have seen in real. It's just that the visuals look a little bit different in a way that's unintuitive for humans but really makes a difference for these models. The purpose of the sim data is only to teach the model that actually you can be robust to this difference like whether the image looks like a real image or a slightly washed version of a real image doesn't actually matter for what you should do in this environment.
Just like a tiny bit of sim data on unrelated tasks and only train on it for a few hundred steps is enough to essentially teach the model this robustness and then your correlation results get much much better.
Yeah, I'm convinced that you know it doesn't help with the task but really just to get the correlation better. I'm just curious like do you guys try different mixture to see whether does the correlation gets better and better as your sim increases or is it you know like is there a a correlation to actually scaling sim to so that the co relation between real and sim gets more and more aligned that yeah.
I think we had we couldn't scale it too far just given how hard it is like to collect like data in the first place. We had I guess we had like two phases of like or two amounts of sim that we tested on which is not like many like a lot of data points. I'd be curious to see how it changes as you scale more sim data.
Between I think the first one was on the order of like I'd say like six environments and this one was closer to 20 15 to 20. The correlations did improve. It just becomes more expensive to find more out of distribution environments and tellyop demonstrations to put into sin data after that.
There was actually a very interesting kind of nuanced point in that experiment. Arhan, I don't know if we have the plot somewhere in the paper maybe about the different types of sim code training data that we tested.
Um because there's a few different choices you could make, right? You could actually collect data in totally different environments. You could collect data in the same environment but on different tasks. You could collect data literally on the task that you're planning to eval on, right? These are all reasonable choices.
It turns out that they have different effects on the correlation values. Specifically I guess if you try to com let's see I think in domain is basically what we called the collecting data on the exact task that you're going to eval and maybe naively you would think that's actually the best thing to do right you would think oh the closer the data is to what I want to eval the better the model should be so that's good but actually this is not the optimal thing to do right because in some sense what you're going to get then is a model that overfits to this small sim data set that essentially tells it the solution to your test that right like you're you're going to eval on those same tasks.
That makes all of your policies better, but it doesn't make your evaluation discriminative any longer, right? Like basically now your all your policies kind of do well roughly speaking. If you have data that's only vaguely related and tells your model how to bridge this distribution shift but doesn't give it the answer to the test that you're going to pose, then you get better correlations in the end, right?
There's a bit of a nuance point. How do you want to choose this data set? You actually don't want it to be too close to your test set because then your test set loses kind of some of its discriminability between different policies.
Yeah, I think this is what the plot basically shows. I think the the the purple plot here is what our final method ends up doing which is to just train on data that is not featured in the environment that you're testing on. Training on other tasks in the same environment is also a good choice. It's actually in some sense a little bit of a better choice. Training on tasks that you're literally going to evaluate on is somewhat worse than and I think one of the nice things is having these unrelated tasks in your code training data set makes it much more easy to just evaluate a new task rather than to have to collect data in your new environment that you want to evaluate or something like that.
Okay. Was there any other question? I think maybe another interesting maybe if you want to go to the main result figure I feel another interesting kind of higher level point that we evaluated is that there's a lot of people who are quite excited about using video models.
Is it on this plot? Maybe you need to go onto the paper. It'll be more clear on this one.
So basically there's a lot of people who are excited about using video models to evaluate policies which is an even easier way in some sense to get an evaluation environment because the model just is the environment.
For some context what this would mean is that you still have your policy it's trained on real data you have a video model that's trained on real data and then you just kind of loop them. So you have your policy look at the outputs of the video model, produce the next action, you pass that into your video model at this one step or you may want to call the world model if it's action conditioned. You loop, right? You basically get a roll out of your policy in the video model.
You can score it at the moment people do this by hand. Then you can see whether the performance in that video model is indicative of real world performance, right? Again an important part here is we don't just care that policies do something reasonable. We actually want these evaluations to be correlated with our real world results, right?
We did actually quite a rigorous test of this in this project where we tried the current best open-source droid model that is available. We ran our policies through that model and we did find that they do something for sure. They're not just randomly waving the arm or something, but the actual correlation if it comes down to it is not nearly as strong as what you get if you use a more classic way of simulating the environment at the moment.
There's in some sense a bit of a continuum like we have the fully handcrafted sim. We have the fully learned world model simulator. The H project is somewhere in the middle, right? Where you have like a bit of learn structure and visuals, but then the actual physics are still classic. At the moment that seems to give you the best trade-off of how easy it is to make versus how good the correlation is you end up with in the end.
Can you talk a little bit more about how you set up the video model and why this would be fail why this would do worse? So is this is this basically just because of predictions getting worse over time or what why why is this like is it just diverging or what's the problem?
I think there's two kind of parts of it. There's one is that like divergence over time. I think the first one I'll touch on is one a lot of it is I'll talk about the divergence first like the divergence like in some cases it even becomes like hard to grade where like these like objects are not necessarily objects vanish or something is that the yeah they vanish but they like come back to where they started from so that's one part of it I guess that's not the only thing like messing up correlations that that mainly just makes it hard to grade sometimes.
The other thing is I think a lot of the trajectories you will probably see are going to be success trajectories. If your world model has if it's I don't know the exact data mix of how this this world model specifically was trained, but you there's like so many ways to fail compared to how many there are that you will succeed. Being able to just model that from being able to collect that amount of failure data is going to be hard.
Just at least in the data that they're being trained in right now, it's more likely that you will gravitate towards success. That's kind of the intuition. So even if the action was if the action was like 2% off and that would have cost a failure, it's just the point is that that's not going to be well represented in the underlying video data set and then it's going to fail. Is that the is that the intuition?
Yeah, I think so. Then the other thing is you can work there's there's probably works going on right now honestly on trying to make them more action conditionable and like true to what the outcome is. Also I guess if you have it's kind of hard to know what the deviation is other than doing some sort of image similarity score too. That makes it hard to like hard to assess.
I don't think the outcome that people should think the outcome that people should take away from this discussion is video model evals will never work or something like this or this is the final solution. Now, I think the upper ceiling of video model evals is much higher, right? If you think about what kind of tasks can you actually test in a classic simulator like the one that we have here, it's great for rigid objects. It's probably decent for articulate objects, it's going to be a whole lot harder to simulate a cloth folding task or simulate a wiping task with some liquid in there, right?
You can try to solve these tasks classically and you can try to build out all the physics models to make those realistic and then have them correlate. You can try the datadriven route and make your video models better and add rollout data into your video video model training so it gets more accustomed to failures and and how policies may not be optimal. There is a strong interest in that second route in the datadriven route and I'm sure that these models will get better.
Maybe the main takeaway for the listeners from this discussion is not that these are always going to be bad. It's more like today if you want to have an evaluation that's a good proxy, probably doing something that's closer to a classic sim and maybe like a mixed approach like our hunts project is going to give you a better result than just using a fully datadriven video model. Maybe a year from now or six months from now we can re-evaluate like this is one of the nice things here like we have this offline data set of real world performance and policies and initial states and so it's now very easy once a new better and awesome video model comes out we can just run that offline evaluation through the video model.
Right now our needs to do a lot of grading when we have a lot of videos so that's a bit of a pain. If somebody figures out the grading aspect then it will be a very easy test to run. There's a lot of potential in that it's just not yet today.
I have a more philosophical question from this point which is that in robotics we see so much evaluation we see so much benchmarks there so much you know people who do real evaluation to sim evaluation but there was I mean yes sure there will be points where there's a few benchmarks that people starts to converge towards then you set trace and from what I last check real tank is like at 98.5% in the last one one and a half years. So what's the thoughts here? Is it the problem is finding the right evaluation or is getting people on board on the evaluation?
It's probably a bit of both, right? You need some agreement as a community what are good evaluations to compare on because if everybody makes their own and nobody compares then we don't learn much from an evaluation. Finding the right evaluation or finding the right tasks is also an important problem, right? As you said, we kind of a lot of people agreed on using Libero as their evaluation suite, but now it kind of like loses its discriminability.
Everybody kind of does good on Libero now and it's a bit hard to say what is really going on. Agreeing on benchmarks is good, but also developing benchmarks that are either by design codeeveloping with the capabilities of the policies or that are easy enough to make or easy enough to extend that we can keep making them harder and harder as the policies get better. I think is good. One aspect specifically that I think benchmarks like Libaro don't push on too hard is generalization.
Usually, at least so far, most of the benchmarks that people put out come with a training data set. Which is not something that happens in LLM land anymore, right? When people make evolves in LLM land, they don't make training sets. They just make test sets and then you test whether the off-the-shelf model generalizes. In robotics, we haven't really done this because so far nothing generalized wide enough that this would make too much sense.
One of the exciting aspects of this project is that you can actually kind of test zero shot generalization, right? We can test it in environments we have never seen with objects we have never seen and the policies can do something and so now this is much more in a place where you can actually very quickly develop different kind of benchmarks that poke different capabilities right like in LM land people make benchmarks to test factuality and then benchmarks to test you know I don't know language translation capability ability to do coding right and so people very quickly come up with these different tasks and you build up this suite of benchmarks over time that allow you to holistically test what these models can do and we haven't had that in robotics because the models didn't used to generalize at all. So we always kind of tested them on the train set.
Now I think we're getting to a point where it actually makes sense. Tools like this tool I think are one way to get us to that same point where people can just propose oh like how does this model do in very cluttered scenes? Let me quickly make up a polarity val of very cluttered scenes and now I can put it out, everybody can test it and we will know the next generation of models how well do they do in cluttered scenes right so there's a I think there's a good point now where we can actually get to a mode like this but we will still need buyin from the community like if nobody evaluates on it then there's not much point in making those benchmarks.
A following a question to this I think also maybe to clarify to the audience about the zero shop is that is the zero shop empowered by the fact of you a data set like Droid which is really you know great and beneficial for the community or is it more empowered by the base model which is pretty strong or you know like how do we resolve moving forward and you know get more people to be on board of this and you know get more models essentially to be to be you know running on this platform I guess that's yeah.
In my current feeling there's a lot of things to be done on this kind of benchmark by just using the droid data set it usually helps to use a pre-trained model that makes your life a little bit easier because it trains faster and you know it's a decent initialization. There's a lot of interesting research to be done on the path from that kind of pre-trained checkpoint to the actual thing that you're going to evaluate.
Really so far all the models that we're testing here are models that are kind of like trained with a very vanilla kind of one pass of data filtering was done and then that's it. That's the model that we're putting out. There's a lot of potential to iterate on this and I don't want people to think that it's all about who can collect the largest, you know, pre-training data set for their model and then that's how you win this kind of eval. There's a lot of interesting research to be done from that pre-train checkpoint to the actual thing that they're evaluating if that answers the question.
Yeah. Yeah. I think that answers. So what do you think what's next for this kind of benchmark then? you have you are you I guess you're trying you're putting this out there people are able to come up with their own polaris do you environment how hard is it to actually specify like a task like if I want to build a pyramid or something like this or put a peg in a hole like how to like how how do I how much work is that to add a new task in practice.
I think that I mean obviously it depends to some extent on the task but the nice thing about the rigid body task ask is that typically there's like a you don't have to do anything crazy to like I guess define a like a sparse reward or like success condition. In this case we didn't really do any articulate object. Articulate objects are articulated objects are also like not too bad. For most pick and place tasks, it's usually like is this bounding box inside or overlapping to some extent with this other bounding box, right?
I think the pyramid example that you gave is a little bit more unique. That one would be a little bit harder because it's just like a chain of things. Lots of complex conditions. To be clear, I think what Arhan is describing is defining the success condition. There's a whole other part to like actually scanning objects and so on.
I would say that our pipeline here actually makes this relatively easy compared to if you had to kind of handcraft scenes and like handcraft objects and then do all of the other reward definition on top. We really tried to make all of this fairly easy. We try to make it really easy for people to take either existing assets and puzzle them into new tasks, which I think is one very good way of making new evolves. Scanning in their own assets. If you want to have a certain object in there or you want to have a certain scene in there, you can scan it.
Sometimes as easy as taking a few pictures with your phone or you can scan it with like some camera that you have on your robot. Then there's a gausian spotting pipeline that will make this like nice mesh for you. The mesh doesn't look very nice, but the visuals are much nicer. There's this little tool that we built that's a browser tool. You don't even need to install anything. You kind of just drop all your assets in there. You can kind of puzzle them into the configuration that you like. You can click save and it will kind of save out all the assets and their initial conditions. That's basically the eval. Now you just need to define a reward like the success condition.
Arhan has a little bit of code where you can just specify, okay, this object needs to be in this position and this object needs to be in this position. That's already in EVA, right? You can kind of package that and you can upload it to Hagen face and now everybody can easily download and test it.
This video on the right, I guess we're seeing you generating two objects or adding two objects, getting the scaling right, then you have this little green box, which is presumably the area where they're randomized, and I'm not quite sure how you're specifying the goal, but I'm assuming that's in there, too.
The goal is not in the GUI itself. The goal right now is being done via like some like I have like a very simple like you just paste like these condition like I have some functions defined for like X cube is on top or in this area or something. It's a little bit harder to do that graphically because at least to do like any task. One thing that I wanted to also clarify is you mentioned it's not so easy that you can just scan it right now.
There's two aspects of this and it's just going to get easier as there's progressions in like the 3D vision like world like for objects I would say you can easily just like take an image scan them in right now because it's like a generative model and as these other tools are developing like for example like these world labs like splat generations I think it will become this is something I've been trying and I'm actually probably going to test like how evaluations hold up on this as But seeing like just like taking an image and then like generating the splat for that and then evaluating it. Like at that point it would be something where you can just like draft up in like 5 minutes.