RoboPapers

February 11, 2026

Ep#62: PolaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies

Ep#62: PolaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies by RoboPapers

Author: RoboPapers

Date: October 2023

Quick Insight: This summary unpacks PolaRiS, a novel framework for robot policy evaluation that dramatically cuts the cost and time of real-world testing. It’s a must-read for anyone building or investing in robotics, offering a path to faster iteration and more reliable real-world performance.

💡 Why is robot policy evaluation uniquely challenging: compared to LLMs or computer vision?
💡 How can a tiny amount of unrelated simulation data: significantly boost real-to-sim correlation for robot policies?
💡 Where do fully data-driven video models: fall short for robot evaluation today, and what role does PolaRiS play?

Robotics is hard. Real-world robot evaluation is harder, a slow, costly, and often irreproducible bottleneck. Arhan and Carl, from the University of Washington and Physical Intelligence respectively, introduce PolaRiS, a system that bridges this gap, making high-fidelity, real-world correlated simulation accessible and scalable for generalist robot policies.

The Real Problem

Robotics is Unique: Unlike LLMs or vision models, robotics faces compounding error and sequential decision-making, where one small mistake cascades. This makes simple, one-step offline metrics insufficient for true policy evaluation.
Real World Pain: Testing robots in the physical world is slow, expensive, and difficult to reproduce consistently. This limits the speed of iteration and the diversity of environments policies can be tested in.
Sim’s Promise: Simulation offers speed, scalability, and perfect reproducibility, but traditional simulators often fail to accurately predict real-world performance due to the "sim-to-real gap."

The PolaRiS Solution

Visual Fidelity: PolaRiS uses real-to-sim environment generation via 3D Gaussian splatting, creating high-quality, visually rich simulation environments from simple scans. This ensures the visual input to the robot policy in sim closely mirrors reality, a critical factor for correlation.
Minimal Data Magic: A small, task-agnostic dataset of simulated robot interactions, co-trained with real-world data, dramatically improves real-to-sim correlation. This "sim data" doesn't teach new tasks; it teaches the policy to ignore spurious visual differences between real and sim, making it robust to distribution shifts.
Hybrid Approach: PolaRiS combines learned visuals (Gaussian splatting) with classic physics, offering a sweet spot between ease of creation and strong real-world correlation. This outperforms purely data-driven video models, which often struggle with failure modes and long-horizon tasks due to data limitations.

Actionable Takeaways

🌐 The Macro Shift: The robotics community is moving from hand-crafted, task-specific simulations to generalist policies that demand scalable, real-world correlated evaluation. PolaRiS enables this by making it cheap and easy to create diverse, high-fidelity sim environments from real scans, allowing for generalization testing akin to LLM benchmarks.
⚡ The Tactical Edge: Implement PolaRiS for rapid policy iteration. Use its real-to-sim environment generation and minimal, unrelated sim data co-training to quickly validate robot policies against real-world performance, reducing costly physical robot time.
🎯 The Bottom Line: PolaRiS offers a critical infrastructure upgrade for robot AI development. By providing a fast, reproducible, and highly correlated simulation environment, it allows builders to iterate on generalist robot policies at software speed, significantly de-risking and accelerating the path to real-world deployment and broader robot capabilities over the next 6-12 months.

Podcast Link: Click here to listen

Hey everyone, welcome to another episode of Robbo Papers. I'm Chris and I'm here with Arhan and Carl and yeah, would you guys like to introduce yourselves and get started?

I guess I can start. I'm Arhan. I'm a first year PhD student at the University of Washington. I'm advised by Abhishek Gupta and I guess what I'm going to talk about is the first project I did in my PhD with Carl and Abishek and many other collaborators on using sim for robot evaluation.

Yeah, my name is Carl. I'm a member of technical staff at physical intelligence and I like to work on large scale robot learning problems and had the pleasure of helping Arhan a little bit with his project here.

Right. I guess I can give a high-level overview to start of what it is. So I briefly touched on using sim for evaluation, but I think more importantly what we wanted to do was a scalable way to do sim evaluation.

There's been previous works like Simpler and Robot Arena and a few others, but I think a few key things that were missing previously were they're either too hard to construct without handcrafting your assets which a lot of robot learning people don't necessarily have access to.

Along with some of them don't really support wrist cameras such as Simpler and this Robot Arena work have this limitation where if you're doing green screening on your external camera views you can't really have a camera that moves around in your environment.

And so one thing that we're doing differently, the approach we took was using a real-to-sim approach to generate environments to make it cheap to get eval environments. Along with if you have some sort of explicit representation like a gausian slot or something, you can actually move around in your environment and get novel views that are not necessarily just green screens of your training data set.

I think it may be helpful to actually zoom out a little bit first and try to kind of explain why should we even do this kind of evaluation for robotics at all?

Sure. Do you want me to do a one minute summary of that?

Yeah, sure. Go for it.

Okay. So basically I think if you thought of it from the start you would say okay as a roboticist we should always do real world evals, right? because real world evals are what we care about. This is kind of the metric that we're optimizing for is to have robots actually work in the real world.

But then when you do that in practice and I'm sure many of the people who are listening and who are roboticists know the pain of that. It's not actually very scalable because running real robot evals is really painful and it takes a lot of time and it's very hard to reproduce.

And so even historically as a community we have used a lot of sim evals, right? like we have often built benchmarks. I think Jeff also has some great benchmarks that he built in simulation. The benefit is clear, right? Like it's very fast, it's very scalable, it's perfectly reproducible. So everybody loves SIM evals.

But then that there's kind of this disconnect, right? Like we care about real world performance, but then we build SIM evals and we all test on SIM evals. And so I think this work here is kind of trying to close this gap a little bit and try to build evals that are SIM evals, but they're actually not really trying to optimize for like the best performance in SIM, but what instead they're trying to optimize for is to be indicative of real world performance.

So the policies that we will evaluate in this paper today are actually trained on real data most of the time and they can run on real robots and in fact a lot of the experiments that Ran did for this paper are real robot experiments.

But then the final product so to say or the outcome of this paper is a sim eval tool where you run a simulated evaluation like any other sim eval. But then what we show for example in this little graph on the right here is that the performance of these policies in the simulation are actually indicative of what they will do in real like basically a policy that does better in our sim benchmark will do better in real right and so this is where the value is basically we can build a simulated tool that allows you to test performance that then hopefully is similarly reflected in the real world.

Okay. And so I think now is maybe a good time to then zoom in and say okay what was there before because we're not the first people to work on sim eval. And then you know how does our technical solution expand the capabilities of what was there before.

So I'll just hand it back to Arhan to stand up. Yeah. Just to add a bit on what K mentioned like if it would be great if K could give a bit of context to some of our non-robotics audience like you know why is evaluation so much harder for robotics as compared to LLMs, VMs, you know computer vision NLP.

Yeah I think that would be that would be really great for the audience. Yeah yeah that's a good point. I think like so I think one thing that makes robotics particularly hard compared to all these other problems that fields that you mentioned is the issue of compounding error in our case like most of these other like offline metrics like validation MSSE I guess okay I I should say compounding error and a multimodality.

I think like all these other metrics that they use like validation MSSE and like I guess they're mostly like onestep metrics where like if you're predicting the next like if you're predicting like an answer to what's in this image or if you're like predicting like housing prices or something like these are like onestep problems and in robotics like whatever you predict affects your what you see next and so you kind of have this issue of compounding error where you like you will start to drift away from like the more errors in prediction you'll get larger drift and so your your evaluation metric might not truly represent what your policy can do.

I guess so this is like the issue between like doing open loop versus closed loop evals and why open loop is hard. And so that's why like we we're trying to build like use simulators to actually be able to roll out your policy effectively.

It's actually quite an interesting point because I don't actually think that like for example LLMs will get around this problem for much longer. Like in some sense, you know, LLM can choose to start with problems that are singlestep problems where you just like take an input and produce a output and that's it.

But over time kind of the problems that the LM want to tackle become more sequential decision-making problems, right? like they may want to interact with users or they may want to do very complicated things in code bases where maybe they need to run code and wait for a result and do it again and so on and so the sequential nature will come into LM2 and then they will have a lot of these same problems where they maybe need to simulate users now because that's their environment and then again they will have similar problems where evaluation becomes hard but so far they could kind of choose to not do sequential decision-making and so they had easier evals for robotics you cannot make that choice because there's no use for robotics without sequential decision- making and so we have to tackle this problem from the start yeah sounds great yeah please continue yeah yeah yeah cool.

I guess like next I so I can get into what we actually did like what our method was so I I talked a little bit about like we were doing this sort of like 3D reconstruction to make it cheap to create new evaluation environments specifically what we were doing is like using 2D gausian splatting for extracting a mesh and then you get have gausian splatting for your visuals.

And then we have this like scene composition thing that I can cover a little bit later maybe where you can like basically pull in different assets that you scan. For environments we're using like gausian splatting but for like objects we're doing more generative models like in the paper we use trellis but recently more things have come out like sand 3D for example which just you give like single view or multiview inputs and then you can get a generative model to produce the object.

Like the benefit of this why we use this for objects versus actually scanning it like normally is because your objects are going to be partially observable. Like if it's sitting on a table, you won't really be able to capture what the bottom of the surface looks like. So that's why generative models are helpful here cuz it can do like the completion what what you can't see.

Anyways, I guess you put this together in the scene builder and then you can kind of get a scene with like a task that you want that you care about that is paired with the real world. In practice what we do we tried to get this to work zero shot in the beginning and I can actually show the plot like right after this but zero shot if you transfer your like real like policies that are only trained on real world data if you transfer them to this simulation that's like g like with using gausian splatting background visuals and like either rate traced or splatted objects as well we actually see like your policy will do something reasonable and like if you have don't do like systematic analysis it'll look like like the policy is transferring to sim like this is probably a good evaluation but if you do like the paired world paired real world correlations between them the zero shot transfer is not super highly correlated and what we found is like if you throw in some code like you throw in a short like code training data set that's not task specific or environment specific to what you're testing on just like random simulation data into your code training mix and then evaluate your policy it the correlations look much much better.

And the intuition behind this is just like you're doing some sort of alignment to get around the distribution shift your policy sees from going to from real to sim. And so basically on the right you have this like example of what some of our correlations look like after doing some short sim code training.

I want to stress like the data that we pick for co-raining is completely irrelevant to what we test on. Like there's no overlap in our evaluation environments where we It's not even like the same robot or anything like that.

Oh, okay. It is it is the same robot. I will say that. It'd be interesting to see like how it like I guess it was not really in our like in the academic capacity to test it on like multi-mbbodyments if you just throw in random sim data. I actually I don't know how how that would change things, but we did the we use the same like droid setup, but like all our objects and environments completely different.

So you to be fair, the policies the policies we're evaluating here are basically single robot policies, right? They were at some point pre-trained on multi-root data sets, but then for all practical purposes, they're really just heavily fine-tuned on the single robot embodiment of a Franka robot. And so it probably makes most sense to put some data for that same robot in there.

Can I ask how you're thinking about physics in all this? Oh, okay. Well, let me start. Yeah. So, because if you're doing gosh and splatting, obviously there's no like you don't have any idea of like what the frictions are or the masses of the different objects and presumably maybe your code is giving you some of that for the robot, but like so could you tell me how you think about this?

So like to clarify is a question on like how we're specifying like Yeah. like do you randomize it or what do you how how do you like or is this not an important problem or like what is how do you think about it that's a very open-ended I guess yeah I think so so most of the things that we evaluated were like rigid body tasks where it's like the objects that we place in front of them like most of them are the robot should be able to pick them up and I think there could be like stronger CIS ID done we we didn't do like extensive CIS ID basically cuz what we were testing testing.

What we tested was like we did test at some point is like dynamics between SIM and real different. We basically did this like hardware in the loop test as we called it but basically what it was was we roll out the policy in the real world but the inputs to the like policy is not real world inputs it's just like mirrored SIM inputs.

Sorry wait other way around. It's you roll out the policy in SIM but the inputs to the policy are real world inputs. So basically you isolate the fact that like you get real world visuals but you isolate the sim dynamics and we that was basically a test to see like our dynamics and issue whether it's like for object masses masses and friction or like robot dynamics too like controllers and stuff but that would that would mostly only work on tele contact right or so for the the robot stuff it would mostly work on contact I guess most of the objects that we're doing like if it's able to pick it up, it will pick it up.

And for that reason, we didn't focus too hard on like object mass and friction randomization. I think when you get into more like dextrous tasks and maybe like deformables, this will definitely become a much bigger issue. In our case, we basically just set them to like kind of like default values approximated by like this the size of the object and like the density or whatever. It was like auto computally.

Yeah. Like I think these are kind of observations that are somewhat in line with what we had seen in previous work as well. So in the Simpler paper which is kind of a previous version of a similar at least problem setting if very different approach there we explicitly try to see how sensitive the whole system is to different physics parameters and it wasn't particularly sensitive and so Arhan's experiment here is actually a very nice way to delineate these two components of a simulated evaluation system which are the dynamics and the visuals and I think it was a quite nice demonstration that actually the dynamics at least in these times tasks as Arhan said are not kind of the key driver of correlation issues but it's really mostly about the visuals which is actually what in the first place motivated us to use the gausian spliding approach because it's one way to get very high quality visuals with relatively low effort on the user side.

Like basically making a very very nice looking gausian plat is much much easier than making a very very nice looking handcrafted simulation environment. And so the ability to get these nice visuals quickly was kind of one of the key motivations to use this technique. It also did turn out that the visuals are really what is driving good correlation.

I have a I have a question just regarding one of the points that was made which is the leveraging sim data into the code training. So I guess you did like code training with simroid data that's collected on different tasks and that was like about 10% of the data set. No, it's Oh. Oh, I see. It's maybe 10% in the code training mixture, but it is a very very tiny data set compared to the actual droid training data set. Yeah. So, that's the question. Why would the correlation get better if the amount of sim data is so little?

So, so I think there's an important kind of insight here, which is that the sim data is not there to teach the robot something it doesn't already know. It's there to teach the robot how to ignore certain differences that are spirious correlations.

Right. So, so in in your sim data, you have certain artifacts that are induced by say the rendering of a gausian splat which looks a little fuzzy for example when you get close like wrist cameras get really close and you don't have that on your real data but all the robot has seen is real data so far right so this is kind of confusing and and these models tend to not generalize like humans they tend to have particular kind of properties where if you have a small distribution shift like this it can really throw off a model but clearly it's nothing about the task right like the tasks we have here are very similar to what we have seen in real. It's just that the visuals look a little bit different in a way that's unintuitive for humans but really makes a difference for these models.

And so the purpose of the sim data is only to kind of teach the model that actually you can be robust to this difference like whether the image looks like a real image or a slightly washed version of a real image doesn't actually matter for what you should do in this environment. And so that's why just like a tiny bit of sim data on unrelated tasks and only train on it for a few hundred steps is enough to essentially teach the model this robustness and then your correlation results get much much better.

Yeah, I'm convinced that you know it doesn't help with the task but really just to get the correlation better. I'm just curious like do you guys try different mixture to see whether does the correlation gets better and better as your sim increases or is it you know like is there a a correlation to actually scaling sim to so that the co relation between real and sim gets more and more aligned that yeah so I think we had so we couldn't scale it too far just given how hard it is like to collect like data in the first place.

But we had I guess we had like two phases of like or two amounts of sim that we tested on which is like not like many like a lot of data points. I'd be curious to see how it changes as you scale more sim data. But between like I think the first one was on the order of like I'd say like six environments and like this one was closer to 20 15 to 20. And the correlations did improve. But I guess it just becomes more expensive to find more out of distribution environments and tellyop demonstrations to put into sin data after that.

Yeah. What is it? Yeah, I think there was actually a very interesting kind of nuanced point in that experiment. Aran, I don't know if we have the plot somewhere in the paper maybe about the different types of sim code training data that we tested. Oh, yeah. Because there's a few different choices you could make, right? You could actually collect data in totally different environments. You could collect data in the same environment but on different tasks. And you could collect data literally on the task that you're planning to eval on, right? like these are all reasonable choices.

And it turns out that they have different effects on the correlation values. And so specifically I guess if you try to com let's see I think in domain is basically what we called the collecting data on the exact task that you're going to eval and maybe naively you would think that's actually the best thing to do right you would think oh the closer the data is to what I want to eval the better the model should be so that's good but actually this is not the optimal thing to do right because in some sense what you're going to get then is a model that overfits to this like small sim data set that essentially tells it the solution to your test that right like you're you're going to eval on those same tasks.

And that makes all of your policies better, but it doesn't make your evaluation discriminative any longer, right? Like basically now your all your policies kind of do well roughly speaking. Whereas if you have data that's only vaguely related and tells your model how to bridge this distribution shift but doesn't kind of give it the answer to the test that you're going to pose, then you get better correlations in the end, right? So so I think there's a bit of a nuance point. How do you want to choose this data set? you actually don't want it to be too close to your test set because then your test set loses kind of some of its discriminability between different policies.

Yeah, I think this is what the plot basically shows. I think the the yeah the purple plot here is is what our final method ends up doing which is to just train on data that is not featured in the environment that you're testing on. Training on other tasks in the same environment is also a good choice. It's actually in some sense a little bit of a better choice. But then training on tasks that you're literally going to evaluate on is somewhat worse than than and I think like one of the nice things is like having these like unrelated tasks in your code training data set makes it much more easy to just evaluate a new task rather than to have to collect data in your new environment that you want to evaluate or something like that.

Okay. Uh was there any other question? I think maybe another interesting maybe if you want to go to the main result figure I feel another interesting kind of higher level point that we evaluated is that there's a lot of people who are quite excited about using video models. Is it on this plot? Maybe you need to go onto the paper. It'll be more clear on this one. Yeah.

So so basically there's a lot of people who are excited about using video models to evaluate policies which is an even easier way in some sense to get an evaluation environment because the model just is the environment. And so for some context what this would mean is that you still have your policy it's trained on real data you have a video model that's trained on real data and then you just kind of loop them. So you have your policy look at the outputs of the video model, produce the next action, you pass that into your video model at this one step or you may want to call the world model if it's action conditioned. And then you loop, right? And so you basically get a roll out of your policy in the video model. You can score it at the moment people do this by hand.

And then you can see whether the performance in that video model is indicative of real world performance, right? And so again an important part here is we don't just care that policies do something reasonable. we actually want these evaluations to be correlated with our real world results, right? And and so we did actually quite a rigorous test of this in in this project where we tried the current best open-source droid model that is available. And we ran our policies through that model and we did find that you know they do something for sure. they're not just like randomly waving the arm or something, but the actual correlation if it comes down to it is not nearly as strong as what you get if you use a more classic way of simulating the environment at the moment.

Right? There's in some sense a bit of a continuum like we have the fully handcrafted sim. We have the fully learned world model simulator. And the H project is somewhere in the middle, right? Where you have like a bit of learn structure and visuals, but then the actual physics are still classic. And so at the moment that seems to kind of give you the best trade-off of how easy it is to make versus how good the correlation is you end up with in the end.

Can you talk a little bit more about how you set up the video model and why this would be fail why this would do worse? So is this is this basically just because of predictions getting worse over time or what why why is this like is it just diverging or what's the problem?

I think I think there's two kind of parts of it. So there's there's one is that like divergence over time. I think the first one I'll touch on is like one a lot of it is yeah I'll talk about the divergence first like the divergence like in some cases it even becomes like hard to grade where like these like objects are not necessarily objects vanish or something is that the yeah they vanish but they like come back to where they started from so that's one part of it I guess that's not the only thing like messing up correlations that that mainly just makes it hard to grade sometimes.

The other thing is like I think a lot of the trajectories you will probably see are going to be success trajectories. And so like if your world model has like if it's I guess I I don't know the exact data mix of how this this world model specifically was trained, but you there's like so many ways to fail compared to how many there are that you will succeed. So being able to just like model that from like being able to collect that amount of like su failure data is going to be hard. So just at least in like the data that they're being trained in right now, it's more likely that you will gravitate towards success. Like that's kind of the intuition.

So even if like the action was like if the action was like 2% off and that would have cost a failure, it's just the point is that that's not going to be well represented in the underlying video data set and then it's going to fail. Is that the is that the intuition? Yeah, I think so. And then the other thing is like I guess like you can work there's there's probably works going on right now honestly on like trying to make them more action conditionable and like true to like what the outcome is. But also I guess if you have it's it's kind of hard to know like what the deviation is other than like doing some sort of like image similarity score too. So that makes it hard to like hard to assess.

Yeah. Like I think you know I I don't Yeah. Like I think you know I I don't think the outcome that people should think the outcome that people should take away from this discussion is oh take away from this discussion is oh video model evolves will never work or video model evolves will never work or something like this or like you know something like this or like you know this is the final solution. Yeah. this is the final solution. Now, I I think the upper ceiling of video model evals is much higher, right? Like if you think about what kind of tasks can you actually test in a classic simulator like the one that we have here, it's great for rigid objects. It's probably decent for articulate objects, it's going to be a whole lot harder to like, I don't know, simulate a cloth folding task or simulate a wiping task with some liquid in there, right?

So so I think you know you can try to solve these tasks classically and you can try to build out all the physics models to make those realistic and then have them correlate. Or you can try the datadriven route and make your video models better and like add rollout data into your video video model training so it gets more accustomed to failures and and how policies may not be optimal. And and I think there is a strong interest in that second route in the datadriven route and I'm sure that these models will get better.

And so maybe the main takeaway for the listeners from this discussion is not that these are always going to be bad. It's more like today if you want to have an evaluation that's a good proxy, probably doing something that's closer to a classic sim and maybe like a mixed approach like our hunts project is going to give you a better result than just using a fully datadriven video model. But you know maybe a year from now or six months from now we can re-evaluate like this is one of the nice things here like we have this offline data set of real world performance and policies and initial states and so it's now very easy once a new better and awesome video model comes out we can just run that offline evaluation through the video model. Right now our needs to do a lot of grading when we have a lot of videos so that's a bit of a pain. But if somebody figures out the grading aspect then it will be a very easy test to run.

Yeah and I think there's a lot of potential in that it's just not yet today. I have a more I guess philosophical question from from this point which is that you know like in robotics we see so much evaluation we see like there so much benchmarks there so much you know people who do real evaluation to sim evaluation but there was I mean yes sure there will be points where there's a few benchmarks that people starts to converge towards then you set trace and from what I last check real tank is like at 98.5% in the last one one and a half years. So like what what's the thoughts here? Like is it the problem is like finding the right evaluation or is getting people on board on the evaluation?

Like it's probably a bit of both, right? Like you need some agreement as a community what are good evaluations to compare on because if everybody makes their own and nobody compares then we don't learn much from an evaluation. I think finding the right evaluation or finding the right tasks is also an important problem, right? Like as you said, we kind of a lot of people agreed on using Libero as their evaluation suite, but now it kind of like loses its discriminability. And kind of like everybody kind of does good on Libero now and it's a bit hard to say what is really going on.

And so I think agreeing on benchmarks is good, but also developing benchmarks that are either by design codeeveloping with the capabilities of the policies or that are easy enough to make or easy enough to extend that we can keep making them harder and harder as the policies get better. I think is good. And I think one aspect specifically that I think benchmarks like Libaro don't push on too hard is generalization.

Right. Like usually, at least so far, most of the benchmarks that people put out come with a training data set. Which is not something that happens in LLM land anymore, right? Like when people make evolves in LLM land, they don't make training sets. They just make test sets and then you test whether the off-the-shelf model generalizes. Right? And and so in robotics, we haven't really done this because so far nothing generalized wide enough that this would make too much sense.

But I think one of the exciting aspects of this project is that you can actually kind of test zero shot generalization, right? like we can test it in environments we have never seen with objects we have never seen and and the policies can do something and so now this is much more in a place where you can actually very quickly develop different kind of benchmarks that poke different capabilities right like in LM land people make benchmarks to test factuality and then benchmarks to test you know I don't know language translation capability ability to do coding right and and so people very quickly come up with these different tasks and you build up this suite of benchmarks over time that allow you to holistically test what these models can do and and we haven't had that in robotics because the models didn't used to generalize at all. So we always kind of tested them on the train set.

Now I think we're getting to a point where it actually makes sense. And then tools like this tool I think are one way to get us to that same point where people can just propose oh like how does this model do in very cluttered scenes? Let me quickly make up a polarity val of very cluttered scenes. and now I can put it out, everybody can test it and we will know the next generation of models how well do they do in cluttered scenes right so there's a yeah I think there's a good point now where we can actually get to a mode like this but we will still need buyin from the community like if nobody evaluates on it then there's not much point in making those benchmarks so like a following a question to this I think also maybe to clarify to the audience about the zero shop is that is the zero shop empowered by the fact of you a data set like Droid which is really you know great and beneficial for the community or is it more empowered by the base model which is pretty strong or you know like how do we resolve moving forward and you know get more people to be on board of this and you know get more models essentially to be to be you know running on this platform I guess that's yeah I think in my current feeling there's a lot of things to be done on this kind of benchmark by just using the droid data set it usually helps to use a pre-trained model that like that makes your life a little bit easier because it trains faster and you know it's a decent initialization. But I think there's a lot of interesting research to be done on the path from that kind of pre-trained checkpoint to the actual thing that you're going to evaluate. And like really so far all the models that we're testing here are models that are kind of like trained with a very vanilla kind of one pass of data filtering was done and then that's it. And then that's the model that we're putting out.

And and so I think there's a lot of potential to iterate on this and I don't want people to think that it's all about who can collect the largest, you know, pre-training data set for their model and then that's how you win this kind of eval. But but I think there's a lot of interesting research to be done from that pre-train checkpoint to the actual thing that they're evaluating if that answers the question. Yeah. Yeah. I think that answers. Uh yeah. So uh what do you think what's next for this kind of benchmark then? you have you are you I guess you're trying you're putting this out there people are able to come up with their own polaris do you environment how hard is it to actually specify like a task like if I want to build a pyramid or something like this or put a peg in a hole like how to like how how do I how much work is that to add a new task in practice I think that I mean obviously it depends to some extent on the task but the nice thing about like the rigid body task ask is that like typically there's like a you don't have to do anything crazy to like I guess define a like a sparse reward or like success condition.

Yeah. Like I guess in this case we didn't really do any articulate object. Articulate objects are articulated objects are also like not too bad. So like for most pick and place tasks, it's usually like is this bounding box inside or overlapping to some extent with this other bounding box, right? I guess that's the case. I think the pyramid example that you gave is a little bit more unique. And that one would be a little bit harder because it's just like a chain of things. And I guess like lots of complex conditions.

So to be clear, I think what Arhan is describing is defining the success condition. Yes. Right. There's a whole other part to like actually scanning objects and so on. Oh, sure. Yeah. And I would say that our pipeline here actually makes this relatively easy compared to if you had to kind of handcraft scenes and like handcraft objects and then do all of the other reward definition on top. yeah, I think I think we really tried to make all of this fairly easy. So but basically we try to make it really easy for people to, you know, take either existing assets and puzzle them into new tasks, which I think is one very good way of making new evolves. Or scanning in their own assets. Like if you want to have a certain object in there or you want to have a certain scene in there, you can scan it. You know, sometimes as easy as taking a few pictures with your phone or you can scan it with like some camera that you have on your robot. Then there's a gausian spotting pipeline that will make this like nice mesh for you. The mesh doesn't look very nice, but the visuals are much nicer. And then there's this little tool that we built that's a browser tool. So, you don't even need to install anything. You kind of just drop all your assets in there. You can kind of puzzle them into the configuration that you like. And then you can click save and it will kind of save out all the assets and their initial conditions. And then that's basically the eval. And now you just need to define a reward like the success condition. I think Arhan has a little bit of code where you can just specify, okay, this object needs to be in this position and this object needs to be in this position. But that's already in EVA, right? And then you can kind of package that and you can upload it to Hagen face and now everybody can easily download and test it.

Yeah. So this video on the right, I guess we're seeing you generating two objects or adding two objects, getting the scaling right,

Ep#62: PolaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies

Ep#62: PolaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies by RoboPapers

The Real Problem

The PolaRiS Solution

Actionable Takeaways

Others You May Like

OpenClaw: The Viral AI Agent that Broke the Internet - Peter Steinberger | Lex Fridman Podcast #491

OpenClaw: The Viral AI Agent that Broke the Internet - Peter Steinberger | Lex Fridman Podcast #491

OpenClaw: The Viral AI Agent that Broke the Internet - Peter Steinberger | Lex Fridman Podcast #491

Ep#62: PolaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies

Ep#62: PolaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies by RoboPapers

The Real Problem

The PolaRiS Solution

Actionable Takeaways

Join 10,000+ smart readers on our AI newsletter and stay ahead of the curve

Others You May Like

OpenClaw: The Viral AI Agent that Broke the Internet - Peter Steinberger | Lex Fridman Podcast #491

OpenClaw: The Viral AI Agent that Broke the Internet - Peter Steinberger | Lex Fridman Podcast #491

OpenClaw: The Viral AI Agent that Broke the Internet - Peter Steinberger | Lex Fridman Podcast #491