The TWIML AI Podcast with Sam Charrington

January 8, 2026

Intelligent Robots in 2026: Are We There Yet? [Nikita Rudin] - 760

Beyond the Demo: Why Humanoid Robots Still Generate Zero Value

by The TWIML AI Podcast with Sam Charrington

Date: October 2023

Quick Insight: This summary is for builders and investors tracking the transition from viral clips to industrial utility. It explains why the Sim-to-Real gap remains the final boss of physical AI.

💡 Why does adding vision actually make robot walking harder?
💡 Is end-to-end learning or a modular architecture the winning play for 2026?
💡 When will a humanoid robot finally produce a positive ROI in a warehouse?

Nikita Rudin, CEO of Flexion Robotics, argues that while hardware is accelerating, we are still in the net-negative value phase of robotics. The gap between a scripted video and a reliable warehouse worker is wider than most realize.

The Value Mirage "I think there is not a single humanoid robot today that actually generates value."

Net Negative Utility: Most current robots require more human handlers than the labor they replace. This makes current deployments expensive research projects rather than business solutions.
The Demo Trap: Viral videos often hide teleoperation or thousands of hours of specific training for one table. Real autonomy requires handling the messy variance of a non-simulated warehouse.
Semantic Navigation: Moving from point A to B is solved, but knowing why to avoid a plant while climbing stairs is not. Robots need semantic understanding to operate in human environments without causing damage.

Modular Brains "The more pragmatic approach is to split the problem."

Hierarchical Intelligence: Combining a high-level VLM for reasoning with a low-level controller for physics beats pure end-to-end models today. This allows robots to use common sense from the internet while maintaining stable footing.
Latency Management: A three-tier system of VLM, VLA and Whole Body Tracking manages compute. This structure ensures the robot doesn't fall over while waiting for a cloud-based brain to think.

The Sim-to-Real Final Boss "Switching robots is fairly easy... bringing them to a new task is more challenging."

Physics Fidelity: High-fidelity simulation must account for motor torque limits and signal delays to work in reality. If the physics in the simulator are too perfect, the real-world robot will vibrate itself to pieces.
Task Generalization: Teaching a robot to walk is now a solved commodity. The real moat is teaching it to open a taped box or handle a knife without manual reward tuning.

Actionable Takeaways:

🌐 The Macro Pivot: The Great Transition: Moving from Blind Locomotion to Semantic Navigation defines the next frontier.
⚡ The Tactical Edge: Prioritize modular architectures that use off-the-shelf VLMs for task orchestration.
🎯 The Bottom Line: Expect the first value-positive humanoid deployments in late 2026. Until then, treat humanoid demos as vision statements rather than ready-to-ship products.

Podcast Link: Click here to listen

my hot take on that and I'll be happy to be proven wrong but I think there is not a single humanoid robot today that actually generates value. Meaning there might be a robot that does something fairly close to what it's supposed to do in a factory or in a warehouse but it's not the exact task. So in the end it's not generating value because it's not doing the actual thing it's supposed to do.

All right, everyone. Welcome to another episode of the TwiML AI podcast. I am your host, Sam Cherington. Today, I'm joined by Nikita Ruden. Nikita is co-founder and CEO of Flexion Robotics. Before we get going, be sure to take a moment to hit that subscribe button wherever you're listening to today's show. Nikita, welcome to the podcast.

Thank you. very excited to be here.

I'm excited to have you on the show and I'm looking forward to digging into our topic for the conversation which is uh really digging into the gap between you know where we are today with robotics and where we need to be to fulfill the vision of the technology. You've been working in this space for quite a while. Uh you did your PhD at ETH Zurich and spent some time at Nvidia. Why don't you share a little bit about your PhD and the focus of your research?

So when I started, we were trying to use simulation with reinforcement learning to teach a legged robots very simple things like just walking on a on flat ground. And when the robot could take a few steps, that was already a big success. And the core focus was to reduce the training time needed to to achieve that.

And when you say legged, like a quadriped.

Exactly. Like a a quadripedle legged robot dog. We were not using bosamic spots. We were using antibiotics animal. Antibiotics is a is a Swiss startup that was a spin-off from from our lab. It's very very similar to to a spot, but it's red and made made in Switzerland. Yeah, we were really trying to reduce the training time needed to to achieve that. So before I started there were some results of reinforcement learning for for such quadripeds but it would take weeks of computation to it to achieve anything and using GPUs and massively parallel simulators we managed to reduce that to just a few minutes. So actually we had a demo on stage at some point at at some conference where yeah we were running training live on a laptop while I was holding the robot and every 15 seconds the laptop would send the latest policy to the robot and you could literally see how it went from just falling over to taking a first step and then after three or four minutes it it would be able to to walk around the stage. That was a pretty cool visual demo for everyone to see exactly how the learning process happens. and and from there my PhD was pushing the agility of of that robot using the similar techniques. So it was still training neural networks in simulation and then transferring them to the real world but the inputs got more complicated, the tasks got more complicated. So by the end we could go to a a search and rescue facility um here in Switzerland. So you have to imagine collapsed buildings, a lot of uh mud, moss, gravel, big rocks, terrain that is very hard to to navigate even for a human. And we would just tell the robot to go from point A to point B and would use its whole body. So he would use the knees to climb on top of big big rocks and then jump over gaps and all again all autonomously all end to end using um images and the state of the robot to to plan its next actions.

In telling that story about you know deploying this robot a search and rescue context um envisioning the demo I've seen similar things the robot dog is going you know maybe opening some doors maybe that wasn't part of your demo but I've seen similar demos of you know the robot dog like climbing hills and crossing rubble and I think those demos like you know attempt in many cases to land the idea that hey you know like flag and ground we're done here like talk a little bit about the distance between you know what you were able to accomplish with that demo and like what you think needs to be done to deploy one of these robot dogs in a real search and rescue scenario for example.

It's a great question and I had this debate so many times even with my my colleagues at NVIDIA and ETH. The the general question is is locomotion solved or not? And we we already had this this debate five years ago when the robot could just barely walk on flat ground and people were saying yeah locomotion is solved. We don't need to focus on it anymore. My take was always like no until the robot can really go anywhere a human can go and you don't even need to think about can it do it or or not how reliable it is then my take is that locomotion is not solved. Um so where we are today is that anything that is is blind can be very very robust. Blind meaning that the the the robot does not perceive its terrain. So it's just reacting. That makes the training much easier because you just have to throw a lot of things at it. It always tries to be stable. Especially for a quadripet, it's very easy to to remain stable. And for example, you can even walk upstairs, it will just hit the first step, realize there is something and then uh climb up doing it um with perception. So now you want it to actually react. You don't want it to hit the first step. You want it to place the feet much more carefully. That's harder. today in end of 2025 where I say well we can do that we can have fairly good policies that plan the sequence of actions based on the perceptive inputs because we're training everything in simulation this means that we need to be much more careful in how we simulate those sensors as well to put a lot of effort into creating complicated trains that can be seen by the cameras and also all sorts of noise models to simulate all the the the disturbances and defects of of those images.

So, what I I'm hearing you say is that the addition of the additional information that you would think would help actually makes it more difficult because it introduces a lot of noise. And whereas previously the robot could kind of stumble through the terrain, now it's trying to incorporate this visual input to plan, but it ends up, you know, making it.

Yeah. what actually happens when it is trying to do this? Is it do you see it stuttering or does it just not work? Is it just hard to train uh from a model perspective? Like what happens?

So I mean in the end what happens is that the final behavior is better if you do everything right. But that's a big that's a big if. Um the typical thing we refer to as the the so-called sim to real gap. If you're training things in simulation and deploy them in real life, it's not the same. Um, and things that work really well in simulation might not work at all in in real life. That seem to real gap is much larger once you have perception in the loop. Once you're simulating either depth images or RGB images is even worse. So that makes the the job of the researcher of the engineer harder. You need to cross that toreal gap for both the physics of the robot and the perceptive inputs. You have to simulate the sensors carefully. But if you do it right, then the final behavior is actually much better because you can see that the robot is not simply reacting to whatever is happening under his feet but actually planning accordingly in advance.

Is this problem of locomotion solved then with you know the with vision uh at least for you know let's take quadripids uh as an example or is it is there are there still kind of outstanding issues like is there a generality uh gap or how do you think about it?

It's interesting. So it really depends where you draw the line on on locomotion. So the once the robot can cross complicated terrain, the next step goes more into into navigation, which means where should it go? Should it climb that thing or or should it avoid it? For now, uh everything I've I've described so far was mostly using geometry. So there there are no semantics. But now if you imagine the robot meaning a point A and a point B and it's going to take the straight line and like kind of plow through whatever is between here and there as opposed to think about which way to go more or less. If there is a huge wall it might avoid it but then mostly it will try to climb on whatever is is in front of it.

Okay. Now if you imagine the robot walking behind me in the office you don't really want it to climb on all every single desk. You want to avoid some things but also walk on others, right? So if there's stairs, you want to take the stairs, but you don't want to hit uh plants or whatever else, which means that suddenly you have to go you have to add semantics to to to the policy. And what does it mean to add semantics to the policy? Once again, there if you go from simulation to reality, it makes the sim to real gap bigger because suddenly you have to simulate all all these offices, all these different objects. Plus, you probably need to give it um images, not just depth images. you need to give it RGB images which means you need to to simulate it in a photo real way or the other option which is probably the the more correct one in the short term is to split the problem. So you train the robot to be very good at walking on anything but you don't give it semantic information and you train another thing on top which will be the you can call it the planner or a higher level policy that was steer steer it around.

Now yeah historically in the conversations I've had with roboticists like this has been a big debate you know whether you know we should be using endtoend deep learning models that can figure all this stuff out or using um a more modular approach. It sounds like what you're saying is that a more modular approach can still be a pragmatic way to overcome the challenges of end to end training.

Yes. It's interesting because my my whole PhD was about going more and more end to end for specifically for locomotion and still I'm I'm here arguing that we should not do everything end to end. I think at some point we'll get there. uh but in the short and medium term as you said the more pragmatic approach is to split the problem and use different techniques for different parts of the problem.

We're splitting the problem and into kind of a locomotion model and a planner model. Does the planner like what h what's the objective for the planner uh when you're trying to you know the you know English language objective is I want this thing to intelligently choose the like the best path um but what is best path how do you define that uh how do you create an objective around that is it the path that maybe it's the path that uses the that's most power efficient if you're in a robot spot or maybe it's the path that you know gets you there faster or least distance like how do you balance all that?

There are different ways to do this. Um if you still choose the RL route when we're doing that you can train these planners with reinforcement learning and typically you have to define this reward function. It would include things like don't hit anything. So avoid objects don't and don't move too fast because that's one thing that makes robot seem very dangerous. These reinforcement learning policies will try to optimize everything. So they will go very very quickly to the goal. This is not really what you want with a robot that operates around humans. Typically we're actually trying to slow them down as much as possible. Um and yeah, that that's mostly it. And then it depends really on on what is on the what is in front of the robot. Like there might be things that it's okay to work to walk on others that it's not.

There is also another approach. Once you split the problem in half, you could train the locomotion with pure reinforcement learning and simulation, but you could train the planner with other data. For example, uh videos of humans walking around the office and you can extract the trajectories from that. And then you don't really need reinforcement learning anymore. You train uh behavior cloning, imitation learning policy that will just steer the robot just like a human would walk.

Uh, and thus far, you know, speaking about how a human would walk. Thus far, we've been talking primarily about quadripeds, how much of all of this translates from quadriped to humanoid robots?

All of it transfers. That is the magic of of reinforcement learning that these policies don't really don't really care if they're controlling a quadriped or a humanoid. And this was the big switch from a PhD to to flexion where we're working mostly on on humanoids. And we've seen that the exact same techniques transfer. There is one interesting thing that happens with humanoids.

That makes sense to me for a planner. Um, but it's less intuitive for a locomotion model and and maybe I'm thinking of the locomotion model that you maybe the the locomotion model itself that I'm thinking of is like split into multiple components. But um, let me be more specific. I'm including in locomotion like you know the outputs that control stepper motors and all that kind of stuff. Is that um part of what is trained in the locom locomotion model? And then I would think that you would need to at least like tune it or do something else to um if you're going to change the form factor of your robot.

No, this is a very good point. When I when I say it transfers, the general techniques transfer the the models themselves don't. So you for sure you need to to retrain a new a new policy, a new controller for for the new robot. Um, but if if your all the simulation pipelines are are general enough, it can be as easy as changing the the the input file, the URDF that describes the robot, retraining it, then you're you're ready to deploy. Uh, the the tuning part is an interesting one because it is a little bit harder for humanoids compared to to quadripeds. And I don't think it's really related to the fact that it has two or four legs or anything like that. Personally, I think it's mostly related to the fact that we have very specific expectations of how a humanoid robot should walk. Whereas a quadriped will um if it walks slightly differently from from a dog is completely fine. But a humanoid if it doesn't move the arms in the right way, if it bends the knees too much or walks a little bit sideways, um humans have a very strong reaction to that.

I saw a tweet uh kind of touching on the same idea just the other day and it was essentially this humanoid robot that was kind of locomoting like a quadriped like it was like on its back with its arms like this and like moving really quickly. Um, and the the main thrust of the tweet was that, you know, humanoid robots move like humanoid robots because we like that's our expectation, but that, you know, there are potentially other, you know, even with that form factor, there are potentially other more efficient ways for these things to move, but they just seem wrong to us.

Yeah, that's true. Um, but if we want these robots operating around humans, we have to create some trust. So we we have to take the less optimal route if it makes humans feel a bit more comfortable.

We've been working on robots for a really long time, but it seems like over the past year like we're, you know, seeing advancements and via video demos coming very quickly. Like, yeah, I kind of asked this question before, but you know, I want to like have you kind of parse through how to think about, you know, these videos like, you know, we were seeing humanoid robots walking at the beginning of the year and now they're running. You know, now they're doing dishes and all these things. Like, what goes into creating a demo like that? um and what are the limitations of what it says about what the robot is capable of.

First I want to say that it's really exciting that the whole ecosystem is moving so quickly. Feels like every single day there is a a new video of a new robot doing something and that's really really good. We see a lot of progress both in in hardware but also how it's uh on on the AI side as well. How what these robots are capable of doing. Having said that, um, typically when when I personally see a demo of a robot doing something, my my approach is to think about what is the absolute easiest way to achieve that. And that typically that is typically how it's done. So we are showing as an ecosystem, we're showing a vision of what robots should be doing, but behind the scenes is sometimes a little bit different.

For example, if you have a robot standing and doing some sort of manipulation on on a table, folding sheets or something like that, typically one of two things is true. Either there is someone hiding behind the robot, behind a curtain somewhere in another room, teleoperating the robot, so it's not really autonomous. That's let's say onethird of the cases. And the the the other two/3s are where the robot is actually autonomous. But to to get there, a 100 people had to teleoperate a 100 robots to collect a lot of data, hours and hours, in some cases thousands of hours of robots doing very very similar things in probably the same environment. They collect data, train a policy that can imitate that data, and then they're able to deploy robots autonomously, which is not exactly what you would see when you watch that video because it feels like robot can just adapt to anything and uh come to your home and and do everything you're doing on on that front. We're just not there yet, but but we'll get there soon.

Remind me the name of the company that started taking pre-orders for a humanoid robot that is, you know, ready for the home, quote unquote.

referring to it to to 1x.

1x I looked at that and and you know I've had a lot of conversations along these lines about kind of where robots are and um and looking at that I I question like okay are we like really a lot further than I think we are or um you know is something else happen here happening here and this is you know um you know the early buyers are going to be beta testers and it may or may not work once it gets to once it gets to the house. Do you have any takes on not that specific company necessarily um but like the the readiness of humanoid robots for the home?

Also 1x is not the only one. There are a few more who announced similar things. I mean in a way it's good. It's very it's very very ambitious to to sell robots next year into people's homes. I think they have a big challenge ahead of them. So let let's see how far they can get next year. But but usually these companies are also fairly honest that it is a a beta of a beta an alpha program. It will be just for early adopters. It will take a few more years before you can really buy these robots and send them into homes. That's partially why as a company we're focusing more on industrial use cases. There are a lot of other challenges with industrial use cases because now suddenly performance is really important. You need to be very fast. You cannot slow every everything down at least in most cases. Um but it's you have a little bit more control over what's happening. So for example, if we want to deploy 10 robots in a new in a new warehouse, it's easier for us to send an engineer for one or two days to check that everything in is in order. They operate as they should. if they don't fine-tune a few things and then let let robots work, which is not something you can do in uh in everyone's home.

And are the the tasks in the industrial setting, I'm imagining they're more repetitive, more consistent, less variation than, you know, run to the fridge and grab me a Coke.

Yeah, that's right. And we get to decide which tasks we tack on and which ones we leave for the future. So we can start with simpler things. A lot of it is moving objects around, moving bringing objects from point A to point B, moving boxes, opening boxes, taking items out, or putting objects into boxes, putting the box in a truck, sending it further. And and this is this seems really within reach or next year, maybe the year after that. But even though you might have seen a video of a robot doing that, that doesn't mean that it's, you know, ready now and people are doing it now without um, you know, with it not being in a development phase. Is that fair?

Yeah, that's fair. My my hot take on that, and I'll be happy to be proven wrong, but I think there is not a single humanoid robot today that actually generates value. Meaning there might be a robot that does something fairly close to what it's supposed to do in a factory or in a warehouse, but it's not the exact task. So in the end, it's not generating value because it's not doing the actual thing it's supposed to do.

Meaning it's doing some variant of the thing or it's like there's a a a handler that's fixing up, you know, cleaning up after the robot as it makes a mess across the.

Exactly. Yeah. And typically would have more handlers than you had people before. You could argue the value is negative, but once again, we'll we'll fix that.

Yeah, we talked about like how you create these demos and the idea that there's either uh real-time telea operation or you know many many people doing telea operation uh to collect training data you know talk a little bit about then you know after that training data is collected via tea operation like what the approach is for training is that um data then used uh as part of RL or is that more a supervised learning type of an approach?

Typically it is it's supervised. So you record the data is you know images of the camera and how the the commands that the tele operator sent to the to to to the robot which typically is like how should you move your hands in space and how should you move your fingers. So that is recorded and then a big transformer uh is trained to to produce the same actions from the same pictures the same the same images. Now what's what's interesting is the whole field shifted a little bit from just training these transformers from scratch to using vision and language encoders were pre-trained on internet scale data like VLMs offtheshelf VLMs. You take a VLM, you remove the output and you you train a part of a new part of the network on top and then you call that a VLA. So it's a vision language action model where the the vision language part is comes was pre-trained before and the action part is trained from scratch.

Got it. Got it. So as opposed to predicting uh next language token, you're now predicting an action token which is then translated into you know a separate motor motion or something like that.

Yeah, exactly. Kind of compare and contrast that approach with folks what folks were doing before. Are are we doing that because it's it's cool or are we doing that because it um you know how much does it how much does having a pre-trained model to start with like save us from um the generic transformers?

It's a very good question. Um the general thought is that it helps for with generalization but since the language and vision encoders were trained on internet scale data um they're supposed to generalize. A typical case was that if you don't do that, you would, you know, train a robot during the day, then you turn like if the lights go down at night, um, it won't be able to perform anymore. Even though, I mean, there's still light, everything should just work, a human wouldn't even see the difference. But because the image embedding ch changes a bit, the deposit doesn't perform anymore. I believe that this gets better with with the pre-trained encoders. To be completely honest, this is overall the generalization capabilities are still need to be proven. Um, I think Ross Dedric has had an amazing talk at Stanford where they were talking about their efforts at Toyota Research Institute where they were comparing training policies on on a very specific task with little data versus training more journalists with a lot a lot of data. and they were seeing some signs of generalization but I I don't want to quote him directly but it seemed like it's not fully understood yet how much of the generalization is uh is coming from that.

Now map th those two little data lot lots of data to the transformer VLM discussion. The VLM would be the little data and the transformer was the was a lot of data because like we're assuming um the VLM was pre-trained or is it re or was it reversed?

No, it's it's actually reversed. If you could if you include the pre-training as data that you get for free then the that you have a massive amount of data for pre-training and just then you can add less data for for fine-tuning the this action head.

The I guess it doesn't matter which one is which because the results were somewhat inconclusive is what I'm hearing. we need to go in more detail and I I'm quoting other people here so it's it's a bit hard to hiberate I I think it makes a lot of sense to have pre-trained uh visual encoders and language language encoders because you don't want to relearn language every single time you want the robot to do something like language is language and by the way we have amazing VLMs now so might as well use them there's more of a question of this action head should you train it on a lot of random data or just on the the things that you want the robot to do in the end. And this is still an open question.

One question that that raises for me is I just had a conversation where we were talking about how with VLMs generally they kind of ignore a lot of the visual information and really rely more heavily on the language information. And it seems like in a robotic scenario that um you know that would be even more harmful to what you're trying to accomplish. Uh do you run into that as a challenge?

I've heard the same thing. I I haven't seen it in in the VA case. I I would guess it's because the robot cannot ignore the the visual input. It's it's the main the the main source of information. What tends to happen, however, is that they ignore the language inputs. If you train the robot to always do the same thing, I don't know, if you have a box and you have an object inside, it always has to take it out. It will completely ignore the language. It will just do the same thing. It will try to guess from the image what it's supposed to do.

You you mentioned the sim tore gap. Um you know we've been making you know this has been a known issue for you know many many years. We've met we have been making good progress on closing that gap you know talk a little bit about you know your experience what is required today to you know create a model in sim and and have it run in real do you have to are you doing things kind of explicitly or specifically to um address you know kind of real world uh or you know is it just the models are better the process is better and you don't really think about that anymore and it just kind of works.

You need to do a lot of things very explicitly. The challenge is that to cross the sim real gap, you need to have a very deep understanding of both worlds of the simulation and how it works and of the real world which means that if you want to have a robot that walks around as it should in sim, you need to go very deep. You need to know exactly what's happening between a command that the policy outputs and then all the way down to torque in the motors. And there are typically 10 different layers of transformations even just on software of how we go from a high level command to actual current in the motors. And it's very tempting to to ignore that. But by understanding every single layer and knowing all the different transformations, then you can properly simulate it. and and this really unlocks uh better performance.

So that suggests like the the level of simulation that you're doing isn't like, you know, you pull up your sim environment and get generic, you know, humanoid robot and you're going to train some model and deploy it to something else. It's like you have a digital twin of your humanoid robot in a simul like a highfidelity simulation environment and you're training you know to a very fine level of detail which sounds very computationally expensive.

No. So we're much closer to to what you described first. So we have a very generic simulation environment but there are some very specific things that are important. One clear example is what is what are the torque and velocity limits of a motor. you cannot expect it to do something that is not possible on the real robot. So you need to add those limits and there are a few more things like that like like what kind of delay can you expect between a command. So you need to identify a few of those parameters and we are actually doing it usually in you know what we call a real to sim process. So we take the real robot, we hang it in the hang it in the air, we let it shake a little bit, collect data of all the different motors and then we know which are those important effects that we need to identify. We identify them and add them to the simulator. But simulation speed is the most important thing. So you cannot afford to simulate all those different effects, currents, magnetic fields, etc. You need to abstract all of it away.

That sounded very hard and expensive. And my personal take is that you still need to understand them even though you're not simulating them.

Got it. And so is the it sounds then that it sounds then like the result of that process is not a general model that you could deploy to any humanoid robot, but one that is specific to the humanoid robot for which you took the real to sim. you know, those key parameters, but that, you know, by because you're able to abstract it out to these, you know, some handful of uh or you know, several handfuls of kind of key parameters, like it's relatively easy to um to do new robots.

Yeah. And this is this was also the surprising part of one of the key learnings of this year is switching robots is fairly easy. Uh as long as the the hardware performs reasonably well. So now as a company we work with a few different suppliers of robots and a few different partners as well with whom we're working closely. We've deployed controllers on let's say between five and 10 different robots. And we see now that making a new robot walk is basically a few days of work and it should be less. It should be less than one day of work if we optimize some of our processes. Now bringing them to a new task. This is a bit more challenging. This requires more more engineering today and this is what we're focusing on. So our one of our key metrics is how much human effort is involved in bringing a new robot to a new task. New robot is very easy today. New task is something we're working on.

And in this context, like how we've kind of talked a little bit about this, but how specific is a task? Meaning like is a task pick and place or is a task robot in this warehouse picking off of this line and placing into these bins?

And that's a great question. Um, more like pick and place. But there there's a an interesting concept there um which is we are trying to leverage the information contains in large VLMs to orchestrate and break down complex tasks into clear subtasks even though that's not what we're focusing on today. Cooking is a great example a great metaphor. If you wanted to cooking, yes, if you wanted to train the robot to cook every single meal on the planet and you would say each meal is its own task, you would you would never finish that, right? Like the the set of tasks is huge. But what you could do is you give the recipe to to a VLM. You also give it images of what robot sees. If the recipe says cut a cucumber, the VM would say, you know, grab the knife, grab the cucumber and and and do this sort of motion to cut it. And then you can break it down into much simpler primitive like cutting things, holding a pan, putting it down somewhere, filling a glass of water, pouring it, things like that. And suddenly the the set of these primitives is not infinite anymore. The challenge is that now we need a higher level intelligence that will orchestrate all these primitives. But what's interesting is that that part is basically solved with with a villain. It's not 100% there, but it's moving much faster than the actual physical interaction of doing all these motions.

So can you elaborate on that? it's the orchestration is solved and uh if so how is it solved and what's the relationship between that orchestration and what we talk about is like the reasoning capabilities of these large models is is it uh the same thing or related.

It's similar maybe one way to describe this is on our website we have a we have two videos we have a video of a robot walking in forest and picking up trash. This is mostly there to showcase what what's possible and also play a little bit on our on our Swiss angle to using our nature. We have another video where the robot is doing the same thing in our office. And in that second video, it's it's 100% autonomous. You give it a text prompt. I think we're saying something like pick up the toys in front of you and drop them in the basket at the end. And we're using an off-the-shelf VLM for that. The way this works is we're giving the images to the VM and we're allowing it to do uh to use to call specific skills of the robot. So the the VM would say, "Oh, I see a toy there on the ground. Let's walk to the toy." And this let's walk to the toy is a skill that is actually triggered and executed by the robot. Once we're there, it will trigger pick up the toy, then go to the basket and and drop it off. And so by having a few of those primitives which are walk to things uh I mean the walking is locomotion as we discussed is itself fairly complicated. So you can walk on stairs you can walk on a bunch of different complex terrain but by having like walking picking things up from the ground and then dropping them somewhere else. We can recombine this in many many different ways without any retraining just by prompting the um an offtheshelf.

When I hear tool use, I hear like separate process or module or model. Um, is that the is that the case? Then like how do you think about you know then I think of like if you know if we got a bunch of tools you've got a bunch of these you know separate models or modules like uh are they actually does this architecture imply that they are in fact separate and you know trained separately or are they um more universal somehow?

That's another great question. So the in our case specifically today they are separate but uh but we are actively working on merging them together into one single model a more general general model and the the hope with that again is that you see some generalization so you see some interpolation between those different models. The and the way we would do that is actually by still training those different modalities primitive separately and then using them as data generators to collect a massive amount of data in simulation to then train one of those larger VAS um across the whole data set such that it can perform everything. We're seeing early results of that uh in our in our company. things are are going that direction but um we still need to prove that this actually leads to to the generalization we're talking about before.

And hearing you describe that you know often when I hear like these kind of student teacher types of approaches or you know I think of like distillation and trying to get to smaller models you're

Intelligent Robots in 2026: Are We There Yet? [Nikita Rudin] - 760

Beyond the Demo: Why Humanoid Robots Still Generate Zero Value

Actionable Takeaways:

Others You May Like

Dario Amodei and Dwarkesh Patel – Exponential Scaling vs. Real World Friction

The Deflationary Singularity: Why Everything is Going to ZERO w/ Salim Ismail

What If Intelligence Didn't Evolve? It "Was There" From the Start! - Blaise Agüera y Arcas

Intelligent Robots in 2026: Are We There Yet? [Nikita Rudin] - 760

Beyond the Demo: Why Humanoid Robots Still Generate Zero Value

Actionable Takeaways:

Join 10,000+ smart readers on our AI newsletter and stay ahead of the curve

Others You May Like

Dario Amodei and Dwarkesh Patel – Exponential Scaling vs. Real World Friction

The Deflationary Singularity: Why Everything is Going to ZERO w/ Salim Ismail

What If Intelligence Didn't Evolve? It "Was There" From the Start! - Blaise Agüera y Arcas