Semi Doped

January 12, 2026

Nvidia CES 2026

The Desktop Supercomputer: How Nvidia Shrinks the Mainframe by Semi Doped

Author: Semi Doped | Date: October 2023

Quick Insight: Nvidia is moving the AI supercomputer from the hyperscaler cloud to your desk. This summary breaks down the Vera Rubin architecture and why local agentic compute is the next frontier for builders.

💡 Can a $4,000 desktop chip actually run server-grade AI models?
💡 Why is the CPU suddenly the bottleneck in an era of GPU dominance?
💡 How does Nvidia's new storage architecture solve the memory wall for long-context models?

Austin Lyons and Vic Shaker analyze Nvidia’s CES presentation where the focus moved from gaming to physical AI and local supercomputing. The duo explores how the Vera Rubin platform transitions AI from a centralized cloud service to a distributed hardware reality.

Top 3 Ideas

The Mini-Computer Moment

"You can take the AI supercomputer and put it in your building."

Desktop Supercomputing: Nvidia's DGX Spark brings server architecture to a 140W form factor. Builders can now run local vision-language models without renting cloud GPUs.
Mainframe Analogy: High-end compute is following the 1960s path from massive IBM mainframes to accessible mini-computers. Mid-sized firms can now own their hardware instead of paying perpetual cloud rent.
Local Inference: Private data stays on-prem while serving local agents. CFOs can transition high Opex cloud costs into durable Capex assets.

The Agentic CPU Crunch

"The more agentic the workload becomes, the more horsepower you need in CPU land."

Custom Silicon: The Vera CPU moves away from off-the-shelf ARM cores to custom Olympus silicon. This 176-thread beast handles the complex orchestration that GPUs cannot touch.
Tool Use: AI agents act like interns who need laptops to manage tasks and databases. CPUs must now handle the logic and security required to keep GPUs fed.

Breaking the Memory Wall

"Treat storage as a first-class citizen."

Context Caching: The Bluefield DPU4 treats distant SSDs as if they were local memory. This allows models to hold massive datasets in context without hitting the HBM limit.
NAND Renaissance: Jensen’s focus on SSDs as an intermediate storage layer is driving a hardware pivot. Long-context window performance now depends on storage speed as much as compute cycles.

Actionable Takeaways

🌐 The Macro Shift: The transition from centralized cloud training to distributed local inference creates a massive demand for high-bandwidth storage and custom CPUs.
⚡ The Tactical Edge: Audit your technical roadmap to prioritize local agentic workflows that reduce latency and data privacy risks.
🎯 The Bottom Line: The next 12 months will favor hardware that enables physical AI and local autonomy. Owning the compute stack is becoming a competitive necessity for builders who want to move faster than the cloud allows.

Podcast Link: Click here to listen

Hello listeners and welcome to the third episode of the semi-dopedes podcast. I am Austin Lions with Chipstrat and with me is Vic Shaker from Vick's newsletter.

All right. So, this week was CES and of course we had a big long presentation from Nvidia as always and we want to unpack that and just dive deep, give you our thoughts, some interesting strategy insights and technical insights.

I'm going to start with Vic, our technical guru. So, Vic, what stood out to you?

Hey, hey, Austin. Yeah, this CES was like, "Where's the consumer equipment here? There's nothing that we can use at home. Why is it a consumer electronic show? You're showing us all these Vera Rubin platforms that I can't put in my house. I can't even buy one of those Grace Blackwell chips or Vera Rubin chips. I mean, why are you showing this to me?

Anyway, it was the first time in 5 years they haven't introduced any GPUs at CES. So that was interesting. However, what I took away from the whole thing was when Jensen brought on those cute robots and I'm like, how much for one of those things? I just want one of those so I can just wander around my house like super cute, you know?

They're like Jensen pets. Nvidia Envy pets, you know. I want some envy pets for myself.

Yeah. There there's consumer electronics for you.

I mean, and then he goes on to declare that this is the chat GPT moment for physical AI. And I'm like, yes, give me a cute robot and I'm good to go.

Anyway, before we get into talking about all this enterprise Vera Rubin stuff, what did you see that was even remotely consumerbased?

So, something interesting that I've been noodling on and I just wrote about this. The DGX Spark is within the consumer price range or the home tinkerer. And for anyone who hasn't been following along super closely, you may remember project digits a year back or so where Nvidia said, "Hey, we're going to take a server grade cloudgrade AI supercomputer like Grace Blackwell and we're going to bring it into desktop form so it can actually sit on your desk."

They ended up they had a naming contest and they ended up naming it the DGX Spark and this is a GB10 chip is what they call it. Grace Blackwell 10. So, it's definitely slimmed down from the GB 100, but it is truly bringing Nvidia's server architecture to the desktop.

Um, and so this could be thought of a little bit as consumer electronics in that it may be the first time you could actually have true AIPC server really within reach. So, that's the closest I can get to a consumer electronics play.

What do you think that a person who is a consumer, an enthusiast would want an AI PC at home for?

In the CES presentation, Nvidia did have a video and I thought it was really awesome and it was a dev sitting down at his desk and hacking on a personal project, which by the way, the first thing that stood out to me right there is if you look closely, he was on a MacBook and the DJX is sitting on his desk and it's plugged in.

Of course, it's air cooled and it's connected to the internet, so it's not like a huge liquid cooled beast. But he was using his laptop as a thin client. And he was he was running a server, basically like a little chat interface server, so he could, you know, have his own little local chatbot.

And then he set up a router where he could direct queries to different models. So like text queries to a fast text model, but he also connected it to a robot so he could ask questions and have it routed to the robot, which of course has a camera, so he could ask questions about his environment, like, "Hey, where's my dog at? Is it on the couch?"

And what he did was he set up on the DJX Spark this router that could say, "Oh, that's asking a environmental question that needs visual information. I'll send it to the robot, have it grab a picture, and then give that image to a vision language model, a VLM running on this DJX Spark, and I would be able to answer the question like, "Hey, is your dog on the couch?" And he shouldn't be.

And then, of course, you could give it a command like, "Hey, tell my dog to get off the couch, right?" So, you could route, again, the DJ expert could route it to the robot, and the robot could use its speaker essentially to talk to the dog.

And what I thought was really cool about this is like, "Oh, this could be what the AIPC actually is, where you're hacking around creating really interesting things." Honestly, it reminds me of IoT when everyone was like, "Hey, now I can make my lights turn on and my garage door close and stuff like that when I wake up or when I go to bed."

Um, but you can imagine it being even more interesting and even easier to program because you can just use language models to talk to it.

Anyway, so I thought this was actually a really cool demo of where I could see like I could imagine people hacking at home again, whether it's IoT or other interesting things or definitely at work having access to this and being able to say like I truly want to auto automate some of my workflow. How can I use a DGX Spark or something like that to help me do that?

So yeah, what what do you what did you think about the DJX Spark?

It was cool. I saw the pictures on X. It had this picture of this chip with these liquid cooling in it and all that. It's not like the server grade thing, but anyway, it's it was very sleek black looking thing. That was awesome.

And I was like, wait, what is the power consumption? Do you know any numbers of how much this would run at home? Because this is going to be a power bill nightmare otherwise.

So, okay, good question. So, there's two different products here. There's the DGX Spark which is this little tiny form factor mini PC thing and then there's the DGX station and that's what you're referring to and like Dell had one and there's pictures on like go to Michael Dell you know on X and you can find it where the side is taken off and it's it's literally looks like a desktop computer of old but yes you can see like copper cold plates in there and like what appears to be maybe tubing and so like it might be liquid cooled which now that product by the way has a some sort of GB 8300 chip in it.

There wasn't many details. Nvidia's DGX station was like a sort of a reference design. Then Dell took that and made the Dell Pro Max. And for listeners, this is your GB300 chip. So straight up, this is server grade gray CPU.

Black Awell GPU has HBM. So it was like in the Dell's instantiation of it, it was 784 GB of unified memory. It was 288 gigs of HBM3 and then 496 gigs of LP DDR5. It's it's the NVLink chip to chip coherent memory.

You know, like literally if you're a developer who used to have to do all of your local development, if you will, actually just remoting into a cloud to access a GB300 there. Now, you literally can have that at your desktop.

But I will say to your question, Vic, I'm thinking like, okay, if this is like 1,500 watts or something crazy of power dissipation, which by the way, the DGX Spark is like X less. It was on the order of like 140 watts. If we're talking like 1500 watts of heat dissipation, and if that's truly liquid cooling, I'm not sure how that's going to work at my desk at home.

You know, so I looked it up while you were explaining this. I saw all the specs of this thing and you hit the nail on the head. is actually the system power for the DGX station 1,600 W.

So it is quite the beast and I think it's aimed more towards an enterprise level deployment where you could have a smallish inference server running that is running HBM serving multiple users and things like that. Maybe it's not really meant for the home hobbyist but that's what the DGX Spark is for.

Yes. Okay. So, okay, I was writing about this, too, and I've been thinking about this, and I think you're exactly right. DJX Spark, you know, that's going to be on the order of like $4,000 or something. That's, you know, definitely like high-end hobbyist or still useful at work, like totally useful in the enterprise.

DGX station. The way I'm framing this and thinking about this, it reminded me back of the 1960s era where people had big IBM mainframes that cost, you know, million dollars plus. And at the time, and remember this is 1960s, right? So at the time there weren't that many actual enterprises who could afford it, but the ones who could, wow, you got access to frontier compute. That's amazing.

Then we started to see the rise of the mini computer which was like 10x cheaper mainframe class compute but now all of a sudden it's like 100k and that actually expanded the total comput because now suddenly large enterprises midsize enterprises could buy this and of course as these form factor they're still big but not mainframe big but as the form factor shrunk over time and the cost came down a little bit over time with competition you could literally have a mainframe at your desk.

You could literally have individual teams that had their own mini computer instead of one big mainframe shared across the whole company. And so when I've been thinking about the DGX station, I've been thinking about that as an analogy of like, hey, today if you want a rack of AI compute, I hope you have 3 million plus, you know, and again, maybe also a liquid cool environment to put it in or whatever, you know, like hyperscalers do and some very large customers do, but a lot of people don't.

So they're still stuck renting GPUs or they're just locked out of having AI supercompute, but there surely there's latent demand there. And so when I think about the GGX station, I think about you can literally get a Grace Blackwell and instead of needing seven, you can have one.

You can have one and I don't know how much it's going to cost. Is it $50,000 for the Dell Max Pro? Is it 100,000? I don't know. But you can take you can unleash the supercomputer from the hyperscaler cloud and you can actually put it in your building.

And I don't even think if it's liquid cooled like to the point the form factor looks like a desktop that you could put at your desk but maybe you're still actually just putting it in your on premises environment where you already have some cloud or like not cloud but some local AI comput racks with like CPU servers or something and so maybe it's like in a closet or in a back room somewhere but you're taking the AI supercomputer unleashing it and putting it in your own business.

So now all of a sudden maybe midsize enterprises can afford one or multiple. And to your point, I envision like why not run a bunch of local inference on it. Like I don't think of it as necessarily only like a very beefy, very expensive singledev ML researcher workstation.

I actually think these, you could call them like an AI appliance, if you will, like local AI appliance, could sit in a room and actually serve all sorts of interesting inference for local use cases. Especially if you want to have your own private data or you just don't want it maybe it accessing the outside world or whatever you want, you just want to have control over it, why not just run that locally?

And by the way, of course, I think CFOs would be super interested in this because now they're thinking opex of renting GPUs converts into capex. Like there's all sorts of interesting benefits here.

Yeah, absolutely. And that's the that's the nice thing about things that were mainframes becoming a computer on your desk today because the phones that we have today are an extension of that because these mainframe like computing powers are now in our phones and pockets.

So in that same analogy, maybe in the future we'll have phones that can run complete inference or watches that can run complete inference. Maybe that is the use case for edge AI and maybe we're not just there yet and it takes time to get there.

But you see how you know your argument when stretched to the extreme on the AI inference front can go to edge AI devices in a very nice way.

Totally. And you know, in the interim, someone's going to ask, "Oh, okay. Well, the counterargument is like, oh, are you going to actually put HBM in your watch or your phone? That sounds pretty expensive." But I do think we could live in a world right now where maybe you don't need to because maybe if we have access to these like inference servers at work, I mean, maybe we can again use our laptop or our phone as a thin client.

And as they showed in this demo, which everyone should go watch, there's like a 2-minute video you can find on YouTube. And until it becomes like cost acceptable or the thermals get figured out and all those implications, maybe the AI that we run locally on our laptops and on our phones is still more of like the real time speech recognition, real time stuff.

And then maybe this heavier aentic stuff. You can still launch it from your phone or your laptop, but it's actually still running on some plugged in compute, but it just happens to be cheaper and locally accessible.

Yeah, that's awesome. Time will tell.

Yes, folks, check out DGX Spark. Check out DGX Station. Go read my article on Chipstrap. Noodle on it because it is super interesting to think about this additional compute TAM by bringing the supercomputer to the desk.

And you might ask, is this going to cannibalize some cloud use cases? And I would actually say in the aggregate, no. Of course, yes, there will be some, but I actually think there's a lot of interesting new workflows that could run locally that are just accreted to everything you're going to already do in the cloud.

Because think about it, the cloud is always going to be frontier compute and it's going to run the frontier model. And what we run at the edge is actually like the frontier model from 5 years ago as it has trickled all the way down. So to do the most interesting, most demanding stuff, long form video generation, whatever, you're surely going to still need racks of GPUs to do it.

But over time, we will continue to improve in the amount of AI that can be done locally. And so I think there's just new interesting use cases waiting to be unlocked.

Now, enough there. Let's move on.

So consumer electronics. One other thing that you could start to call consumer electronics would be the car, right? Is the car not like a like a Rivian or a Tesla? Is it not just a computer on wheels?

So, I'm stretching here a little bit, but was there anything from Nvidia around autonomy or automobiles?

Yes. Alameo. So, Alpameo is Nvidia's answer to fully self-driving vehicles, kind of the ones that you find on Tesla vehicles kind of. It is what they describe an open-source autonomous vehicle platform that uses something called a vision language action model.

So the way a vision language action model works is basically it takes sensor inputs from the environment and then it processes it through kind of a language model in the sense of it's going to do reasoning and chain of thought reasoning and I'll explain a little bit more of what these kinds of things do and then it takes action.

So the action is basically in the form of trajectories or the list of possible actions that it can take in the future. So those are you know the kind of decisions it needs to make based on the vision inputs it gets from its sensors.

Right? So the VLA model is actually a 10 billion parameter model. The way it works is it it takes all the sensor inputs and it uses the LLM under the hood to kind of think through how a human would think through the whole situation and there was like a couple of like videos on this as well that was really nice at the keynote and it basically thinks through with the way you and I would you know maybe if we speak out loud while driving this is how we would be like or telling our you know if when our kids get old enough to drive they'll be like okay this is what we would tell our kids hey look out for the dog that's on the left.

Make sure that you are ready for when it runs across. Or hey, slow down. There's somebody crossing the road. Or accelerate because the light just turned yellow and it's safer to accelerate than slow down.

So, you know, these are the instructions that you would probably give with language to somebody you're teaching or somebody you are or even maybe to yourself if that's how you drive. But that's what the VA model does.

This whole thing is going to be first tried out with the MercedesBenz CLA and is apparently hitting the roads in Q1 2026. So, I don't know if you see one, let me know. It's kind of cool.

In response to all of this, I think Elon said like, "Oh, this is great. This is" and actually to Jetson's credit, he did acknowledge Tesla as like the the most advanced driving model on the road today. And so Elon was like, "Yes, this is good."

But it's the getting it to 99% that is the easy part, but handling all the long tails of self-driving, which is when when you say it long tails, if people aren't familiar with the phrase, it's all the unknown unknowns. Basically, you just don't know what's going to go wrong on the road.

So the there is a whole bunch of scenarios which you just cannot account for. And that's the hardest part of self-driving. And Elon says that is the part that has taken the most time to actually get right and that's at least a few years away. So people are like saying okay Alameo is still like maybe a few years away from where the cell Tesla self-driving is today.

Yeah definitely there's there's a ton that we'll have to unpack here in its own episode. But I definitely was interested to hear that there was this MercedesBenz car that's going to hit the road. So obviously, oh, and one other thing somewhere, I can't remember if it I think it was in the CES presentation, Jensen said something like they have a quarter of their company working on, it was some large number, maybe it was just like 7,000 people, but some very large amount of people working on autonomous vehicles.

I'll have to go find that somewhere to figure out what the exact number was, but it was a surprising amount of Nvidians, if you will, working on autonomous vehicles. So, I think we'll only continue to hear more on this front, especially if Jensen thinks, you know, it's the chat GPT era of physical AI and self-driving is nothing but physical AI.

Yeah, absolutely. I think we will hear a lot more about this going forward. And they have this whole thing that is the interesting thing about all this is all open source and I was reading the website and they said you know they have this entire data set of all you know driving scenarios and vehicle dynamics all of this stuff is actually on GitHub.

So if you go to nvab/alpasim you will see like there's like a whole lot of stuff that you can use to do your own kind of like work on autonomous driving if people are into that kind of research. Not saying I can particularly do it, but it's amazing that they put all this out there for people to work on and improve.

Obviously, it's a pretty hard problem.

Get Yep. Get yourself a DGX station, download this GitHub repo, and off you go.

Done. All right. Should we move on to the Vera Rubin platform even though it's not consumer electronics?

Yes, we should do it because you know the bulk of the whole keynote talk was the Vera Rubin platform and it was pretty pretty amazing honestly I have to say. The Vera Rubin platform actually consists of six chips right it has the Vera CPU the Reuben GPU and they have a data processing unit or DPU which is on its gener generation 4 blue field DPUs.

They also have an NVLink switch which is now on version six for scale up networking. Nick which is network cards for you know networking within and outside the the rack and the chips and all that. And then you have the spectrum six photonics the first photonix you know Ethernet switch for scale out and in between rack.

So this all of this like six of these things make up the Vera Rubin platform and the fundamental claim is that the Vera Rubin platform is provides a cost per token of one/10enth that of Grace Blackwood.

Let me quickly add in one thing. What's interesting is that Nvidia used to say they made seven chips. So they say hey we're not just a GPU company. We make seven chips and write software of course and now they're saying six chips and the one that's missing is the Infiniban switch.

So normally it was CPU GPU DPU you know nick and then it was we make two switches one is Ethernet spectrum X and one is infiniband switch and here they specifically did not call out the infiniband switch I just found that was interesting I don't know I don't necessarily know that that means that they are going to stop making infiniband but I think it does go to show that their customers have all been opting for Ethernetbased switches for scale scale out because people are familiar with Ethernet and therefore at a minimum from a marketing perspective they're just going to focus the conversation on that. So there's just an interesting observation.

That's a great one actually because maybe this is a subtle hint of where things are going in the future.

Totally. Yeah. Let's talk about some of these chips. What do you think?

Yeah. I mean let's let's dive in for people. We'll get a little technical here, but we can explain why it matters. But yeah, what what did you hear from Vera and Reuben and maybe how did that compare to Grace Blackwell?

Yeah, the Vera CPU actually has more cores than the Grace CPU. And interestingly, this is the first CPU from Nvidia that uses their custom Olympus cores which have been like in-house customized. It's ARM based but it's alo it's ARM compatible but it's not an offthe-shelf ARM core like the previous generation Grace CPUs which actually used 72 ARM neoverse cores.

So there are 88 custom cores now versus the 72 ARM cores in grace. So that's the one big difference. And this Olympic this Olympus core actually has supports spatial multi-threading which means it takes in two threads per core. And therefore now you effectively have like you know this is basically the multi-threading thing we saw with CPUs like in the 1990s or 2000s or whenever that came out. So now you have 176 threads running actually. So the real jump is from 72 to 176 which is kind of significant when it comes to computing power on the Vera CPU.

Interesting. What what why does that matter for a GPU? Because at the end of the day all the matrix multiplications are happening on the GPU. So but what's going on on the CPU and and why is even more simultaneous compute useful?

So the CPU is used for a lot more stuff than just you know matrix multiplication because in the era of AI Agents that the year 2026 is supposed to be all about Agents there's going to be a lot more CPUbased interactions. So for example, if you give a command to do some task to an to an agent, the agent has to decide which tool has to be invoked and how to collect data from the tool and how to feed it into the next tool and so on so forth.

And you know it has to access a database if required. It has to pull data. It has to decide when to send it to the GPU for matrix multiplication and you know all of this you know orchestration processes and compression. you know, you need security, so you need to encrypt. So, all of these functions are handled by the CPU, and the more agentic the workload becomes, the more CPUs you're going to need and the more horsepower you need in CPU land. So, that's why this is important.

Yes, that's where my head was at exactly, which is if you look, if you go way back, at first, the CPU was like, you know, maybe the hopper era, it was a little bit more of like a glorified memory controller. It was just about like how do I get the right data pre-processed, shaped and ready for the GPU to keep it fed. And then we started to see like the rise of rag.

And so like oh yeah, how do I have like all my internal documents in a database so that as we are querying something that could be hallucinating, we can ground it against this database query and and with actual relevant context essentially. And so starting to put more onus on the CPUs and and to your point like as we get into an agentic era there's going to be all sorts of orchestration security even potentially like you have to ask where's like user management and all that stuff going to live.

Is it going to live on that headnode CPU or is it going to live somewhere else? I think TBD on that, but but I think your point is still right, which is there's with the rise of agents, there's going to be more and more demand put on CPUs and and so of course, you know, zooming way out when we're talking about um you know, tons of GPU demand and not enough capacity from TSMC.

And you know, I think we're starting to already see a rise in the demand for CPUs as we've heard from like Lisa Sue and others saying that they're surprised that their hyperscalers are actually the demand for CPUs has increased. Um, you know, not just CPU cloud cores, but actually related to AI. You know, I think we will continue to see pressure on CPUs because at the end of the day, if you think about it, agents, it's like having a hundred interns at your disposal.

If you if you hired interns at your company, what do you do? You give them a laptop. It has a CPU in it. You say, "Go do some work." Right? So, this is conceptually the same thing. So, there's g you know, we'll continue to see compute or CPUbased compute TAM increase and demand increase.

So, I'll I'll let you carry on, but I think that means to your point that we need less of just a offtheshelf CPU and more about how do we I think we'll continue to see an evolution of how do we evolve these CPUs as the workloads and the demands on the AI processing, you know, generative AI era increase.

Yeah, absolutely. And, you know, because of all of the stuff you just explained, there's people already warning about upcoming CPU crunch. I'm like, Where in the AI, you know, universe is there not a crunch? There is a crunch for lasers. There is a crunch for memory. There's a crunch for CPUs, GPUs, packaging.

So, nobody should talk about SSDs. Nobody should talk about shortages anymore. Everything is in short supply. It's not news. Okay. Everything's in short supply.

Totally. Anyway, now that you mentioned SSDs, we need to we need to talk about a little bit of that stuff. But before that, this Vera CPU actually has LPDDR uh support, but it's in 12 SOCAM slots, which means that it's not soldered onto the board, so you can upgrade it.

So, if you throw in like 128 gigs on each SOCAM slot on the CPU, you'll get 1.5 TB of LPDDR memory that the CPU can use, which is awesome because the last generation had like 480GB and it was soldered. It was you couldn't upgrade it or anything like that. So that is a very big and very important feature because now you have so much storage that is possible on the CPU DM side.

So why is that so important? Now here's the thing when you're doing inferencing, okay, and you want to have a long context window, you're going to for every token you generate, you're going to keep piling up on the KV cache, the key value cache. And this key value cache grows really fast. And the longer your context window, the bigger your key value cache.

Now HBM we just spoke about everything is a shortage in HBM. You cannot have enough HBM. Okay, like we will talk about it, but you know in the Ruben GPU you had an 88GB of HBM4. That's all you've got. But that's not enough for KV cache. KV cache can grow really big if you have very long context. Right?

So the beauty of having DRAM in the form of LPDDR on the Vera CPU that can go up to a terabyte is that you can offload this KV cache to the DRAM. And you might think wait that's terrible because we need to you know get it back quickly right the thing is that the chip to chip bandwidth between the CPU and GPU in this Vera CPU is like 1.8 8 terab terabyte per second, right?

So that's 2x faster. So 900 GB gabyte per second was the last generation bandwidth and now you have double the bandwidth. So what that means is that the GPU can really quickly offload its KV cache into DAM on the LPDDR5 and keep it there and bring it back if it wants to. Now the question is what happens if the KV cache gets even bigger than you can store on 1.5 terabytes and we'll get to that point later.

Yes. The one additional feature the CPU adds is that it actually encrypts the data and it pro provides what is called confidential computing support. And that's important because like you were saying unless you have your own DGX station at home where you're like confident that nobody else is going to be on it you are renting GPUs and renting hardware.

And if you're doing sensitive movements of like data within this compute system, you need that to be encrypted and that provides security for many applications like medical and you know military and things like that. So that feature is introduced here for the first time. So you can see how the AI ecosystem is really getting polished now for like general enterprise use cuz right now it was all like just kind of raw but now it's getting the sheen on top.

That's awesome. So yeah, we can go on to Reuben GPU now because that's it's kind of well known by now in a sense, but it's built on TSMC 3 nanometers and it and it contains like 336 billion transistors and it's like the reticle size is like you know 2x that of a reticle. So it's like a big chip. It's a really big chip. You can see Jensen holding it up anywhere. Anywhere you find a Jetson picture, he's holding up these these GPUs.

And uh the interesting thing is that it holds 288 GB of uh HBM4 this time. Okay. And uh HPM4 is important because of its extended bandwidth. But if you have been tracking this memory uh you know levels, you'll be like, "Wait, wasn't the GB 200? Wait, wasn't the GB300 also 288 GB?" Cuz yes, it is, but it was 288 GB of HBM 3 and GB 2000 was more like 192GB of HBM 3.

So the difference in going to HPM4 is that instead of 1,024 pins parallel lanes where you can pull data in and out of the HPM memory, HPM4 gives you 48 has double the number of interface lines that allows much you know wider but slow motion of data. So it provides a total of like 22 terbte per second bandwidth and which is fantastic because now you have you're cutting down the memory bandwidth problem memory what is that the memory wall issue even more breaking down that wall.

So all of this stuff is is kind of the uh Ruben GPU and it's supposed to give like X what Blackwell can give on inference and 3X on training and most interestingly that at least I found this interesting was that Jensen explained something about the NVF FP4 and maybe this is wellnown it's just that maybe I don't know it but it's the whole idea of this numeric format is that they dynamically decide within the whole inference and compute system which part of the compute should use which numeric precision.

So where you want to use more numeric precision where it matters you use more where it doesn't matter you use less. So what ends up happening is that it's much faster for inference. So even though you don't like double the hardware by doing things like this you can squeeze like 5x inference bandwidth compared to blackwell. So they want to they're proposing like oh this should be the industry standard now. But I thought that was cool. What do you think?

Yeah it's super cool. I love these optimizations to to come come in later and say like, okay, we made it beefy and how do we make it faster? Let's use interesting optimizations.

Yeah, absolutely. We should go back quickly to what context memory storage means and why that's important for long context inferencing because that has created kind of ripples in the industry now. Although I'm not sure why this is such news because we've always known that we needed more storage. Okay, but I'll connect the dots now and we'll see how this goes through.

Nvidia CES 2026

The Desktop Supercomputer: How Nvidia Shrinks the Mainframe by Semi Doped

Top 3 Ideas

The Mini-Computer Moment

The Agentic CPU Crunch

Breaking the Memory Wall

Actionable Takeaways

Others You May Like

Dario Amodei and Dwarkesh Patel – Exponential Scaling vs. Real World Friction

The Deflationary Singularity: Why Everything is Going to ZERO w/ Salim Ismail

What If Intelligence Didn't Evolve? It "Was There" From the Start! - Blaise Agüera y Arcas

Nvidia CES 2026

The Desktop Supercomputer: How Nvidia Shrinks the Mainframe by Semi Doped

Top 3 Ideas

The Mini-Computer Moment

The Agentic CPU Crunch

Breaking the Memory Wall

Actionable Takeaways

Join 10,000+ smart readers on our AI newsletter and stay ahead of the curve

Others You May Like

Dario Amodei and Dwarkesh Patel – Exponential Scaling vs. Real World Friction

The Deflationary Singularity: Why Everything is Going to ZERO w/ Salim Ismail

What If Intelligence Didn't Evolve? It "Was There" From the Start! - Blaise Agüera y Arcas