a16z
August 16, 2025

Google DeepMind Lead Researchers on Genie 3 & the Future of World-Building

Google DeepMind Lead Researchers Shlomi Fruchter and Jack Gruslys break down their new interactive world model, Genie 3. They explore the model’s surprisingly effective spatial memory, its role in accelerating embodied AI, and why interactive world generation is diverging from passive video.

The Magic of Real-Time World Generation

  • "The persistence part for me...was when I kind of sat up in my chair and I was like, ‘How did that happen?’”
  • "There is something magical about the real-time aspect...when it responds immediately."
  • Genie 3 is not just a video generator; it’s an interactive world model. Users can navigate and control a character in real-time within a generated environment, a key differentiator from passive video models.
  • The model exhibits "spatial memory," maintaining object persistence even when the camera looks away and back. This capability was a specific design goal, but its effectiveness was still "mind-blowing" to the research team.
  • The current model balances real-time performance with a one-minute memory limit, a deliberate trade-off to enable high-resolution, interactive world generation at speed.

Bridging the Gap to Reality

  • "From Genie 2 to 3, the real-world capabilities really increased. On the physics side, some of the water simulations...and lighting are really breathtaking. It's at the point where a human who is not an expert will watch it and think it looks real."
  • Genie 3 represents a massive leap in physical realism. It demonstrates an emergent understanding of how characters should interact with different terrains—like swimming in water or skiing downhill—properties that arise from the breadth of its training data rather than explicit programming.
  • Unlike its predecessors that relied on image prompts, Genie 3’s direct text-to-world generation grants far greater controllability. It can generate highly specific or even "arbitrary silly things" with impressive text adherence.

Unlocking Embodied AI and Robotics

  • "In robotics...we think with Genie 3 it's the best of both worlds, because you're taking a real-world data-driven approach, but then you've got the ability to learn in simulation."
  • The researchers see Genie 3 as the "fastest path" to capable embodied agents. It addresses a core bottleneck in robotics: the difficulty of collecting safe, diverse, and scalable training data in the physical world.
  • Genie 3 is designed as a composable "environment model." It can generate limitless interactive simulations for other agents (like Google’s SIMA) to learn from experience, bridging the critical "sim-to-real" gap.

Key Takeaways:

  • World Models Are a New Modality. Genie 3 is not just better video; it's an interactive environment generator. This divergence from passive, cinematic models like Veo signals a new frontier focused on agency and simulation, creating a distinct discipline within generative AI.
  • Simulation Is the Key to Embodied AI. The biggest hurdle for robotics is the lack of realistic training environments. Genie 3 tackles this "sim-to-real" gap head-on, providing a scalable way to train agents on infinite experiences before they ever touch physical hardware.
  • Emergent Properties Will Drive the Future. Key features like spatial memory and nuanced physics weren't explicitly coded but emerged from scaling. The next breakthroughs in world models will come from discovering these unexpected capabilities, not just refining existing ones.

For further insights, watch the full discussion here: Link

This episode unpacks Google DeepMind's Genie 3, revealing how real-time, interactive world models are creating a new frontier for AI agent training, robotics, and the very definition of digital environments.

Introduction to Genie 3: A New Class of World Model

  • The conversation begins with Jack and Schlomi, lead researchers at Google DeepMind, reflecting on the massive public reaction to Genie 3. They felt they were onto something significant with real-time environment generation, but the response exceeded their expectations. The model's key breakthrough is its interactivity, moving beyond passive video generation to create dynamic, controllable worlds.
  • Mark's Perspective: Mark, one of the hosts, highlights the model's game-changing features: spatial memory and frame-to-frame consistency, which for the first time enable a truly interactive experience with generated video.
  • Core Innovation: The ability to generate and navigate environments in real-time from simple prompts is the central innovation that underpins all potential applications.

The Genesis of Genie 3: Combining Ambitious Goals

  • Jack explains that Genie 3 was born from the synthesis of several distinct but complementary research threads within Google DeepMind. The project aimed to combine the most ambitious elements from previous work into a single, powerful model.
  • Building on Past Work: Genie 3 integrates learnings from Genie 2 (which focused on 3D environments), VEO (a state-of-the-art cinematic video model), and GameGAN (known for its interactive Doom simulation).
  • The Real-Time "Awe Moment": Schlomi emphasizes the magical feeling of real-time interaction. This was a core goal, pushing the model to the edge of what was considered possible to create an immediate, responsive experience for the user.
  • Quote from Schlomi: "There is something when it responds immediately that is really magical... we really wanted to push it to somewhere we weren't sure it's going to work."

Unlocking "Spatial Memory": A Planned Surprise

  • A standout feature that captivated users is Genie 3's "spatial memory"—its ability to maintain object persistence even when they are out of view. A compelling example discussed is a character painting a wall, moving away, and returning to find the original paint stroke still there.
  • Planned but Surprising: Jack clarifies that this was a primary design goal, not an accidental emergent property. The team aimed for "minute-plus memory" alongside real-time generation and high resolution, a set of conflicting technical challenges.
  • Technical Approach: Schlomi notes the team deliberately avoided using explicit 3D representations like NeRFs (Neural Radiance Fields) or Gaussian Splatting, which build a static 3D model of a scene. Instead, Genie 3 generates frame-by-frame, which they believe is key to its generalization capabilities.
  • Current Limitations: The model's memory is currently designed for around one minute, a trade-off made to balance performance with other capabilities.

Emergent Behaviors and Scaling Laws

  • The leap from Genie 2 to Genie 3 brought significant improvements in realism and world understanding, demonstrating clear scaling properties. The model now generates physics, lighting, and environmental interactions that are convincingly realistic to a non-expert observer.
  • Improved World Understanding: The model can infer logical actions, such as an agent opening a door it approaches. It also demonstrates a deeper understanding of physics, seen in realistic water simulations and lighting effects.
  • Quote from Jack: "On the physics side, some of the water simulations you can see, some of the lighting as well, are really breathtaking... it's at the point where like a human who is not an expert will watch it and think it looks real."
  • Agent-Environment Interaction: The model correctly interprets how agents should interact with different terrains. For example, a character running into a blue, wavy area will start swimming, and a skier will slow down when trying to go uphill. This is an emergent property resulting from the scale and breadth of the training data.

The Power of Text-to-World Generation

  • A major advancement from Genie 2 is the model's direct text-to-world capability, which provides far greater control and alignment than the previous image-prompting method. This allows users to describe highly specific or arbitrary scenes with remarkable accuracy.
  • From Image to Text: Genie 2 relied on an initial image to start the world, which created a "transfer issue." Genie 3's direct text adherence allows for more precise and imaginative world creation.
  • Leveraging Internal Expertise: Jack attributes this success to leveraging knowledge from other teams within Google DeepMind, particularly the VEO project, which excelled at text alignment. This collaborative environment allowed them to "turbocharge progress."

Genie 3 vs. VEO: Differentiated Modalities

  • Schlomi clarifies why Genie 3 is a distinct project from VEO, Google's high-fidelity video generation model. While they share some underlying technology, their goals and capabilities are fundamentally different.
  • Interactivity vs. Cinematic Quality: Genie 3 is designed for navigation and action within an environment, a capability VEO lacks. Conversely, VEO produces higher-quality, cinematic video and includes audio, which Genie 3 does not.
  • Strategic Divergence: The team believes that video generation and interactive world models may remain separate disciplines for the near future, each optimized for different use cases (e.g., filmmaking vs. agent training). The vast design space requires making trade-offs, and combining all features into one model is not necessarily the most effective next step.

The Roadmap for Future World Models

  • Looking ahead, the researchers are focused on building more capable models rather than targeting specific applications. The goal is to push the core technology forward, enabling others to discover novel uses.
  • Research-Driven Development: Schlomi explains that the primary driver is not a specific application but the challenge of achieving high quality, real-time generation, and controllability. Applications are expected to follow as the technology matures.
  • Future Directions: While not committing to a specific roadmap, Jack mentions that his personal motivation remains tied to AGI (Artificial General Intelligence) and embodied agents—AI systems that can interact with the physical world. He believes interactive world models are the fastest path to developing these agents.
  • The Path to True Simulation: Schlomi acknowledges that while impressive, the models are still far from accurately simulating the full complexity of reality. The ultimate goal is to create worlds that users can step into and experience immersively.

Genie 3's Role in Embodied AI and Robotics

  • The conversation highlights Genie 3's potential to solve one of the biggest challenges in robotics: the sim-to-real gap, which is the difficulty of transferring AI behaviors learned in simulation to the real world.
  • An Environment, Not an Agent: Jack frames Genie 3 as a general-purpose simulator for training other agents, like Google DeepMind's SIMA. It provides a rich, dynamic environment where agents can learn from experience through RL (Reinforcement Learning), similar to how AlphaGo learned by playing against itself.
  • Bridging Simulation and Reality: Traditional robotics simulations are often too simplistic, while collecting data in the real world is expensive, slow, and potentially unsafe. Genie 3 offers the best of both: a data-driven, realistic environment where agents can learn safely and efficiently.
  • Quote from Jack: "What we think with Genie 3 is it's the best of both, right? Because you're taking a real-world data-driven approach, but then you've got the ability to learn in simulation."

Future Access and the Development Curve

  • While the team is excited to get Genie 3 into the hands of more developers, there is no concrete timeline for a public release. The researchers believe we are still at the beginning of the development curve for world models.
  • Early but Compelling: Jack suggests that while current capabilities are already powerful, there is a "huge gap to close" to achieve true real-world fidelity and add new functionalities.
  • Potential for Breakthroughs: He compares the current state to language models before major architectural shifts unlocked new plateaus of performance, suggesting more breakthroughs are likely.

Conclusion

Genie 3 marks a strategic shift from passive content generation to active, interactive simulation. For investors and researchers, this signals an emerging market for AI training environments and embodied agent platforms, where scalable data generation for robotics and AGI becomes a primary value driver.

Others You May Like