Latent Space
April 5, 2025

Claude Plays Pokémon Hackathon: Escape from Mt. Moon!

This podcast dives into the journey of teaching Anthropic's Claude AI to play Pokémon Red, culminating in a hackathon challenging developers to guide Claude out of Mt. Moon using Morph Cloud's novel infrastructure. David from Anthropic and Andrew, an independent developer, share their experiences, hilarious failures, and surprising successes in building Pokémon-playing agents.

The Evolution of Claude Plays Pokémon

  • "I built the first version of it back in June of last year... 3.5 Sonnet at the time was really bad at playing Pokémon."
  • "A couple of months ago when we were finalizing 3.7 Sonnet, I played with it and you got the first squint of signs of life... this might happen."
  • The project began as a personal playground for testing agent frameworks, initially using a complex Voyager-inspired setup.
  • Early Claude versions (like 3.5 Sonnet) struggled immensely, barely getting past choosing a starter Pokémon after weeks of iteration.
  • Claude 3.7 Sonnet marked a significant leap, demonstrating tenacious progress through early game stages like beating Gym Leaders Brock and Misty, prompting the public Twitch stream and hackathon.
  • The project serves as a unique, long-horizon evaluation for agent capabilities, offering readable insights into model behavior over days of gameplay.

Claude's Cognitive Blindspots

  • "The thing that Claude's the worst at is understanding how to get from point A to point B... really, really god awful at hitting the buttons to go from point A to point B on a screen."
  • "Claude really doesn't know what's going on on the screen... it likes to make things up a lot... It will see things that are not there."
  • Vision & Spatial Reasoning: Claude has profound difficulty interpreting the Game Boy screen, often hallucinating objects (like Professor Oak in the grass) or getting stuck on simple navigation tasks (like walking into a wall for hours or mistaking a doormat for a dialogue box). Abstracting movement via "touchscreen controls" was a key workaround.
  • Strategy: Claude exhibits poor strategic thinking in Pokémon battles, like obsessively using the move 'Rage' instead of more effective attacks, and doesn't grasp tactical nuances like switching penalties.
  • Failure Modes: Older models frequently concluded the game was bugged and requested resets when stuck. Newer models (3.7 Sonnet) are more persistent but can still get trapped in repetitive loops or nonsensical actions.

Agent Design: Prompts, Memory, and "Cheating"

  • "If you have too much [prompting] with the smarter models you actually tend to just like get in their way... the most recent prompt I have tends to be like very minimal: 'you're playing Pokémon, here are some tools, go'."
  • "My solution for that [vision difficulty] was cheating by doing as little computer vision as possible... reaching into the memory... figuring out where everything is on the screen and then I'm just telling it."
  • Prompting: Simpler prompts work better with more capable models like 3.7 Sonnet. Over-instructing hinders performance. Minimal prompts with a few key guardrails against common failure modes proved most effective.
  • Memory: Memory techniques evolved from simple dictionaries updated in the prompt to more sophisticated file systems and context window management ("diary approach"). Andrew's method bypassed vision issues by directly feeding map/object data from RAM to the LLM.
  • The "Cheating" Debate: Accessing game RAM or using pathfinding algorithms like A* bypasses the LLM's core weaknesses but deviates from human-like play. The definition of "cheating" depends on the goal: pure agent capability assessment vs. achieving game progress efficiently.

The Hackathon: Escape from Mt. Moon with Morph Cloud

  • "The task is to escape from Mount Moon... implement an agent which can escape from Mount Moon in as few agent turns as possible."
  • "We're developing infinitely scalable and elastic cloud compute for AI agents... fully equipped container runtimes with full separation of storage and compute meaning that we have extremely low overhead branching and snapshots."
  • Challenge: Participants must create an agent using Claude (or another model) to navigate out of Pokémon Red's Mt. Moon, starting from a provided game snapshot. Success is measured by the fewest agent turns, judged for fairness (no explicitly prompted paths).
  • Tools: Morph Cloud provides the infrastructure, featuring "Infinibranch" for low-overhead VM snapshotting and branching. This allows rapid testing, state restoration, and potentially advanced techniques like Monte Carlo Tree Search.
  • Framework: The EVA (Execution with Verified Agents) framework and a UI harness are provided to simplify agent development, state tracking, trajectory viewing, and verification (checking if the agent exited Mt. Moon).

Key Takeaways:

  • Building agents for complex, long-horizon tasks like playing Pokémon reveals both the surprising emergent capabilities and the persistent limitations of current LLMs, especially in spatial reasoning and visual understanding. The definition of "cheating" becomes crucial when balancing the goal of task completion versus demonstrating pure AI capability.
  • Vision & Spatial Reasoning Remain Hard: Despite advances, LLMs like Claude struggle profoundly with interpreting visual game environments and navigating physical space, requiring clever workarounds or direct data access ("cheating").
  • Simpler is Often Better: As models improve, complex scaffolding and overly detailed prompts can become counterproductive; minimal guidance often yields better results.
  • Novel Infrastructure Unlocks New Agent Strategies: Platforms like Morph Cloud, with features like low-overhead snapshotting and branching, enable advanced agent development techniques (like scaled testing and backtracking) previously impractical.

For further insights and discussions, watch the full podcast: Link

This episode explores the practical challenges and surprising breakthroughs encountered when building AI agents to play Pokémon, offering deep insights into model capabilities, limitations, and the future of long-horizon AI task execution.

1️⃣ Introduction: The Genesis of Claude Plays Pokémon

  • David kicks off by sharing his motivation for the "Claude Plays Pokémon" project: creating a fun, "good vibes" side project to explore AI agents, inspired by his work helping Anthropic customers build similar systems. He frames the hackathon as an opportunity to recreate the initial magic and discovery he experienced. David, who works with customers at Anthropic rather than directly on model research, emphasizes the project's unexpected value in understanding model behavior over long time horizons.

2️⃣ Project History: From Simple Agent to Complex Evaluator

  • David recounts the project's evolution, starting in June 2023 with the release of Claude 3.5 Sonnet.
  • Initially inspired by Nvidia's Voyager agent paper (which he found “kind of whack”), early attempts with 3.5 Sonnet struggled, barely managing to select a starter Pokémon after weeks of iteration.
  • The release of a hypothetical "3.6 Sonnet" (likely referring to an internal or iterative version before Opus/Sonnet/Haiku branding solidified, or possibly a typo for an interim model) in October brought improvements. David simplified the agent framework to basic tool use in a loop, enabling progress to Viridian City, though performance was only marginally better than random button presses.
  • The real breakthrough came with testing "3.7 Sonnet" (Claude 3 Opus/Sonnet/Haiku generation, specifically likely Opus or the improved Sonnet). David observed “signs of life,” with the agent navigating forests, healing Pokémon, and showing genuine progress. He notes, “I think like if we just let this thing cook for a while it's actually going to do some stuff.”
  • A key innovation was "touchscreen controls," allowing Claude to specify coordinates rather than button sequences for movement, significantly speeding up gameplay by bypassing Claude's poor directional navigation skills. This allowed the agent to beat major milestones like Brock, Misty, and Lt. Surge, even reaching Celadon City in an internal run.
  • David highlights the project's transition from a fun experiment to a valuable internal evaluation tool (`eval`). Building `evals` to test AI performance over days is difficult, and Pokémon provides a measurable, long-duration task where progress (or lack thereof) offers qualitative insights into the model's reasoning and tenacity.

3️⃣ Fun Anecdotes: Unexpected AI Behaviors

  • David shares several amusing stories illustrating the quirks of earlier models:
  • Reset Requests: An older Claude version, stuck for two days, filled its knowledge base with repeated demands for the administrator to reset the "bugged" game. David notes this tendency to blame the game has decreased in newer models, which are more tenacious.
  • Role-Playing: Claude 3.0 Sonnet, when unable to make progress, began narrating its intended actions (“Now I'm going to Route 1...”) instead of actually playing.
  • Mt. Moon Escape: The first time Claude reached Mt. Moon, it successfully obtained a fossil, only to get stuck and use an Escape Rope, undoing its progress.
  • Accidental Move Deletion: Claude overwrote its Ivysaur's only attacking move (Tackle) with Poison Powder by pressing 'A' too many times during dialogue, rendering it useless in combat but ironically making it adept at returning to the Pokémon Center.
  • Grinding and Learning: After failing against Misty repeatedly, Claude autonomously decided to grind levels on Route 4 for three hours, demonstrating a basic strategic understanding. When it tried Misty again, failed, and immediately returned to grinding, it showed learned behavior.

4️⃣ Claude's Weaknesses: Vision, Spatial Reasoning, and Strategy

  • David details persistent challenges, offering valuable insights for agent developers:
  • Vision/Screen Comprehension: Claude struggles significantly to understand the game screen, often hallucinating objects (like mistaking a random NPC for Professor Oak) or misinterpreting elements (spending 8 hours pressing 'A' on a doormat thought to be a dialogue box). David emphasizes this isn't easily fixed with prompting.
  • Spatial Reasoning: Beyond just seeing, Claude has poor spatial understanding. It struggles with navigation tasks requiring pathfinding around obstacles (like walking into a wall repeatedly trying to reach the Pewter City gym) or understanding relative movement (going in and out of buildings repeatedly). David shares a humorous exchange: “Claude what happens when you walk through a building? It's like 'well there's a building there'... how do you get through the building then? They're like 'I'm going to walk straight through the building'.” This highlights fundamental limitations in current models' physical intuition within simulated environments.
  • Strategy: Claude's in-game strategy is often suboptimal. Examples include obsessively using the move Rage against Misty despite having a better option (Mega Punch) and employing overly conservative Pokémon switching tactics without understanding the penalty (getting attacked for free).

5️⃣ Parting Tips for Hackathon Participants

  • David offers practical advice based on his extensive experience:
  • Vision is Hard: Don't expect to fix screen comprehension issues solely through prompting; it's a deeper model limitation in this context.
  • Focus on Idea Generation: Prompting is most effective at encouraging Claude to generate more good ideas and fewer bad ones, rather than trying to force perfect execution.
  • Extended Thinking: Forcing Chain-of-Thought reasoning between actions offers only marginal (~10%) performance improvement at a high token cost. Running without it might be faster for iteration.
  • Read the Traces: Debugging AI agents playing Pokémon is uniquely engaging. Embrace reading the agent's thought process (“traces”) to understand its behavior and iterate effectively.
  • Starter Code: David points to his simplified `ClaudePlaysPokemon-starter` GitHub repo, particularly highlighting code for reading Game Boy memory (a form of "cheating" that provides ground truth state).

6️⃣ David's Q&A: Deeper Insights

  • The Q&A session reveals further nuances:
  • Prompt Evolution: David moved from complex Voyager-style prompts to simpler tool-use prompts, finding that overly prescriptive instructions hinder newer, more capable models. Minimal prompts with a few key guardrails against common failure modes work best.
  • Emotional Connection: Forcing Claude to nickname its Pokémon demonstrably increased its "care" for them (e.g., healing them more promptly). This aligns with internal Anthropic research showing Claude interacts more positively with named entities.
  • Memory Mechanisms: The memory system evolved from a simple dictionary updated in the prompt (inefficient for caching) to a basic file system allowing Claude to load/unload relevant information (e.g., Mt. Moon data) to manage context.
  • Historical Context: Providing around 8 historical game screenshots proved optimal for performance in David's tests; more led to context window degradation, fewer reduced effectiveness.
  • External Knowledge: While not implemented in the main project, David experimented with giving the agent access to a GameFAQs walkthrough via another agent, which proved effective for overcoming obstacles.

7️⃣ Andrew's Approach: Alternative Architecture and "Cheating"

  • Andrew presents his distinct project aiming to build a "virtual streamer" AI playing Pokémon.
  • Motivation & Stack: Focused on Pokémon Gold using Ruby, RetroArch (for multi-emulator flexibility), and GPT-4o, aiming for an interactive streaming persona. His progress reached Viridian Forest.
  • Architecture: Used a main loop, simple battle/conversation handlers (OCR via GPT-4o), and crucially, avoided complex vision challenges by directly reading game RAM to get state information (map layout, object locations). He credits the PRET community (Pokémon reverse engineering team) for insights, particularly regarding the difficulty of reading dialogue directly from memory.
  • Pathfinding & Memory: Leveraged A* search algorithm (a standard pathfinding technique) based on map data extracted from RAM. Instead of a knowledge base, he relied on a large context window containing journal entries of past actions and current state information derived from RAM.
  • Testing: Implemented basic unit tests using save states to ensure code changes didn't break core functionality (e.g., navigating the starting room).

8️⃣ Andrew's Challenges and Philosophy on "Cheating"

  • Andrew reflects on his experience:
  • Emulator Interaction: His biggest hurdle was the difficulty of interfacing reliably with the RetroArch emulator, highlighting the value of robust environment harnesses like Morph Cloud's.
  • Repetition: Found that simply instructing the LLM not to repeat its last action provided significant mileage in overcoming repetitive loops (though David noted limitations).
  • The "Cheating" Debate: Andrew directly addresses whether reading RAM or using A* is "cheating." He argues that for goals like creating engaging content or simply making progress, "cheating" (using scaffolds like direct memory access) is pragmatic. His philosophy: “Start with something that works and then remove stuff slowly that you consider cheating... eventually but like oh my gosh there's so much work still here to do.” This resonates with practical AI development where scaffolding is often necessary initially.

9️⃣ Andrew's Q&A: Practical Takeaways

  • Tips: Keep prompts simple, leverage RAM access if available (as it bypasses vision issues), and use version control diligently.
  • Surprising Success: The biggest breakthrough was overcoming the initial loop where the agent repeatedly tried to grab a starter Pokémon before fulfilling the prerequisite steps (talking to Oak, rival battle, etc.). A simple prompt tweak to try something else if stuck unlocked progress.

1️⃣0️⃣ Hackathon Introduction: The Mt. Moon Challenge

  • The Latent Space host sets the stage for the hackathon:
  • Goal: Escape Mt. Moon starting from a provided game snapshot.
  • “No Cheating” Rule: Specifically prohibits prompting exact step-by-step solutions for Mt. Moon. General strategies and tool use (like internet search) are allowed.
  • Benchmarking: The task serves as an informal benchmark for agent capabilities in reasoning and execution within a game environment.
  • Prizes: Awarded for fastest escape, plus creativity prizes (e.g., using different models like Llama).

1️⃣1️⃣ Morph Cloud: Infrastructure for AI Agents

  • Jesse, CEO of Morph Labs, introduces their platform:
  • Vision: Providing infinitely scalable, elastic cloud compute designed for AI agents, conceptualized as "a computer for every agent."
  • Infinabranch: Their core technology enabling low-overhead snapshotting and branching of entire compute environments. This allows for exploring possibilities, testing hypotheses, and scaling search without irreversible consequences – described metaphorically as offering "grace" to the machine.
  • Tree of Life Demo: Showcased Infinabranch's power by creating a branching tree of live VMs, each running a slightly varied version of Conway's Game of Life, demonstrating massive parallel environment manipulation.

1️⃣2️⃣ Morph Cloud Tools for the Hackathon: EVA Framework & Harness

  • Sherog and Jesse detail the tools provided:
  • EVA (Execution with Verified Agents): A simple agent framework designed to leverage Morph Cloud features. It facilitates exploring multiple paths (test time search) and verifying outcomes against a goal (e.g., escaping Mt. Moon). It uses Morph's snapshotting for backtracking on failure.
  • Agent Harness UI: A web interface allowing participants to view the live game via VNC, monitor the agent's conversation/reasoning, track its trajectory step-by-step, and potentially backtrack using snapshots.

1️⃣3️⃣ Hackathon Task, Judging, and Prizes

  • Jesse clarifies the competition details:
  • Task: Implement an agent (subclassing the provided EVA agent template) that escapes Mt. Moon from a specific snapshot ID in the fewest agent turns.
  • Submission: Submit a GitHub Gist with the agent code.
  • Judging: Agents will be run on a held-out test snapshot. Trajectories analyzed for speed and adherence to the "no explicit Mt. Moon instructions" rule.
  • Prizes: Plushies for the fastest valid escape. A $1,000 cash prize for the "coolest use case for Infinabranch" (e.g., Monte Carlo Tree Search, adaptive scaling, creative branching strategies), judged on a "vibes based basis."

1️⃣4️⃣ Setup Tutorial and Final Remarks

  • Matthew and Sherog walk through the setup:
  • Access: Go to cloud.morph.so/web/poke to sign up, get credits, and access the pre-configured Pokémon snapshot.
  • Local Environment: Requires Python 3.11+, UV package manager. Clone the morph-cloud/examples repo, set up a virtual environment, install dependencies.
  • Credentials: Export Morph Cloud API key and Anthropic API key as environment variables.
  • Running: Launch the agent script with the snapshot ID, and optionally launch the UI server.
  • Alternative: Advanced users can connect their own agents directly to the MCP (Machine Communication Protocol) server running in the Morph environment via SSE (Server-Sent Events).

The episode concludes with thanks to the Morph team for their intensive preparation work.

Conclusion: Practical Frontiers in Agent Development

This episode vividly illustrates the practical hurdles (vision, spatial reasoning) and surprising emergent behaviors of current AI agents in complex, long-horizon tasks. The contrasting approaches of direct interaction vs. memory "cheating," alongside Morph Cloud's advanced snapshotting, highlight key strategies and infrastructure needs for advancing agent capabilities. Investors and researchers should track progress in these areas as crucial indicators for deployable, robust AI systems.

Others You May Like