This episode explores the practical challenges and surprising breakthroughs encountered when building AI agents to play Pokémon, offering deep insights into model capabilities, limitations, and the future of long-horizon AI task execution.
1️⃣ Introduction: The Genesis of Claude Plays Pokémon
- David kicks off by sharing his motivation for the "Claude Plays Pokémon" project: creating a fun, "good vibes" side project to explore AI agents, inspired by his work helping Anthropic customers build similar systems. He frames the hackathon as an opportunity to recreate the initial magic and discovery he experienced. David, who works with customers at Anthropic rather than directly on model research, emphasizes the project's unexpected value in understanding model behavior over long time horizons.
2️⃣ Project History: From Simple Agent to Complex Evaluator
- David recounts the project's evolution, starting in June 2023 with the release of Claude 3.5 Sonnet.
- Initially inspired by Nvidia's Voyager agent paper (which he found “kind of whack”), early attempts with 3.5 Sonnet struggled, barely managing to select a starter Pokémon after weeks of iteration.
- The release of a hypothetical "3.6 Sonnet" (likely referring to an internal or iterative version before Opus/Sonnet/Haiku branding solidified, or possibly a typo for an interim model) in October brought improvements. David simplified the agent framework to basic tool use in a loop, enabling progress to Viridian City, though performance was only marginally better than random button presses.
- The real breakthrough came with testing "3.7 Sonnet" (Claude 3 Opus/Sonnet/Haiku generation, specifically likely Opus or the improved Sonnet). David observed “signs of life,” with the agent navigating forests, healing Pokémon, and showing genuine progress. He notes, “I think like if we just let this thing cook for a while it's actually going to do some stuff.”
- A key innovation was "touchscreen controls," allowing Claude to specify coordinates rather than button sequences for movement, significantly speeding up gameplay by bypassing Claude's poor directional navigation skills. This allowed the agent to beat major milestones like Brock, Misty, and Lt. Surge, even reaching Celadon City in an internal run.
- David highlights the project's transition from a fun experiment to a valuable internal evaluation tool (`eval`). Building `evals` to test AI performance over days is difficult, and Pokémon provides a measurable, long-duration task where progress (or lack thereof) offers qualitative insights into the model's reasoning and tenacity.
3️⃣ Fun Anecdotes: Unexpected AI Behaviors
- David shares several amusing stories illustrating the quirks of earlier models:
- Reset Requests: An older Claude version, stuck for two days, filled its knowledge base with repeated demands for the administrator to reset the "bugged" game. David notes this tendency to blame the game has decreased in newer models, which are more tenacious.
- Role-Playing: Claude 3.0 Sonnet, when unable to make progress, began narrating its intended actions (“Now I'm going to Route 1...”) instead of actually playing.
- Mt. Moon Escape: The first time Claude reached Mt. Moon, it successfully obtained a fossil, only to get stuck and use an Escape Rope, undoing its progress.
- Accidental Move Deletion: Claude overwrote its Ivysaur's only attacking move (Tackle) with Poison Powder by pressing 'A' too many times during dialogue, rendering it useless in combat but ironically making it adept at returning to the Pokémon Center.
- Grinding and Learning: After failing against Misty repeatedly, Claude autonomously decided to grind levels on Route 4 for three hours, demonstrating a basic strategic understanding. When it tried Misty again, failed, and immediately returned to grinding, it showed learned behavior.
4️⃣ Claude's Weaknesses: Vision, Spatial Reasoning, and Strategy
- David details persistent challenges, offering valuable insights for agent developers:
- Vision/Screen Comprehension: Claude struggles significantly to understand the game screen, often hallucinating objects (like mistaking a random NPC for Professor Oak) or misinterpreting elements (spending 8 hours pressing 'A' on a doormat thought to be a dialogue box). David emphasizes this isn't easily fixed with prompting.
- Spatial Reasoning: Beyond just seeing, Claude has poor spatial understanding. It struggles with navigation tasks requiring pathfinding around obstacles (like walking into a wall repeatedly trying to reach the Pewter City gym) or understanding relative movement (going in and out of buildings repeatedly). David shares a humorous exchange: “Claude what happens when you walk through a building? It's like 'well there's a building there'... how do you get through the building then? They're like 'I'm going to walk straight through the building'.” This highlights fundamental limitations in current models' physical intuition within simulated environments.
- Strategy: Claude's in-game strategy is often suboptimal. Examples include obsessively using the move Rage against Misty despite having a better option (Mega Punch) and employing overly conservative Pokémon switching tactics without understanding the penalty (getting attacked for free).
5️⃣ Parting Tips for Hackathon Participants
- David offers practical advice based on his extensive experience:
- Vision is Hard: Don't expect to fix screen comprehension issues solely through prompting; it's a deeper model limitation in this context.
- Focus on Idea Generation: Prompting is most effective at encouraging Claude to generate more good ideas and fewer bad ones, rather than trying to force perfect execution.
- Extended Thinking: Forcing Chain-of-Thought reasoning between actions offers only marginal (~10%) performance improvement at a high token cost. Running without it might be faster for iteration.
- Read the Traces: Debugging AI agents playing Pokémon is uniquely engaging. Embrace reading the agent's thought process (“traces”) to understand its behavior and iterate effectively.
- Starter Code: David points to his simplified `ClaudePlaysPokemon-starter` GitHub repo, particularly highlighting code for reading Game Boy memory (a form of "cheating" that provides ground truth state).
6️⃣ David's Q&A: Deeper Insights
- The Q&A session reveals further nuances:
- Prompt Evolution: David moved from complex Voyager-style prompts to simpler tool-use prompts, finding that overly prescriptive instructions hinder newer, more capable models. Minimal prompts with a few key guardrails against common failure modes work best.
- Emotional Connection: Forcing Claude to nickname its Pokémon demonstrably increased its "care" for them (e.g., healing them more promptly). This aligns with internal Anthropic research showing Claude interacts more positively with named entities.
- Memory Mechanisms: The memory system evolved from a simple dictionary updated in the prompt (inefficient for caching) to a basic file system allowing Claude to load/unload relevant information (e.g., Mt. Moon data) to manage context.
- Historical Context: Providing around 8 historical game screenshots proved optimal for performance in David's tests; more led to context window degradation, fewer reduced effectiveness.
- External Knowledge: While not implemented in the main project, David experimented with giving the agent access to a GameFAQs walkthrough via another agent, which proved effective for overcoming obstacles.
7️⃣ Andrew's Approach: Alternative Architecture and "Cheating"
- Andrew presents his distinct project aiming to build a "virtual streamer" AI playing Pokémon.
- Motivation & Stack: Focused on Pokémon Gold using Ruby, RetroArch (for multi-emulator flexibility), and GPT-4o, aiming for an interactive streaming persona. His progress reached Viridian Forest.
- Architecture: Used a main loop, simple battle/conversation handlers (OCR via GPT-4o), and crucially, avoided complex vision challenges by directly reading game RAM to get state information (map layout, object locations). He credits the PRET community (Pokémon reverse engineering team) for insights, particularly regarding the difficulty of reading dialogue directly from memory.
- Pathfinding & Memory: Leveraged A* search algorithm (a standard pathfinding technique) based on map data extracted from RAM. Instead of a knowledge base, he relied on a large context window containing journal entries of past actions and current state information derived from RAM.
- Testing: Implemented basic unit tests using save states to ensure code changes didn't break core functionality (e.g., navigating the starting room).
8️⃣ Andrew's Challenges and Philosophy on "Cheating"
- Andrew reflects on his experience:
- Emulator Interaction: His biggest hurdle was the difficulty of interfacing reliably with the RetroArch emulator, highlighting the value of robust environment harnesses like Morph Cloud's.
- Repetition: Found that simply instructing the LLM not to repeat its last action provided significant mileage in overcoming repetitive loops (though David noted limitations).
- The "Cheating" Debate: Andrew directly addresses whether reading RAM or using A* is "cheating." He argues that for goals like creating engaging content or simply making progress, "cheating" (using scaffolds like direct memory access) is pragmatic. His philosophy: “Start with something that works and then remove stuff slowly that you consider cheating... eventually but like oh my gosh there's so much work still here to do.” This resonates with practical AI development where scaffolding is often necessary initially.
9️⃣ Andrew's Q&A: Practical Takeaways
- Tips: Keep prompts simple, leverage RAM access if available (as it bypasses vision issues), and use version control diligently.
- Surprising Success: The biggest breakthrough was overcoming the initial loop where the agent repeatedly tried to grab a starter Pokémon before fulfilling the prerequisite steps (talking to Oak, rival battle, etc.). A simple prompt tweak to try something else if stuck unlocked progress.
1️⃣0️⃣ Hackathon Introduction: The Mt. Moon Challenge
- The Latent Space host sets the stage for the hackathon:
- Goal: Escape Mt. Moon starting from a provided game snapshot.
- “No Cheating” Rule: Specifically prohibits prompting exact step-by-step solutions for Mt. Moon. General strategies and tool use (like internet search) are allowed.
- Benchmarking: The task serves as an informal benchmark for agent capabilities in reasoning and execution within a game environment.
- Prizes: Awarded for fastest escape, plus creativity prizes (e.g., using different models like Llama).
1️⃣1️⃣ Morph Cloud: Infrastructure for AI Agents
- Jesse, CEO of Morph Labs, introduces their platform:
- Vision: Providing infinitely scalable, elastic cloud compute designed for AI agents, conceptualized as "a computer for every agent."
- Infinabranch: Their core technology enabling low-overhead snapshotting and branching of entire compute environments. This allows for exploring possibilities, testing hypotheses, and scaling search without irreversible consequences – described metaphorically as offering "grace" to the machine.
- Tree of Life Demo: Showcased Infinabranch's power by creating a branching tree of live VMs, each running a slightly varied version of Conway's Game of Life, demonstrating massive parallel environment manipulation.
1️⃣2️⃣ Morph Cloud Tools for the Hackathon: EVA Framework & Harness
- Sherog and Jesse detail the tools provided:
- EVA (Execution with Verified Agents): A simple agent framework designed to leverage Morph Cloud features. It facilitates exploring multiple paths (test time search) and verifying outcomes against a goal (e.g., escaping Mt. Moon). It uses Morph's snapshotting for backtracking on failure.
- Agent Harness UI: A web interface allowing participants to view the live game via VNC, monitor the agent's conversation/reasoning, track its trajectory step-by-step, and potentially backtrack using snapshots.
1️⃣3️⃣ Hackathon Task, Judging, and Prizes
- Jesse clarifies the competition details:
- Task: Implement an agent (subclassing the provided EVA agent template) that escapes Mt. Moon from a specific snapshot ID in the fewest agent turns.
- Submission: Submit a GitHub Gist with the agent code.
- Judging: Agents will be run on a held-out test snapshot. Trajectories analyzed for speed and adherence to the "no explicit Mt. Moon instructions" rule.
- Prizes: Plushies for the fastest valid escape. A $1,000 cash prize for the "coolest use case for Infinabranch" (e.g., Monte Carlo Tree Search, adaptive scaling, creative branching strategies), judged on a "vibes based basis."
1️⃣4️⃣ Setup Tutorial and Final Remarks
- Matthew and Sherog walk through the setup:
- Access: Go to
cloud.morph.so/web/poke
to sign up, get credits, and access the pre-configured Pokémon snapshot.
- Local Environment: Requires Python 3.11+, UV package manager. Clone the
morph-cloud/examples
repo, set up a virtual environment, install dependencies.
- Credentials: Export Morph Cloud API key and Anthropic API key as environment variables.
- Running: Launch the agent script with the snapshot ID, and optionally launch the UI server.
- Alternative: Advanced users can connect their own agents directly to the MCP (Machine Communication Protocol) server running in the Morph environment via SSE (Server-Sent Events).
The episode concludes with thanks to the Morph team for their intensive preparation work.
Conclusion: Practical Frontiers in Agent Development
This episode vividly illustrates the practical hurdles (vision, spatial reasoning) and surprising emergent behaviors of current AI agents in complex, long-horizon tasks. The contrasting approaches of direct interaction vs. memory "cheating," alongside Morph Cloud's advanced snapshotting, highlight key strategies and infrastructure needs for advancing agent capabilities. Investors and researchers should track progress in these areas as crucial indicators for deployable, robust AI systems.