- Overview
- Architecture
- Features
- Directory Structure
- Requirements
- Installation
- VLM Backend Setup (run.py)
- CLI Agent Backend Setup (run_cli.py)
- Running the Agent
- Command Line Options
- Customizing Agent Behavior
- Advanced Configuration
- Troubleshooting
- Submission Instructions
- Citation
- License
This project implements an AI agent capable of playing Pokémon Emerald on a Game Boy Advance emulator. PokeAgent uses a vision-language model (VLM) to analyze game frames, understand the current game state, and make intelligent decisions to progress through the game via a series of MCP tools that we expose. PokeAgent is designed to be easily customizable for different VLMs and agent behaviors.
The system uses a headless server: the game and emulator run in a server process; agents and UIs run as clients. The server exposes HTTP REST and MCP endpoints; clients poll for state and submit actions.
For module-level detail, see the README in each area:
- server/README.md — Game server, frame streaming, MCP proxy, ports and endpoints.
- agents/README.md — PokeAgent, prompts, objectives, prompt optimization, local subagents.
- pokemon_env/README.md — Emulator, memory reader, Porymap map data.
- utils/README.md — Mapping, persistence, VLM backends, metrics.
Optional maintainer notes: System-Design/README.md (folder is often gitignored; create locally if missing).
- Multiple VLM backends: OpenAI, OpenRouter, Google Gemini, Anthropic, (via
utils/vlm_backends.py) - Vision-based perception: VLMs analyze game frames and state
- Agent scaffolds: PokeAgent (optional trajectory-based prompt optimization via
--enable-prompt-optimization; separate from the in-agentsubagent_reflecttool), vision-only - PokeAgent local subagents:
subagent_reflect,subagent_verify,subagent_gym_puzzle, andsubagent_summarizeare one-step local VLM calls;subagent_battleris a delegated battle loop that consumes real global steps but returns only a compacted battle summary to the orchestrator. Logged interaction names remain readable (Subagent_Reflect,Subagent_Verify,Subagent_Summarize,Gym_Puzzle_Analysis,Subagent_Battler). Recent trajectory text comes fromrun_data/{run_id}/prompt_evolution/trajectories/trajectories.jsonl(RunDataManager.log_trajectory). - MCP support: External CLI agents (Claude Code/Codex CLI/Gemini CLI) interact with the game via
pokemon_mcp_server.py. Containerization limits non-tool HTTP to the game server. The HTTP game server does not implement local subagents such assubagent_reflect; CLI agents use a reduced MCP surface (seeserver/cli/pokemon_mcp_server.py). - Checkpoints & backups: Save/resume runs; backups in
backups/; analysis data inrun_data/ - Metrics & logging: Per-step and cumulative tokens, cost, actions, as well as run initialization settings are found in .pokeagent_cache/{run_id}/cumulative_metrics.json; LLM logs (llm_logs/) and other session logs are also tracked, though cumulative_metrics is the single source of truth. One-step local subagents (reflect, verify, summarize, gym puzzle) record a synthetic
tool_callsrow on their step so the interaction name is visible next to token usage (they do not invoke MCP tools). - Map system: Porymap integration, NPC display, movement preview, portal tracking
- Web interface: Real-time stream at
http://localhost:8000/streamby default. The port can be manually specified via the --port flag to both run.py and run_cli.py - Video recording: Optional MP4 recording of gameplay saved to
run_data/{run_id}/end_state/videos/ - Customizable prompts: Edit prompt assets under
agents/prompts/to directly steer agent behavior.
pokeagent-speedrun/
├── README.md
├── pyproject.toml # Project config and dependencies (uv/pip)
├── uv.lock # Locked dependency versions (uv sync uses this)
├── requirements.txt # Pip fallback (frozen from env)
├── run.py # Multiprocess entry: starts server + in-repo agent client
├── run_cli.py # Entry for external CLI agents (MCP); spawns server + MCP proxy
├── server/
│ ├── app.py # FastAPI game server (emulator, /state, /action, /mcp/*, etc.)
│ ├── agent_thinking.txt # Runtime file (gitignored); server writes latest thinking for UI
│ ├── frame_server.py # Frame streaming
│ ├── stream.html # Web UI for streaming
│ └── cli/
│ └── pokemon_mcp_server.py # MCP proxy: stdio ↔ HTTP to game server
├── agents/
│ ├── __init__.py # Package exports (PokeAgent, VisionOnlyAgent)
│ ├── PokeAgent.py # Main benchmark agent
│ ├── vision_only_agent.py
│ ├── subagents/ # reflect, verify, summarize, battler, gym_puzzle, verify helpers; utils/ = registry, runtime, context, trajectory_window, puzzle_solver
│ ├── utils/ # prompt_optimizer, etc.
│ ├── objectives/ # Direct objectives, types, categorization
│ └── prompts/ # Canonical prompt assets and path helpers
├── utils/
│ ├── mapping/ # ascii_map_loader, map_formatter, map_stitcher, map_stitcher_singleton,
│ │ # pathfinding, pokeemerald_parser, porymap_json_builder, porymap_state
│ ├── data_persistence/ # backup_manager, run_data_manager, llm_logger
│ ├── agent_infrastructure/ # cli_agent_backends, vlm_backends
│ ├── metric_tracking/ # session readers (claude, gemini, codex), server_metrics
│ ├── state_formatter.py # Facade; re-exports from utils.mapping.porymap_state
│ ├── knowledge_base.py # Shared by agents and server
│ ├── anticheat.py, error_handler.py, json_utils.py, ocr_dialogue.py
│ └── ...
├── pokemon_env/
│ ├── emulator.py # EmeraldEmulator (mGBA, input, frame advance)
│ ├── memory_reader.py # PokemonEmeraldReader (DO NOT MODIFY for submissions)
│ ├── emerald_utils.py, enums.py, types.py, utils.py
│ ├── porymap_paths.py # Centralized path resolution for porymap data
│ ├── porymap/ # Pokeemerald decompilation data (data/maps, data/tilesets)
│ └── ...
├── tests/
│ ├── run_tests.py, states/, ground_truth/, test_*.py
│ └── ...
├── Emerald-GBAdvance/ # rom.gba (not included), *.state
├── .pokeagent_cache/ # Runtime cache per run (checkpoints, metrics, maps)
├── backups/ # Backup archives
├── run_data/ # Per-run analysis data
└── llm_logs/ # LLM interaction logs (auto-generated)
- Python 3.10–3.11
- Pokémon Emerald ROM (not included; obtain legally)
- An API key for access to of the supported VLM backends (see VLM Backend Setup)
- mGBA system library for Python bindings
git clone https://github.com/sethkarten/pokeagent-speedrun
cd pokeagent-speedrunOption A – uv (recommended):
uv uses pyproject.toml and uv.lock for reproducible installs.
# Install uv (if needed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create .venv and install dependencies from uv.lock
uv sync
# Activate the environment (prompt will show (pokeagent-speedrun))
source .venv/bin/activateTo run without activating: uv run python run.py ... (uv uses the project venv automatically). For dev tools (pytest, ruff, mypy): uv sync --group dev.
Option B – Conda:
Create a conda env (e.g. conda create -n pokeagent python=3.10), then install a compatible libffi in the env (e.g. conda install libffi) so the mGBA Python bindings work, and install Python deps: pip install -r requirements.txt.
Required for the mGBA Python bindings. Example (Ubuntu 20.04):
wget https://github.com/mgba-emu/mgba/releases/download/0.10.5/mGBA-0.10.5-ubuntu64-focal.tar.xz
tar -xf mGBA-0.10.5-ubuntu64-focal.tar.xz
sudo dpkg -i mGBA-0.10.5-ubuntu64-focal/libmgba.debmacOS (x86_64): brew install mgba
- uv: Already done in step 2 (
uv sync). Re-runuv syncifpyproject.tomloruv.lockchange. - pip:
pip install -r requirements.txt(e.g. inside your conda env).
Place your Pokémon Emerald ROM at Emerald-GBAdvance/rom.gba. US English SHA-1: f3ae088181bf583e55daf962a92bb46f4f1d07b7.
Set the required env var(s) for your backend, then run with the template below. You can vary flags (e.g. --load-state, --headless, --record) as needed.
Default template:
python run.py --backend {backend} --model-name {name} --port 8000 --agent-auto --scaffold pokeagent --direct-objectives categorized_full_game --direct-objectives-start 0 --direct-objectives-battling-start 0| Backend | Env var(s) | Example (replace {backend} and {name} in template) |
|---|---|---|
| OpenAI | OPENAI_API_KEY |
--backend openai --model-name gpt-5 |
| Anthropic | ANTHROPIC_API_KEY |
--backend anthropic --model-name claude-sonnet-4.5 |
| OpenRouter | OPENROUTER_API_KEY |
--backend openrouter --model-name anthropic/claude-4.5-sonnet |
| Google Gemini | GEMINI_API_KEY or GOOGLE_API_KEY |
--backend gemini --model-name gemini-3-flash-preview |
| Vertex | Google Cloud auth (e.g. GOOGLE_APPLICATION_CREDENTIALS) |
--backend vertex --model-name gemini-3-flash-preview |
| Auto | Any of the above | --backend auto --model-name <model-id> (backend inferred from model name) |
External CLI agents (Claude Code, Codex, Gemini CLI) connect via MCP. Set the required env / auth, then use the template below. First run with a given backend image: add --build so the container is built with your UID/GID.
Default template:
python run_cli.py --backend {backend} --api-gateway openrouter --directive agents/prompts/cli-agent-directives/pokemon_directive.md --port 8000| Backend | Env / Auth | Example |
|---|---|---|
| Claude | claude auth login (OAuth), or ANTHROPIC_API_KEY; for OpenRouter: OPENROUTER_API_KEY + --api-gateway openrouter |
--backend claude; OpenRouter: --backend claude --api-gateway openrouter |
| Gemini | GEMINI_API_KEY |
--backend gemini |
| Codex | codex login or OPENAI_API_KEY; for OpenRouter: OPENROUTER_API_KEY + --api-gateway openrouter |
--backend codex; OpenRouter: --backend codex --api-gateway openrouter |
CLI agents run in Docker for isolation. Use --build on first run (e.g. python run_cli.py --backend claude --build --directive agents/prompts/cli-agent-directives/pokemon_directive.md), then omit --build for later runs.
run.py (in-repo agent): Starts the game server, then runs the selected agent client. Use the VLM Backend Setup template and swap in your --backend and --model-name. Examples of common variants:
# Load a specific state or resume from checkpoint
python run.py --backend gemini --model-name gemini-2.5-flash --load-state Emerald-GBAdvance/splits/01_tutorial/01_tutorial.state --port 8000 --agent-auto --scaffold pokeagent --direct-objectives categorized_full_game --direct-objectives-start 0 --direct-objectives-battling-start 0
python run.py --backend gemini --model-name gemini-2.5-flash --load-checkpoint --port 8000 --agent-auto --scaffold pokeagent --direct-objectives categorized_full_game --direct-objectives-start 0 --direct-objectives-battling-start 0
# Headless with recording
python run.py --backend gemini --model-name gemini-2.5-flash --port 8000 --agent-auto --scaffold pokeagent --headless --record --direct-objectives categorized_full_game --direct-objectives-start 0 --direct-objectives-battling-start 0run_cli.py (external CLI agents via MCP): Starts the game server and MCP proxy; the CLI agent in the container talks to the game via MCP tools. Use the CLI Agent Backend Setup template; set the required env/auth for your backend and add --build on first run.
Debug controls (with display): M = state overlay, Shift+M = map, S = screenshot, Tab = cycle mode, Space = one agent step, 1/2 = save/load state, arrows/WASD = move, Z/X = A/B.
Web UI: http://localhost:8000/stream (or your --port).
Choose behavior with --scaffold (default: pokeagent).
| Scaffold | Description |
|---|---|
pokeagent |
Default. Main benchmark agent with direct objectives, knowledge, and prompt optimization. |
autonomous_cli |
Legacy alias for pokeagent. |
vision_only |
Vision-only agent (no map info, no pathfinding, button sequences). |
Example:
python run.py --scaffold pokeagent --agent-auto| Flag | Description |
|---|---|
--rom PATH |
Path to the ROM file (default: Emerald-GBAdvance/rom.gba). |
--port INT |
Port for the game server and web interface (default: 8000). Frame server and MCP server are accessed through at ports at a +1 and +2 offset respectively |
--load-state PATH |
Load a saved state file on startup. |
--load-checkpoint |
Load from checkpoint files in the run cache. |
--backup-state PATH |
Load from a backup zip; extracts to cache and loads checkpoint, metrics, and persistent knowledge (preferred for resuming a run). |
--backend NAME |
VLM backend: openai, gemini, openrouter, anthropic, or auto (default: gemini). |
--model-name TEXT |
Model name for the backend (default: gemini-2.5-flash). |
--scaffold NAME |
Agent scaffold: pokeagent, autonomous_cli, or vision_only (default: pokeagent). |
--headless |
Run without the pygame display. |
--agent-auto |
Run the agent in automatic mode (no manual stepping). |
--manual |
Start in manual mode instead of agent mode. |
--record |
Record video of gameplay to run_data/{run_id}/end_state/videos/. |
--no-ocr |
Disable OCR dialogue detection (default: on). |
--direct-objectives NAME |
Load a direct objective sequence (e.g. categorized_full_game, autonomous_objective_creation). |
--direct-objectives-start INT |
Start index for story objectives (default: 0). |
--direct-objectives-battling-start INT |
Start index for battling objectives in categorized mode (default: 0). |
--clear-knowledge-base |
Clear knowledge_base.json before starting. |
--run-name TEXT |
Optional suffix for the run directory name. |
--enable-prompt-optimization |
Enable reflective prompt optimization from trajectory analysis. |
--optimization-frequency INT |
Steps between prompt optimization runs (default: 10). |
--allow-walkthrough |
Enable get_walkthrough tool (vision_only scaffold). |
--allow-slam |
Enable SLAM / map building (vision_only scaffold). |
| Flag | Description |
|---|---|
--backend NAME |
CLI agent backend: claude, gemini, or codex (default: claude). |
--api-gateway NAME |
Auth: login (OAuth/subscription, default) or openrouter (uses OPENROUTER_API_KEY). |
--login |
Run backend-specific auth login before starting (e.g. claude auth login). |
--directive PATH |
Path to system prompt/directive file for the CLI agent (default: repo CLI directive). |
--port INT |
Port for the game server (default: 8000). |
--load-state PATH |
Load a saved state file on startup. |
--load-checkpoint |
Load from checkpoint files in the run cache. |
--backup-state PATH |
Load from a backup zip; extracts to cache and enables checkpoint load. |
--termination-condition NAME |
Condition type to stop the run (default: gym_badge_count). |
--termination-threshold INT |
Threshold for termination (e.g. 1 = first badge; default: 1). |
--poll-interval INT |
Seconds between termination checks (default: 10). |
--graceful-timeout INT |
Seconds to wait for graceful shutdown before force kill (default: 30). |
--record |
Record video of gameplay. |
--no-ocr |
Disable OCR dialogue detection (default: on). |
--direct-objectives NAME |
Load a specific direct objective sequence. |
--direct-objectives-start INT |
Start index for direct objectives (default: 0). |
--run-name TEXT |
Optional name for the run directory. |
--build |
Build the container image before running (recommended so files are owned by your user). |
--mcp-sse-port INT |
Port for MCP SSE server (default: game port + 2). |
--agent-thinking-effort LEVEL |
Reasoning/thinking effort for CLI agent: low, medium, or high (Claude: --thinking-budget; Codex: -c model_reasoning_effort; Gemini: modelConfigs override). |
- Prompt files:
agents/prompts/holdspokeagent-directives/andcli-agent-directives/; paths are repo-root-relative. - Main benchmark agent:
agents/PokeAgent.py. - Vision-only variant:
agents/vision_only_agent.py.
Edit the prompts in those files and restart the agent. Use --debug-state for detailed state in logs. For Nuzlocke-style behavior, change the system prompt and action/memory logic accordingly.
- Environment:
OPENAI_API_KEY,OPENROUTER_API_KEY,GEMINI_API_KEY,GOOGLE_API_KEY; optionalPYTHONPATHfor development. - Persistence: Checkpoints and run data are under
.pokeagent_cache/{run_id}/andrun_data/{run_id}/. Backups of.pokeagent_cache/{run_id}/are created on objective or major milestone completion. See utils/README.md for layout. - Metrics:
cumulative_metrics.json(in cache) and LLM logs; see utils/README.md.
- Module not found: Ensure deps are installed (
uv syncorpip install -r requirements.txt) andPYTHONPATHincludes the repo root if needed. - Web UI: Ensure the server is running and the port (default 8000) is free; open
http://localhost:8000/stream. You may need to forward the port to your local machine if you are connected via ssh
Allowed: Changing agent behavior (prompts, planning, memory), adding or changing VLM backends in utils/agent_infrastructure/vlm_backends.py, improving logging, tests, docs, performance, UI, and utilities.
Not allowed (for competitive submissions): Modifying pokemon_env/memory_reader.py or memory-reading logic, changing how game state is extracted, altering emulator core or anti-cheat, or manipulating game memory outside normal button input.
Submission requirements, how to submit, evaluation criteria, and tips for success are coming soon. All submission infrastructure will live at pokeagentchallenge.com.
If you use this codebase in your research, please cite:
@misc{karten2026pokeagentchallengecompetitivelongcontext,
title={The PokeAgent Challenge: Competitive and Long-Context Learning at Scale},
author={Seth Karten and Jake Grigsby and Tersoo Upaa Jr and Junik Bae and Seonghun Hong and Hyunyoung Jeong and Jaeyoon Jung and Kun Kerdthaisong and Gyungbo Kim and Hyeokgi Kim and Yujin Kim and Eunju Kwon and Dongyu Liu and Patrick Mariglia and Sangyeon Park and Benedikt Schink and Xianwei Shi and Anthony Sistilli and Joseph Twin and Arian Urdu and Matin Urdu and Qiao Wang and Ling Wu and Wenli Zhang and Kunsheng Zhou and Stephanie Milani and Kiran Vodrahalli and Amy Zhang and Fei Fang and Yuke Zhu and Chi Jin},
year={2026},
eprint={2603.15563},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2603.15563},
}This project is licensed under the MIT License - see the LICENSE file for details. Make sure to comply with the terms of service of any VLM APIs you use.

