Model-Based Reinforcement Learning in Pokémon Red
An experimental research project applying Dreamer-style model-based RL to play Pokémon Red on a Game Boy emulator. Built from scratch - iterating from a simple VAE+GRU dynamics model to a discrete Recurrent State-Space Model trained on native-resolution pixels.
PokéDreamer is a research project that teaches an agent to build an internal world model of Pokémon Red - then reason, plan, and act inside that imagined world rather than the real emulator.
A neural network that learns the transition dynamics of the game: given the current pixel observation and an action, predict the next frame - entirely in a compressed latent space.
Instead of repeatedly stepping the emulator to explore, the agent "dreams" future trajectories inside the world model at high speed. The RSSM prior enables pure latent rollouts.
Architecturally inspired by Hafner et al.'s DreamerV2/V3 - using a Recurrent State-Space Model (RSSM) with discrete categorical latents and Gumbel-Softmax straight-through gradients.
Built from scratch to understand every component - from the VAE bottleneck to KL-balancing in the RSSM. Not a wrapper around existing frameworks.
From symbolic state probes to pixel-level imagination - three generations of world models.
The first world model - a Variational Autoencoder compressing frames to 32-dim latents, with a GRU dynamics model predicting the next latent. An MPC planner imagines action sequences and picks the best path using linear probes for map position.
The major leap - native 160×144 resolution, a 4-layer Residual CNN encoder, and a full Recurrent State-Space Model with 32×32 discrete categorical latents trained via Gumbel-Softmax. The RSSM's prior enables pure imagination rollouts without any emulator steps.
The vision: System 1 (a PPO/GRPO actor trained inside RSSM imagination) directed by System 2 (a multimodal LLM strategic planner). The first version where "policy trained inside imagination" becomes the explicit target - approaching DIAMOND-style pixel-fidelity world modeling.
Quantitative results from training and evaluation across both completed versions.
Measured over 14,564 validation trajectories of length 29 steps. Scheduled sampling (SS) is critical: the SS model's compounding drift stays flat under 3.5 tiles out to 29 imagined steps, while the pure teacher-forcing (TF) ablation exceeds 10.4 tiles - a 3× degradation.
| Rollout Step | SS Latent MSE | TF Latent MSE | SS Tile Error | TF Tile Error | Improvement |
|---|---|---|---|---|---|
| Step 1 | 0.09268 | 0.12240 | 3.72 tiles | 4.06 tiles | 1.09× |
| Step 5 | 0.08751 | 0.23061 | 3.32 tiles | 5.08 tiles | 1.53× |
| Step 10 | 0.08842 | 0.40189 | 3.30 tiles | 6.47 tiles | 1.96× |
| Step 15 | 0.09238 | 0.59876 | 3.33 tiles | 7.81 tiles | 2.35× |
| Step 20 | 0.09830 | 0.78455 | 3.36 tiles | 9.13 tiles | 2.72× |
| Step 25 | 0.10606 | 0.92376 | 3.42 tiles | 9.72 tiles | 2.84× |
| Step 29 | 0.11197 | 1.04063 | 3.47 tiles | 10.44 tiles | 3.01× |
4 epochs on 20 NPZ files (~16,000 native-resolution transitions). Batch size 64, sequence length 15. Reconstruction loss steadily decreases - the best checkpoint (epoch 4, val recon = 0.1003) demonstrates pixel-level world modeling at native Game Boy resolution.
| Epoch | Train Loss | Train Recon | Train KL | Val Loss | Val Recon | Val KL |
|---|---|---|---|---|---|---|
| 1 | 0.1476 | 0.1379 | 0.0078 | 0.1266 | 0.1256 | 0.0010 |
| 2 | 0.1207 | 0.1144 | 0.0063 | 0.1172 | 0.1110 | 0.0062 |
| 3 | 0.1490 | 0.1068 | 0.0422 | 0.1228 | 0.1142 | 0.0086 |
| 4 | 0.1021 | 0.1015 | 0.0005 | 0.1651 | 0.1003 ★ | 0.0648 |
How the world model evolved across versions - from continuous latents and teacher forcing to discrete RSSM with KL-balancing.
| Component | v1 (VAE + GRU) | v2 (Discrete RSSM) |
|---|---|---|
| Resolution | 40×36 pixels (PWhiddy downsampled) | 160×144 pixels (native Game Boy) |
| Encoder | Variational Autoencoder (VAE) | 4-layer Residual CNN → 512-dim embed |
| Latent Space | Continuous R³² (reparameterization) | Discrete 32×32 categorical (1024-dim) |
| Dynamics | Autoregressive GRU (scheduled sampling) | RSSM: GRU h_t (512) + Stochastic s_t |
| Gradient Estimator | Reparameterization trick | Gumbel-Softmax straight-through |
| KL Balancing | Standard ELBO | 80% prior / 20% posterior balancing |
| Imagination | Prior-only rollout | Full posterior + prior with KL balancing |
| Decoders | Pixel decoder only | Pixel + Reward predictor + Continue predictor |
| Controller | Lookahead MPC planner (coordinate probe) | Actor-Critic trained in imagination |
Get PokéDreamer running in minutes. You'll need a legally-obtained Pokémon Red ROM.
git clone https://github.com/xoTEMPESTox/PokeDreamer.git
cd PokeDreamer
conda env create -f environment.yml
conda activate pokemon-rl
Copy your legally-obtained Pokemon - Red Version (USA, Europe).gb to the project root. Alternatively, download the pre-collected dataset from Hugging Face.
# Collect data (optional - dataset available on HF)
python scripts/collect_data.py --episodes 20 --out-dir data
python scripts/train_rssm.py \
--data-dir data \
--epochs 12 \
--batch-size 64 \
--out-dir checkpoints/rssm_v2
Renders a side-by-side video comparing the real emulator vs. the RSSM's imagined frames.
python scripts/generate_demo_video_v2.py \
--checkpoint checkpoints/rssm_v2/best_world_model.pt \
--save-state saves/intro_done.state \
--out-video checkpoints/rssm_v2/side_by_side_demo_v2.mp4
RSSM v2 best checkpoint (best_world_model.pt), v1 VAE and dynamics checkpoints - all on Hugging Face.
20 NPZ files of native-resolution (160×144) gameplay transitions with full RAM state annotations. ~340MB total.
HuggingFace Dataset →Full Python source - models, dataset loader, game state extractor, training scripts, and demo video generator.
GitHub Repository →Detailed documentation of every module: models.py, dataset.py, game_state.py, ram_addresses.py, and all scripts.