PokéDreamer

Model-Based Reinforcement Learning in Pokémon Red

An experimental research project applying Dreamer-style model-based RL to play Pokémon Red on a Game Boy emulator. Built from scratch - iterating from a simple VAE+GRU dynamics model to a discrete Recurrent State-Space Model trained on native-resolution pixels.

98.7% Map ID Accuracy
1.23 Tiles Coord Error
Drift Reduction (SS vs TF)
160×144 Native Resolution

What is PokéDreamer?

PokéDreamer is a research project that teaches an agent to build an internal world model of Pokémon Red - then reason, plan, and act inside that imagined world rather than the real emulator.

🧠

World Model

A neural network that learns the transition dynamics of the game: given the current pixel observation and an action, predict the next frame - entirely in a compressed latent space.

🎮

Imagination-Based Planning

Instead of repeatedly stepping the emulator to explore, the agent "dreams" future trajectories inside the world model at high speed. The RSSM prior enables pure latent rollouts.

📡

Dreamer Lineage

Architecturally inspired by Hafner et al.'s DreamerV2/V3 - using a Recurrent State-Space Model (RSSM) with discrete categorical latents and Gumbel-Softmax straight-through gradients.

🔬

Research-Grade

Built from scratch to understand every component - from the VAE bottleneck to KL-balancing in the RSSM. Not a wrapper around existing frameworks.

The Research Journey

From symbolic state probes to pixel-level imagination - three generations of world models.

v1
Completed ✓

VAE + GRU MPC Planner

The first world model - a Variational Autoencoder compressing frames to 32-dim latents, with a GRU dynamics model predicting the next latent. An MPC planner imagines action sequences and picks the best path using linear probes for map position.

40×36Resolution
R³²Latent Space
98.7%Map Accuracy
1.23 tilesCoord Error
Full v1 Documentation →
v2
Completed ✓

Discrete RSSM World Model

The major leap - native 160×144 resolution, a 4-layer Residual CNN encoder, and a full Recurrent State-Space Model with 32×32 discrete categorical latents trained via Gumbel-Softmax. The RSSM's prior enables pure imagination rollouts without any emulator steps.

160×144Resolution
1024-dimDiscrete Latent
0.1003Best Val Recon
4 epochsTrained
Full v2 Documentation →
v3
In Planning 🚀

Dual-Agent SOTA System

The vision: System 1 (a PPO/GRPO actor trained inside RSSM imagination) directed by System 2 (a multimodal LLM strategic planner). The first version where "policy trained inside imagination" becomes the explicit target - approaching DIAMOND-style pixel-fidelity world modeling.

≥2Target Badges
<5 tilesImagination Drift
<200MBTarget Model Size
LLM+RLDual Agent
Full v3 Roadmap →

Key Experimental Findings

Quantitative results from training and evaluation across both completed versions.

📈 v1: Scheduled Sampling vs. Teacher Forcing - Rollout Drift

Measured over 14,564 validation trajectories of length 29 steps. Scheduled sampling (SS) is critical: the SS model's compounding drift stays flat under 3.5 tiles out to 29 imagined steps, while the pure teacher-forcing (TF) ablation exceeds 10.4 tiles - a 3× degradation.

Rollout Step SS Latent MSE TF Latent MSE SS Tile Error TF Tile Error Improvement
Step 10.092680.122403.72 tiles4.06 tiles1.09×
Step 50.087510.230613.32 tiles5.08 tiles1.53×
Step 100.088420.401893.30 tiles6.47 tiles1.96×
Step 150.092380.598763.33 tiles7.81 tiles2.35×
Step 200.098300.784553.36 tiles9.13 tiles2.72×
Step 250.106060.923763.42 tiles9.72 tiles2.84×
Step 290.111971.040633.47 tiles10.44 tiles3.01×

🌟 v2: RSSM World Model Training Progression

4 epochs on 20 NPZ files (~16,000 native-resolution transitions). Batch size 64, sequence length 15. Reconstruction loss steadily decreases - the best checkpoint (epoch 4, val recon = 0.1003) demonstrates pixel-level world modeling at native Game Boy resolution.

Epoch Train Loss Train Recon Train KL Val Loss Val Recon Val KL
10.14760.13790.00780.12660.12560.0010
20.12070.11440.00630.11720.11100.0062
30.14900.10680.04220.12280.11420.0086
40.10210.10150.00050.16510.10030.0648

v1 vs v2 Architecture Comparison

How the world model evolved across versions - from continuous latents and teacher forcing to discrete RSSM with KL-balancing.

Component v1 (VAE + GRU) v2 (Discrete RSSM)
Resolution 40×36 pixels (PWhiddy downsampled) 160×144 pixels (native Game Boy)
Encoder Variational Autoencoder (VAE) 4-layer Residual CNN → 512-dim embed
Latent Space Continuous R³² (reparameterization) Discrete 32×32 categorical (1024-dim)
Dynamics Autoregressive GRU (scheduled sampling) RSSM: GRU h_t (512) + Stochastic s_t
Gradient Estimator Reparameterization trick Gumbel-Softmax straight-through
KL Balancing Standard ELBO 80% prior / 20% posterior balancing
Imagination Prior-only rollout Full posterior + prior with KL balancing
Decoders Pixel decoder only Pixel + Reward predictor + Continue predictor
Controller Lookahead MPC planner (coordinate probe) Actor-Critic trained in imagination

Quick Start Guide

Get PokéDreamer running in minutes. You'll need a legally-obtained Pokémon Red ROM.

01

Clone & Install

bash
git clone https://github.com/xoTEMPESTox/PokeDreamer.git
cd PokeDreamer
conda env create -f environment.yml
conda activate pokemon-rl
02

Place ROM & Collect Data

Copy your legally-obtained Pokemon - Red Version (USA, Europe).gb to the project root. Alternatively, download the pre-collected dataset from Hugging Face.

bash
# Collect data (optional - dataset available on HF)
python scripts/collect_data.py --episodes 20 --out-dir data
03

Train the RSSM World Model

bash
python scripts/train_rssm.py \
    --data-dir data \
    --epochs 12 \
    --batch-size 64 \
    --out-dir checkpoints/rssm_v2
04

Generate Imagination Demo

Renders a side-by-side video comparing the real emulator vs. the RSSM's imagined frames.

bash
python scripts/generate_demo_video_v2.py \
    --checkpoint checkpoints/rssm_v2/best_world_model.pt \
    --save-state saves/intro_done.state \
    --out-video checkpoints/rssm_v2/side_by_side_demo_v2.mp4

Models, Datasets & References

📖 References & Credits

  • DreamerV2/V3 Hafner et al. (2021, 2023) - Mastering Atari with Discrete World Models / DreamerV3. Primary RSSM architecture inspiration.
  • PWhiddy PPO PokemonRedExperiments - Used as the data collection policy for gathering gameplay transitions.
  • PyBoy Baekalfen/PyBoy - Game Boy emulator used for environment interaction and screen capture.
  • DIAMOND DIAMOND - Diffusion As a Model Of eNvironment Dreams. Long-term architectural reference point for v3.