Motivation & Framing
v1 establishes the core thesis of PokéDreamer: that an agent can reason about Pokémon Red using a learned dynamics model rather than requiring access to the real emulator at every decision step. This is the fundamental distinction between model-based and model-free RL.
(s_t, a_t) → s_{t+1} from data, and the planner used its rollouts - not the real emulator - to decide"? If yes at every decision point, you have a world model project.
The v1 approach uses a symbolic intermediate representation: the VAE compresses pixel frames to a compact latent z ∈ ℝ³², the GRU dynamics model propagates those latents autoregressively, and linear/MLP probes read out player position from frozen latents for the planner's objective function.
Architecture Overview
The v1 system is composed of four distinct learned components working in sequence:
| Component | Input | Output | Description |
|---|---|---|---|
| VAE Encoder | 40×36×3 pixels | z ∈ ℝ³² | Compresses screen frames to compact latents |
| GRU Dynamics | (z_t, a_t) | z_{t+1} | Predicts next latent given current state+action |
| RAM Probes | z ∈ ℝ³² | (x, y, map_id) | Decode player position from latents |
| MPC Planner | current z | action sequence | Imagines K-step futures, scores by probe output |
The observation space uses the PWhiddy PPO's downsampled 40×36×3 pixel format - this allows reusing the frozen PPO checkpoint as data collection policy and executor, while the world model operates in the same latent space.
Variational Autoencoder (VAE)
The VAE is the first piece of the world model pipeline. It learns to compress Game Boy screen frames into a low-dimensional latent space z ∈ ℝ³², from which the decoder can reconstruct the original image.
Architecture Details
- Input: 40×36×3 RGB frames (PWhiddy downsampled resolution)
- Encoder: Convolutional layers → mean + log-variance heads → reparameterization
- Latent dim: 32 (continuous Gaussian)
- Decoder: Transposed convolutions reconstructing 40×36×3
- Loss: ELBO = Reconstruction (MSE) + β × KL divergence
- β: 1.0 (standard VAE)
Training Configuration
Checkpoint: checkpoints/vae/best_vae.pt
Epochs: 15
latent_dim: 32
beta (KL): 1.0
lr: 1e-3
batch_size: 128
GRU Dynamics Model
The dynamics model learns the transition function z_{t+1} ≈ f(z_t, a_t) - predicting the next latent from the current latent plus the chosen action. This is what enables the agent to imagine future states without running the emulator.
Scheduled Sampling
A critical training detail: scheduled sampling gradually shifts the model from using ground-truth previous latents (teacher forcing) to using its own predicted latents. This prevents the exposure bias problem where the model sees perfect inputs at train time but its own (imperfect) predictions at inference time.
Architecture Details
- Input: concat(z_t ∈ ℝ³², action_embedding ∈ ℝ¹⁶) = ℝ⁴⁸
- Hidden state: GRU with 256-dim hidden vector
- Output: z_{t+1} ∈ ℝ³²
- Sequence length: 30 steps (BPTT rollout)
- Scheduled sampling: Linear decay over 15 epochs to 20% teacher forcing
RAM State Probes
To use the world model for planning, the agent needs a way to extract meaningful information from the latent space z. We train lightweight probes on frozen latents - they never modify the VAE or dynamics model.
Probe Architecture
- Input: frozen z ∈ ℝ³²
- Shared base: 64-dim hidden layer
- Coordinate head: Linear regression for (x, y) tile position (MAE loss)
- Map ID head: 248-class classification (cross-entropy loss)
MPC Planner
The Model Predictive Control (MPC) planner is the brain of the v1 agent. Rather than reacting to observations, it looks ahead by simulating multiple action sequences through the dynamics model and selecting the best-scoring imagined future.
Planning Loop
For each candidate action sequence:
imagined_state = dynamics.rollout(z_current, sequence, k steps)
(x_pred, y_pred) = probe.decode(imagined_state)
score = evaluate(x_pred, y_pred, target_map_id)
Best sequence = argmax(score)
Execute first action of best sequence via frozen PPO
The evaluation function is deliberately simple: minimize Manhattan distance to a target tile/map. The contribution is the world model + imagination capability, not the sophistication of the planner.
Results: VAE Training
| Metric | Value | Notes |
|---|---|---|
| Train Loss (Total) | 1258.41 | Recon: 1227.76, KL: 30.66 |
| Val Loss (Total) | 1255.95 | Recon: 1225.83, KL: 30.13 |
| Epochs | 15 | Converged before 15 |
| Latent Dim | 32 | Continuous Gaussian |
Note: Loss values are in raw MSE pixel space (not normalized). The train/val parity indicates no significant overfitting.
Results: RAM State Probes
| Task | Metric | Value |
|---|---|---|
| Map ID Classification | Validation Accuracy | 98.7% |
| Player Coordinate Decoding | Manhattan Distance (MAE) | 1.23 tiles |
| Val Loss (Combined) | Probe loss | 8.3260 |
Results: Dynamics Model
| Model | Val AR Loss | Val TF Loss | Notes |
|---|---|---|---|
| Scheduled Sampling (Primary) | 0.10314 | 0.03675 | 20% min TF ratio |
| Pure Teacher Forcing (Ablation) | 0.72547 | 0.01913 | TF=1.0 always |
Autoregressive (AR) loss is the meaningful metric - it measures performance when the model uses its own predictions as input, as it must do during imagination. The TF model has low teacher-forced loss but fails catastrophically in autoregressive rollout.
Results: Rollout Drift Analysis
This is the key ablation. Over 14,564 validation trajectories of length 29, we compare the scheduled sampling model vs. the pure teacher forcing ablation as both are rolled out autoregressively.
| Step k | SS Latent MSE | TF Latent MSE | SS Tile Error | TF Tile Error | Ratio |
|---|---|---|---|---|---|
| 1 | 0.09268 | 0.12240 | 3.72 tiles | 4.06 tiles | 1.09× |
| 5 | 0.08751 | 0.23061 | 3.32 tiles | 5.08 tiles | 1.53× |
| 10 | 0.08842 | 0.40189 | 3.30 tiles | 6.47 tiles | 1.96× |
| 15 | 0.09238 | 0.59876 | 3.33 tiles | 7.81 tiles | 2.35× |
| 20 | 0.09830 | 0.78455 | 3.36 tiles | 9.13 tiles | 2.72× |
| 25 | 0.10606 | 0.92376 | 3.42 tiles | 9.72 tiles | 2.84× |
| 29 | 0.11197 | 1.04063 | 3.47 tiles | 10.44 tiles | 3.01× |
The SS model's tile error remains flat (3.30–3.47 tiles) across all 29 steps. The TF ablation degrades from 4.06 tiles at step 1 to 10.44 tiles at step 29 - a 3× degradation in imagination quality.
Full Hyperparameter Registry
# VAE
vae:
latent_dim: 32
beta: 1.0
lr: 1.0e-3
batch_size: 128
epochs: 15
input_resolution: [40, 36] # HxW
# Dynamics (Scheduled Sampling)
dynamics:
hidden_dim: 256 # GRU hidden size
action_embed_dim: 16
seq_len: 30 # BPTT rollout steps
decay_epochs: 15 # linear SS decay
min_teacher_forcing: 0.2
lr: 1.0e-3
batch_size: 128
epochs: 20
# Probes
probes:
input_dim: 32 # frozen latent dim
hidden_dim: 64 # per-task head
lr: 1.0e-3
batch_size: 128
epochs: 10
State & Action Schema
The symbolic state representation extracted from WRAM at each emulator tick:
state = {
'map_id': int, # raw map ID byte (248 locations)
'x': int, # player tile X coordinate
'y': int, # player tile Y coordinate
'facing': int, # 0=down, 4=up, 8=left, 12=right
'in_battle': bool, # True when battle is active
'dialog_open': bool, # True when text box is showing
'badges': int, # bitmask of 8 gym badges
'party_hp': list[int], # current HP per party slot
'party_max_hp':list[int], # max HP per party slot
}
# Action Space
action ∈ {UP=0, DOWN=1, LEFT=2, RIGHT=3, A=4, B=5, START=6, SELECT=7}