Long-Term Vision
The eventual goal of this roadmap is something architecturally closer to DIAMOND (Diffusion As a Model Of eNvironment Dreams) - a world model where the RL policy is trained entirely inside imagined rollouts, at pixel fidelity good enough that the imagined environment is genuinely playable.
Everything from v1 and v2 was preparation. v3 is where research becomes game-playing: can the agent earn Gym badges, catch Pokémon, and progress through the story - all driven by a policy that has never needed to interact with the real emulator during training?
System 1 - Fast Reactive Controller
System 1 is a neural network policy π(a_t | h_t, s_t) - an Actor conditioned on the RSSM latent state. It executes actions at emulator speed and is responsible for low-level reactive control.
Behaviour Cloning Warm-Start
Before RL fine-tuning, the actor is pre-trained on 10–50h of curated human demonstrations via Behaviour Cloning (BC). This gives the policy a reasonable prior over useful action sequences before it starts exploring.
BC Phase:
Input: (h_t, s_t) ← RSSM latent at time t
Target: expert action a_t from human gameplay
Loss: cross-entropy(actor_logits, expert_actions)
Data: 10–50h curated human demonstrations
PPO + GRPO RL Fine-tuning
After BC warm-start, the actor is fine-tuned with PPO (Proximal Policy Optimization) - and potentially GRPO (Group Relative Policy Optimization) - trained entirely inside the RSSM's imagined rollouts.
Imagination Training Loop:
1. Sample starting states from replay buffer
2. Roll out RSSM prior for H=15 steps using actor π
3. Compute rewards from reward_predictor and RAM probes
4. Compute λ-returns: V_λ = r_t + γ(1-λ)v(h_{t+1},s_{t+1}) + γλV_{t+1}
5. Update Actor via policy gradient on V_λ
6. Update Critic via MSE(v(h_t,s_t), V_λ)
7. Periodically collect real environment data to refresh world model
Multi-Task Actor-Critic
The actor is conditioned on a goal embedding from System 2 - enabling the same policy to pursue different objectives (navigation, battle, exploration) depending on the current macro-goal.
| Component | Architecture | Description |
|---|---|---|
| Actor | MLP(1536+goal_dim → 256 → 256 → 8) | Policy network outputting action logits |
| Critic | MLP(1536+goal_dim → 256 → 256 → 1) | Value function for λ-return targets |
System 2 - Slow Strategic Planner
System 2 is a Multimodal LLM (Gemini 2.0 Flash or GPT-4o) that receives game screenshots and RAM state JSON, and produces high-level macro goals for System 1.
Interface
// System 2 Input (per planning call):
{
"screenshot": "<base64 160×144 Game Boy frame>",
"state": {
"map_name": "Pallet Town",
"x": 5, "y": 7,
"badges": 0,
"party_hp": [45, 0],
"in_battle": false,
"dialog_open": false,
"steps_without_progress": 312
},
"current_goal": "Reach Viridian City"
}
// System 2 Output (macro goal):
{
"goal": "Navigate to Route 1 entrance",
"goal_embedding": [0.23, -0.14, ...], // embedded for System 1 conditioning
"rationale": "Need to head north to reach Viridian City"
}
Replanning Trigger
System 2 is called when any of these conditions are met:
- A story progress flag advances (new badge, new area reached)
- System 1 has taken 500+ steps without story progress
- System 1 requests a goal update explicitly
- A battle ends (success or failure)
Dual-Agent Integration
The two systems form a hierarchical architecture - System 2 acts as a manager that sets objectives, System 1 acts as a worker that executes them:
┌─────────────────────────────────────────────────────┐
│ SYSTEM 2 (LLM) │
│ Input: screenshot + RAM JSON │
│ Output: macro goal (text + embedding) │
│ Speed: ~1–5 calls/min (slow) │
└───────────────────┬─────────────────────────────────┘
│ goal embedding
▼
┌─────────────────────────────────────────────────────┐
│ SYSTEM 1 (Actor-Critic) │
│ Input: RSSM latent (h_t, s_t) + goal embedding │
│ Output: action logits → sampled action │
│ Speed: ~60 Hz (real-time) │
└───────────────────┬─────────────────────────────────┘
│ action
▼
PyBoy Emulator
Improved World Model
v3 requires a more capable world model than v2 to support policy training inside imagination. Key improvements:
- More data: Scale from ~16k to 100k+ transitions with diverse coverage (all gym routes, battle sequences, buildings)
- Longer training: 20–50 epochs vs. 4 epochs in v2
- More RSSM capacity: Explore det_dim=1024, larger stochastic space
- Compression target: Target <200MB vs. ~650MB current - essential for practical deployment
- Better KL scheduling: Free-bit training to prevent posterior collapse
Phase 1: Behaviour Cloning
Detailed plan for the BC warm-start phase:
| Step | Task | Details |
|---|---|---|
| 1 | Collect human demos | 10–50h of expert gameplay covering all major game segments |
| 2 | Encode to RSSM latents | Pass demo frames through encoder + RSSM to get (h_t, s_t) sequences |
| 3 | BC training | Minimize cross-entropy between actor logits and expert actions |
| 4 | Validate | Test actor on PyBoy - should exhibit basic navigation and battle behavior |
Phase 2: Imagination-Based RL
The core contribution of v3 - training the actor entirely inside imagined rollouts:
| Component | Details |
|---|---|
| Rollout horizon | H = 15 imagination steps per update |
| λ-return discount | γ = 0.99, λ = 0.95 |
| Actor loss | Policy gradient on λ-returns (PPO clip) |
| Critic loss | MSE between value estimates and λ-returns |
| World model refresh | Every N training epochs, collect real transitions and retrain world model |
| Entropy bonus | Small entropy term to prevent premature policy collapse |
Reward Design
The reward function must balance multiple objectives for comprehensive game progress. Rewards are predicted by the world model's reward head, with sparse ground-truth signals from RAM at real-environment collection steps.
| Signal | Type | Value | Description |
|---|---|---|---|
| Δ Badges | Sparse | +10.0 | Earning a new Gym badge |
| Δ Pokédex species | Sparse | +2.0 | First time catching a species |
| Δ Unique map IDs | Moderate | +0.5 | Entering a new map location |
| Battle win | Moderate | +1.0 | Successfully winning a battle |
| XP gain | Dense | +0.01×Δ | Small continuous exploration signal |
| HP lost | Penalty | -0.05×Δ | Discourage reckless behavior |
| Stuck penalty | Penalty | -0.001 | Per step with no progress signal |
Target Metrics
| Metric | v2 Achieved | v3 Target | Notes |
|---|---|---|---|
| Badges (no human intervention) | 0 (policy not trained) | ≥ 2 | Brock + Misty |
| Pokédex species caught | N/A | ≥ 10 | Early Routes |
| 50-step imagination drift | ~5–8 tiles (est.) | < 5 tiles | Better world model |
| World model size | ~650MB | < 200MB | Compression required |
| Inference speed | N/A | ≥ 60 Hz | Real-time playback |
Risks & Mitigations
| Risk | Severity | Mitigation |
|---|---|---|
| World model hallucination at long horizons | High | Cap imagination rollout to H=15; refresh world model with real data every N epochs |
| Critic overfitting to imagined rollouts | High | Alternating world-model and policy update phases; use fresh trajectories for each |
| Discrete gradient estimation instability | Medium | Temperature annealing for Gumbel-Softmax; fall back to straight-through if needed |
| LLM System 2 latency bottleneck | Low | S2 replanning is infrequent (every 500 steps); use async calls |
| Reward sparsity - no badge for thousands of steps | High | Dense auxiliary rewards (XP gain, map exploration) to maintain training signal |