In Planning 🚀 v3.md

v3: Dual-Agent
SOTA System

System 1 (PPO actor trained inside RSSM imagination) + System 2 (Multimodal LLM strategic planner). The first version where "policy trained inside imagination" becomes the explicit target - approaching DIAMOND-style pixel-fidelity world modeling.

Long-Term Vision

The eventual goal of this roadmap is something architecturally closer to DIAMOND (Diffusion As a Model Of eNvironment Dreams) - a world model where the RL policy is trained entirely inside imagined rollouts, at pixel fidelity good enough that the imagined environment is genuinely playable.

The Progression
v1's continuous latents and v2's discrete RSSM are intentionally scoped well below DIAMOND's bar - they're the steps to learn the concepts (latent dynamics, imagination-based planning, compounding-error mitigation) before attempting anything at that scale. v3 is the first point where "policy trained inside imagination" becomes the explicit target.

Everything from v1 and v2 was preparation. v3 is where research becomes game-playing: can the agent earn Gym badges, catch Pokémon, and progress through the story - all driven by a policy that has never needed to interact with the real emulator during training?

System 1 - Fast Reactive Controller

System 1 is a neural network policy π(a_t | h_t, s_t) - an Actor conditioned on the RSSM latent state. It executes actions at emulator speed and is responsible for low-level reactive control.

Behaviour Cloning Warm-Start

Before RL fine-tuning, the actor is pre-trained on 10–50h of curated human demonstrations via Behaviour Cloning (BC). This gives the policy a reasonable prior over useful action sequences before it starts exploring.

text
BC Phase:
  Input:  (h_t, s_t) ← RSSM latent at time t
  Target: expert action a_t from human gameplay
  Loss:   cross-entropy(actor_logits, expert_actions)
  Data:   10–50h curated human demonstrations

PPO + GRPO RL Fine-tuning

After BC warm-start, the actor is fine-tuned with PPO (Proximal Policy Optimization) - and potentially GRPO (Group Relative Policy Optimization) - trained entirely inside the RSSM's imagined rollouts.

text
Imagination Training Loop:
  1. Sample starting states from replay buffer
  2. Roll out RSSM prior for H=15 steps using actor π
  3. Compute rewards from reward_predictor and RAM probes
  4. Compute λ-returns: V_λ = r_t + γ(1-λ)v(h_{t+1},s_{t+1}) + γλV_{t+1}
  5. Update Actor via policy gradient on V_λ
  6. Update Critic via MSE(v(h_t,s_t), V_λ)
  7. Periodically collect real environment data to refresh world model

Multi-Task Actor-Critic

The actor is conditioned on a goal embedding from System 2 - enabling the same policy to pursue different objectives (navigation, battle, exploration) depending on the current macro-goal.

ComponentArchitectureDescription
ActorMLP(1536+goal_dim → 256 → 256 → 8)Policy network outputting action logits
CriticMLP(1536+goal_dim → 256 → 256 → 1)Value function for λ-return targets

System 2 - Slow Strategic Planner

System 2 is a Multimodal LLM (Gemini 2.0 Flash or GPT-4o) that receives game screenshots and RAM state JSON, and produces high-level macro goals for System 1.

Interface

json
// System 2 Input (per planning call):
{
  "screenshot": "<base64 160×144 Game Boy frame>",
  "state": {
    "map_name": "Pallet Town",
    "x": 5, "y": 7,
    "badges": 0,
    "party_hp": [45, 0],
    "in_battle": false,
    "dialog_open": false,
    "steps_without_progress": 312
  },
  "current_goal": "Reach Viridian City"
}

// System 2 Output (macro goal):
{
  "goal": "Navigate to Route 1 entrance",
  "goal_embedding": [0.23, -0.14, ...],  // embedded for System 1 conditioning
  "rationale": "Need to head north to reach Viridian City"
}

Replanning Trigger

System 2 is called when any of these conditions are met:

Dual-Agent Integration

The two systems form a hierarchical architecture - System 2 acts as a manager that sets objectives, System 1 acts as a worker that executes them:

text
┌─────────────────────────────────────────────────────┐
│                   SYSTEM 2 (LLM)                    │
│  Input: screenshot + RAM JSON                        │
│  Output: macro goal (text + embedding)               │
│  Speed: ~1–5 calls/min (slow)                        │
└───────────────────┬─────────────────────────────────┘
                    │ goal embedding
                    ▼
┌─────────────────────────────────────────────────────┐
│               SYSTEM 1 (Actor-Critic)               │
│  Input: RSSM latent (h_t, s_t) + goal embedding     │
│  Output: action logits → sampled action              │
│  Speed: ~60 Hz (real-time)                           │
└───────────────────┬─────────────────────────────────┘
                    │ action
                    ▼
              PyBoy Emulator

Improved World Model

v3 requires a more capable world model than v2 to support policy training inside imagination. Key improvements:

Phase 1: Behaviour Cloning

Detailed plan for the BC warm-start phase:

StepTaskDetails
1Collect human demos10–50h of expert gameplay covering all major game segments
2Encode to RSSM latentsPass demo frames through encoder + RSSM to get (h_t, s_t) sequences
3BC trainingMinimize cross-entropy between actor logits and expert actions
4ValidateTest actor on PyBoy - should exhibit basic navigation and battle behavior

Phase 2: Imagination-Based RL

The core contribution of v3 - training the actor entirely inside imagined rollouts:

ComponentDetails
Rollout horizonH = 15 imagination steps per update
λ-return discountγ = 0.99, λ = 0.95
Actor lossPolicy gradient on λ-returns (PPO clip)
Critic lossMSE between value estimates and λ-returns
World model refreshEvery N training epochs, collect real transitions and retrain world model
Entropy bonusSmall entropy term to prevent premature policy collapse

Reward Design

The reward function must balance multiple objectives for comprehensive game progress. Rewards are predicted by the world model's reward head, with sparse ground-truth signals from RAM at real-environment collection steps.

SignalTypeValueDescription
Δ BadgesSparse+10.0Earning a new Gym badge
Δ Pokédex speciesSparse+2.0First time catching a species
Δ Unique map IDsModerate+0.5Entering a new map location
Battle winModerate+1.0Successfully winning a battle
XP gainDense+0.01×ΔSmall continuous exploration signal
HP lostPenalty-0.05×ΔDiscourage reckless behavior
Stuck penaltyPenalty-0.001Per step with no progress signal

Target Metrics

Metricv2 Achievedv3 TargetNotes
Badges (no human intervention)0 (policy not trained)≥ 2Brock + Misty
Pokédex species caughtN/A≥ 10Early Routes
50-step imagination drift~5–8 tiles (est.)< 5 tilesBetter world model
World model size~650MB< 200MBCompression required
Inference speedN/A≥ 60 HzReal-time playback

Risks & Mitigations

RiskSeverityMitigation
World model hallucination at long horizonsHighCap imagination rollout to H=15; refresh world model with real data every N epochs
Critic overfitting to imagined rolloutsHighAlternating world-model and policy update phases; use fresh trajectories for each
Discrete gradient estimation instabilityMediumTemperature annealing for Gumbel-Softmax; fall back to straight-through if needed
LLM System 2 latency bottleneckLowS2 replanning is infrequent (every 500 steps); use async calls
Reward sparsity - no badge for thousands of stepsHighDense auxiliary rewards (XP gain, map exploration) to maintain training signal
Next: Code Reference →