PokéDreamer v3 - Dual-Agent Roadmap

Long-Term Vision

The eventual goal of this roadmap is something architecturally closer to DIAMOND (Diffusion As a Model Of eNvironment Dreams) - a world model where the RL policy is trained entirely inside imagined rollouts, at pixel fidelity good enough that the imagined environment is genuinely playable.

The Progression

v1's continuous latents and v2's discrete RSSM are intentionally scoped well below DIAMOND's bar - they're the steps to learn the concepts (latent dynamics, imagination-based planning, compounding-error mitigation) before attempting anything at that scale. v3 is the first point where "policy trained inside imagination" becomes the explicit target.

Everything from v1 and v2 was preparation. v3 is where research becomes game-playing: can the agent earn Gym badges, catch Pokémon, and progress through the story - all driven by a policy that has never needed to interact with the real emulator during training?

System 1 - Fast Reactive Controller

System 1 is a neural network policy π(a_t | h_t, s_t) - an Actor conditioned on the RSSM latent state. It executes actions at emulator speed and is responsible for low-level reactive control.

Behaviour Cloning Warm-Start

Before RL fine-tuning, the actor is pre-trained on 10–50h of curated human demonstrations via Behaviour Cloning (BC). This gives the policy a reasonable prior over useful action sequences before it starts exploring.

text

BC Phase:
  Input:  (h_t, s_t) ← RSSM latent at time t
  Target: expert action a_t from human gameplay
  Loss:   cross-entropy(actor_logits, expert_actions)
  Data:   10–50h curated human demonstrations

PPO + GRPO RL Fine-tuning

After BC warm-start, the actor is fine-tuned with PPO (Proximal Policy Optimization) - and potentially GRPO (Group Relative Policy Optimization) - trained entirely inside the RSSM's imagined rollouts.

text

Imagination Training Loop:
  1. Sample starting states from replay buffer
  2. Roll out RSSM prior for H=15 steps using actor π
  3. Compute rewards from reward_predictor and RAM probes
  4. Compute λ-returns: V_λ = r_t + γ(1-λ)v(h_{t+1},s_{t+1}) + γλV_{t+1}
  5. Update Actor via policy gradient on V_λ
  6. Update Critic via MSE(v(h_t,s_t), V_λ)
  7. Periodically collect real environment data to refresh world model

Multi-Task Actor-Critic

The actor is conditioned on a goal embedding from System 2 - enabling the same policy to pursue different objectives (navigation, battle, exploration) depending on the current macro-goal.

Component	Architecture	Description
Actor	MLP(1536+goal_dim → 256 → 256 → 8)	Policy network outputting action logits
Critic	MLP(1536+goal_dim → 256 → 256 → 1)	Value function for λ-return targets

System 2 - Slow Strategic Planner

System 2 is a Multimodal LLM (Gemini 2.0 Flash or GPT-4o) that receives game screenshots and RAM state JSON, and produces high-level macro goals for System 1.

Interface

json

// System 2 Input (per planning call):
{
  "screenshot": "<base64 160×144 Game Boy frame>",
  "state": {
    "map_name": "Pallet Town",
    "x": 5, "y": 7,
    "badges": 0,
    "party_hp": [45, 0],
    "in_battle": false,
    "dialog_open": false,
    "steps_without_progress": 312
  },
  "current_goal": "Reach Viridian City"
}

// System 2 Output (macro goal):
{
  "goal": "Navigate to Route 1 entrance",
  "goal_embedding": [0.23, -0.14, ...],  // embedded for System 1 conditioning
  "rationale": "Need to head north to reach Viridian City"
}

Replanning Trigger

System 2 is called when any of these conditions are met:

A story progress flag advances (new badge, new area reached)
System 1 has taken 500+ steps without story progress
System 1 requests a goal update explicitly
A battle ends (success or failure)

Dual-Agent Integration

The two systems form a hierarchical architecture - System 2 acts as a manager that sets objectives, System 1 acts as a worker that executes them:

text

┌─────────────────────────────────────────────────────┐
│                   SYSTEM 2 (LLM)                    │
│  Input: screenshot + RAM JSON                        │
│  Output: macro goal (text + embedding)               │
│  Speed: ~1–5 calls/min (slow)                        │
└───────────────────┬─────────────────────────────────┘
                    │ goal embedding
                    ▼
┌─────────────────────────────────────────────────────┐
│               SYSTEM 1 (Actor-Critic)               │
│  Input: RSSM latent (h_t, s_t) + goal embedding     │
│  Output: action logits → sampled action              │
│  Speed: ~60 Hz (real-time)                           │
└───────────────────┬─────────────────────────────────┘
                    │ action
                    ▼
              PyBoy Emulator

Improved World Model

v3 requires a more capable world model than v2 to support policy training inside imagination. Key improvements:

More data: Scale from ~16k to 100k+ transitions with diverse coverage (all gym routes, battle sequences, buildings)
Longer training: 20–50 epochs vs. 4 epochs in v2
More RSSM capacity: Explore det_dim=1024, larger stochastic space
Compression target: Target <200MB vs. ~650MB current - essential for practical deployment
Better KL scheduling: Free-bit training to prevent posterior collapse

Phase 1: Behaviour Cloning

Detailed plan for the BC warm-start phase:

Step	Task	Details
1	Collect human demos	10–50h of expert gameplay covering all major game segments
2	Encode to RSSM latents	Pass demo frames through encoder + RSSM to get (h_t, s_t) sequences
3	BC training	Minimize cross-entropy between actor logits and expert actions
4	Validate	Test actor on PyBoy - should exhibit basic navigation and battle behavior

Phase 2: Imagination-Based RL

The core contribution of v3 - training the actor entirely inside imagined rollouts:

Component	Details
Rollout horizon	H = 15 imagination steps per update
λ-return discount	γ = 0.99, λ = 0.95
Actor loss	Policy gradient on λ-returns (PPO clip)
Critic loss	MSE between value estimates and λ-returns
World model refresh	Every N training epochs, collect real transitions and retrain world model
Entropy bonus	Small entropy term to prevent premature policy collapse

Reward Design

The reward function must balance multiple objectives for comprehensive game progress. Rewards are predicted by the world model's reward head, with sparse ground-truth signals from RAM at real-environment collection steps.

Signal	Type	Value	Description
Δ Badges	Sparse	+10.0	Earning a new Gym badge
Δ Pokédex species	Sparse	+2.0	First time catching a species
Δ Unique map IDs	Moderate	+0.5	Entering a new map location
Battle win	Moderate	+1.0	Successfully winning a battle
XP gain	Dense	+0.01×Δ	Small continuous exploration signal
HP lost	Penalty	-0.05×Δ	Discourage reckless behavior
Stuck penalty	Penalty	-0.001	Per step with no progress signal

Target Metrics

Metric	v2 Achieved	v3 Target	Notes
Badges (no human intervention)	0 (policy not trained)	≥ 2	Brock + Misty
Pokédex species caught	N/A	≥ 10	Early Routes
50-step imagination drift	~5–8 tiles (est.)	< 5 tiles	Better world model
World model size	~650MB	< 200MB	Compression required
Inference speed	N/A	≥ 60 Hz	Real-time playback

Risks & Mitigations

Risk	Severity	Mitigation
World model hallucination at long horizons	High	Cap imagination rollout to H=15; refresh world model with real data every N epochs
Critic overfitting to imagined rollouts	High	Alternating world-model and policy update phases; use fresh trajectories for each
Discrete gradient estimation instability	Medium	Temperature annealing for Gumbel-Softmax; fall back to straight-through if needed
LLM System 2 latency bottleneck	Low	S2 replanning is infrequent (every 500 steps); use async calls
Reward sparsity - no badge for thousands of steps	High	Dense auxiliary rewards (XP gain, map exploration) to maintain training signal

Next: Code Reference →

v3: Dual-AgentSOTA System

Long-Term Vision

System 1 - Fast Reactive Controller

Behaviour Cloning Warm-Start

PPO + GRPO RL Fine-tuning

Multi-Task Actor-Critic

System 2 - Slow Strategic Planner

Interface

Replanning Trigger

Dual-Agent Integration

Improved World Model

Phase 1: Behaviour Cloning

Phase 2: Imagination-Based RL

Reward Design

Target Metrics

Risks & Mitigations

v3: Dual-Agent
SOTA System