Build Pixel-Based Embodied Agent with Latent MPC

Pure NumPy Rendering Enables Dependency-Free Pixel Simulations

Create a grid world where the agent sees raw RGB frames, mimicking real embodied settings without Pillow or game engines. The GridWorldRGBNoPIL class renders an 8x8 grid (112x112 pixels) with agent (blue), goal (green), obstacles (red), and gridlines on a light background. Positions spawn randomly with margins to avoid edges.

Key method: _render_u8() fills a uint8 NumPy array with background, overlays lines every cell_px=14 pixels, then paints cells:

def paint_cell(x,y,color):
    y0,y1 = y*s,(y+1)*s
    x0,x1 = x*s,(x+1)*s
    img[y0+1:y1-1, x0+1:x1-1] = color

Actions: up(0), down(1), left(2), right(3), stay(4). Step computes Manhattan distance reward (0.1 * delta_dist + 1.0 on goal), bounds/obstacle checks. State vector normalizes positions to 0,1.

This setup teaches: Raw pixels force the model to learn perception end-to-end, avoiding symbolic cheats. Trade-off: Fixed size limits scalability, but perfect for prototyping MPC.

"We create a fully NumPy-rendered grid world in which the agent observes RGB frames rather than symbolic state variables."

Transition Data Captures Goal-Conditioned Dynamics

Roll out 120 random episodes (45 steps max) to build a dataset of transitions: current image, action, next image, next state, goal (normalized goal pos). Use DataLoader with batch_size=64.

Why random policy? It explores state-action space efficiently for world modeling, not optimal behavior—model learns prediction first, planning later. Each item:

{
    "img_t": img_t,  # (3,H,W) [0,1]
    "action": torch.tensor(a),
    "img_tp1": img_tp1,
    "state_tp1": st,  # [agent_x,y, goal_x,y]
    "goal": goal
}

Pitfall avoided: Include goal in every transition for conditioning, enabling directed planning. ~5k transitions suffice due to simple env—scales to real data with more episodes.

Compact CNN World Model Predicts Frames and States

VLASimLite (zdim=64): Encoder (3 conv layers: 3→24→48→64, strides=2) flattens to zdim. Decoder (transposed convs back to 3xHxW, sigmoid). Dynamics: embed action (Embedding(5,16)), goal MLP (2→16), concat with z → 128 → zdim. State head: z → 4 (sigmoid).

Forward: encode img_t → z → predict z_next (conditioned on a, goal) → decode img_pred + state_pred.

"We build a CNN encoder to compress visual input into a latent space and condition latent dynamics on actions and goals."

Architecture insight: Small zdim keeps inference fast for MPC rollout (120 candidates x 6 horizon = ~50k dynamics evals/frame). No RNN—pure feedforward for stability.

Multi-Task Loss Ensures Predictive Consistency

Train 4 epochs, Adam lr=2e-3, grad clip=2.0:

loss = F.l1_loss(img_pred, img_tp1) + 3.0*F.mse_loss(st_pred, st_tp1) + 1e-4*z_next.pow(2).mean()

L1 for pixels (robust to outliers), MSE x3 for states (precise positions), tiny L2 on latents (prevents drift).

Why this works: Reconstruction anchors representation, state loss adds structure, dynamics learn forward sim. After 4 epochs, loss drops steadily—lightweight means quick convergence on CPU/GPU.

Common mistake: Over-relying on pixel loss alone leads to blurry predictions; state supervision sharpens goal awareness.

"We optimize the latent dynamics so that the model learns consistent forward prediction from pixels."

Latent MPC Samples and Scores Trajectories

No gradients—Monte Carlo: Sample n_candidates=120 sequences of horizon=6 actions (randint(0,5)). Rollout from current z:

for t in range(horizon):
    z_roll = model.predict_next_latent(z_roll, cand[:,t], goal_k)
stT = model.state(z_roll)
dist = torch.abs(stT[:,0:2] - stT[:,2:4]).sum(dim=-1)
changes = (cand[:,1:] != cand[:,:-1]).float().mean(dim=1)  # Smoothness penalty
score = dist + 0.12*changes
best = torch.argmin(score)[0]  # First action

Goal auto-extracts from current state pred. Penalty discourages jitter. ~1ms/frame on modest GPU.

Trade-off: Short horizon fast but myopic; tune n_candidates vs compute. Predict next frame for viz.

"We sample multiple action sequences, roll them forward through the learned dynamics, and select the sequence that minimizes predicted distance to the goal."

Closed-Loop Execution Validates End-to-End Loop

run_episode(): Perceive (encode img) → MPC plan → act → repeat. Reaches goal in <20 steps, return ~3.0+. Viz side-by-side real/pred frames shows accurate foresight.

Before: Random policy wanders. After: Purposeful pathing around obstacles. Quality check: Pred aligns 1-2 steps ahead, low dist error.

"This approach captures the core idea behind modern Vision-Language-Action systems, where perception and decision-making are tightly integrated within a predictive model."

Exercise: Increase grid_size=16, n_obstacles=15—retrain, observe MPC scaling.

Key Takeaways

Render pixels with NumPy arrays for portable sims: Fill bg, overlay grids, paint cells—avoids lib bloat.
Collect goal-conditioned transitions via random rollouts: 100+ episodes yield robust data.
Use conv encoder (zdim=64) + action/goal-conditioned dynamics for lightweight world models.
Train with L1 pixels + MSE states + latent L2: Balances recon, accuracy, stability.
MPC in latent space: Sample 120x6 trajs, score dist + smoothness—execute best first action.
Replan every step from pixels: Enables adaptation without receding horizon compute.
Start small (8x8 grid): Prototype VLA agents fast, scale to video/CV2 inputs.
Full code GitHub-ready: Copy-paste, tweak cfg for your env.