Learning Action and World Dynamics Together with Diffusion — Stannis Zhou's (Google DeepMind) Diffusion Model Predictive Control

YC Paper Club 2026 / talk approx. 12 min

Guangyao "Stannis" Zhou · 20:48 "What we did in D-MPC was use a diffusion model to learn both a 'multi-step action proposal' and a 'multi-step dynamics model.'"

The second talk at the first YC Paper Club (2026-05-20, Y Combinator, Mountain View). Talk approx. 12 min (video from 18:33). The speaker is Guangyao "Stannis" Zhou (Staff Research Scientist, Google DeepMind). He co-leads world models for robotics, but this paper is work from about two years ago, "before moving into hardcore robotics." The paper is "Diffusion Model Predictive Control" (arXiv 2410.05364, TMLR 2025, Google DeepMind).

Zhou currently co-leads world models for robotics at Google DeepMind. This talk is the earlier work at the source of that line, and he frames it as a place where the prototype of his later thinking can be seen, set on top of toy problems. The theme is bringing diffusion models into control, and learning both "what to do next" and "how the world will move if you do it" generatively.

What is Model Predictive Control (MPC)?

Model Predictive Control (MPC) consists of two parts: a dynamics model (= world model) and a planner that selects actions. You assemble an agent that solves diverse tasks by maximizing a known objective function. The idea is straightforward: propose a sequence of actions, roll out the resulting states with the dynamics model, evaluate them with the objective function, pick the best action, and execute it in the environment.

Three advantages are noted. It can adapt to new reward functions at inference time. The dynamics model is easier to learn and generalizes better than a policy itself. And by holding "action proposal" and "dynamics" separately (factorization), adapting to new dynamics becomes easy. That last point becomes important later in the "broken ankle" experiment.

Two challenges

To make MPC practical, two problems must be solved. One is that if the dynamics model is not accurate, compounding error (the accumulation of error) occurs. The other is that the planner must be strong enough to select a good action sequence. When you chain together a model that only predicts one step ahead, small errors snowball and long-horizon prediction collapses.

D-MPC — learning both proposal and dynamics with diffusion models

What D-MPC (Diffusion Model Predictive Control) did was learn both a multi-step action proposal and a multi-step dynamics model with diffusion models. Because it generates the sequence all at once rather than step by step, compounding error is suppressed and the planner can be simplified. In fact, a simple sampling-based planner alone reportedly outperformed many prior methods.

The algorithm itself is plain. From offline data, you learn both a policy that predicts actions from the current observation and a dynamics model that rolls observations forward in response to actions — both as diffusion models. At inference time, you sample action proposals, score and rank them, and pick the best. The multi-step action proposal widens coverage of the action space, and the multi-step dynamics can roll out long horizons without error accumulation.

A map of diffusion-based agents

Zhou organizes the related work hierarchically. All these methods assemble the joint distribution of states and actions in different ways. diffusion policy generates actions conditioned on observations — strong at complex control but requires expert demos. Diffuser models states and actions jointly. Decision Diffuser generates future observations conditioned on history, then extracts actions with an inverse dynamics model — its advantage is that it can learn from video-only data, which is big for robotics where data is the bottleneck. And D-MPC follows the flow of action proposal → roll out with dynamics → select with the planner, and can adapt at inference time to both new rewards and new dynamics.

Results — on par at fixed reward, and test-time adaptation

In the fixed-reward, single-task setting (the MuJoCo locomotion tasks in D4RL), it shows competitiveness on par with existing state of the art. More interesting are two kinds of adaptability. The first is adaptation to new rewards — a model trained only on locomotion tasks like running shows new behaviors such as jumping simply by changing the reward function at inference time. The second is adaptation to new dynamics — for example, a situation where the walker's left ankle breaks and the results of actions change. Because D-MPC holds action proposal and dynamics separately, you can recover much of the performance by adapting only the dynamics model on a small amount of play data collected in the new environment. The ablations showed that both the multi-step action proposal and the multi-step dynamics each contribute to performance.

Editorial Note

The core of D-MPC is the design decision to "hold them separately." It learns "what to do next (action proposal)" and "how the world will move if you do it (dynamics)" as separate diffusion models. When the body breaks (the broken ankle), all you need to rebuild is the dynamics, and the action policy can be reused as is — much like a driver who, without changing their driving habits, only recalibrates their feel to match a car whose braking has weakened. In 2026, when LLM world models are drawing attention, the structure of "model the world's motion in multiple steps with a generative model, and keep the planner simple" reads as a prehistory of the robotics world models Zhou now leads. Just as he says — "a toy problem from two years ago" — the prototype of the idea is here.

Points of Focus

Severing compounding error with "multi-step generation"

When you chain one-step-ahead predictions repeatedly, the small error at each step accumulates and long-horizon prediction collapses — the same as an image degrading through copies of copies. D-MPC suppresses this snowballing by generating the sequence all at once. It is a diffusion-model-style answer to a challenge that keeps standing in the way of putting world models into practice.

The host's joke about "DeepMind's last public paper"

After Zhou's talk, the host Chaubard joked, "This is the last paper Google DeepMind will publish, good luck." It is a moment that captures in one line the mood of frontier labs growing cautious about publishing research, and from MEMEX's perspective of preserving primary sources in Japanese, it serves as a small piece of testimony that illuminates, from behind, the very value of being able to engage with published research right now.

Video Outline (this segment)

(17:31) The host introduces the next paper, interest moving from division policy to world models
(18:33) Stannis Zhou takes the stage, self-introduction (Google DeepMind, co-lead of robotics world models)
(19:06) What Model Predictive Control (MPC) is — dynamics model + planner
(20:19) The motivation for D-MPC — the twin challenges of accurate dynamics and a strong planner
(20:48) Learning both action proposal and dynamics with diffusion models
(21:36) A map of diffusion-based agents (diffusion policy / Diffuser / Decision Diffuser / D-MPC)
(24:59) The algorithm — offline learning, sampling and selection at inference time
(27:31) Results — on par at fixed reward, test-time adaptation to new rewards (jumping)
(28:36) Adaptation to new dynamics — the broken ankle and the benefit of factorization
(29:04) Ablations — the contribution of each component

Related Links

グァンヤオ・「スタニス」・チョウ

Guangyao "Stannis" Zhou

Google DeepMind スタッフリサーチサイエンティスト / ロボティクス world model

Glossary

Model Predictive Control (MPC): A framework for an agent that combines a dynamics model (world model) with a planner that selects actions, solving tasks by maximizing a known objective function. It repeats: propose an action sequence → roll out with the dynamics → evaluate and pick the best → execute in the environment. Swapping the reward at inference time elicits different behaviors.
D-MPC (Diffusion Model Predictive Control): A method that uses diffusion models to learn both a multi-step action proposal and a multi-step dynamics model from offline data (arXiv 2410.05364, TMLR 2025, Google DeepMind). By generating the sequence all at once it suppresses compounding error, and outperforms prior methods with a simple sampling-based planner. It can adapt to new rewards and new dynamics at inference time.
compounding error: The phenomenon where per-step prediction errors accumulate over a long time horizon and long-horizon prediction breaks down. It is like an image degrading through copies of copies. D-MPC suppresses this snowballing with multi-step prediction (generating the sequence all at once).
factorization (separating action proposal from dynamics): A design that holds "what to do next (action proposal)" and "how the world will move if you do it (dynamics)" as separate models. When the body or environment changes (e.g., a broken ankle), you only need to adapt the dynamics model on a small amount of data, and the action policy can be reused as is. This makes adapting to new dynamics easy.
diffusion policy / Decision Diffuser: A lineage of diffusion-based control methods. diffusion policy generates actions conditioned on observations (strong at complex control but requires expert demos). Decision Diffuser generates future observations conditioned on history and extracts actions via inverse dynamics (can learn from video-only data). D-MPC, contrasted with these, is the configuration of action proposal + dynamics + planner.

comment is stripped from the HTML output. */}