Isaac Ward · 30:47 "Hidden in this presentation is a billion-dollar question. That's no exaggeration. Yann LeCun raised $1.03 billion in March, basically just to train a world model — this talk is about that question."
Isaac Ward has been researching world models for several years. Back then it was still before they drew attention, and now they are having their moment in the sun, as he puts it. The stakes of the talk are clear — according to reports, Yann LeCun raised $1.03 billion in March 2026 for a new company, AMI Labs, built around developing world models. Ward unpacks the technical substance of that bet by tracing the research lineage of LeCun and Randall Balestriero.
What a world model is
A world model A large neural network that predicts 'given the current state (or observation) and the action taken, what state comes next.' For a robot, it can replay in its head 'if I turn left, where in the room will I be facing.' It enables capabilities such as generating imagined consequences, model-based control, and quantifying uncertainty (surprise). It's an old idea going back to Sutton 1990. predicts, from the current state and the action taken, what situation comes next. Writing the observation as S, it predicts how the world changes when you take a given action. For a robot, it's like a flight simulator that replays in its head "if I turn left, where in the room will I be facing." There are three capabilities — generating imagined consequences, enabling model-based control, and quantifying "surprise."
Ward cautions that this is not a new idea. Sutton in 1990 described modern world models almost verbatim — a "black box" that takes the situation and the action to be executed as input and outputs a prediction of the situation immediately afterward. What's new is not the idea but its packaging and marketing, as he frames it.
Model-free or model-based
Whether an agent has an internal model of the world or not — this is contested in both research and startups. Model-free simply feeds observations into a large neural net and outputs the optimal action, with no explicit representation of "what the future looks like if I take this action." Performance is good but it's somewhat brittle out-of-distribution. Model-based explicitly learns a world model and uses it to predict the consequences of candidate actions. The advantage is being able to quantify modeling error — which matters when deploying in the real world.
The pitfall of representational collapse
Training a world model means simultaneously learning how to compactly represent high-dimensional observations (images or LiDAR) and how actions change that representation. The co-learning of representation and dynamics. There's a trap here. The optimization landscape has many "do nothing" solutions. A typical local optimum answers "every state is the same" — representational collapse In world model training, the trivial local optimum that crushes all states into the same latent representation. Like a lazy student who answers 'the same' every time to 'what comes next?'—formally consistent but learning nothing. Existing methods use various tricks to avoid it (heuristics that enforce the soundness of the latent space, repurposing pretrained models, privileged data). . It's close to a lazy student who answers "the same" every time to "what comes next?", formally consistent but having learned nothing.
Existing world models (PLDM, DINO-WM, Dreamer, TD-MPC, and others) use various tricks to avoid this collapse. Heuristics that enforce the soundness of the latent space, repurposing existing autoencoders or diffusion / video models, or privileged data available only at training time. All of them are "tricks for avoiding collapse," and hard to configure.
JEPA and SIGReg
JEPA Joint Embedding Predictive Architecture. The architecture Yann LeCun is centrally advancing. It converts observations into latent vectors with an image encoder and, with a predictor conditioned on the action, predicts the 'next latent representation' (the key point being it predicts the next latent rather than the next image itself). It can also decode back to an image, but the interesting processing happens in the latent space. is LeCun's central work — it converts observations into latent vectors with an image encoder and, with a predictor conditioned on the action, predicts the "next latent representation." The key is that it predicts the next latent rather than the next image itself. It can decode back to an image, but all the interesting processing happens in the latent space.
What LeJEPA adds is a new regularization term called SIGReg Sketched Isotropic Gaussian Regularization. The regularization term introduced by LeJEPA (arXiv 2511.08544). It projects (sketches) the latent embedding onto many random one-dimensional directions and statistically tests whether each one-dimensional distribution is a normal (Gaussian) distribution. If it's normal in every direction, one can cheaply judge that the entire latent space is an isotropic, sound distribution and has not collapsed. It replaces the assorted collapse-avoidance tricks of existing methods with a single loss term. . The name is an acronym for Sketched (one-dimensional projection of high-dimensional data), Isotropic (looks the same whichever direction you slice it), and Gaussian-distributed. It projects the embedding onto many random one-dimensional directions and tests whether the slice along each direction is a normal distribution. If every one-dimensional slice is normal, one can cheaply judge that the latent space is an isotropic, sound "round cloud" and has not collapsed. It folds the assorted grab-bag of tricks into a single hyperparameter and a single loss term. LeJEPA proves that an isotropic Gaussian distribution is the optimal embedding distribution and then enforces it. Ward calmly positions this as "this too is, after all, just offering a new kind of elegant trick."
LeWorldModel — what you get
The application of this lineage is LeWorldModel A paper applying LeJEPA's idea to world models (arXiv 2603.19312, 2026, Lucas Maes / Quentin Le Lidec / Damien Scieur / LeCun / Balestriero). Small at about 15 million parameters, it can be trained on a single GPU, and is reported to plan up to 48x faster than foundation-model-based world models. It demonstrates open-loop prediction on push-T / push-cube, MPC in latent space, and quantifying surprise via error spikes against perturbations. . Small at about 15 million parameters, it runs on a single GPU and is reported to plan up to 48x faster than foundation-model-based world models (this efficiency figure is from LeWorldModel, not LeJEPA itself). In open-loop prediction, the "real" and "imagined" sequences match well on push-T and push-cube. Control is search in the latent space — an MPC that encodes the initial observation and the goal observation and searches in the latent space for the actions to get from start to finish. On small 2D tasks it beats competitors; on 3D, DINO-WM with its large foundation backbone wins.
Particularly impressive is the quantification of surprise. When you add a mischievous perturbation to the world model (changing the color of the T, teleporting the T to a different location), the model error spikes at that moment. This is detectable, meaning an agent with a model can quantify how far off its predictions are and has a good estimate of uncertainty. The model-free approach doesn't naturally give you this.
Editorial Observations
The honesty of Ward's talk lies in the sober summary that "this just offers one new trick." World models became a topic in the sun in 2026, but their essence is the old problem that "learning representation and dynamics simultaneously leads to collapse" and the competition of ingenuity over "how to elegantly prevent collapse." LeJEPA's SIGReg is the solution that enforces an isotropic Gaussian — a "sound round cloud" — with a single loss. And LeCun's $1.03 billion sits directly on top of this technical lineage, as a declaration of betting on world models that learn from reality rather than language. The contrast between Ward's point that "the idea goes back to Sutton 1990" and the stakes of "the billion-dollar question" is what makes this talk valuable as a primary source.
Points of Focus
Rearranging world models along the axis of "tricks to prevent collapse"
The sharpness of Ward's framing is in lining up PLDM, DINO-WM, Dreamer, TD-MPC, and LeJEPA along the single axis of "how to avoid representational collapse." By centering not a flashy comparison of capabilities but the mundane failure mode all methods share (the trivial solution that learns nothing), it becomes clear that SIGReg's novelty lies in "folding a hard-to-configure grab-bag into a single loss."
The practical value of being able to "measure uncertainty yourself"
The core advantage of model-based is that the agent itself can quantify how far off its predictions are. The experiment where adding a perturbation makes the model error spike gives a machine a self-awareness close to "a driver who slows down because the road feels unfamiliar." In control deployed to the real world, this property of "being able to measure your own confusion" is a safety-side margin that model-free lacks.
Video Outline (this segment)
- (29:54) Host's introduction — "the most world-model-obsessed person"
- (30:26) Isaac Ward takes the stage, introduces LeJEPA / world models
- (30:47) The billion-dollar question — LeCun's bet on world models
- (31:14) What a world model is — state + action → next observation, Sutton 1990
- (33:51) Model-free or model-based, quantifying error
- (36:11) Representational collapse and co-learning, existing methods' collapse-avoidance tricks
- (38:26) JEPA and SIGReg — latent prediction + isotropic Gaussian regularization
- (40:09) LeWorldModel's capabilities — open-loop prediction, latent-space MPC, speed
- (42:00) Quantifying surprise — perturbations and model-error spikes
- (42:38) Discussion — model-based vs. model-free, how to elegantly prevent collapse
Related Links
- Paper "LeJEPA" (arXiv 2511.08544, Balestriero / LeCun, 2025-11)
- Paper "LeWorldModel" (arXiv 2603.19312, 2026)
- YC Paper Club video (this segment from 30:26)
Glossary
- world model
- A model that predicts "given the current state (observation) and the action taken, what state comes next." Close to a flight simulator that replays the consequences of actions in its head. It enables generating imagined consequences, model-based control, and quantifying uncertainty. An old idea going back to Sutton 1990.
- JEPA (Joint Embedding Predictive Architecture)
- The architecture Yann LeCun is centrally advancing. It converts observations into latent vectors and, conditioned on the action, predicts the "next latent representation." The key is that it predicts the next latent rather than the next image itself, and the important processing happens in the latent space.
- SIGReg (Sketched Isotropic Gaussian Regularization)
- The regularization term introduced by LeJEPA (arXiv 2511.08544). It projects the latent embedding onto many random one-dimensional directions and tests whether the distribution along each direction is a normal distribution. If it's normal in every direction, the latent space can be regarded as isotropic and sound. It replaces existing methods' assorted collapse-avoidance tricks with a single loss term. It enforces this after proving that an isotropic Gaussian is the optimal embedding distribution.
- representational collapse
- In world model training, the trivial local optimum that crushes all states into the same latent representation. Like a lazy student who answers "the same" every time to "what comes next?"—formally consistent but learning nothing. SIGReg prevents this elegantly.
- model-free / model-based
- Model-free directly outputs the optimal action from observations and has no explicit representation of the future (somewhat brittle out-of-distribution). Model-based explicitly learns a world model and predicts the consequences of candidate actions. The core advantage of the latter is that it can quantify modeling error itself and measure uncertainty.
- LeWorldModel
- A paper applying LeJEPA's idea to world models (arXiv 2603.19312, 2026). About 15 million parameters, runs on a single GPU, and is reported to plan up to 48x faster than foundation-model-based world models. It can quantify "surprise" via error spikes against perturbations. The efficiency figure is from this paper, not LeJEPA itself.