Pre-training in an Age When Data Runs Out — Konwoo Kim (Stanford) on Pre-training under Infinite Compute

YC Paper Club 2026 / talk ~15 min

Konwoo Kim · 53:16 "When you are constrained by data and not at all constrained by compute, how should you approach pre-training?"

The 5th (and final) talk of the first YC Paper Club (May 20, 2026, Y Combinator, Mountain View). Talk approximately 15 minutes (video 51:24 onward). The presenter is Konwoo Kim (Stanford PhD candidate). The paper, "Pre-training under infinite compute," is co-authored with Suhas Kotha, Percy Liang, and Tatsunori Hashimoto (arXiv 2509.14786, Stanford, 2025-09).

Over the past 6–7 years, pre-training has dramatically expanded model capabilities — in-context learning with GPT-3 in 2020, alignment via RLHF at Anthropic in 2022, the emergence of reasoning with o1 and DeepSeek R1 in 2024. Because pre-training is expensive, research has focused on compute efficiency. But the question Konwoo Kim poses points the other way — in an era where soon it is data that becomes the constraint, what do you do if compute is unlimited?

Why "data-constrained" now

To improve compute efficiency, you need to scale both the number of model parameters and the number of data points (the Chinchilla scaling law). The problem is that soon you will be bound by data. By published estimates, human-generated text on the internet grows only about 3% per year. Meanwhile, the compute poured into pre-training grows roughly 4–5x per year. In other words, the compute you can spend per data point keeps growing at roughly 4x per year. This is an entirely different algebraic regime from the "compute efficiency" world we have grown accustomed to. It is also closer to questions in classical statistics, or to older benchmarks like MNIST and Penn Treebank (which had few data points and were implicitly data-constrained).

The asymptote as a yardstick

This paper brings in the modern toolkit of scaling laws. It pursues a recipe that monotonically lowers IID validation loss (= in-distribution generalization), and shows that it lies on a clean power law. The asymptote The flat lower bound to which a scaling law's power law converges. It represents the best loss a recipe can reach with infinite compute, corresponding to the recipe's 'ceiling.' The lower a recipe's asymptote, the fundamentally better it is when you stack unlimited compute. This paper introduces the asymptote as a yardstick for evaluation. becomes the yardstick. If you can fit a power law, then by looking at its asymptote you can estimate the recipe's best loss — the point reached under infinite compute. The goal is to find algorithms that lower the asymptote. The setup is plain: it reproduces the data-constrained world by restricting to just 200M tokens (general web data from DCLM), trains progressively larger models, and watches IID validation loss. First, the naive approach — the "standard recipe" of repeatedly training on the same data (epoching) while making the model larger — overfits faster the more it is over-parameterized, and beyond a certain point the loss increases.

Strong regularization

The natural intuition when you see this line is "how do you fix it?" One answer is strong regularization / weight decay A technique that suppresses a model from memorizing (overfitting) the data. Weight decay is a kind of regularization that keeps weights small. In this paper, weight decay roughly 30x the value used in compute-optimal pre-training is tuned optimally per parameter count. The loss then lies on a clean power law with a measurable asymptote (3.43). It is like a 'strict diet' that prevents memorizing a small amount of data. (aggressive regularization). When you tune weight decay optimally per parameter count, the loss lies on a clean power law as you increase parameters. This is really strong regularization — about 30x the value used in compute-optimal pre-training. This power law has an exponent of 1 in the parameter count n (as data-constrained theory predicts) and an asymptote of 3.43. The standard recipe, which overfits early, does not even have a measurable asymptote.

Ensembling

Opening the classical machine-learning toolbox, one of the famous tools is ensembling A classical technique that combines the predictions of multiple independently trained models. A committee of many small independent experts generalizes better than a single large expert when data is scarce. In this paper, an ensemble combining several 300M models also lies on a clean power law, and its asymptote is far lower than the asymptote of the regularization recipe. . When you ensemble 300M-parameter models while increasing the number of members (5 of them is 1.5B total), this too lies on a clean power law, with an exponent of 1 in the number of members and an asymptote. Crucially, the ensemble's asymptote is far lower than the regularization recipe's asymptote — a true win in data efficiency under infinite compute. Even when compared at matched compute, the ensemble beats the regularization recipe. If you want the best 1.5B model under data constraints, you are better off forming a committee of small models than a single large model. By combining regularization (which lets you keep growing the model) and ensembling (a new axis of compute scaling: adding more models) into "joint scaling," and estimating it in a double limit (the limit of member count K, then the limit of parameters N), the loss improves substantially.

5x, then shrink back down with distillation

To check whether the recipe scales, they draw a data scaling law across four token counts (up to 1.7B). The joint scaling recipe gives roughly 5x (precisely 5.17x) the data efficiency of the standard recipe. This win is roughly constant with respect to token count, and is expected to hold when extrapolated to the scale of 10 trillion tokens. It can also be realized with finite models — for example, about 3.7x with a 5-member ensemble of 1B models. It takes training compute, but inference compute can be reduced with distillation A technique that compresses the behavior of a large model or ensemble by having a small single model imitate it. In this paper, an 8-member ensemble (about 2.4B total) is distilled into a single 300M model, retaining about 83% of the ensemble's gain. By paying test-time compute up front during training, you obtain a small, highly data-efficient model for inference compute. Even self-distillation (distilling into a student of the same configuration) lowers the loss, which connects to the view of implicitly learning a 2-member ensemble. . Distilling an 8-member ensemble (about 2.4B total) into a single 300M model retains about 83% of the gain. Even more surprisingly, self-distillation — distilling a 300M teacher into a new 300M student of the same configuration — lowers the loss, even surpassing the asymptote of the regularization recipe. This connects to the view of implicitly learning a 2-member ensemble. Although it pursued only IID loss, the trend carries over directly to downstream benchmarks (about 9% improvement), and holds in settings beyond pre-training — continued pre-training as well. Using only 4B math tokens out of 73B tokens reaches performance equivalent to training on all 73B, roughly 17x (precisely 17.5x) the data efficiency.

Editorial notes

The core of this presentation lies in the idea of turning the "data wall" to advantage. Until now, research has optimized in the direction of "saving compute," but since compute grows 4x per year while data grows only 3% per year, we will eventually enter a regime where "compute is plentiful and data is scarce." There, the classical tools of regularization, ensembling, and distillation are revived as weapons of data efficiency. And by placing the "asymptote" as a yardstick, recipes are re-evaluated by "how far they can reach with infinite compute." Just as Akshay Vegesna's talk at the same Paper Club raised the issue of "the gap in sample efficiency between AI and humans," this paper's "how to learn under data scarcity" is the two sides of the same coin. The interest of the Chris Ré lab, to which host Chaubard belongs — "how far can you generalize with fixed data and infinite compute" — connects directly to this final talk, which can be read as the through-line of the whole event.

Points of focus

The structure of a "comeback" for classical methods

Regularization, ensembling, and distillation are techniques that have been around for decades. What makes this paper interesting is that, rather than inventing a novel algorithm, it re-measures these methods in the new regime where "compute is plentiful and data is scarce." In particular, the conclusion that "under data constraints, a committee of small models is better than a single large model" conditionally reverses the recent intuition toward large single models.

The mystery of self-distillation surpassing the asymptote

Simply distilling itself into a model of the same configuration lowers the loss, even surpassing the asymptote of the regularization recipe — the paper explains this counterintuitive result by connecting it to the prior-work view that "self-distillation is implicitly learning a 2-member ensemble." It becomes an emblematic phenomenon showing that the ensemble's gain can be folded into a single model without increasing inference cost.

Video outline (this segment)

  • (50:37) Host introduction — the obsession with sample efficiency, the interest of the Chris Ré lab
  • (51:24) Konwoo Kim takes the stage, introduces the co-authors (Suhas, Percy, Tatsu)
  • (51:38) The history of pre-training expanding capabilities (in-context learning → alignment → reasoning)
  • (52:41) Data at 3% per year, compute at 4–5x per year — the data wall
  • (53:16) The central question — data-constrained, compute-unconstrained pre-training
  • (54:14) Scaling laws and the asymptote, the 200M-token DCLM setup
  • (55:59) The standard recipe overfits
  • (56:28) 30x weight decay, asymptote 3.43
  • (57:44) Ensembling — a lower asymptote, the double limit of joint scaling
  • (1:02:28) Data scaling law — 5.17x data efficiency
  • (1:04:06) Distillation and self-distillation, downstream benchmarks, 17.5x for math CPT

Related links

Glossary

data-constrained, compute-unconstrained
A regime in which you have only a small fixed amount of data but can use unlimited compute. Because human-generated internet text grows at 3% per year while pre-training compute grows 4–5x per year, the compute per data point keeps growing at 4x per year, so you eventually enter this regime. It is likened to a small cookbook and infinite cooking time.
asymptote
The flat lower bound to which a scaling law's power law converges. It represents the best loss a recipe can reach with infinite compute, corresponding to the recipe's "ceiling." The lower a recipe's asymptote, the fundamentally better it is. This paper introduced the asymptote as a yardstick for evaluation.
regularization / weight decay
A technique that suppresses a model from memorizing (overfitting) the data. In this paper, weight decay roughly 30x the compute-optimal value is tuned optimally per parameter count, putting the loss on a clean power law (asymptote 3.43). It is like a "strict diet" that prevents memorizing a small amount of data.
ensembling
A classical technique that combines the predictions of multiple independently trained models. When data is scarce, a committee of small independent experts generalizes better than a single large expert. Its asymptote is lower than the regularization recipe's, and it wins even at matched compute. Combined with regularization, joint scaling gives about 5.17x the data efficiency.
distillation / self-distillation
A technique that compresses the behavior of a large model or ensemble by having a small single model imitate it. An 8-member ensemble (about 2.4B total) is distilled into a single 300M, retaining about 83% of the gain. Even self-distillation (distilling into a student of the same configuration) lowers the loss, which connects to the view of implicitly learning a 2-member ensemble.
continued pre-training
Additionally training an existing model on data from a specific domain. In this paper, a 3B model is continued-pre-trained using only 4B math tokens out of all 73B tokens, and through data-efficiency tricks reaches performance equivalent to training on all 73B — demonstrating roughly 17.5x the data efficiency.
comment is stripped from the HTML output. */}