Akshay Vegesna · 44:12 "We know that scaling up models makes generalization better. But we don't have a mechanistic understanding of why that happens."
Akshay Vegesna summarizes the current state of machine learning like this — we know that scaling up models makes generalization better, but we have no mechanistic understanding of why that happens. If we could understand generalization, we might be able to optimize directly for it. So the payoff of understanding is enormous. The talk traces Andrew Gordon Wilson's argument and lays out the position that "deep learning is not as mysterious or different as it is made out to be."
The Three "Mysteries"
People in the field often describe generalization as a mystery, and cite three reasons for it. overparameterization Using a model that has far more parameters than the number of data points. Under the classical bias-variance tradeoff it should overfit, but in practice generalization actually improves, as the scaling laws show. Wilson explains this through PAC-Bayes and the ideas that 'solutions that are easy to compress become more abundant' and 'the volume of flat minima grows exponentially.' , benign overfitting, and double descent. These are brought up as grounds for the claim that generalization may simply be impossible to understand. Wilson's work resolves this "mystery" by using classical generalization theory that had not previously been applied to explaining overparameterization.
Measuring with PAC-Bayes, a Classic
The first classical theory is PAC-Bayes A classical framework that bounds generalization (test loss) from above by the sum of the training loss and a 'complexity/compression term.' In the past, overparameterization made the compression term dominant, so the bound became loose and vacuous (meaningless). Wilson argues this was a misuse of the bound, and shows that if the compression term is computed differently, a useful bound can be obtained even at the scale of a billion parameters. . It bounds test loss (= generalization) from above by the sum of the training loss and a compression term. In the past, overparameterizing a model made this compression term dominant, so the bound became loose and vacuous (meaningless), useless for anything. Wilson points out that this was a misuse of the bound. If the compression term is computed differently, the bound becomes meaningful again.
Why Does Overparameterization Work
The PAC-Bayes framework explains the success of overparameterization well. First, the empirical risk (= training loss) goes down as you add parameters, because the model fits the data better. Furthermore, Wilson's work shows that the more parameters you add, the more "easily compressible solutions" you find (the work of Lotfi et al. — there is a negative correlation between the number of bits needed to encode the training set and the number of parameters). In other words, the second term, the compression term, also goes down. Another way to see it is flatness. As you add parameters, the volume of flat minima in parameter space grows exponentially, while the volume of sharp minima does not grow much. Flat minima are easier to compress than sharp minima, so this is consistent with the compression view too. As a result, overparameterization fits within existing theory, and a useful generalization bound can be obtained even at the scale of a billion parameters.
"Benign Overfitting" and Soft Inductive Bias
The next mystery is benign overfitting The phenomenon where a deep neural network can fit completely random noise (= can overfit) yet generalizes well on structured data. The mystery is 'how can it simultaneously have the inductive bias that enables generalization, when it can also fit random data?' Wilson gives intuition with a regularized polynomial model — it fits random data given enough parameters, but on structured data, regularization makes it select the lower-order terms. . A deep net can fit completely random noise, yet it generalizes well on structured data. The mystery is "how can it simultaneously have the inductive bias that enables generalization, when it can also fit random data?" A regularized polynomial model gives the intuition — it fits random data given enough parameters, but on structured data, regularization makes it select the lower-order terms. In this way you get both flexibility and inductive bias.
Generalizing, a neural network can be seen as an expressive model that has a soft inductive bias The property of having a highly expressive (large) hypothesis space while still 'preferring' simpler, more compressible solutions consistent with the data. It is the middle ground — neither hard-constraining the hypothesis space (= failing to model reality) nor leaving it unconstrained (= overfitting). It is likened to a habit of having a large toolbox while reaching for the simplest tool that fits. . A hard-constrained hypothesis space fails to model reality, and an unconstrained hypothesis space overfits. The middle ground — having a highly expressive hypothesis space while preferring solutions that are likely to generalize (for example, easily compressible ones) — is the answer. It is close to a habit of having a large toolbox while reaching for the simplest tool that fits.
No Free Lunch and Sample Efficiency
In conclusion, the so-called "mysteries" of deep learning are consistent with, and partly explained by, existing theory such as soft inductive bias and PAC-Bayes. The question Akshay leaves open lies in sample efficiency. If we could find the right inductive bias, we might be able to optimize directly for it. According to the no free lunch theorem The no free lunch theorem. There is no learner that is universally best across all problems, and every improvement in learning efficiency can only come from building in assumptions (inductive biases) about 'what kind of data will arrive.' Akshay positions this as where the key lies to closing the enormous gap in sample efficiency between AI and humans. , improvements in learning efficiency can only be obtained through inductive bias. Considering the enormous gap in sample efficiency between AI and humans, he concludes that working on this problem is a good bet.
Editorial Notes
The core of this talk is an intellectual stance that points toward "dispelling the mystery." Overparameterization, benign overfitting, and double descent tend to be discussed as a three-piece set that mystifies deep learning, but Wilson re-explains them with the classics of PAC-Bayes and soft inductive bias. The key phrase is "soft inductive bias" — the posture of having a large hypothesis space while preferring simple solutions. Read together with the context in which Akshay and Q Labs position generalization as "AI's central unsolved problem" and hold up a practical approximation of Solomonoff induction, this talk resonates as a declaration of a bet — "capability grows with scale, but if we can understand why it grows, we can engineer sample efficiency." Placed alongside Konwoo Kim's talk at the same Paper Club (pre-training under data constraints), a shared concern surfaces: "the gap in sample efficiency with humans."
Points of Focus
Replacing "Mystery" with "Misuse"
What is working in Wilson's argument is the redefinition that the PAC-Bayes bound being "loose and meaningless" in the past was not a limit of the theory but a "misuse of the bound." The point that, if the compression term is computed differently, a useful bound can be obtained even at the scale of a billion parameters, dismantles the received wisdom that "deep learning lies outside classical theory" from the inside. The framing that it was a matter of application, not the supernatural, sets the tone of the whole talk.
The Geometry Where Flat Minima "Increase"
As an explanation for why overparameterization helps generalization, the geometric perspective that "as you add parameters, the volume of flat minima grows exponentially, while sharp minima do not grow much" is presented. The flatter the minimum, the easier it is to compress and the better it generalizes. The picture that the more dimensions you add, the far more "broad, flat valleys" there are, and the easier it is for gradient descent to fall into them, gives a physical, tangible feel to the empirical rule of the scaling laws.
Video Outline (this segment)
- (43:20) Host's introduction — Akshay Vegesna of QLabs
- (43:54) Takes the stage, covering Andrew Gordon Wilson's paper, collaborating with Andrew at Q Labs
- (44:12) The central question — generalizes with scale, but why is not understood
- (44:46) The three "mysteries" — overparameterization, benign overfitting, double descent
- (45:04) PAC-Bayes — training loss + compression term, the past misuse
- (45:42) Explanation of overparameterization — empirical risk goes down + easily compressible solutions (Lotfi et al.)
- (46:39) The volume of flat minima grows exponentially
- (47:50) Benign overfitting and the intuition from the regularized polynomial model
- (48:46) Soft inductive bias — expressiveness + bias toward simple solutions
- (49:43) No free lunch and the bet on sample efficiency
Related Links
- Paper "Deep Learning is Not So Mysterious or Different" (arXiv 2503.02113, ICML 2025)
- Andrew Gordon Wilson (NYU) profile
- Q Labs official site
- YC Paper Club video (this segment from 43:54)
Glossary
- generalization
- Performance on unseen data rather than on training data. The quantity machine learning ultimately cares about. The recognition that "scaling improves generalization, but mechanistically why is not understood" is the starting point of this talk and of Wilson's paper.
- PAC-Bayes
- A classical framework that bounds test loss (generalization) from above by the sum of the training loss and a compression term. In the past, overparameterization made the compression term dominant and rendered the bound meaningless, but Wilson shows that if the compression term is computed differently, a useful bound can be obtained even at the scale of a billion parameters.
- overparameterization
- Having far more parameters than data points. Classically it should overfit, but in practice generalization improves. The explanations are (1) the empirical risk goes down, (2) more easily compressible solutions are found, (3) the volume of flat minima grows exponentially.
- benign overfitting
- The phenomenon where a model can fit random noise (can overfit) yet generalizes well on structured data. A regularized polynomial model gives the intuition — it fits random data given enough parameters, but on structured data, regularization makes it select the lower-order terms.
- soft inductive bias
- The property of having a large, highly expressive hypothesis space while preferring simpler, more compressible solutions. The middle ground — neither hard-constraining (failing to model reality) nor unconstrained (overfitting). Close to a habit of having a large toolbox while reaching for the simplest tool.
- no free lunch theorem
- The theorem that there is no learner that is universally best across all problems, and that every improvement in learning efficiency can only come from assumptions (inductive biases) about "what kind of data will arrive." Here lies the key to closing the gap in sample efficiency between AI and humans.