Inference Is Not a 'Cost' but a 'Capability' — Tanishq Kumar (Stanford) on Speculative Speculative Decoding

YC Paper Club 2026 / talk approx. 14 min

Tanishq Kumar · 05:53 "Inference today is seen as a lever for cost or convenience. But in one, two, three years, inference will come to be seen as a capability."

The first of five talks at the inaugural YC Paper Club (2026-05-20, Y Combinator, Mountain View). Talk approx. 14 minutes (video from 03:49). The speaker is Tanishq Kumar (Stanford CS PhD student). The paper, "Speculative Speculative Decoding," is co-authored with Tri Dao (Princeton / Together AI) and Avner May (Together AI) (arXiv 2603.03251, ICLR 2026).

Tanishq Kumar is a researcher who has specialized in training, and he admits that he initially viewed inference as "you just hand over the weights and multiply matrices — why would you need a team for that?" That view was overturned. This talk binds inference into a single thesis: inference speed is not merely a matter of efficiency, but is itself the ceiling on the intelligence you can reach.

Inference Is Not a "Cost" but a "Capability"

Inference cost surpasses training cost; RL is a wrapper around inference and is overtaking the compute of pre-training — these two points are often discussed. The third point Kumar emphasizes comes from a different angle. If you have an algorithm whose performance scales with "the amount of thinking," then the number of tokens you can emit per second (tokens per second) becomes the peak intelligence you can draw out. So the relationship holds: the faster the inference, the smarter the model. Kumar half-jokingly holds up a future vision of "lining up 20,000 B200s to work solely on the Riemann hypothesis," recasting inference as a problem of capability.

How Speculative Decoding Works

Kumar carefully illustrates the prerequisite, speculative decoding . A small draft model (tiny llama) drafts tokens one at a time by reading ahead, and a large target model (big llama) verifies them all together in a single forward pass.

The reason it's fast lies in an asymmetry — "verifying is easier than generating." A transformer can obtain the probabilities of many tokens in a sequence in parallel in one pass, but generation can only proceed one at a time. The slow sequential generation is delegated to the small, fast model, and the large model uses one pass to check, probabilistically, "would I have emitted this token myself?" If it's plausible, it's accepted; at the point where it's not, it's rejected. At a rejected position, you can draw one "bonus token" for free, with no additional forward pass. This free token pays off later.

The Sequential-Dependency Bottleneck

Kumar says speculative decoding is a currency exchange that "trades flops for latency." If a prediction computed in advance turns out to be correct, you can fast-forward through time — the same deep idea as speculative execution on a CPU. But ordinary speculative decoding cannot be pushed indefinitely. Draft too much and the acceptance rate drops. The biggest bottleneck is the sequential dependency between draft and target. Until the verification of round T is finished, the drafting of round T+1 cannot begin. This is because you need to stack the previous verification result on top as a prefix, so a logical dependency remains here.

SSD — Running Drafting and Verification Simultaneously

The high-level idea of SSD (Speculative Speculative Decoding) is simple — parallelize the sequential operations. Make drafting and verification happen simultaneously. Normally they run alternately on the same hardware, but in SSD the two are split across separate hardware (in the paper, the target is placed on 4 H100s and the draft on a separate single H100) and run simultaneously.

While the target is verifying the current round, the draft immediately reads ahead to the most likely consequence of that verification result, and starts drafting the next round ahead of time on top of it. If it's correct, the answer is ready the moment the target next asks for a draft — you can hide the drafting latency entirely. Furthermore, because verification takes time, more tokens can be drafted during that interval, raising the expected number of tokens per round and making it even faster.

Guessing the Verification Result

The design challenge is "can you predict the result of verification in advance?" Verification is the step that uses the large model's intelligence, so it ought to be hard to predict. The key lies in information the draft itself holds. When the draft generates a blue token, there remain other candidate tokens that it did not adopt. Those become candidate bonus tokens for verification. In other words, from the draft model's token distribution, you go after the likely consequences on the target side. The candidates span the vocabulary (tens of thousands to over a hundred thousand) and so are broad, but in practice they hit with up to 90% accuracy — enough for a speedup. You just decode the multiple guessed sequences in parallel on top of the shared prefix.

Results — Getting Both Latency and Throughput

Kumar self-deprecatingly holds up "numbers going up" as the "north star of AI research." Set against open-source engines (vLLM's speculative decoding, and SGLang, which was the fastest), SSD is faster than they are. Speculative decoding usually helps latency, but whether it helps throughput is unclear — in this setting SSD helped both. The paper's average is phrased modestly as roughly 30% faster than optimized speculative decoding, but in the talk Kumar closed with: "Next time I'm watching people dance at a San Francisco house party, I can stand in the corner thinking, I know how to run Llama 3 70B at 300 tokens per second on 4 H100s." Adding, jokingly, that it's sensitive information.

Editorial Notes

The value of this talk distills into one line: "redefining inference as a capability." Inference optimization usually stays confined to the operational story of "cheaper, faster, more convenient." Kumar brings to that the equation "if performance scales with the amount of thinking, then inference speed = the ceiling on intelligence," and backs that claim with a concrete algorithm in SSD. In an era where test-time scaling (the trend of stacking compute at inference time to get smarter) is becoming a premise, the framing of "speed is not a luxury but a capability" lifts speculative decoding's position from the operational layer to the frontier of research.

Points to Watch

The transformer asymmetry that "verification is easier than generation"

What underpins all of speculative decoding, including SSD, is the structural asymmetry whereby a transformer "can obtain the probabilities of many tokens in a sequence in parallel in one pass, but can only generate one at a time." It can be likened to a division of labor between a fast drafter and a slow proofreader who checks the whole thing at a glance. The counterintuitive property that proofreading is faster than drafting is the foundation of the speedup.

The idea of "parallelizing on separate hardware" for the sequential dependency

The core of SSD is not new mathematics but an engineering decision that solves the sequential dependency through hardware placement. Instead of housing target and draft together in the same box, they are split across separate GPUs and run concurrently. Drafting the "most likely consequence" ahead of time without waiting for verification is a configuration that brings the CPU's branch prediction (speculative execution) straight into LLM inference. If it's right, you can fast-forward through time; if it's wrong, you switch to a backup strategy.

Video Outline (This Segment)

(03:49) Tanishq Kumar takes the stage; the title having "speculative" twice is intentional
(04:00) Self-introduction (Stanford PhD student), joint research with Tri Dao and Avner May
(05:53) The central thesis — inference will be seen as a "capability" in the near future
(06:30) Side-by-side demo of fast inference (autoregressive / vLLM speculative / SSD)
(07:54) Illustration of speculative decoding — draft and target, bonus tokens
(10:48) Speculation as a currency exchange, the sequential-dependency bottleneck
(11:53) The SSD idea — parallelizing drafting and verification, separate-hardware placement
(13:36) Predicting the verification result, leveraging the draft's non-adopted tokens
(16:09) Results — comparison with vLLM / SGLang, getting both latency and throughput

Related Links

タニシュク・クマール

Tanishq Kumar

Stanford CS 博士課程 / 推論高速化 (Speculative Speculative Decoding)

Glossary

Speculative Decoding: An LLM inference-acceleration technique in which a small draft model generates tokens one at a time by reading ahead, and a large target model verifies them all together in a single forward pass. Using the transformer asymmetry that "verification is easier than generation," it accepts only tokens that the large model would not find unnatural to emit. At a rejection point, you can draw one bonus token with no additional compute.
SSD (Speculative Speculative Decoding): A technique that eliminates the draft↔target sequential dependency remaining in ordinary speculative decoding (arXiv 2603.03251, ICLR 2026). It runs drafting and verification concurrently on separate hardware; while the target is verifying, the draft predicts the verification result and drafts the next round ahead of time. It hides the drafting latency and improves both latency and throughput. The optimized engine is named Saguaro.
draft model / target model: The draft (small model) is the role that proposes "tokens likely to come next" ahead of time via fast sequential generation. The target (large model, the real one) is the role that verifies the draft's proposals in one pass and accepts only tokens it would not find unnatural to emit. In SSD, the two are split across separate GPUs and run concurrently.
bonus token: In speculative decoding, a token that can be sampled with no additional forward pass at a position where a token was rejected. In SSD, candidate tokens the draft did not adopt while drafting are reused as candidate bonus tokens for verification, and are used to predict the verification result (up to 90% accuracy).
test-time scaling / inference = capability: Refers to the family of techniques whose performance grows the more compute is stacked at inference time. Kumar's central thesis is that if performance scales with "the amount of thinking," the number of tokens per second becomes the very ceiling on the intelligence you can draw out. This makes inference speed a matter of capability rather than of cost or convenience.

comment is stripped from the HTML output. */}