Beyond 'RAG Is Dead' — Kuba Rogut (Turbopuffer) on Agentic Retrieval and 'the Right Million Tokens'

AI Engineer Europe 2026 (London) / talk approx. 11 min

Jeff Dean (quoted by Kuba Rogut) · 10:00 "You don't need trillions of tokens at once. What you need is 'the right million tokens.'"

Talk "RAG is dead, right??" at AI Engineer Europe 2026 (London) (talk approx. 11 min, video published 2026-06-09, AI Engineer official channel). The speaker is Kuba Rogut (deployed engineer at Turbopuffer). Turbopuffer is a full-text search + vector search database built from first principles on top of object storage. Starting from the "RAG is dead" meme on X (formerly Twitter), the talk traces how search moved from a one-shot vector DB call to agentic retrieval, where an agent iteratively hunts for context — illustrated with real examples from Cursor / Claude Code.

"RAG is dead, right?" — Kuba Rogut holds up this meme, which flooded X from late 2025 into early 2026, with a touch of irony. But look at Google search volume, and RAG's search interest actually spiked in mid-2025. "Twitter, look at this." The talk's claim is simple — RAG is not dead. The naive form of RAG, the "one-shot vector search," has merely transformed into agentic retrieval, where you search iteratively using tools.

What RAG is, what agentic search is

First, let's sort out the terminology. Many people think of RAG Retrieval-Augmented Generation. Most people take it to mean the simple vector-search form — 'embed a corpus, do a vector search, and just hand it to the LLM' — but Kuba takes retrieval broadly: not only vector search, but also full-text search (BM25), grep / glob / regex, and various filters. The augmented generation is the part that hands the results to the LLM. as the simple vector search of "embed a corpus, do a vector search, and just hand it to the LLM." In Turbopuffer's view, retrieval (search) is not only vector search — it includes full-text search (BM25), grep / glob, regular expressions, and basic filters. The augmented generation is the part that hands that to the LLM.

agentic search Many people use it to refer to behavior like Claude Code / Codex 'grepping the filesystem,' but Kuba's definition is broader: giving an agent a set of tools and having it find context and reason step by step, iteratively. A loop of reading files, judging whether it found what it needed, and continuing to search until satisfied. is the same. Many people use it to refer to behavior like Claude Code / Codex "grepping the filesystem." But the essence is giving an agent a set of tools and having it find context and reason step by step, iteratively. Claude Code, for instance, reads files, judges whether it found what it needed, and if not, searches again — repeating until satisfied.

The Cursor case — the impact of the Merkle tree and semantic search

As an example of a Turbopuffer customer doing agentic search well, Kuba cites Cursor (one of Turbopuffer's earliest customers). When Cursor opens a new codebase, it chunks, parses, and embeds the code to make it semantically searchable. The clever part is detecting duplicate codebases within a team using a Merkle tree A cryptographic hash tree. Cursor computes the similarity of the multiple codebases a team opens using a Merkle tree, and if they are similar enough, copies the existing data and only re-chunks / re-embeds the files that changed. It avoids the cost of fully rebuilding every time a 100-person team opens the same codebase. It does this safely on top of Turbopuffer. . A 100-person team usually opens the same one-to-few codebases. Re-chunking, re-embedding, and re-uploading every time is expensive. So it computes codebase similarity with a Merkle tree (a cryptographic hash tree), and if they are similar enough, copies the data and only updates the changed files. Turbopuffer supports this safely.

The impact is also drawn from Cursor's blog (internal benchmark figures that Kuba presents). Giving the model semantic search yields, on average across models, a 12.5–13.5% improvement in answer accuracy, and about 24% improvement for the Composer model (pre–Composer 2). In online A/B tests, large codebases saw about a 2.6% improvement in code retention and about a 2.2% reduction in unsatisfactory requests. The numbers look small because semantic search doesn't fire on every query (queries where it doesn't fire are still in the denominator).

Claude Code doesn't use semantic search — the "cache compute" framing

By contrast, Claude Code doesn't use vector search. According to Boris Cherny (the creator of Claude Code), early Claude Code tried RAG and a local vector DB, but it didn't work well. A key view Turbopuffer has internalized is that embeddings and semantic search are a kind of cache compute A view that treats embeddings / semantic search as 'an investment you compute and cache ahead of time.' Claude Code–style per-session exploration (grep→read→judge→iterate every time) has 10 people × 10 days × 10 agents recomputing the same understanding from scratch each time, wasting tokens. The Cursor style pays up front to index once, then at runtime just queries with lightweight tools. The upstream cost is one-time, and from then on it saves tokens, time, and cost. . Claude Code–style exploration is per-session — every time you ask "how does the metadata filtering work?" it re-finds files via grep, read, evaluate, and iterate. If 10 agents × 10 days × 10 developers repeat the same question, they retread the same steps each time and eat tokens (even 6,000 tokens per substep adds up).

The Cursor style is the reverse: pay the up-front indexing cost once, and at runtime just query with lightweight tools — "how is the metadata filtered?" You get the result instantly and save tokens, time, and cost. Internally at Turbopuffer, Kuba reveals, people who had been heavy Claude Code users began switching to Cursor for the speed (Composer 2 + semantic understanding).

From RAG to agentic retrieval — "the right million"

Sophisticated customers no longer do the naive RAG of "search once with vectors and dump it into the context." The agent calls out many times, reasons over multiple steps, uses semantic search or full-text search as needed, and pulls only what the use case requires. Retrieval is no longer a one-shot VectorDB call; it has become iterative. Kuba quotes Jeff Dean's words referring to Google's staged retrieval The idea that even with a huge context window, instead of handing over trillions of tokens at once, you use lightweight mechanisms to narrow down in stages to the right million (or hundred thousand, or ten thousand) tokens. Kuba expresses agreement with Jeff Dean's line, 'not trillions at once, but the right million.' Turbopuffer's customers store trillions of tokens, but what matters is extracting 'the right subset.' — even when the context window reaches trillions of tokens, what you need is not trillions at once, but to narrow down with a lightweight mechanism to "the right million." Turbopuffer's customers store trillions of tokens, but what matters, he concludes, is extracting "the right hundred thousand, ten thousand, or million" and handing it to the window.

Editorial Notes

The value of this talk is replacing the inflammatory "RAG is dead" meme with the unglamorous technical fact that "retrieval has evolved from a one-shot call to iterative agentic retrieval." What works in Kuba's framework is the "embeddings = cache compute" framing — contrasting Claude Code's per-session exploration (grepping from scratch every time) with Cursor's up-front indexing in terms of token accounting. It is a position-talk from a vector DB vendor (Turbopuffer), yet it honestly sets alongside it the fact that Claude Code doesn't use semantic search (Boris Cherny), and what is intellectually honest is that it presents this not as "which is right" but as the trade-off of "do you prepay the cache, or pay each time at runtime." Placed next to the harness argument and other Claude Code–family articles, it becomes one piece, from the search side, of the central 2026 question of "how to carry context."

Points of Focus

Search volume as counter-evidence

Against the claim that "RAG is dead," Kuba applies not the meme volume on X but Google search volume — RAG's search interest spiked in mid-2025. Setting the felt sense of discourse (social media) against an indicator of real demand (search volume) to break the conventional wisdom is a fine example of cooling hype with primary data.

"Cache compute" decides tool choice

Which is faster — Claude Code (per-session exploration) or Cursor (up-front indexing) — is decided by "how many times you draw out the same understanding." In settings where the same question is repeated across many people over many days, indexing once pays off; for one-off tasks, per-session exploration is enough. The on-the-ground testimony that people inside Turbopuffer began switching from Claude Code to Cursor backs up this accounting.

Video Outline

  • (00:00) Self-introduction (Kuba, deployed engineer at Turbopuffer), what Turbopuffer is
  • (00:46) The "RAG is dead" meme vs. the paradox of Google search volume
  • (01:31) Definition of RAG — retrieval is not only vector search but full-text search, grep, regex, filters
  • (02:17) Definition of agentic search — searching step by step and iteratively with tools, and reasoning
  • (03:02) The Cursor case — indexing the codebase, duplicate detection with a Merkle tree
  • (04:41) The impact of Cursor's semantic search — answer accuracy +12.5–13.5% / Composer +24% / A/B retention and unsatisfactory rate
  • (06:08) Claude Code doesn't use vector search (Boris Cherny)
  • (06:29) Embeddings = cache compute, a trace comparison of per-session exploration vs. up-front indexing
  • (08:43) From RAG to agentic retrieval — iterative, taking only what's needed
  • (10:00) Jeff Dean "the right million tokens" / staged retrieval

Related Links

Glossary

RAG (Retrieval-Augmented Generation)
A method of handing context retrieved by search to an LLM to generate. Many people use it to mean the simple form of "vector search → LLM," but Kuba takes retrieval broadly, including full-text search (BM25), grep / glob, regular expressions, and filters. What is "dead" is the naive one-shot form, not retrieval itself.
agentic search / agentic retrieval
Search where an agent is given a set of tools and finds context and reasons step by step, iteratively. Not a one-shot VectorDB call, but repeating semantic search / full-text search as needed and taking only what the task requires. The exploration in Claude Code / Cursor is an example.
cache compute
A view that treats embeddings / semantic search as "an investment you compute and cache ahead of time." Claude Code–style per-session exploration recomputes the same understanding from scratch each time and wastes tokens, while the Cursor style indexes up front and queries lightweightly at runtime. The upstream cost is one-time.
Merkle tree
A cryptographic hash tree. Cursor computes the similarity of the multiple codebases a team opens using a Merkle tree, and if they are similar enough, copies the existing data and only re-chunks / re-embeds the changed files. It is a mechanism to avoid the cost of a full rebuild, done safely on top of Turbopuffer.
staged retrieval / "the right million"
The idea that even with a huge context window, instead of handing over trillions of tokens at once, you narrow down in stages to "the right million (or hundred thousand, or ten thousand) tokens." Kuba agrees with Jeff Dean's "not trillions at once, but the right million." Turbopuffer's customers store trillions of tokens, but what matters is extracting the right subset.
comment is stripped from the HTML output. */}