You can't just one-shot it — a Granola Product Engineer on the production feedback loop for LLM features

AI Engineer Code Summit (NYC) · May 10, 2026

Mehedi Hassan · 08:30 The answer isn't "one-shot better." It's how you build a feedback loop where you're playing tennis with the LLM. Do that, and the final product doesn't feel like a black box — it feels like magic.

A technical talk at the AI Engineer Code Summit (NYC, May 2026). Around 9 minutes. The speaker is Mehedi Hassan, Product Engineer at Granola .

A 9-minute lightning session, but the content is an on-the-ground reinforcement of the vibe coding trilogy — Karpathy, Schluntz, Boris Cherny — from a startup actually running these features in production. Mehedi describes himself as a veteran web developer "coding since jQuery was cool" who has lived through both major shifts (React, and now LLMs); from that position he spends 9 minutes on "what it takes to actually run AI features in production."

Granola was founded in 2024 in London by Chris Pedregal (former Google PM) and Sam Stephenson, building an AI meeting-notes app. The distinctive UX: "sits in the Dock; while the user takes notes by hand, AI records system audio + microphone audio in the background and produces a summary when the meeting ends." It raised $20M in October 2024, and $125M at a $1.5B valuation in March 2026 — a high-growth startup expanding "from meeting notes to enterprise AI apps." Mehedi, as Product Engineer there, runs LLM features in production every day.

The message of the video can be summarized in a line: "You can't just one-shot it." In the LLM world, you might assume that a single line of code calling an API delivers a magical feature — in production, that illusion breaks. What's needed instead, according to Mehedi: build a "feedback loop where you're playing tennis with the LLM." He shows two concrete implementations in 9 minutes — (1) an in-house tracing tool, (2) turning the Electron app into a Web Shell.

From the MEMEX editorial perspective, what matters is that this is the "field Product Engineer translation" of the vibe coding trilogy:

Karpathy's "verifiability is the new bottleneck" → Mehedi's "in-house tracing tool that surfaces the whole agent loop"
Schluntz's "become the Claude PM" → Mehedi's "tennis-rally feedback loop with the LLM"
Boris Cherny's "Claude Code writes 80% of the code at Anthropic" → Mehedi's "Cursor tests the PR and uploads screenshots"

After the theory (Karpathy), the inside-the-organization view (Boris), and the production best practices (Schluntz), Mehedi's session adds the final piece to the industry landscape — "the tactics of a field engineer who actually ships every day."

Key observations

How "add the Web Search tool in one line" breaks in production (02:00 – 03:30)

Mehedi's first concrete example. The "Web Search tool" Anthropic and OpenAI provide via official SDKs looks like a one-line addition. "Just add the web search tool and you expect it to just work. That's what the labs want you to believe" (02:30) — they want you to believe that, but the reality is different.

Three problems that come up in production:

Token cost blow-up: complex queries balloon the context. "10 pence per chat" is possible. Doesn't hold at millions-of-users scale
Provider fragility: "one day the model gets updated and Web Search quality regresses. Out of our control" (03:00). A lab-side update silently breaks the UX
Competition at a different order of magnitude: "Web Search is the core business of billion-dollar companies. Adding one line to your LLM pipeline isn't going to compete" (03:25)

The observation isn't simply a Granola war story — it's the structural point that "integrating an LLM API isn't differentiation." The lab SDK does 80% of the work, and the remaining 20% is the production-quality wall — a reality the industry broadly feels, articulated in two minutes.

"One prompt can't satisfy everyone" — the output style problem (03:30 – 04:30)

The second problem is output. In Granola's meeting summary feature, Mehedi's concrete examples:

Sales rep → wants a dual-focus summary (the other company + the sales team's perspective)
Engineer → action items, blockers, Linear ticket candidates
HR → a completely different format

"One prompt normally can't satisfy everyone. And the LLM is stubborn. You have to look inside and find a way to get the behavior you want" (04:10). "LLM behavior is largely seen as a black box, but we want to go very deep into the details" (04:25) — he asks Product Engineers to commit to opening the black box.

This observation is the mirror of Amanda Askell on Claude's character design. Amanda is on the "bake a good character into the model" side; Mehedi is on the "find room to tune the baked-in character from the product side." The same black box, seen from both the trainer's and the integrator's angles.

The in-house tracing tool — "where one-shotting is OK" (04:30 – 06:00)

Mehedi's first fix. "Thanks to LLMs, this kind of thing actually can be one-shotted. This is where one-shotting is fine" (04:45). They built their own tracing tool in-house.

Feature requirements:

Tool-call visibility — fully traceable from start to finish
Individual tool calls, search trails, reasoning trails, costs
Data structures designed "the way we want them"
The UI is built so that not just engineers but also product, data, and CX (Customer Experience) staff can use it

"In the past, you'd hand this off to a SaaS provider — there was no time to build it yourself. But today, we actually have time to build a tracing tool tailored to our own needs" (05:30) — a structural observation that the drop in software construction cost in the LLM era has changed the economics of in-house tooling. Granola chose to build instead of using a SaaS option like Raindrop, and the reasoning is summarized here.

The decisive example: "our founder himself fully traces the agent loop from front to back and identifies what broke" (05:50). A design goal of "a tool the CEO uses to read API traces." Place this alongside Boris Cherny's account of Anthropic's "Member of Technical Staff" culture and the picture becomes clear — the organizational philosophy of "regardless of title, everyone looks at the internals" is becoming a new standard in the LLM era.

Turning the Electron app into a Web Shell — physically solving the parallel test problem (06:00 – 08:15)

Mehedi's second fix. This is the most technically interesting part.

The problem: Granola is an Electron desktop app. Desktop apps carry the structural constraint of "only one instance can run at a time." When you want to A/B/C/D parallel-test four feature flag variants, a web app can spin them up in separate tabs simultaneously — but an Electron app is one instance.

A concrete example of the testing friction: "to get a colleague to test a new feature, they have to run the Electron app locally, install dependencies, and try it. You don't get the luxury of a web app" (06:30).

The fix: turn the Electron app's frontend into a Web Shell and deploy it online. CI generates a preview link for every PR. "This dramatically increased our development velocity" (07:15).

Technical implementation details (07:30 – 08:15):

Electron is a two-process structure — main process (system APIs) and renderer process (frontend)
They abstracted the IPC API (Inter-Process Communication) and built a layer that falls back to web standards in the web environment
React APIs (router, session, query layer) are also designed to be replaceable by web-standard equivalents
The result: the renderer process no longer depends on Electron, and runs as a web app as-is

The best part: "Cursor automatically tests the PR and uploads screenshots to the PR" (07:30). The AI coding agent uses the web preview to test itself — a physical implementation of Karpathy's "verifiability" concept inside Granola. A concrete example of resolving the verifiability bottleneck by "having AI do its own testing."

The "tennis rally" metaphor — the essence of the feedback loop (08:15 – 09:00)

Mehedi's conclusion. "The answer isn't to one-shot better. It's how you build a feedback loop where you're playing tennis with the LLM. Do that, and the final product doesn't feel like a black box — it feels like magic" (08:25).

The tennis rally metaphor is the right one: don't try to win the point on one shot; trade shots until it's decided. The same structure as Schluntz's "Claude PM" concept and Karpathy's "human-in-the-loop verification" workflow.

The "magic vs black box" contrast is decisive: by surfacing the internals, the UX feels magical to the user. "Expecting magic without opening the black box" is engineering abandoned; "making magic after opening the black box" is feedback loop design. An ethical and implementational compass for engineers building LLM products, crystallized into 9 minutes.

Q&A: considering a move to Tauri (09:18 – 09:47)

One question from the floor: "Are you considering replatforming from Electron to Tauri ?" Mehedi's answer: "We've considered it several times. But Electron is currently working well for us. The APIs change frequently. We tried Tauri, but we didn't see dramatic improvements on performance — which is what we care most about" (09:20 – 09:40).

The pragmatism is what's interesting. "Don't jump on the newest stack" — Granola's engineering culture. Underneath the investor-attracting $125M round, the internal judgment stays grounded.

Industry context

Granola was founded in 2024 in London. CEO Chris Pedregal is a former Google PM and the founder of Socratic (an AI tutoring app). Co-founder Sam Stephenson came from Ideaflow. They started with the straightforward idea of "use AI to generate meeting notes," and attracted investors with the app's UX design (sitting in the Dock, the user writing handwritten notes while AI records in the background — a distinctive combination).

Funding history:

Mid-2024: launch coverage in TechCrunch
Oct 2024: $20M round — TechCrunch's tongue-in-cheek headline: "VCs invested $20M because they use this product themselves"
Mar 2026: $125M round at a $1.5B valuation — announced expansion into enterprise AI apps

AI Engineer Code Summit (NYC, May 2026) is the latest event in the AI Engineer series. Together with AI Engineer Europe 2026 (Amsterdam, April 2026), it's the major industry-practitioner conference. Mehedi's session is one of the shortest at 9 minutes, but the "Product Engineer perspective" — distinct from other sessions at the CTO/CEO level — is valuable as a field-level view.

Granola's choices (build a tracing tool in-house, turn Electron into a Web Shell, consider Tauri) cover the typical decision axes for LLM-era startup implementation. SaaS (= outsource) vs. in-house, keep the existing stack vs. try a new one, hit the LLM API directly vs. wrap it — Mehedi presents these decision axes concisely from the field. A rare source for "how a Product Engineer inside a $1.5B-valuation startup actually decides."

Where it sits among related videos

Mehedi's session reinforces the "vibe coding" lineage covered on MEMEX with a production-operations perspective. Laid out as a four-tier landscape:

Concept layer: Karpathy AI Ascent 2026 — Software 3.0, the verifiability bottleneck, "jagged intelligence"
Inside-the-organization layer: Boris Cherny (Anthropic) — 80% of Claude Code, the printing press metaphor, no lines written by hand
Operations best-practices layer: Schluntz (Anthropic) — the Claude PM concept, leaf node strategy, the 22,000-line PR
This piece = the field Product Engineer layer: Mehedi Hassan (Granola) — "don't one-shot, play tennis," in-house tracing, the Web Shell-ification of Electron

Lined up, the four show the industry arriving at the shared recognition — "depending on LLM capability alone doesn't get you to production quality" — from four different angles. A four-step hierarchy: concept (Karpathy), lab organizational philosophy (Boris), recommended patterns (Schluntz), and the field tactics of teams shipping to customers (Mehedi).

Related pieces in the same AI Engineer lineage:

Agent Observability — Raindrop — people solving the same tracing problem from the SaaS side; useful as a comparison with Granola's in-house choice
Vibe Engineering Effect Apps — Michael Arnaldi tackles the same abstraction problem with Effect
Playground in Prod — Samuel Colvin / Pydantic — another perspective on optimizing agents in production

Implementation implications

For technologists running LLM products in production, tactics drawn from Mehedi's 9 minutes:

First, understand the trap of "add a Web Search tool in one line". The built-in tools provided by LLM lab SDKs (Web Search, Code Interpreter, File Search, and so on) are convenient, but carry structural limits: (a) token costs are hard to predict, (b) quality fluctuates with lab-side updates, (c) you're compared on features against billion-dollar competitors (Google Search, Bing). Evaluate these risks before letting your product's core feature depend on a lab's built-in tool.

Second, reassess the economics of building tracing tools in-house. With software construction cost down in the LLM era, "things you used to have to do via SaaS" are now buildable in-house. Tools with product-specific data structures, used by non-engineers (PM, data, CX), reward in-house construction. SaaS options like LangSmith, Helicone, Arize, Phoenix are excellent, but "we could build this in 1–2 weeks" is now always on the table.

Third, treat "opening the LLM black box" as a required Product Engineer skill. Mehedi's line — "LLM behavior is largely seen as a black box, but we want to go very deep into the details" — applies to hiring criteria and performance reviews. The organizational design is to develop engineers who "track LLM tool calls, read reasoning trails, and diagnose failures," not engineers who "call the LLM API and display the result."

Fourth, treat parallel-test capability as a constraint in product design. Granola's move to Electron-as-Web-Shell came from the recognition that "without A/B/C/D parallel testing on feature flags, the improvement velocity of LLM features drops." Reassess your product's technology choices (Electron, Tauri, native, web) on the axis of "can we iterate fast on LLM features?"

Fifth, bring AI coding agents into QA. Mehedi's implementation — "Cursor tests the PR and uploads screenshots" — resolves the verifiability bottleneck from a different angle. Using Claude Code Hooks or an equivalent mechanism, you can build per-PR automated testing → screenshot attachment in-house.

Critical perspective

Caveats on this 9-minute concentrate.

First, no alternative to "built our own Web Search" is shown. Mehedi points out the limits of the lab-provided Web Search tool, but doesn't disclose what Granola actually uses (SerpAPI, Tavily, custom scraping, etc.). There's a gap between "billion-dollar companies do this" and "what we chose." Partly a function of the 9-minute format, but the piece most useful to implementers is missing.

Second, no open-sourcing or sharing of the in-house tracing tool. Compared to Anthropic (the Claude Code team open-sourcing much of what they build) and its "open the session" stance, Granola's in-house tool stays closed as an internal asset. A reasonable commercial decision, but from an ecosystem standpoint that aims to lift industry-wide productivity, it's a missed opportunity.

Third, the "tennis rally" metaphor stays abstract. The detail of "what concrete structure makes the rally work" — where humans intervene and where things are automated, what metrics drive each iteration — is not shown. Also a function of the 9-minute format, but a concrete example would help: "we watched ____, changed ____, iterated ____ times, and shipped the feature."

Fourth, no concrete implementation for "Cursor tests the PR and uploads screenshots". The most exciting part is also the least specified — "how do you configure it," "how do you handle failures," "what's the cost." Not clear whether they use Cursor's GitHub Integration or run it themselves.

Caveats aside, as a 9-minute concentrate of "Product Engineer tactics for shipping LLM features to production," it's high-value as the kind of "9 minutes you share with your boss" for junior and mid-level engineers in the industry.

Reader takeaways

Don't underestimate LLM features as "one line of code." Lab SDK tools (Web Search, Code Interpreter, etc.) work for 80%; getting the remaining 20% to production quality is the Product Engineer's job
Don't give up with "the LLM is a black box." Build observability that surfaces the tool calls and reasoning trails inside. Use the decision axes of "can non-engineers use it" and "is the data structure product-specific" to choose between SaaS (LangSmith, Helicone, Arize) and in-house
Don't try to satisfy everyone with one prompt. Design from the start to branch output style by role (sales, engineer, HR, and so on)
Make "tennis rally with the LLM" an internal shorthand. A culture of testing many variants quickly, not winning on one shot
Make parallel-test capability an evaluation axis for product technology choices. Electron, Tauri, native, web — your choice shapes the iteration speed of LLM features
Bring AI coding agents (Cursor, Claude Code, etc.) into the QA pipeline. Automated PR testing + screenshot attachment is a practical way to resolve the verifiability bottleneck
Keep the pragmatism of "don't jump on the newest stack." Granola, at a $1.5B valuation, still runs on Electron — the value of "use what works"

Video outline

(00:00) Self-introduction — Mehedi, Product Engineer at Granola, coder since the jQuery era
(00:46) Granola introduction — Dock-resident, system audio + microphone audio, real-time transcription, users take notes
(01:20) Live demo — record the previous session and generate a summary
(01:51) "What happens when you put an AI feature into production" — using Granola's chat feature as an example
(02:30) The Web Search pitfall — the labs say "one line will do"
(02:50) Token cost, provider dependence, billion-dollar competition
(03:30) Output style — differences by role (sales / engineer / HR)
(04:15) "The LLM is stubborn — open the black box"
(04:30) Granola's in-house tracing tool — where one-shotting is fine
(04:45) Feature requirements — tool call visibility, search trails, reasoning trails, costs
(05:11) The UI is designed for the whole company (engineers, PM, data, CX) to use
(05:30) From SaaS to in-house — "in the past there wasn't time; now we can build it"
(05:50) The founder himself traces the agent loop end-to-end
(06:00) Granola's Electron problem — parallel testing is hard
(06:30) The friction of having a colleague test — "run Electron locally, install dependencies"
(07:00) The fix — turn the Electron app's frontend into a Web Shell
(07:30) Cursor auto-tests the PR and uploads screenshots
(07:45) Technical detail — main process and renderer process structure
(08:00) Abstract the IPC API, fall back to web standards
(08:15) The renderer process becomes Electron-independent
(08:25) Conclusion — "don't one-shot, play tennis"
(09:00) "Make magic with the feedback loop — not a black box"
(09:08) Q&A begins
(09:18) Q: Considering Tauri? — A: Considered it, but staying on Electron for API stability and performance
(09:47) Close

Key quotes

"I've been coding since jQuery was cool. I've seen React change frontend engineering, and now experiencing LLMs change engineering and everything else" (Mehedi, 00:30, intro)
"Just add the web search tool and you expect it to just work — that's what the labs want you to believe. The reality is different" (Mehedi, 02:30)
"One day the model gets updated and Web Search quality regresses. Out of our control" (Mehedi, 03:00)
"Web Search is the core business of billion-dollar companies; adding one line to your LLM pipeline isn't going to compete" (Mehedi, 03:25)
"One prompt can't satisfy everyone. The LLM is stubborn" (Mehedi, 04:00, on the output style problem)
"LLM behavior is largely seen as a black box, but we want to go very deep into the details" (Mehedi, 04:25)
"Thanks to LLMs, this kind of tool can actually be one-shotted. This is where one-shotting is fine" (Mehedi, 04:45, on the in-house tracing tool)
"The UI is built so not just engineers but product, data, and CX can use it" (Mehedi, 05:11)
"In the past you'd hand this to a SaaS provider — there was no time to build it yourself. Now we can build it" (Mehedi, 05:30)
"Our founder himself fully traces the agent loop from front to back and identifies what broke" (Mehedi, 05:50)
"Cursor automatically tests the PR and uploads screenshots to the PR" (Mehedi, 07:30, the implementation of verifiability)
"The renderer process no longer depends on Electron, and runs as a web app as-is" (Mehedi, 08:15)
"The answer isn't to one-shot better. It's how you build a feedback loop where you're playing tennis with the LLM" (Mehedi, 08:25, the core message)
"The final product doesn't feel like a black box — it feels like magic" (Mehedi, 08:55, conclusion)
"We tried Tauri, but didn't see dramatic improvements on performance — which is what we care most about" (Mehedi, 09:35, Q&A)

Sources

You can't just one shot it — Mehedi Hassan, Granola (YouTube, AI Engineer official channel)

Related resources:

メヘディ・ハッサン

Mehedi Hassan

Granola Product Engineer / 「one-shot するな、フィードバックループしろ」

Glossary

Granola: An AI meeting-notes app founded in 2024 in London. CEO: Chris Pedregal (former Google PM, founder of Socratic). Co-founder: Sam Stephenson (formerly Ideaflow). Sits in the Dock, records system audio + microphone audio, and generates a summary with AI; the distinctive UX has the user take handwritten notes to personalize the AI output. Raised $20M in October 2024, $125M at a $1.5B valuation in March 2026, expanding from "meeting notes to enterprise AI app."
Mehedi Hassan: Product Engineer at Granola. Queen Mary University of London (2018–2021), a self-described web developer "coding since jQuery was cool." Presented "You can't just one-shot it" at AI Engineer Code Summit 2026 (NYC).
One-shotting: The approach of asking the LLM to generate a complete output in a single prompt. Effective for simple tasks or lightweight tools (e.g. code for an internal tracing tool), but insufficient for features aiming at production quality (meeting summaries, Web Search integration, user dialog). Mehedi's argument: "don't one-shot — build a feedback loop."
Electron: A GitHub cross-platform desktop app framework. Builds desktop apps from web tech via Chromium + Node.js. Adopted by many major apps — VS Code, Slack, Discord, Granola. Characterized by a two-process structure: main process (Node.js, system API access) and renderer process (Chromium, UI).
Tauri: A Rust-based cross-platform desktop app framework, watched as an Electron alternative. Built on Rust + the system's standard Webview (Chromium not bundled), so it's lightweight, has stronger security, and a smaller build size. A mainstream option, but it can lag Electron on API maturity and web-tech compatibility. Mehedi's assessment: "we tried Tauri, but didn't see dramatic improvements on performance."
IPC API (Inter-Process Communication API): The API for communication between Electron's main process and renderer process. Used for system API access, file I/O, OS event handling, and so on. Granola abstracted this IPC API and built a layer that falls back to web standards (Fetch API, LocalStorage, etc.) in the web environment.
Web Shell-ification (Electron → web): The implementation pattern Granola adopted. By abstracting the IPC API, the Electron app's renderer process (frontend) can also run as a web app. CI generates a preview link per PR, enabling parallel testing and automated testing by Cursor. A technique that lowers desktop-app development friction to web-app levels.
Agent loop: The processing loop in which an LLM agent repeats tool call → result → reasoning → next tool call. Debugging requires tracing that captures the full chain — which tool was called when, what was returned, what the agent reasoned. Mehedi notes that Granola's founder himself traces this loop front to back to debug it.
Tracing tool: A tool that visualizes the execution of an LLM agent. Records and displays tool calls, reasoning steps, token usage, costs, search queries, and so on. Industry-standard SaaS / OSS options include LangSmith (LangChain), Helicone, Arize Phoenix, and OpenTelemetry. Granola chose to build in-house; a design goal is "a UI usable by non-engineers (PM, data, CX) as well."
Feedback loop: The structure of "implement → observe → fix" repeated quickly when developing LLM features. Mehedi's metaphor: "a tennis rally with the LLM." Works when multiple elements combine — parallel-test capability, internal observability, automated testing by AI coding agents. The core design philosophy of "don't expect one-shot perfection."
Cursor (PR test automation): An AI coding agent (from Cursor). At Granola, Cursor runs tests against the Web-Shell PR preview and uploads screenshots to the PR. A concrete implementation of Karpathy's "verifiability is the bottleneck" — AI does its own testing to reduce the human verification load.
AI Engineer Code Summit (NYC, May 2026): A major conference in the AI Engineer series, held in NYC in May 2026. Alongside AI Engineer Europe 2026 (Amsterdam, April 2026), it's a major industry-practitioner event. Speakers include implementers from startups running LLM products in production — Granola, Trigger.dev, Arize, OpenCode (Tavoon), and others.

comment is stripped from the HTML output. */}