Voice AI: When Will the 'Her' Moment Arrive? — Neil Zeghidour / Gradium (AI Engineer Europe)

AI Engineer Europe · May 9, 2026

Neil Zeghidour · 11:43 "Most voice AI demos are filmed in a quiet room next to a phone."

AI Engineer channel (published May 9, 2026, approximately 19 minutes). A keynote at AI Engineer Europe 2026 in London (April 8–10).

2026 — the year many people feel voice AI has "arrived." Neil opens by noting that the demo videos from the top of the industry are filmed in unnatural silence. Noise, overlapping speech, backchanneling — the phenomena of real conversation — defeat today's voice AI almost entirely. The "Her" moment (natural conversation between human and AI in the spirit of the film) has not yet arrived. A 19-minute keynote that starts from this diagnosis of the current state.

The speaker is Neil Zeghidour — founding CEO of Gradium (based in Paris). Formerly at Google DeepMind, where he led work on voice models including SoundStream and AudioLM. In late 2023 he co-founded Kyutai (an open-science AI lab funded by Eric Schmidt, Xavier Niel, and others), where he led Moshi (the world's first-class full-duplex speech-text foundation model, open-sourced in 2024). He spun out Gradium in September 2025 and completed a $70M (about ¥10B) seed round in December.

The diagnosis has three parts. (1) The limits of cascade systems (STT → LLM → TTS in series). (2) Tool call latency is now the biggest bottleneck. (3) Full-duplex S2S is the ultimate answer, but the only ones that currently exist are Moshi (Kyutai) and derivatives (NVIDIA Persona Plex). What he then presents is his company's on-device TTS, Gradium Phonon — a small model under 100 million parameters that runs on an iPhone CPU. A cost-structure proposal that also rescues "startups burning their funding on TTS bills."

The closing line is a piercing declaration: "I strongly disagree with some competitors who claim 'voice has been commoditized.' That's a complete lie. Voice is the most challenging domain, and the last remaining mile is the hardest to solve." The claim — too soon to treat it as a commodity.

Key Observations

The diagnosis: "Most voice AI demos are filmed in a quiet room next to a phone" (11:43)

Neil's one-line critique of the current state. Half-duplex models (the model is either listening or speaking) break under noise, coughs, backchannels, and overlapping speech. So industry demos are shot in silence. He gives a concrete example: in Japanese culture, active backchanneling ("mm, mm, ah") can account for up to 20% of a conversation. Pulling the reader from a meta angle (the filming environment of demo videos) into the technical core (half-duplex vs full-duplex) is skillfully constructed — a single sentence that reframes the question "has voice AI really arrived?"

Tool calls = the biggest latency bottleneck; the solution is "fillers" (07:25)

After establishing the premise — TTS latency is 200 ms, LLM latency is 200 ms, and that's still not enough to reach human conversation (cumulative understanding + a response within 200 ms) — he shows the real bottleneck. Tool calls can take more than 4 seconds, and they're unpredictable. Countering this with 10 ms / 20 ms TTS optimizations is pointless. The solution is filler design — split the LLM into two; while one waits for the tool result, the other carries the natural conversation. He demonstrates this live with a vibe-coded agent, "Colin from Wonderlust Travel." When you say "I want to go to Tokyo," behind the running search, Colin keeps talking — "Tokyo is a lovely choice, a fusion of ultra-modern high-rises and beautiful shrines …" — filling the silence. A concrete picture emerges.

What's left to the "Her moment" is an S2S scalability problem (15:30 - 18:00)

Neil grants that full-duplex S2S has been technically demonstrated by Moshi, then sets the next wall: scalability. In the world of Her, the protagonist talks to AI all day. If "always-on" voice goes consumer-app scale, hyperscalers are already running at a loss, and API fees become unpayable. The cost of LLMs has fallen close to zero, but TTS alone remains expensive — he cites startups "burning their funding on TTS bills" as concrete examples. Gradium's answer is Phonon — a lightweight TTS under 100 million parameters that runs on a smartphone CPU. "No cloud GPUs needed; no API fees at all" — a design aimed at consumer scale. The industry's "last mile," he argues, is a structural shift from "cloud-assumed scale" to "small models that finish on the device."

Video Outline

  • (00:00) Self-introduction and the framing question — when will the "Her" moment arrive
  • (03:55) ElevenLabs' latest demo (gym buddy Logan) — getting natural, but still not there
  • (05:42) The structure and latency of cascade systems (STT → LLM → TTS)
  • (06:15) TTS alone is 200 ms; with the LLM, not within human-conversation latency
  • (06:54) Tool calls = 4 seconds and up, unpredictable; the real bottleneck
  • (07:25) The filler solution — split the LLM, fill the wait with natural conversation
  • (07:48) Wonderlust Travel demo — running the vibe-coded agent
  • (09:05) Explanation of Speech-to-Speech (S2S) architecture
  • (09:30) The limits of half-duplex models — everything other than Moshi is half-duplex
  • (10:32) Half-duplex demo — backchanneling collapses the conversation
  • (11:43) "Most voice AI demos are filmed in a quiet room"
  • (11:58) Moshi demo — a two-person conversation with co-founder Alex
  • (13:10) Paralinguistic understanding — recognizing tone and emotion
  • (13:43) Evaluating Moshi — flow is unbeatable, but it's dumb as an agent
  • (14:16) NVIDIA's Persona Plex (a Moshi derivative), no observability → not deployable in production
  • (15:30) Scalability — the consumer-scale problem when it becomes "always on"
  • (15:48) Cost structure — startups burning their funding on TTS bills
  • (16:36) Privacy — on-device is comfortable for users
  • (17:00) Announcing Gradium Phonon — under 100 million parameters, runs on iPhone CPU
  • (17:38) Phonon live demo
  • (18:30) Rebutting "voice has been commoditized" — a complete lie
  • (18:50) Conclusion — the last remaining mile is the hardest to solve

Sources

Voice AI: when is the 'Her' moment? — Neil Zeghidour, Gradium AI (AI Engineer)

comment is stripped from the HTML output. */}