The State of AI Audio — Thor Schaeff (Google DeepMind) on an Audio Stack Built on 'Audio Understanding'

AI Engineer Europe 2026 (London) / talk approximately 19 minutes

Thorsten "Thor" Schaeff · 12:56 "Here the intelligence is baked directly into the audio model. That's different from a cascading pipeline that goes through text and runs it through an LLM to get intelligence."

A talk at AI Engineer Europe 2026 (London) (official title "From Transcription to Live Music: Gemini's Audio Stack"; on the slide "What's new in AI audio"; talk approximately 19 minutes; video published 2026-06-09; AI Engineer official channel). The speaker is Thorsten "Thor" Schaeff (Developer Relations Engineer at Google DeepMind, working on the Gemini API + Google AI Studio, formerly ElevenLabs). Building on Gemini 3's audio understanding, it's a DevRel demo that runs through audio understanding (Echo Script), audio generation (Voice Library), real-time conversation (Gemini 3.1 Flash Live), and music generation (Lyria 3).

Thor Schaeff is in DevRel at Google DeepMind, working on the Gemini API and Google AI Studio (previously DevRel for AI audio at ElevenLabs). He introduces himself as having "joined the team the day before Gemini 3 was released." The talk's central thread is simple: it shows, through demos, how everything in DeepMind's audio work — understanding, generation, real-time, music — is stacked on a single foundation, Gemini 3's audio understanding.

The foundation is Gemini 3's audio understanding

The first thing Schaeff establishes is the foundation of audio understanding The ability not just to transcribe audio but to grasp speaker, language, emotion, pacing, and context — even situations where speakers talk over one another. Schaeff explains that Gemini 3 excels at this audio understanding, and that it becomes the foundation for audio generation and real-time conversation. He sets the goal as 'understand deeply, transcribe richly, and reason robustly across audio.' . Gemini 3 doesn't "just transcribe" audio — it grasps speaker, language, emotion, pacing, and context, even situations where speakers talk over one another. He sets the goal as "understand deeply, transcribe richly, and reason robustly across audio. Handle many languages, dialects, accents, and modalities seamlessly." He places this audio understanding — which supports both downstream audio generation and real-time conversation — as the backbone of the whole talk.

Echo Script — extract everything in one request

The first demo is Echo Script An audio-analysis demo app Schaeff built on Google AI Studio (you can try it in the AI Studio gallery). With a single API call to Gemini 3 Flash, you pass a response schema (structured output) and extract, all at once: a summary, speaker identification (with names where there's context), timestamps, language, emotion (happy/sad/angry/neutral), and an English translation if the source isn't English. It shows the difference from a pure transcription model. . Unlike a pure transcription model, it pulls a lot of information out of audio in a single request. Schaeff passes Gemini 3 Flash a response schema (structured output) and asks, in one instruction, to "distinguish speakers and label them by name where there's context, give accurate timestamps, the language, an English translation if it's not English, identify emotion from happy/sad/angry/neutral, and an overall summary up front." The result comes back structured and ready to feed straight into the UI. The demo proceeds while switching among German, French, Japanese (which missed, he admits with a wry laugh), and Chinese, showing the language and emotion labeling at work.

Audio generation — "directing" about 30 voices

Audio generation takes a different approach from other TTS, Schaeff says. Rather than narrowing down from a huge voice library by gender or accent, Gemini has about 30 base voices, and you steer "how they perform" with a director's note In Gemini's audio generation, an instruction that steers about 30 base voices by 'how to perform.' You provide the scene setting (audio profile), accent, and emotion the same way you'd direct a human voice actor. Schaeff demonstrates it in the Voice Library app in the AI Studio gallery. Because audio understanding is the foundation, you can produce the voice you want from just a few base voices. . Because audio understanding is the foundation, you can transform a voice by accent and performance. In the AI Studio gallery's Voice Library app, you assemble an audio profile (scene), director's note, sample context, and the transcript you want read. As examples, he generates a voice with a "strong, authentic Irish accent" set in a bustling County Clare pub, then a Singapore Hawker center-style voice (the tone of someone recommending "chicken rice"), showing how to nudge from a standard American English base voice toward the target voice.

Gemini 3.1 Flash Live — sound-to-sound real-time

Gemini 3.1 Flash Live Google DeepMind's real-time conversation model (released 2026; Schaeff is a co-author of the official blog post). A native sound-to-sound (audio-to-audio) real-time multimodal model that takes in text, audio, and video in real time over WebSocket and returns real-time audio and its transcription. The intelligence is baked directly into the audio model — unlike a text → LLM → TTS cascading pipeline. , which Schaeff says launched a few weeks ago, is a native sound-to-sound real-time multimodal model. It takes in text, audio, and video in real time over WebSocket and returns real-time audio and a transcription. The design difference Schaeff emphasizes is that the intelligence is baked directly into the audio model — not a cascading pipeline that converts to text and runs it through an LLM to get intelligence. He prefaces that benchmarks can't be fully trusted in the audio domain, but cites having reasoning and thinking inside the model as an advantage.

The demo proceeds while showing that you can try it for free at ai.studio/live (no credit card needed). With a system instruction telling it to "speak with a friendly Irish accent," he brings in the camera feed and asks "can you see me?" — and it comments on his outfit (a Gemini shirt + a backwards cap) in an Irish accent. Then he asks for a German poem, and the Irish accent gets applied to the German too (= the system instruction needs tuning), a moment that makes the behavior very clear. Screen capture is also possible, and video is taken in at up to 1 fps.

coding agent skills, Lyria 3, Live Jukebox

For developers, Schaeff recommends the published coding-agent skill covering the entire Gemini API, including the Live API. Implementing real-time audio has many hard spots, and dropping skills like these into a coding agent steers it in the right direction — a fittingly DevRel closing note.

Last comes music. Lyria 3 Google DeepMind's music generation model. It can generate songs with lyrics. There are two variants: Lyria 3 clip (Fast), which makes a roughly 30-second jingle, and Lyria 3 Pro, which makes full-length songs. Schaeff demonstrated it in a 'Live Jukebox' app that gives the Gemini Live model a tool to call Lyria. can generate songs with lyrics, and comes in two variants: Lyria 3 clip for 30-second jingles and Lyria 3 Pro for full songs. Schaeff shows a Live Jukebox app modeled on "the old days of calling a radio station to request a song" — he gives the Gemini Live model a tool to compose with Lyria, and when he improvises a request for "a German techno-schlager about the UK startup scene," a song is generated along with a DJ-style response. It's a demo that bundles real-time conversation, tool calling, and music generation into one.

Editorial Notes

The backbone of this talk comes down to a single sentence: "audio understanding is the foundation for everything." It's structured so that the run of demos lets you feel how Gemini 3's audio understanding (grasping emotion, pacing, overlapping speakers, even multiple languages) underpins all of understanding (Echo Script), generation (Voice Library's directing), and real-time (Flash Live). What's most technically telling is the design decision to "bake intelligence into the audio model rather than a cascading pipeline (audio → text → LLM → audio)." It concretely shows the shift from a relay scheme that loses nuance and latency at every handoff to a single model that hears and speaks directly — through the Irish accent and the over-camera conversation. Fittingly for DevRel, it consistently pushes a low barrier to entry — "you can try it for free in AI Studio," "put skills into your coding agent" — and as such it has archive value less as breaking news and more as "a map of the audio AI you can try at hand right now."

Points of Interest

From "transcription" to "structured extraction in one request"

What the Echo Script demo shows is that audio processing has moved from the single function of "transcribing to text" to "returning speaker, language, emotion, translation, and summary structured in one API call." Just by passing a response schema (structured output), you can feed the result straight into the UI. It's completed with one request to a single model rather than a pipeline stitching together multiple models — and that greatly changes the developer experience.

From cascading to baked-in — a structural shift in audio AI

The core of Flash Live is the design that "bakes intelligence into the audio model." Traditional voice assistants used an audio → text → LLM → audio relay scheme, dropping pauses and the nuance of accent at each handoff while latency piled up. A single sound-to-sound model understands the sound it hears as-is and speaks. The "failure" demo where the Irish accent gets applied even to German is, conversely, evidence that accent and tone are handled consistently inside the model.

Video Outline

  • (00:00) Multilingual greeting demo + overview of DeepMind's AI audio
  • (01:31) Recent releases — Gemini 3, Gemma 4 (on-device multimodal audio understanding)
  • (02:14) Gen Media is Veo 3.1 Lite, audio is Gemini 3.1 Flash Live
  • (03:10) Gemini 3's audio understanding — beyond transcription: emotion, pacing, overlapping speakers, multilingual
  • (04:14) Echo Script demo (Gemini 3 Flash, AI Studio gallery)
  • (04:46) Extract summary, speaker, timestamp, language, emotion, translation in one request
  • (07:05) Structured output via one API call + response schema
  • (08:02) Stack dedicated audio models on top of Gemini 3 as the foundation
  • (08:38) Audio generation — direct about 30 base voices with director's note
  • (09:01) Voice Library app (AI Studio gallery)
  • (10:25) Irish accent example (a County Clare pub)
  • (11:08) Singapore Hawker center-style example
  • (11:58) Gemini 3.1 Flash Live — sound-to-sound real-time multimodal
  • (12:56) Intelligence baked into the model (difference from a cascading pipeline)
  • (13:09) ai.studio/live for free, Irish accent + camera demo
  • (14:07) Irish accent gets applied to a German poem
  • (15:00) Screen capture, video at up to 1 fps
  • (16:31) Gemini coding agent skills including the Live API
  • (15:54) Lyria 3 music generation (clip 30 seconds / Pro full songs)
  • (16:43) Live Jukebox demo — German techno about the UK startup scene

Related Links

Glossary

audio understanding
The ability not just to transcribe audio but to grasp speaker, language, emotion, pacing, and context, even overlapping speakers. Gemini 3 excels at this, and it becomes the foundation for audio generation and real-time conversation — the backbone of the talk.
Echo Script
An audio-analysis demo app Schaeff built on AI Studio. With a single API call to Gemini 3 Flash, it passes a response schema and produces structured output for summary, speaker identification, timestamps, language, emotion, and English translation, all at once. It shows the difference from a pure transcription model.
director's note
In Gemini's audio generation, an instruction that steers about 30 base voices by "how to perform." You give scene, accent, and emotion the same way you'd direct a voice actor. Demonstrated in the Voice Library app in the AI Studio gallery.
Gemini 3.1 Flash Live
Google DeepMind's real-time conversation model. A sound-to-sound multimodal model that takes in text, audio, and video over WebSocket and returns real-time audio and its transcription. The intelligence is baked into the audio model — unlike a text → LLM → TTS cascading pipeline. You can try it for free at ai.studio/live.
Lyria 3
Google DeepMind's music generation model. It can generate songs with lyrics, in two variants: Lyria 3 clip for 30-second jingles and Lyria 3 Pro for full songs. In the Live Jukebox demo, Gemini Live was given a tool to call Lyria and improvised a song from a request.
comment is stripped from the HTML output. */}