FLUX, Open Research, and the Future of Visual AI — Stephen Batifol / Black Forest Labs (AI Engineer Europe)

AI Engineer Europe · May 8, 2026

Stephen Batifol · 09:55 External encoders — that's the Frankenstein setup.

AI Engineer channel (published May 8, 2026, around 22 minutes). The final-day keynote at AI Engineer Europe 2026 (London, April 8–10).

A 22-minute keynote showing where Black Forest Labs (BFL, based in Freiburg, Germany) — founded in 2024 by the original Stable Diffusion / Latent Diffusion team after going independent — stands today, and where it sees itself going. A dense package that runs from a review of the company's main model lineage (Flux 1 / Flux Context / Flux 2), to the Self-Flow research paper published about 1.5 months ago, to the real-time editing demo of the Klein model, and ending on the future vision of "Visual Intelligence" = world models + robotics + autonomous driving.

The presenter is Stephen Batifol — Developer Relations engineer at BFL. BFL was founded by a team with over 200,000 academic citations, and counts Microsoft, Adobe, Canva, and Mistral among its customers. Flux 1 (August 2024, the first breakthrough — an open-source text-to-image model that ran on a laptop) was "the most popular model on Hugging Face" at the time, an anecdote Stephen relays from Clem (Hugging Face's founding CEO).

The core of the talk is the Self-Flow research paper (published about 1.5 months ago, open). Conventional approaches to training multimodal generative models (image + video + audio + action) depend on external pre-trained encoders (e.g. Dyno V2) to learn representations, but they hit a scaling ceiling, are modality-specific, and have such mismatched objectives that "Frankenstein setup" is the only honest description. Self-Flow resolves this — a student/teacher two-stage structure that unifies representation learning and generation in the same flow, eliminating the need for external encoders.

Four demos. (1) Text accuracy in image generation — even something as simple as "World" used to come out with two Ls, but Self-Flow learns adjacency relationships. (2) Simultaneous video and audio generation — a demo speaking the company's own name, "Hello from the Black Forest," contrasts the baseline's flicker against Self-Flow's stability. (3) Robot action prediction — the same Self-Flow model also generates the action of picking up a can. (4) Real-time editing with Klein 4B/9B — generated in 0.5 seconds, against the QAN comparison at 15–20 seconds. He closes on the future picture: "visual intelligence = world models + robotics + autonomous driving."

Key observations

"External encoders are the Frankenstein setup" — a diagnosis of the current state (09:55)

A critique of the "standard configuration" for training multimodal generative models. Bolting individually optimized encoders into a single generative model — Dyno V2 for images, a different encoder for video, another for audio — produces "the Frankenstein setup," with mismatched objectives (the encoders aim at segmentation; the generative model aims at content generation). He also presents an empirical observation: upgrading to a higher-tier encoder (Dyno V3) makes generation worse. "Dyno V3 is a better model than Dyno V2, but for some reason, when you train it for generation, it gets worse. We don't even know the rules" — a candid diagnosis.

Self-Flow = unifying representation and generation with a student/teacher pair (11:00 – 14:30)

The substance of the fix. Place a "student" and a "teacher" inside the same model: (a) hand the student a heavily noised image and have it denoise (generation loss); (b) hand the teacher a lightly noised image, and have the student learn to approach the teacher's representation (representation loss). Optimizing both in a single model removes the need for external encoders. Results: every modality (image, video, audio) improves, and convergence is faster (the baseline saturates by 2M steps, while Self-Flow keeps reducing loss). A paper that signals a major turning point — the Stable Diffusion lineage moving in the direction of unifying representation learning and generation.

The nested joke of "Hello from the Black Forest" (17:00)

The demo showing Self-Flow's ability to generate video and audio simultaneously, with a prompt that is the company name itself — "Hello from the Black Forest." On the baseline video+audio generator, the audio flickers and lip-sync distorts. Self-Flow trains both in the same model, so flicker is largely gone and "Hello from the Black Forest" is spoken cleanly. Company name (Black Forest Labs) → demo prompt → demonstration of the model's capability — a nested structure that makes it stick.

Real-time editing with Klein 4B/9B (0.5 seconds) (18:00)

A real-time editing demo built through integration with the Korean Klein model. Klein 9B delivers quality at least equivalent to other open-source models (QAN and others) at 0.5 seconds of latency (QAN is 15–20 seconds). "Real-time guidance" — the image changes the moment the user issues an edit instruction — makes an interactive experience real. BFL's future picture, of "an interactive visual engine for games and films that renders actual movies on prompt," starts to take concrete shape here.

Video outline

  • (00:00) Self-introduction; an overview of BFL — the team that built Stable Diffusion / Latent Diffusion
  • (00:50) Customers — Microsoft, Adobe, Canva, Mistral, and others; over 200,000 academic citations
  • (01:14) Flux 1 (Aug 2024) — the first breakthrough; open source; runs on a laptop; anatomical accuracy
  • (02:13) Flux Context — the world's first OSS editing model (text + image input); 7–8 seconds to generate
  • (04:00) Flux 2 (Nov 2025) — BFL's best ever; up to 10 image inputs at once; editing and generation unified
  • (06:00) BFL's philosophy — release frontier models; raise the quality bar every time
  • (09:00) The limits of existing multimodal generative models — scaling ceiling, modality-specific, objective mismatch
  • (09:55) "The Frankenstein setup" — a critique of stacking external encoders
  • (10:30) The observation that Dyno V3 is worse than Dyno V2 on generation tasks
  • (11:00) Introducing the Self-Flow paper — published about 1.5 months ago
  • (11:30) Self-Flow architecture — student/teacher two-stage, unifying representation learning and generation
  • (13:50) Results — every modality (image, video, audio) improves, convergence is faster
  • (14:50) Text generation accuracy — the "World" with two Ls bug resolved
  • (16:00) Comparison: video flicker is resolved
  • (17:00) "Hello from the Black Forest" — video + audio simultaneous generation demo
  • (17:30) Robot action prediction — the same Self-Flow model also generates the can-pickup action
  • (18:00) Klein 4B/9B real-time editing demo — 0.5 seconds (vs. QAN's 15–20 seconds)
  • (19:48) The future of Visual Intelligence — real-time generation, interactive visual engines
  • (21:00) World models — robotics, autonomous driving, manufacturing automation
  • (21:30) Q&A — data sources (confidential), how the world is represented (memory in the context window), long-context handling (sliding window)

Sources

FLUX, Open Research, and the Future of Visual AI — Stephen Batifol, Black Forest Labs (AI Engineer)

comment is stripped from the HTML output. */}