Playground in Prod — Optimizing Agents in Production (Samuel Colvin / Pydantic)

AI Engineer Code Summit — May 7, 2026

Samuel Colvin · 00:54 "I don't really believe in AI observability — sooner or later it gets eaten by either observability or AI."

AI Engineer channel (published May 7, 2026, about 1h 20m). A live workshop by the founding CEO of Pydantic.

An 80-minute workshop with an odd opening — the operator of an AI observability platform (Pydantic Logfire) saying in public, of his own category, "I don't really believe in it, it gets eaten sooner or later." On that foundation, what gets put forward is a larger framing: "observability is the substrate; the real goal is autonomous optimization of agents in production." The central techniques are two: GEPA (Genetic Pareto Optimization) and Managed Variables (managing things beyond prompts via Pydantic models).

The speaker is Samuel Colvin — the founding CEO of Pydantic, Python's most-installed data validation library. Pydantic is built into the OpenAI Python SDK, the Anthropic Python SDK, FastAPI, LangChain, and nearly every major Python AI framework — a foundation of the Python AI ecosystem. The company is now advancing a three-layer stack — validation (Pydantic), agents (Pydantic AI), observability plus optimization (Pydantic Logfire) — pursuing a strategy that captures the entire development / execution / improvement loop.

The live exercise is concrete and has a sense of humor. He builds an agent that answers, "what percentage of UK MPs come from political families?" Onto the long-debated theme of political dynasties, he defines a structured-output schema (`name / role / relationship`) in Pydantic AI and runs the agent. The first run reaches 85% accuracy; from there he iterates on prompts, and then has GEPA auto-optimize.

The metaphor for GEPA is delightful — "bring the best racehorse, breed them, produce an even better racehorse. But sometimes mix in a very slow horse to see what happens." The essence of genetic algorithms — "retain obviously weak candidates for the sake of Pareto diversity" — captured in one line via a horse-racing analogy. The talk also opens with a self-deprecating confession — "for family reasons, I wrote most of this presentation overnight" — carrying a UK directness throughout.

Key Observations

"I don't really believe in AI observability" (00:54)

An unusually candid line from an observability vendor's founder. "This is a feature of our category — sooner or later it gets eaten by either observability or AI." A grown-up stance that pairs commitment with cold reflection because he's selling the thing. With this self-awareness in mind, the reason Pydantic Logfire doesn't stop at observability and moves into Managed Variables and GEPA integration becomes clear — observability is a way station, optimization is the destination.

Explaining Genetic Pareto via the racehorse analogy (02:47)

The skill of compressing the essence of GEPA into a single line — "breed the best racehorse to produce a better racehorse, but mix in a slow horse from time to time." The notion of a Pareto frontier — "if you keep only the best examples, diversity disappears, so retain obviously weak candidates too" — is explained with zero jargon. The algorithm optimizes strings (prompts), but what GEPA optimizes isn't only text — it's any object defined by a Pydantic model. Combined with the generalization in Managed Variables, "anything" in the agent becomes a target for optimization.

"Letting an LLM be the judge is like running the asylum with the lunatics" (18:38)

A pointed remark about LLM-as-judge in evaluation. For the political-dynasty example, deterministic eval comparing against a golden dataset (human-verified ground truth) is far more reliable than "let an LLM decide right or wrong." Looks contradictory to the autonomous-optimization story, but it's actually consistent — "optimization needs reliable evaluation metrics" and "rely on an LLM for the metric itself and optimization collapses."

Video outline

(00:00) Self-introduction, Pydantic's three products (validation / AI / Logfire)
(00:54) The unexpected preamble — "I don't really believe in AI observability"
(01:14) Today's topic — eval, Managed Variables, and the autonomous optimization beyond
(01:56) What GEPA is — an overview of Genetic Pareto Optimization
(02:47) The racehorse breeding analogy explaining Pareto diversity
(03:04) Managed Variables — managing arbitrary objects via Pydantic models, not only prompts
(04:00) The challenge — what percentage of UK MPs come from political families
(05:08) Codebase walk-through — defining the structured output schema in Pydantic AI
(11:00) Designing the golden dataset — generated with Opus 4.6 etc., then human-checked
(15:00) Logfire setup, multi-model connection via Pydantic AI Gateway
(17:00) Structure of the eval function — dataset plus custom evaluators
(18:38) "Letting an LLM be the judge is like running the asylum with the lunatics"
(20:00) Run 1 (simple prompt) — 85% accuracy
(after) Refined prompt plus GEPA auto-optimization, comparison in Logfire

Sources

Playground in Prod: Optimising Agents in Production Environments — Samuel Colvin, Pydantic (AI Engineer)

Pydantic: pydantic.dev · Pydantic AI: ai.pydantic.dev · Pydantic Logfire: pydantic.dev/logfire

GEPA paper (Stanford, April 2025): arXiv:2504.12462

サミュエル・コルヴィン

Samuel Colvin

Pydantic 創業 CEO / Python データ検証 + AI エージェント + 観測性スタック

comment is stripped from the HTML output. */}