Ship Real Agents — Laurie Voss (Arize) on How to Build Evals That Actually Work (AI Engineer Europe 2026 London Workshop)

AI Engineer Europe 2026 (London) Workshop — Laurie Voss / Arize · May 14, 2026

Laurie Voss · 17:06 "Don't write too many evals. If the agent is smarter than you expected and skips two tools, a prescriptive eval will fire false negatives. The agent can be cleverer than the evals you wrote."

Ship Real Agents: Hands-On Evals for Agentic Applications — Laurie Voss, Arize (AI Engineer Europe 2026)

A workshop session at AI Engineer Europe 2026 (London, held May 14, 2026), about 2h 4m. The instructor is Laurie Voss (Arize AI Head of Developer Experience, former npm Inc co-founder). A full record of a 2-hour hands-on workshop that builds from the fundamentals of agent eval through implementation, "LLM-as-judge" meta-evaluation, the capability eval → regression eval promotion pattern, and the crucial warning that "prescriptive evals break."

Laurie Voss's 2-hour hands-on agent-eval workshop, held in the final session of AI Engineer Europe 2026. Voss is widely known in the JavaScript industry of the 2010s as a co-founder of npm Inc and is currently Head of Developer Experience at Arize AI. As the evangelist for package management at npm now turned evangelist for AI evals, the placement of this figure embodies a generational handoff in the industry.

The core message of the workshop is the simple reframing that "eval has been over-constrained by ML engineering jargon. For an AI engineer, evals are just 'tests.'" This sits in the same observability + eval production-floor knowledge tradition as MEMEX's existing articles Sally-Ann Delucia (Arize) Hierarchical Memory and Amy Boyd / Nitya Narasimhan (Microsoft) Mind the Gap, and stands as a complete standalone video.

The "Vibes problem" — the trap of shipping AI without evals

The problem definition Voss presents at the opening of the workshop is concrete. "Many people build an AI feature, hit a few queries, decide 'it looks correct,' and ship. Then it fails on unexpected input. It fails on edge cases. It fails on adversarial input. And the most common failure mode is users asking questions dumber than you expected."

That final point reveals the depth of Voss's operational knowledge. The main cause of agent failure is not the typical adversarial attack but "users not knowing the vocabulary the agent expects" — the agent expects a domain term, the user asks in everyday language. The origin of production-agent reliability problems lives here.

"The way to handle the Vibes problem isn't enough with unit tests. The same prompt generates different text every time, any of which may be correct. Simple string match won't work. As a result many teams escape to human review, which has three flaws: it doesn't scale, it doesn't catch regressions, and it doesn't run in CI," Voss explains.

Concrete example: "When I changed the system prompt to fix a tone issue, the bot did become kinder, but it also started hallucinating product features." Without evals, you cannot catch the side effect that a tone change introduces hallucination. By contrast: " Faithfulness eval can judge whether the bot is using the source material correctly" — a strategy of detecting different failure modes with different evals.

The three kinds of eval — code / LLM-as-judge / human

Voss's eval taxonomy is operationally important.

Type	Characteristics	Where to use	Drawbacks
Code eval	Deterministic Python / TypeScript, runs in ms, near-zero cost	Format validation, length limits, forbidden phrases, required fields	Brittle for complex / nondeterministic outputs
LLM-as-judge	Uses a stronger LLM than production as judge, grading by rubric	Semantic correctness, faithfulness, tone, content judgment	Time, cost, nondeterministic (the judge itself can be wrong)
Human	The "gold standard" but doesn't scale	Building golden data sets, meta-evaluating LLM judges	Fatigue causes about 50% errors; thousands per day in CI impossible

Voss emphasizes: "These three are not competing approaches — they're complementary. A real eval suite uses all three at once." This aligns with the argument of Vincent Koc (Comet) on Malleable Evals — "one kind of static benchmark is insufficient; combine multiple evaluation means" — an important industry consensus.

The note on the 50% error rate of human annotators is sharp: " The failure rate of human evaluators due to fatigue . Even if Meta or Google could hire the population of a small country, it isn't cost-effective." So humans are limited to building golden data sets and meta-evaluating LLM judges, while production evaluation runs on LLM-as-judge and code eval — the modern eval architecture.

"Agents make it harder" — the structure of cascading failures

An important point emerges in the middle of the workshop. "Single-LLM-call eval is hard, but agents are harder — because of cascading failures." An agent calls tools in sequence. At each step, you have to verify that the right tool was chosen, that the parameters were correct, that the output was interpreted correctly. In a multi-agent system, the eval target extends further: did the routing LLM pick the right sub-agent, what happened inside the sub-agent, did the return value propagate upstream correctly.

Voss's real example — when you ask an agent to "write a report on Tesla": "the first research agent goes, 'Tesla? Oh, you mean Nikola Tesla', and exhaustively researches the 18th-century inventor. Then it writes an investment case from that information, and forwards it to your boss. This is a cascading failure ." Each step is internally consistent, but the whole is radically wrong.

The most important warning — "don't write too many evals"

From a MEMEX editorial point of view, the most important point in the workshop. A point many eval guides miss:

"Agents can be 'surprising' in the reverse direction too. The hazard when writing evals is writing too prescriptively. Don't write evals that say 'call tool A, then tool B, then decision C, then produce the answer.' The agent may find a cleverer way than you. This is especially true when you upgrade the model — the new model does in fewer steps what the old one did. If your eval is too prescriptive, it breaks" (17:06).

This is a point that subtly tensions with the "verifiability matters" argument in Karpathy's Software 3.0 / Agentic Engineering. Karpathy argues "AI evolves quickly in verifiable domains (math / coding / games)," but Voss issues the opposite-side warning: "embedding too much verifiability into evals impedes the agent's becoming cleverer."

The workshop's practical conclusion: "Eval the end-state, not the path." Eval "did the Tesla report get written?" Do not eval "did it go through tool A → B → C?" The latter ossifies the agent's future improvement.

The capability eval → regression eval promotion pattern

Voss's framing of the eval life cycle is also important.

Capability eval : tasks the agent currently fails — the hill it has yet to climb. Run repeatedly during development until it passes 100%
Regression eval : once a capability eval reaches 100% pass, it joins the test suite and continuously ensures "what worked before"
Iterate: add a new capability eval and climb a new hill. Past ones are all protected as regression evals

This dynamic of "continuously promoting capability evals into regression evals" is the core of long-term production-agent reliability. Eval suites are often extended only reactively (you add an eval only after seeing a failure), but Voss's framing proposes proactive eval design (always keep one hill you're climbing).

Arize Phoenix — the open-source eval platform

The hands-on portion of the workshop uses Arize Phoenix. This is Arize's open-source AI observability platform, separate from the enterprise Arize AX. Voss warns that "people sometimes sign up and end up in AX by mistake" — easy to conflate — but Phoenix runs even on your laptop as a fully open-source platform, providing a UI for storing and analyzing traces, eval execution, and an experiments feature.

Tech stack:

OpenTelemetry (hotel) : abbreviated "hotel." The standard protocol used across all kinds of observability, including Kubernetes
OpenInference : the LLM-specific extension to hotel. Standardizes prompt / completion / token / model / tool calls
Two lines to start auto-instrumentation: `phoenix.register(project_name=..., auto_instrument=True)` starts trace collection on nearly every SDK (Anthropic / OpenAI / Gemini / CrewAI / LangChain / LlamaIndex, etc.)

Voss's demo agent is a "financial analysis agent" — pass it a stock ticker and two sub-agents (research → write report) produce a financial report. He intentionally uses Claude Haiku: "It's reliably dumb, so it produces mistakes — ideal for an eval demo." This is also a practical judgment — selecting a reliably weak model for demo purposes is smart design.

The pitfall of LLM-as-judge — the need for meta-evaluation

The latter half of the workshop digs into the reliability of LLM judges. "LLM judges are also nondeterministic — the judge itself can be wrong." Hence meta-evaluation is needed — "an eval that evals the eval."

Implementation: build a golden data set with human annotators (mindful of the 50% rule above), and measure whether the LLM judge produces the same verdict on that golden set. Tune the LLM judge's prompt so the judge matches human criteria. This is the last line of defense for production-agent quality assurance.

The explanation field — the key to actionability

The unique advantage Voss highlights for LLM judges is the explanation field. Code evals give pass / fail only; LLM judges return a written explanation of "why it failed."

A real example (presented in the workshop): for the user prompt "Tell me budget travel recommendations for Tokyo," the agent produced travel recommendations but omitted concrete cost — the LLM judge's explanation: "Travel recommendations were provided, but for the requirement of 'budget travel' the cost information is missing. A subtle distinction, but the response does not follow the intent of the original request."

This makes the eval actionable. The explanation gives hints about "what to add to the prompt." When you run evals at the scale of thousands of traces, "the pattern of the same kind of failure repeating" becomes visible, and that becomes the criterion for distinguishing a systematic prompt problem (vs. a one-off confused agent).

But a new problem: "Reading thousands of explanations also exhausts humans." Solution: bring in a third LLM to categorize the explanations. "An LLM that summarizes the LLM's explanations of the LLM" — the workshop's phrase "It's LLMs all the way down" captures the modern eval architecture succinctly.

Editorial Observations — the MEMEX positioning of Arize-style eval

Three reasons this workshop is worth covering on MEMEX:

(1) Demystifying eval — Voss reframes ML-engineering jargon (rubric, trace, span, meta-evaluation) for the AI engineer as "this is just testing in the end." This is consistent with other Arize-side content like Arize Alex Lamb's Optimising Agents in Prod — the core of democratization, "non-ML specialists can write evals too."

(2) The warning against "prescriptive eval" — the 17:06 point that "agents are cleverer than you expect" is one many eval guides miss despite seeming obvious. It tensions subtly with Karpathy's verifiability argument and presents a design trade-off: "the more you embed verifiability, the more you narrow the agent's flexibility." As MEMEX's editorial axis, holding both lenses simultaneously (verifiability-first vs agent-flexibility-first) is what conveys the industry's actual situation accurately.

(3) The importance of an open-source platform — Arize Phoenix being open source, running on a laptop, and conforming to industry standards OpenTelemetry + OpenInference means that "production-agent eval can be assembled without vendor lock-in." This is a different direction from Anthropic Project Glasswing or Anthropic's enterprise strategy — "open eval infrastructure" — directly tied to raising AI reliability across the whole industry.

The placement of Laurie Voss himself is also worth noting. A figure who supported the JavaScript ecosystem of the 2010s as co-founder of npm Inc has become an evangelist for AI eval in 2026. The transition from the democratization of package management (anyone can use a JS library) to the democratization of eval (anyone can make production AI testable) is the trajectory of a person embodying the generational handoff in developer infrastructure. He deserves to be added as a candidate to MEMEX's people profiles.

Related Resources

Malleable Evals — from static benchmarks to adaptive evaluation (Vincent Koc / Comet) — another approach on the same eval theme
Hierarchical memory and context management (Sally-Ann Delucia / Arize) — the same Arize ecosystem
Mind the Gap — the full picture of agent observability (Microsoft) — a cross-cutting observability theme
Playground in Prod — optimizing agents in production (Arize) — methodology for production agents
From vibe coding to agentic engineering — Andrej Karpathy — the contrasting point on verifiability
Arize Phoenix GitHub — the open-source eval platform