Laurie Voss · 17:06 "Don't write too many evals. If the agent is smarter than you expected and skips two tools, a prescriptive eval will fire false negatives. The agent can be cleverer than the evals you wrote."
Laurie Voss's 2-hour hands-on agent-eval workshop, held in the final session of AI Engineer Europe 2026. Voss is widely known in the JavaScript industry of the 2010s as a co-founder of npm Inc and is currently Head of Developer Experience at Arize AI. As the evangelist for package management at npm now turned evangelist for AI evals, the placement of this figure embodies a generational handoff in the industry.
The core message of the workshop is the simple reframing that "eval has been over-constrained by ML engineering jargon. For an AI engineer, evals are just 'tests.'" This sits in the same observability + eval production-floor knowledge tradition as MEMEX's existing articles Sally-Ann Delucia (Arize) Hierarchical Memory and Amy Boyd / Nitya Narasimhan (Microsoft) Mind the Gap, and stands as a complete standalone video.
The "Vibes problem" — the trap of shipping AI without evals
The problem definition Voss presents at the opening of the workshop is concrete. "Many people build an AI feature, hit a few queries, decide 'it looks correct,' and ship. Then it fails on unexpected input. It fails on edge cases. It fails on adversarial input. And the most common failure mode is users asking questions dumber than you expected."
That final point reveals the depth of Voss's operational knowledge. The main cause of agent failure is not the typical adversarial attack but "users not knowing the vocabulary the agent expects" — the agent expects a domain term, the user asks in everyday language. The origin of production-agent reliability problems lives here.
"The way to handle the Vibes problem The failure pattern Laurie Voss presented at AI Engineer Europe 2026 for the absence of AI eval. A syndrome in which developers build an AI feature, run a few queries, decide vibes-based that 'it looks correct,' ship it, and fail on edge cases / adversarial input / unexpected vocabulary. The reason ordinary unit tests do not catch it is that the same prompt produces different outputs each time, any of which may be correct (string match is impossible). The transition from vibe checks to formal eval is needed — examples include the Anthropic Claude Code team, Descript, and Bolt. isn't enough with unit tests. The same prompt generates different text every time, any of which may be correct. Simple string match won't work. As a result many teams escape to human review, which has three flaws: it doesn't scale, it doesn't catch regressions, and it doesn't run in CI," Voss explains.
Concrete example: "When I changed the system prompt to fix a tone issue, the bot did become kinder, but it also started hallucinating product features." Without evals, you cannot catch the side effect that a tone change introduces hallucination. By contrast: " Faithfulness eval One of the standard patterns of LLM-as-judge eval. Judges whether an LLM, when responding, is basing its answer only on source material (documents passed via RAG, the system prompt, etc.). The core of hallucination prevention. Example: when a customer-support bot is passed the company's docs, another LLM judges whether the bot has invented features not in those docs. Provided as a built-in eval in Arize. can judge whether the bot is using the source material correctly" — a strategy of detecting different failure modes with different evals.
The three kinds of eval — code / LLM-as-judge / human
Voss's eval taxonomy is operationally important.
| Type | Characteristics | Where to use | Drawbacks |
|---|---|---|---|
| Code eval | Deterministic Python / TypeScript, runs in ms, near-zero cost | Format validation, length limits, forbidden phrases, required fields | Brittle for complex / nondeterministic outputs |
| LLM-as-judge | Uses a stronger LLM than production as judge, grading by rubric | Semantic correctness, faithfulness, tone, content judgment | Time, cost, nondeterministic (the judge itself can be wrong) |
| Human | The "gold standard" but doesn't scale | Building golden data sets, meta-evaluating LLM judges | Fatigue causes about 50% errors; thousands per day in CI impossible |
Voss emphasizes: "These three are not competing approaches — they're complementary. A real eval suite uses all three at once." This aligns with the argument of Vincent Koc (Comet) on Malleable Evals — "one kind of static benchmark is insufficient; combine multiple evaluation means" — an important industry consensus.
The note on the 50% error rate of human annotators is sharp: " The failure rate of human evaluators due to fatigue An operational finding Laurie Voss disclosed in the workshop. A full-time human reviewing LLM output all day will make mistakes at roughly a 50% rate due to fatigue. This shows that, even if a large hyperscaler hired a country-scale workforce, scaling production-agent eval with humans alone is structurally impossible. In the workshop, Voss frames it as: 'Even Meta or Google might be able to hire the population of a small country, but it still wouldn't be cost-effective.' As a result, the modern eval architecture limits human evaluation to golden-data-set construction and meta-evaluation of LLM judges, and leaves production evaluation to LLM-as-judge and code eval. . Even if Meta or Google could hire the population of a small country, it isn't cost-effective." So humans are limited to building golden data sets and meta-evaluating LLM judges, while production evaluation runs on LLM-as-judge and code eval — the modern eval architecture.
"Agents make it harder" — the structure of cascading failures
An important point emerges in the middle of the workshop. "Single-LLM-call eval is hard, but agents are harder — because of cascading failures." An agent calls tools in sequence. At each step, you have to verify that the right tool was chosen, that the parameters were correct, that the output was interpreted correctly. In a multi-agent system, the eval target extends further: did the routing LLM pick the right sub-agent, what happened inside the sub-agent, did the return value propagate upstream correctly.
Voss's real example — when you ask an agent to "write a report on Tesla": "the first research agent goes, 'Tesla? Oh, you mean Nikola Tesla', and exhaustively researches the 18th-century inventor. Then it writes an investment case from that information, and forwards it to your boss. This is a cascading failure A failure pattern of multi-step agents. A small early misjudgment is amplified in later steps, and the final output is radically wrong, but discovery is difficult because each step looks internally consistent. Laurie Voss's Tesla / Nikola Tesla mix-up example is canonical. In eval architecture, combining step-level eval (was each tool call correct) with end-to-end eval (was the final output correct) is how cascading failures are caught. ." Each step is internally consistent, but the whole is radically wrong.
The most important warning — "don't write too many evals"
From a MEMEX editorial point of view, the most important point in the workshop. A point many eval guides miss:
"Agents can be 'surprising' in the reverse direction too. The hazard when writing evals is writing too prescriptively. Don't write evals that say 'call tool A, then tool B, then decision C, then produce the answer.' The agent may find a cleverer way than you. This is especially true when you upgrade the model — the new model does in fewer steps what the old one did. If your eval is too prescriptive, it breaks" (17:06).
This is a point that subtly tensions with the "verifiability matters" argument in Karpathy's Software 3.0 / Agentic Engineering. Karpathy argues "AI evolves quickly in verifiable domains (math / coding / games)," but Voss issues the opposite-side warning: "embedding too much verifiability into evals impedes the agent's becoming cleverer."
The workshop's practical conclusion: "Eval the end-state, not the path." Eval "did the Tesla report get written?" Do not eval "did it go through tool A → B → C?" The latter ossifies the agent's future improvement.
The capability eval → regression eval promotion pattern
Voss's framing of the eval life cycle is also important.
- Capability eval The first stage of eval Laurie Voss defines. An eval that intentionally surfaces tasks the agent currently fails — the hill it has yet to climb. Run repeatedly during development until the agent passes 100%. : tasks the agent currently fails — the hill it has yet to climb. Run repeatedly during development until it passes 100%
- Regression eval An eval that, after a capability eval reaches 100% pass, is added to the test suite. Guarantees that what the agent could do before continues to work in the future. The core of continuous monitoring that checks new feature additions, model upgrades, and prompt changes for regressions of past capabilities. : once a capability eval reaches 100% pass, it joins the test suite and continuously ensures "what worked before"
- Iterate: add a new capability eval and climb a new hill. Past ones are all protected as regression evals
This dynamic of "continuously promoting capability evals into regression evals" is the core of long-term production-agent reliability. Eval suites are often extended only reactively (you add an eval only after seeing a failure), but Voss's framing proposes proactive eval design (always keep one hill you're climbing).
Arize Phoenix — the open-source eval platform
The hands-on portion of the workshop uses Arize Phoenix. This is Arize's open-source AI observability platform, separate from the enterprise Arize AX. Voss warns that "people sometimes sign up and end up in AX by mistake" — easy to conflate — but Phoenix runs even on your laptop as a fully open-source platform, providing a UI for storing and analyzing traces, eval execution, and an experiments feature.
Tech stack:
- OpenTelemetry (hotel) The industry-standard protocol for observability. Used widely from Kubernetes logs to various cloud monitoring. Arize Phoenix is built on OpenTelemetry. The LLM-specific extension is OpenInference, which standardizes AI-specific metadata such as prompt text / completion text / token counts / model name / tools invoked. : abbreviated "hotel." The standard protocol used across all kinds of observability, including Kubernetes
- OpenInference An LLM-specific extension protocol jointly built on top of OpenTelemetry by Arize and other industry players. Standardizes AI-specific metadata: prompt text, completion text, token counts, model name, invoked tools, and so on. Instrumentation packages are provided for all major LLM SDKs (Anthropic / OpenAI / Gemini / various agent frameworks), and two lines of code (import + register) start trace collection across every SDK. : the LLM-specific extension to hotel. Standardizes prompt / completion / token / model / tool calls
- Two lines to start auto-instrumentation: `phoenix.register(project_name=..., auto_instrument=True)` starts trace collection on nearly every SDK (Anthropic / OpenAI / Gemini / CrewAI / LangChain / LlamaIndex, etc.)
Voss's demo agent is a "financial analysis agent" — pass it a stock ticker and two sub-agents (research → write report) produce a financial report. He intentionally uses Claude Haiku: "It's reliably dumb, so it produces mistakes — ideal for an eval demo." This is also a practical judgment — selecting a reliably weak model for demo purposes is smart design.
The pitfall of LLM-as-judge — the need for meta-evaluation
The latter half of the workshop digs into the reliability of LLM judges. "LLM judges are also nondeterministic — the judge itself can be wrong." Hence meta-evaluation The process of verifying that the judge of an LLM-as-judge eval is itself judging correctly. In the workshop, Voss describes it as 'an eval that evals the eval.' Typically you measure whether the LLM judge produces the same verdict as a human-curated golden data set (known good answers). This is how you confirm, before deploying the LLM judge, that the judge truly judges by your criteria. The last line of defense for the quality assurance of production agents. is needed — "an eval that evals the eval."
Implementation: build a golden data set with human annotators (mindful of the 50% rule above), and measure whether the LLM judge produces the same verdict on that golden set. Tune the LLM judge's prompt so the judge matches human criteria. This is the last line of defense for production-agent quality assurance.
The explanation field — the key to actionability
The unique advantage Voss highlights for LLM judges is the explanation field. Code evals give pass / fail only; LLM judges return a written explanation of "why it failed."
A real example (presented in the workshop): for the user prompt "Tell me budget travel recommendations for Tokyo," the agent produced travel recommendations but omitted concrete cost — the LLM judge's explanation: "Travel recommendations were provided, but for the requirement of 'budget travel' the cost information is missing. A subtle distinction, but the response does not follow the intent of the original request."
This makes the eval actionable. The explanation gives hints about "what to add to the prompt." When you run evals at the scale of thousands of traces, "the pattern of the same kind of failure repeating" becomes visible, and that becomes the criterion for distinguishing a systematic prompt problem (vs. a one-off confused agent).
But a new problem: "Reading thousands of explanations also exhausts humans." Solution: bring in a third LLM to categorize the explanations. "An LLM that summarizes the LLM's explanations of the LLM" — the workshop's phrase "It's LLMs all the way down" captures the modern eval architecture succinctly.
Editorial Observations — the MEMEX positioning of Arize-style eval
Three reasons this workshop is worth covering on MEMEX:
(1) Demystifying eval — Voss reframes ML-engineering jargon (rubric, trace, span, meta-evaluation) for the AI engineer as "this is just testing in the end." This is consistent with other Arize-side content like Arize Alex Lamb's Optimising Agents in Prod — the core of democratization, "non-ML specialists can write evals too."
(2) The warning against "prescriptive eval" — the 17:06 point that "agents are cleverer than you expect" is one many eval guides miss despite seeming obvious. It tensions subtly with Karpathy's verifiability argument and presents a design trade-off: "the more you embed verifiability, the more you narrow the agent's flexibility." As MEMEX's editorial axis, holding both lenses simultaneously (verifiability-first vs agent-flexibility-first) is what conveys the industry's actual situation accurately.
(3) The importance of an open-source platform — Arize Phoenix being open source, running on a laptop, and conforming to industry standards OpenTelemetry + OpenInference means that "production-agent eval can be assembled without vendor lock-in." This is a different direction from Anthropic Project Glasswing or Anthropic's enterprise strategy — "open eval infrastructure" — directly tied to raising AI reliability across the whole industry.
The placement of Laurie Voss himself is also worth noting. A figure who supported the JavaScript ecosystem of the 2010s as co-founder of npm Inc has become an evangelist for AI eval in 2026. The transition from the democratization of package management (anyone can use a JS library) to the democratization of eval (anyone can make production AI testable) is the trajectory of a person embodying the generational handoff in developer infrastructure. He deserves to be added as a candidate to MEMEX's people profiles.
Related Resources
- Malleable Evals — from static benchmarks to adaptive evaluation (Vincent Koc / Comet) — another approach on the same eval theme
- Hierarchical memory and context management (Sally-Ann Delucia / Arize) — the same Arize ecosystem
- Mind the Gap — the full picture of agent observability (Microsoft) — a cross-cutting observability theme
- Playground in Prod — optimizing agents in production (Arize) — methodology for production agents
- From vibe coding to agentic engineering — Andrej Karpathy — the contrasting point on verifiability
- Arize Phoenix GitHub — the open-source eval platform