Vincent Koc · 04:21 "Our AI applications are not static, yet we treat them like static software."
Vincent Koc leads evaluation research at Comet An AI developer platform offering tools for LLM evaluation, observability, and production operations. Supports AI evaluation benchmarks for large enterprises including Uber, Netflix, and UK banks. Official site: comet.com . He's also a core contributor to the OSS coding agent harness OpenCode An OSS coding agent harness led by SST. Launched in 2026 as the open counterpart to Claude Code, distinguished by an architecture that self-generates skills per task and evolves the harness itself. Vincent Koc is one of the core contributors. . He calls himself "a friendly tech canary," and tells self-deprecating stories — "in 2013 I used a VR headset for three hours when it said to stop after five minutes, and vomited for three hours" — to embody his "try it first, then report back" style.
The talk's central claim is a provocation: there is half-truth to the industry joke "evals are dead." Evaluation methodology centered on static benchmarks won't survive into the agentic AI era. The proposed solution is malleable evals — evaluation designed as a live system that evolves with the agent.
1. The trap of applying static-software evaluation methods (unit tests, regression, CI/CD, chaos engineering) directly to AI — AI applications are not static, yet we treat them as if they were. Neither benchmarks nor handcrafted datasets have a clear answer to "how do we update this each week after production?"
2. The evolution from prompt engineering to context engineering to intent engineering — 2023's "throw in random words and pray it helps" stage is over. By 2025, context (RAG, tool calling) made things steerable. 2026 is intent engineering — machines self-optimize against intent, and evaluation needs to keep up.
3. The 80/20 rule, with the adaptive 20% concentrating the risks that break a business — 80% of behavior can be covered by static eval, but the adaptive 20% (customers using the product strangely, unexpected question patterns) is where production incidents happen. The design challenge is to have evals self-evolve to capture this.
Key Observations
"Our AI applications are not static, yet we treat them like static software" (04:21)
The fundamental mismatch Koc points out. "When you ship software, you sometimes change the unit tests — relatively quickly. But practically, the software itself has become malleable" (05:34).
The example: OpenCode. "The harness itself self-modifies. You want to create skills, you want to do other things — the harness adapts in response" (06:20). In an era when software ships at the speed of light, how is the benchmark supposed to keep up?
"Prompt engineering should have died in 2023, but people are still doing it" (06:25 to 07:30)
Koc's provocative lineage of evaluation. "Prompt engineering — beating random words into the AI and hoping the result improves — is closer to accidental pharmaceutical discovery" (06:30). The same structure as making a liver-disease drug and discovering a painkiller: no systematic path to improvement.
Then the move to context engineering. "With RAG and tool calling, we could make the agent steerable, and decompose the evaluation by part. But even that didn't break through" (07:09).
2026 is intent engineering. "Machines can self-optimize against intent — proven in harnesses like OpenCode" (08:36). At this point, evaluation enters a new stage — each user's experience is different, so a uniform benchmark cannot capture it.
"80% can stay static, but it's the 20% that breaks your business" (13:30 to 14:05)
Koc's closing frame. "80% is static stuff, defined in an intentful manner — but the remaining 20% keeps changing. That 20% is what ruins your business. Someone asks a strange question, uses the agent in an unusual way — and it's absolute hell" (13:30).
The direction of the solution: "Treat evals not as a static dataset but as code, as software, as a living agent. Not a snapshot at one point in time, but a self-optimizing, growing solution" (13:59). Self-curating eval suites from traces, always-on optimization, self-healing via telemetry-in-the-loop — these are the components of malleable evals.
The calcification problem — eval calcification and the link to Karpathy's auto-research
Koc's coinage: "eval calcification." Getting a laugh by saying "I want this as a paper title," he points to the phenomenon by which static datasets harden over time and drift away from actual agent behavior.
As a hint at a solution, he cites Karpathy's auto-research concept — "set a goal, set a target, the machine tunes itself" (11:31). Applied to evaluation, the inversion is that the starting point is no longer the evaluation dataset or "answer set" — the end state (what the user wants to achieve) is the eval. Evals draw closer to code, with the machine working in between.
Video outline
- (00:00) Opening, the friendly canary, the VR-vomit anecdote
- (01:15) Speaker introduction, evaluation work at Comet, benchmark operations for large enterprises
- (01:32) The industry joke "evals are dead" and its half-truth
- (01:53) The software engineering evaluation lineage (unit test, regression, CI/CD, chaos engineering)
- (02:49) The current state of AI/DS evaluation — static benchmarks plus handcraft plus offline eval, with chaos engineering absent
- (04:21) "AI apps are not static, but we treat them as if they were"
- (05:00) The Adaptive Testing for LLM Evals paper introduced
- (05:34) Software itself becoming malleable — OpenCode's self-evolution
- (06:30) "Random word pharmacology" — prompt engineering
- (07:30) Context engineering, partial evaluation made possible by RAG plus tool calling
- (08:36) Intent engineering, machines self-optimizing from intent
- (09:50) Difficulty of evaluating the intentful machine — each user's experience is different
- (10:21) Eval needs are higher than ever; a rebuttal to "observability is dead"
- (11:31) Intent-based outcomes, with rubric / self-curating from traces / always-on / telemetry-in-the-loop as the four components
- (12:18) The eval calcification problem, link to Karpathy's auto-research
- (13:30) The 80/20 rule — the adaptive 20% breaks the business
- (13:59) Treat evals as a living agent — not a snapshot but a self-optimizing solution
- (14:30) Closing — not a sales pitch, but a conceptual map to take home
Sources
From the AI Engineer Europe 2026 official YouTube playlist. The video ID is available on the AI Engineer official channel.
Glossary
- Malleable Evals
- The evaluation approach Vincent Koc proposes. Not a static benchmark, but an evaluation design treated as a live system that evolves with the agent. Composed of four elements: self-curating eval suites from traces, always-on optimization, self-healing via telemetry-in-the-loop, and intent-based outcomes.
- Eval calcification
- A problem Koc coined. The phenomenon where static datasets harden over time and drift away from actual agent behavior. "I want this as a paper title," he said, getting a laugh.
- Intent Engineering
- The latest stage in the 2026 evaluation lineage, after prompt engineering (2023) and context engineering (2024 to 25). Machines self-optimize against intent (the state the user wants to reach). Evaluation must follow intent as it evolves.
- 80/20 problem
- Koc's framing. 80% of agent behavior can be adequately captured by static eval, but the remaining 20% — the adaptive behaviors ("customers using it strangely," "unexpected question patterns") — is where production incidents originate. Designing evals that self-evolve to catch that 20% is the essence of malleable evals.
- OpenCode
- An OSS coding agent harness led by SST. Launched in 2026 as the open counterpart to Claude Code, distinguished by an architecture that self-generates skills per task and evolves the harness itself. Vincent Koc is a core contributor. Cited in the talk as a working example of "the harness itself being malleable."