Building Agents That Run for Hours — Anthropic's Ash Prabaker × Andrew Wilson on Long-Running Agent Design (AI Engineer Europe 2026 Workshop)

AI Engineer Europe 2026 (London) Day 1 Workshop — Ash Prabaker × Andrew Wilson / Anthropic · May 18, 2026

Ash Prabaker · 17:50 "The frontier doesn't really shrink, it just moves. As models get stronger, the harness itself doesn't disappear — it evolves into the next hard place."

AI Engineer Europe 2026 (London) Day 1 Workshop (held April 8, 2026, published May 18, 2026, approximately 1h 15m 40s). Instructors: Ash Prabaker and Andrew Wilson (London-based Solution Architect), both with Anthropic's Applied AI team. Starting from Anthropic's official November 2026 blog post "Building long-running agents," they teach harness design for agents that run continuously for 5–6 hours and beyond, blending the one-year evolution history of Claude Code with the latest GAN-style harness Agent design inspired by Generative Adversarial Networks (GANs). A harness pattern proposed by Anthropic's Applied AI team in 2026 that runs generator (builder) and evaluator (critic) roles in separate context windows. The evaluator actually clicks live pages with Playwright to verify behavior and returns critique to the generator. The key insight is exploiting the asymmetry — 'tuning the critic to be harsh is tractable, but tuning the builder to be self-critical is not.' experiments (separating generator + evaluator + planner). The most systematic disclosure of design knowledge inside a frontier lab to date.

This workshop, run by two engineers from Anthropic's Applied AI team, systematically narrates the specific evolutionary history of "why Claude Code went from 20 minutes to days of operation in a year," and additionally discloses the harness patterns currently being experimented with inside a frontier lab — making it the industry's only comprehensive session of its kind. The two-part structure: in the first half (Andrew), Claude Code / Agent SDK releases over the past 12 months are reconstructed from a harness perspective; in the second half (Ash), the next-generation harness experiments separating roles into generator / evaluator / planner are explained.

From the MEMEX editorial perspective, what matters is that this is fully complementary to Tejas Kumar's (IBM) "Harnesses in AI: A Deep Dive". If Tejas's session is a primer that teaches "what is a harness" from the ground up, this is the frontier knowledge of "how to keep evolving the harness." That two different views of harnesses were addressed at the same AI Engineer Europe 2026 records the moment the industry, in 2026, recognized harnesses as a central axis.

Three structural difficulties of long-running agents

The three reasons agents fail in long-running operation, organized at the start by Andrew:

  1. Finite context — the context window is finite to begin with; amnesia occurs at the start of a session (start from zero memory); in the middle of a session context rot The consistency degradation that occurs when an agent runs in one context window for a long time. Information loaded early diverges from how mid- and late-stage processing handles it, and the agent starts to behave as if it weren't 'an agent in the same context.' A term used by Anthropic's Applied AI team in the November 2026 blog. Distinct from the physical expansion of the context window — corresponds to a 'drop in concentration.' (consistency degradation) sets in; near the end of context, context anxiety (rushing to finish) is observable
  2. Weak planning — by default, models are bad at planning. They try to one-shot the whole thing, finish only half a feature, run out of context and leave a half-finished app
  3. Lack of self-evaluation — the least intuitive problem. Models are sycophantic by nature — they "say what users want to hear" — and this trait applies to coding judgments too. Failures appear often: "looking at a half-implemented feature and judging 'yes, it works,'" or "the button was built, but the backend doesn't exist — yet the feature appears complete"

Reconstructing one year of Claude Code from a harness perspective

Andrew's timeline of model + harness co-evolution:

  • Mid-2024 (pre-Claude Code) — release of Sonnet 3.5, artifacts, computer use. The "verify itself by looking at what it built" aha moment. MCP spec shipped
  • February 2025 (Sonnet 3.7, Claude Code research preview) — insider quote: "the goal is to understand how developers use it and apply that to future model improvements." Claude Code was experimentally released from the start
  • May 2025 (Opus 4 / Sonnet 4, Claude Code GA, SDK public) — context management improves, reward hacking around task completion drops
  • July 2025 (Jeffrey Huntley's Ralph Wiggum technique paper) — the moment "simple prompts fed in a loop to Claude Code CLI" spread across the industry
  • September–October 2025 (Sonnet 4.5, Claude Code 2.0, SDK rename) — model becomes "context-aware" by tracking its own context, checkpoints introduced, Claude Code SDK renamed to Agent SDK (recognition of generality beyond coding)
  • October–November 2025 (Haiku 4.5, Opus 4.5) — the "Sonnet workhorse + Opus planner" pattern becomes economically viable. Sub-agent parallelization becomes practical
  • November 2025 (Skills, progressive disclosure, programmatic tool calling) — three-stage disclosure established: load front matter only, then skill body when needed, then script when further needed
  • Late November 2025 (Anthropic official long-running agents post) — initializer + harness loop architecture officially disclosed
  • December 2025 – February 2026 (Opus 4.6 / Sonnet 4.6, agent teams, server-side compaction, 1M context GA) — Opus 4.6 specialized for planning, Sonnet 4.6 delivers Opus-grade intelligence at Sonnet pricing. Agent teams where sub-agents communicate directly, server-side auto-compaction, 1M context window general availability

METR benchmark One of the agent-capability benchmarks operated by METR (Model Evaluation and Threat Research) — it measures 'the duration during which an agent, with a minimal scaffold (simple harness), can complete 50% of tasks.' For Claude models: about 1 hour for Opus 3.7, about 4 hours for Sonnet 4, 12 hours for Opus 4.6. With an actual long-running agent harness, however, operation extends far longer (Anthropic cites cases of days). : Opus 3.7 at about 1 hour → Opus 4.6 at 12 hours (12x in one year). This is the "minimal scaffold, 50% task completion" number — with a proper harness, days of operation are possible.

Initializer + Harness Loop architecture (November 2025 official)

Andrew walks through the architecture Anthropic officially proposed. A vague prompt like "build a Slack clone" is implemented through a series of persistent artifacts + an iterative loop:

  1. Initializer agent — decomposes the vague prompt and creates persistent artifacts: (a) featurelist.json (observation: JSON is less prone to being overwritten than markdown), (b) progress file, (c) git repo init, (d) init script, (e) features-complete flag
  2. Each step of the harness loop:
    • Start in a fresh context window
    • Check present working directory and progress file
    • Run init script (smoke test, server boot, etc.)
    • Pick one incomplete feature
    • Implement
    • Verify in practice with something like Puppeteer
    • Pass → git commit + feature status updated to pass
    • If incomplete features remain, iterate next time in a fresh context window

The important design principles here: always start in a fresh context window, treat persistent artifacts (file system) as the true memory, one step, one feature, verification loop built in. This is the same philosophy as Lawrence Jones's (Incident.io) AI SRE implementation, formalized officially by Anthropic.

GAN-style Harness — the adversarial structure of Generator vs. Evaluator

In the second half, Ash walks through next-generation harness experiments. The inspiration: Generative Adversarial Networks (GANs) — applying adversarial pressure between the generative and the discriminative side, evolving both together.

Implementation: Generator (builder) and Evaluator (critic) run in separate context windows / separate system prompts. The Evaluator doesn't just read diffs; it actually opens live pages in Playwright and clicks through to verify behavior. The critique returns to the Generator and loops. Structurally different from the self-checking of a typical single-Claude-Code-session.

A foundational concern: "If the evaluator is also an LLM, won't it just rubber-stamp?" Ash's answer: "Tuning a standalone critic to be harsh is tractable; tuning the builder to be self-critical is not." The same asymmetry that exists with humans — easier to criticize an artwork than to make one. A design strategy that exploits the gap between the LLM's "capability as critic" and "capability as generator."

Front-End Design Rubric — grading taste

A concrete example of the evaluation rubric Ash introduces (for front-end tasks):

Criterion What is verified Weight as of April 2026
Design visual quality, layout, typography High (Opus 4.6 is strong on functionality)
Originality distinctiveness — avoiding "purple gradient AI slop" High
Craft detail finishing, spacing, consistency Medium
Functionality it works, it meets the spec Low (Opus 4.6 is already strong)

Calibration: provide few-shot examples of reference sites, converging the evaluator's taste toward the operator's. A counter to the industry consensus that "taste can't be graded": "if you write your strong opinions down, you can grade it."

Adding a Planner role — the Contract Negotiation pattern

In the current experiments Ash describes, Planner is added on top of Generator / Evaluator, making it a three-role structure. Key design principle: "the Planner doesn't lock down granular technical detail at once." Reason: a single error in the Planner cascades and amplifies across the entire sprint at multi-hour time scales. Solution: the Planner only produces high-level sprint sequences.

An even more important innovation is Contract Negotiation. Before the Generator starts writing code, it negotiates with the Evaluator what "done" means:

  • Generator: proposes "I'll build feature X, verified by Y"
  • Evaluator: pushes back — "scope too wide," "Y is a weak test," "missing edge case Z"
  • Read and write markdown on disk, iterate until both agree
  • After agreement, Generator implements; Evaluator grades against the agreed contract, not the original spec

This brings into the harness the adversarial pressure the Ralph Wiggum loop (fixed-plan.md style) lacked. A design that achieves "the PM role of turning a user story into testable assertions" without the planner's overspecification.

Editorial Observation — Harness reshapes the structure of the AI industry

Three angles MEMEX takes on this workshop.

(1) A rare case of systematic public disclosure of design thinking inside a frontier lab. Frontier labs typically publish only model releases; internal experimental harness thinking proceeds in private. The Ash & Andrew workshop fully discloses, over 1h 15m, the implementation patterns Anthropic's Applied AI team is currently testing — generator / evaluator / planner role separation, contract negotiation, GAN-style adversarial pressure. These are likely to be folded into Claude Code's standard features in the coming months, which makes this a preview of "what will become normal".

(2) The thesis that the "harness vs. model" boundary keeps moving. The intuitive prediction — "harnesses disappear as models get better" — is wrong. In practice, the harness evolves to fill the model's weaknesses, then that role gets baked into training data and absorbed into the model, then that part of the harness is removed, and harnessing routes around to a new weakness. Andrew summarizes this iterative structure with Ash's phrase, "the frontier doesn't really shrink, it just moves." It is the micro-implementation of "the stepwise rise of the abstraction layer," running through Karpathy's Software 3.0 discussion.

(3) Implications for MEMEX's AI × economy / AI × politics axis. When autonomous agents that run for 5–6 hours become practical, agent labor time can be secured nearly equal in parallel to a developer's 1-day labor (8 hours). This is the technical basis for the productivity-doubling phenomenon observed in Intercom's 2x development speed in 9 months and PFF's post-engineer engineering organization. Meanwhile, the UK government's 10DS Insurgency Model also rests on dropping these long-running agents into internal tooling. Long-running agents are not just a technical topic — they are the infrastructure supporting structural change across economies, organizations, and state institutions.

Video Outline (highlights only)

  • (00:00) Introduction, Anthropic Applied AI team
  • (02:00) Citing Boris's (Claude Code founder) first-anniversary tweet — 20 minutes → days of operation
  • (02:30) Three difficulties of long-running agents (context / planning / self-evaluation)
  • (04:00) Model and harness co-evolution, METR benchmark
  • (05:30) Introduction of Agent SDK primitives
  • (06:00) Claude Code release timeline begins (mid-2024 → February 2025)
  • (08:00) May 2025: Opus 4 / Sonnet 4 / Claude Code GA / SDK public
  • (09:00) Ralph Wiggum technique (Jeffrey Huntley, July 2025)
  • (11:00) September 2025: Sonnet 4.5, context-aware
  • (12:00) October–November 2025: Haiku 4.5, Opus 4.5, Skills + progressive disclosure
  • (13:00) November 2025: walkthrough of the architecture in Anthropic's official long-running agents post
  • (15:00) December 2025 – 2026: Opus 4.6, Sonnet 4.6, agent teams, 1M context
  • (16:00) The "harness doesn't disappear, it evolves" thesis
  • (17:50) Ash enters: "the frontier doesn't really shrink, it just moves"
  • (18:00) GAN-style harness introduction — generator vs. evaluator
  • (19:00) Asymmetry: "critic-tuning is tractable, builder-self-criticism is not"
  • (21:00) The four-criterion front-end design rubric
  • (24:00) Demo of live-page evaluation using Playwright
  • (28:00) Adding the Planner role, sprint decomposition
  • (31:00) Contract negotiation pattern
  • (40:00) Demo: from "build me a retro game maker" prompt to completion
  • (60:00) Verifying the effect of separating generator / evaluator / planner
  • (70:00) Q&A and wrap-up in remaining time

Sources