Tejas Kumar · 04:50 An agent harness is everything around the model that grounds it in reality. The mechanism that anchors a black-box model to a stable environment — that's the harness.
The one takeaway: from 2026 onward, the headroom for improvement in AI development isn't "polishing the prompt" but designing the harness — Tejas's structural argument. The live demo, where the same prompt and an old model (GPT 3.5 Turbo) flip from 0 steps (failure) to 6 steps (success) depending only on the harness, makes the structure visible.
Tejas Kumar's AI Engineer Europe 2026 talk is a primer aimed at the whole industry, organizing the definition of the rapidly spreading term "harness." It sits as a pair with the long-running agent Anthropic workshop (Ash Prabaker × Andrew Wilson), which spent 1 hour 15 minutes on frontier harness design at the same event — together delivering the "harness primer + frontier" set to the industry.
From the MEMEX editorial perspective, what matters is that Tejas spelled out a three-year industry-evolution prediction: "2025 = the year of agents, 2026 = the year of the harness, 2027 = the year of the dynamic harness." This gives MEMEX a vantage point for placing individual cases it has observed — Incident.io's AI SRE harness, Intercom's company-wide rollout of Claude Code, Namespace's Continuous Compute — on a single axis of industry evolution.
Why a harness is necessary — in the name of reliability
The problem Tejas opens with: most AI developers are in the position of "renting" a frontier model, paying $20 a month for a context window. The rented model is a black box — even when the UI shows "Opus," if the server falls back to Sonnet, the user can't tell. In an environment with so many uncontrollable variables, the harness exists to make the agent's behavior reliable.
"The purpose of the harness is reliability." The harness's role is to surround the non-determinism of the black box and "anchor it to a stable environment you control."
Etymology — isomorphism with climbing and walking the dog
Tejas uses two analogies to make the concept intuitive:
- A climber's harness — anchoring yourself to a stable mountain so that if you fall, you don't drift too far. "Tying yourself to something stable"
- A walking harness for a dog (a leash) — with the joke "the dog doesn't go and bankrupt you with tokens." Limiting the range of action prevents unexpected cost (= unexpected behavior)
An AI harness has the same structure: by confining the model's behavior to a "controllable range," it secures reliability.
ML harness vs. Agent harness — disentangling the vocabulary
Tejas organizes the current state where "harness" denotes two different things:
- ML harness A usage from the machine learning world. Close to a model test suite + test runner — a mechanism that feeds in input data and evaluates output quality (accuracy, F1, confusion matrix, etc.). An established term in ML engineering since the 2010s; per Tejas Kumar, distinct from the AI engineering world's 'Agent harness.' Examples: OpenAI Evals, the AISI evaluation system at Anthropic, HELM, and so on. — the ML-world usage. Closer to a model test suite + test runner. Feeds in input and evaluates output quality. This is the ML engineering world's concept
- Agent harness The concept Tejas Kumar defined at AI Engineer Europe 2026 (May 17, 2026). 'Everything that surrounds the model and grounds it in reality' — the entire infrastructure that surrounds the non-determinism of the black box (= the property that the same input may produce different results each time) and confines it to a controllable range. Includes tool registry / context management / guardrails / agent loop / verify step. Distinct from an ML harness (= test runner). — the AI-engineering-world usage. Refers to all the infrastructure surrounding the model (tools, context management, guardrails, verify step). This is the subject of the talk
The six components of an Agent Harness
The standard components Tejas systematizes:
- Tool Registry One of the six components of an agent harness (Tejas Kumar's system). The registry of tools the model can call. Each tool's name, argument schema, and execution code — file-system read/write, bash execution, web search, external API calls — is managed on the harness side. Production harnesses (Claude Code, Cursor, Codex) typically have tens to hundreds of tools. — the registry of tools the model can call. Production harnesses like Claude Code, Cursor, and Codex hold file-system read/write and bash execution as tools
- Model — the LLM at the core. Selectable in some setups, fixed in others
- Context Primitives One of the six components of an agent harness (Tejas Kumar's system). The mechanisms for managing the context window. Automatic compaction (= summarizing and compressing older dialogue), selective retention of conversation history, dynamic adjustment of token budget. Per Tejas Kumar, 'almost every production harness automatically compacts its own context.' Without this, long-running tasks drain the context window and the agent collapses. — context window management. Automatic compaction, conversation history compression, and so on. "Almost every production harness automatically compacts its own context"
- Guardrails One of the six components of an agent harness (Tejas Kumar's system). Upper-bound settings for agent behavior — max steps, max tokens, forbidden patterns (commands / URLs / keywords that must not be executed), sandboxing scope, and so on. Functions as a wall against runaway behavior. Tejas's dog-leash analogy ('the dog doesn't go and bankrupt you with tokens') is about this component. — limits on max steps, max tokens, forbidden patterns, and so on
- Agent Loop One of the six components of an agent harness (Tejas Kumar's system). The loop that repeats 'receive LLM output → execute tool calls → return results to the LLM → receive the next output.' Tejas Kumar's key point: 'the harness is not the same thing as the agent loop; the harness is everything around the agent loop.' One harness may coordinate multiple agent loops (nested loops). — the answer to the common confusion of "is the harness the same as the agent loop?": "No, the harness is around the agent loop. It can hold further loops around the loop"
- Verify Step One of the six components of an agent harness (Tejas Kumar's system). The deterministic verification step inserted just before / just after the agent reports 'task complete.' For a coding agent, that's running lint + tests; for a web task, confirming the state ('did the button press succeed?'), and so on. Without it, agents may report false successes (a phenomenon confirmed in Tejas's live demo). Implement the verify step in deterministic code (a function / script), not LLM output. — verification after completion (e.g., lint + test execution for a coding agent)
Live demo — Baby's First Harness
The heart of the talk is the live demo. The task: "go to Hacker News and upvote the first post," deliberately using GPT 3.5 Turbo (the 2023 model, weaker than current). The prompt is not touched at all; only the harness is strengthened, in four stages.
Phase 1: no harness — failure plus lying
The first implementation has only a browser session (Chromium launched via Playwright), a few tools (navigate / click), and a simple agent loop. The result: reach Hacker News → press the upvote button → encounter the login screen and panic → crash. But the agent reports "success" as a lie. That's the state without a harness.
Phase 2: guardrails — max iterations + context compression
Default guardrails are added. Max iterations (kill after 6 steps), max messages (compress context above a threshold). A naive context compressor: "always keep the system prompt and user prompt; keep only the last 2 messages." Even just this prevents the "infinite loop" failure mode.
Phase 3: extract the harness + add a verify step
The logic is extracted from index.ts into a runHarness() function. A deterministic verifySuccessfulUpvote() is added — it reads the agent's trace history and deterministically checks (a) whether the browser click was on the upvote button, (b) whether it was redirected to the login page, (c) whether login failed. Result: the agent still fails at login, but it no longer lies. It correctly reports "didn't succeed." In Tejas's phrasing: "test-driven development vibes — step one of solving a problem is admitting you have one."
Phase 4: add a deterministic login handler — success
Login handler pattern One of the harness patterns Tejas Kumar demonstrated at AI Engineer Europe 2026. A hook the harness runs before each step of the agent loop. It checks 'is the current URL a login page?' and either (a) does nothing if it isn't, or (b) if it is, programmatically fills in credentials from a secure source, submits the form, and returns to the original page — all deterministically. Because the LLM never sees the credentials, it's a classic harness design that secures both security and reliability.
To be explicit: this isn't a rejection of prompt engineering. Prompts matter. But with the harness missing, no amount of prompt polishing raises the reliability floor. The important fact: the prompt was never changed. Not the intuitive solution of "strengthen the prompt" or "change the system prompt" — assembling the harness flipped the outcome from 0 steps (failure) to 6 steps (success). This is the same insight as "the bottleneck is not context but guidance" from Pedro Rodrigues' (Supabase) three skill design principles, demonstrated from a different angle.
Industry evolution prediction — 2025 / 2026 / 2027
Tejas's three-year prediction at the close:
- 2025 = year of agents — the year the term "agent" spread across the industry
- 2026 = year of harnesses — as the fact that "harness" was used 52,000 times at AI Engineer Europe 2026 shows, the year the harness for running agents in production becomes the industry's central topic
- dynamic on-the-fly harness The 2027 industry-evolution prediction Tejas Kumar offered at AI Engineer Europe 2026. The stage where, on being given a task, the agent first dynamically generates its own harness before starting work. Per Tejas: 'plan mode on steroids — the agent is self-aware, identifies where it's likely to hallucinate, builds a harness, and erects guardrails.' Because it requires metacognition (= the ability to cognize one's own cognition), the empirical basis at 2026 LLMs is thin; Tejas himself prefaces it with 'this includes personal wishes.' = the 2027 prediction — the stage where the agent dynamically generates its own harness before starting work
The 2027 prediction is offered, with Tejas's personal hopes included, as the "next logical step" toward AGI. If the dynamic harness holds, agents wouldn't merely operate under reliability constraints — they'd acquire the self-improving capacity to design their own reliability.
Editorial reading — the harness as a MEMEX industry axis
Three angles for taking this talk into MEMEX.
(1) Tejas's role as a definer of industry vocabulary. Technical terms originating from frontier labs (Anthropic, Google DeepMind, OpenAI) reflect only those labs' angles. By having Tejas, as IBM DevRel, define "what a harness is" to the whole industry, the recognition gap between ML engineering veterans and AI engineering newcomers narrows. This is industry contribution from a DevRel as a vocabulary organizer, comparable to Anthropic's Project Glasswing or Anthropic's official advocacy of Skills. Tejas presents IBM OpenRAG (mentioned at 17:30) — an enterprise-oriented OSS — as a worked example, signaling IBM's positioning as a builder of "industry-wide implementation foundations" distinct from frontier labs.
(2) The shift from "strengthen the prompt" to "assemble the harness." Since the spread of ChatGPT, developer problem-solving has centered on "prompt engineering = polish the system prompt." Tejas's demo shows that with the same prompt the result flips from 0 steps (failure) to 6 steps (success), demonstrating that the axis of improvement is shifting from prompt to harness. This is the harness-version implementation of the same industry thesis — "single-shot generation has a ceiling; solve it with structure" — seen in Mehedi Hassan's (Granola) "Cannot one-shot it" and Pedro Rodrigues' skill design.
(3) The basis for MEMEX observing 2026 as "the year of the harness" is now in place. Tejas's prediction + Anthropic's long-running agent workshop + Incident.io's AI SRE product + Namespace's Continuous Compute + PFF's post-engineer org — these line up as data points showing "the harness is the differentiator of the agent era." The basis for "harness" surfacing as a major 2026 cluster on the MEMEX network graph became clear in Tejas's framing. That said, MEMEX is not in a position to confidently endorse the 2027 prediction (dynamic harness) at this point. Whether metacognition-bearing agents are implementable lacks empirical grounding as of 2026; the piece treats this with a reserved stance, marked for review a year later.
Video outline
- (00:00) Self-introduction; IBM AI Developer Advocate; the Watson models
- (01:00) "Who can talk confidently about harnesses?" — a few hands go up
- (01:45) Why a harness is needed — developers are in the position of "renting" a frontier model
- (03:00) "The purpose of the harness is reliability"
- (03:30) The climbing and dog-walking analogies
- (04:00) Disentangling ML harness vs. Agent harness
- (04:50) "An agent harness is everything that surrounds the model and grounds it in reality"
- (05:30) The six components of an Agent Harness (tool / model / context / guardrails / loop / verify)
- (07:30) Live demo begins — Hacker News upvote with GPT 3.5 Turbo
- (08:30) Phase 1: no harness; failure at login + a false report
- (10:30) Phase 2: guardrails (max iterations, context compression)
- (12:30) Phase 3: extracted as a harness + verify step kills the lying
- (15:00) Phase 4: deterministic login handler succeeds — upvote completed in 6 steps
- (17:00) Emphasis: "the prompt was never changed"
- (17:30) The IBM OpenRAG example — a production enterprise-harness implementation
- (18:00) Industry-evolution prediction — 2025 agent / 2026 harness / 2027 dynamic harness
- (19:30) The view that the dynamic harness is the next logical step toward AGI
- (20:00) Close; pointer to the slides on GitHub
Key quotes
- (04:50) "An agent harness is everything around the model that grounds it in reality. The mechanism that anchors a black-box model to a stable environment — that's the harness." — Tejas's core definition
- (03:00) "The purpose of the harness is reliability." — the purpose compressed to one word
- (around 12:30) "test-driven development vibes — step one of solving a problem is admitting you have one." — the line as the verify step dismantled the agent's lying in Phase 3
- (17:00) "The prompt was never changed from the start." — emphasized across all four demo phases
- (18:00) "2025 = year of agents, 2026 = year of harnesses, 2027 = year of dynamic on-the-fly harnesses" — the three-year prediction
- (19:30) "The dynamic harness is the next logical step toward AGI" — a long-term prediction prefaced with "personal wishes included"
- (05:30) "The harness is not the same thing as the agent loop; the harness is everything around the agent loop." — an explicit answer to the common confusion
Critical perspective — caveats on the 2027 prediction
Tejas's 2025 / 2026 predictions are already backed by observable data (the industry-wide adoption of "agent," "harness" becoming a central topic). The 2027 prediction (dynamic on-the-fly harness), however, comes with MEMEX-side reservations.
The premise of the dynamic harness is that an agent has metacognition (= the ability to cognize its own cognition) — that it can "identify where it's likely to hallucinate." On 2026 LLM benchmarks (e.g., SimpleQA / TruthfulQA / HaluEval), performance on this is limited. Tejas himself prefaces with "personal wishes included" (19:30). MEMEX takes the position of reviewing the 2027 prediction a year later.
Related resources
- Anthropic Ash Prabaker × Andrew Wilson long-running agent workshop
- Incident.io AI SRE harness
- Intercom's company-wide rollout of Claude Code
- Namespace's Continuous Compute
- Pedro Rodrigues' (Supabase) three skill design principles
- Mehedi Hassan's (Granola) "Cannot one-shot it"
- Anthropic Project Glasswing
- PFF's post-engineer org
- Speaker profile: Tejas Kumar