AI Harnesses Deep Dive — Tejas Kumar (IBM) systematizes the case for '2026 as the year of the harness' (AI Engineer Europe 2026)

AI Engineer Europe 2026 (London) — Tejas Kumar / IBM · May 17, 2026

Tejas Kumar · 04:50 An agent harness is everything around the model that grounds it in reality. The mechanism that anchors a black-box model to a stable environment — that's the harness.

Harnesses in AI: A Deep Dive — Tejas Kumar, IBM (AI Engineer Europe 2026)

AI Engineer Europe 2026 (London, published May 17, 2026, around 20m 26s). The speaker is Tejas Kumar (IBM AI Developer Advocate, on the DevRel side for the Watson models). An 18-minute deep dive that systematizes "what an AI harness is" from first principles, and live-builds a "baby's first harness" — using GPT 3.5 Turbo (the 2023 model) to complete a Hacker News upvote. He separates two often-confused concepts (ML harness vs. Agent harness) and offers an industry-evolution prediction: "2025 = the year of agents, 2026 = the year of the harness, 2027 = the year of the dynamic on-the-fly harness."

The one takeaway: from 2026 onward, the headroom for improvement in AI development isn't "polishing the prompt" but designing the harness — Tejas's structural argument. The live demo, where the same prompt and an old model (GPT 3.5 Turbo) flip from 0 steps (failure) to 6 steps (success) depending only on the harness, makes the structure visible.

Tejas Kumar's AI Engineer Europe 2026 talk is a primer aimed at the whole industry, organizing the definition of the rapidly spreading term "harness." It sits as a pair with the long-running agent Anthropic workshop (Ash Prabaker × Andrew Wilson), which spent 1 hour 15 minutes on frontier harness design at the same event — together delivering the "harness primer + frontier" set to the industry.

From the MEMEX editorial perspective, what matters is that Tejas spelled out a three-year industry-evolution prediction: "2025 = the year of agents, 2026 = the year of the harness, 2027 = the year of the dynamic harness." This gives MEMEX a vantage point for placing individual cases it has observed — Incident.io's AI SRE harness, Intercom's company-wide rollout of Claude Code, Namespace's Continuous Compute — on a single axis of industry evolution.

Why a harness is necessary — in the name of reliability

The problem Tejas opens with: most AI developers are in the position of "renting" a frontier model, paying $20 a month for a context window. The rented model is a black box — even when the UI shows "Opus," if the server falls back to Sonnet, the user can't tell. In an environment with so many uncontrollable variables, the harness exists to make the agent's behavior reliable.

"The purpose of the harness is reliability." The harness's role is to surround the non-determinism of the black box and "anchor it to a stable environment you control."

Etymology — isomorphism with climbing and walking the dog

Tejas uses two analogies to make the concept intuitive:

A climber's harness — anchoring yourself to a stable mountain so that if you fall, you don't drift too far. "Tying yourself to something stable"
A walking harness for a dog (a leash) — with the joke "the dog doesn't go and bankrupt you with tokens." Limiting the range of action prevents unexpected cost (= unexpected behavior)

An AI harness has the same structure: by confining the model's behavior to a "controllable range," it secures reliability.

ML harness vs. Agent harness — disentangling the vocabulary

Tejas organizes the current state where "harness" denotes two different things:

ML harness — the ML-world usage. Closer to a model test suite + test runner. Feeds in input and evaluates output quality. This is the ML engineering world's concept
Agent harness — the AI-engineering-world usage. Refers to all the infrastructure surrounding the model (tools, context management, guardrails, verify step). This is the subject of the talk

The six components of an Agent Harness

The standard components Tejas systematizes:

Tool Registry — the registry of tools the model can call. Production harnesses like Claude Code, Cursor, and Codex hold file-system read/write and bash execution as tools
Model — the LLM at the core. Selectable in some setups, fixed in others
Context Primitives — context window management. Automatic compaction, conversation history compression, and so on. "Almost every production harness automatically compacts its own context"
Guardrails — limits on max steps, max tokens, forbidden patterns, and so on
Agent Loop — the answer to the common confusion of "is the harness the same as the agent loop?": "No, the harness is around the agent loop. It can hold further loops around the loop"
Verify Step — verification after completion (e.g., lint + test execution for a coding agent)

Live demo — Baby's First Harness

The heart of the talk is the live demo. The task: "go to Hacker News and upvote the first post," deliberately using GPT 3.5 Turbo (the 2023 model, weaker than current). The prompt is not touched at all; only the harness is strengthened, in four stages.

Phase 1: no harness — failure plus lying

The first implementation has only a browser session (Chromium launched via Playwright), a few tools (navigate / click), and a simple agent loop. The result: reach Hacker News → press the upvote button → encounter the login screen and panic → crash. But the agent reports "success" as a lie. That's the state without a harness.

Phase 2: guardrails — max iterations + context compression

Default guardrails are added. Max iterations (kill after 6 steps), max messages (compress context above a threshold). A naive context compressor: "always keep the system prompt and user prompt; keep only the last 2 messages." Even just this prevents the "infinite loop" failure mode.

Phase 3: extract the harness + add a verify step

The logic is extracted from index.ts into a runHarness() function. A deterministic verifySuccessfulUpvote() is added — it reads the agent's trace history and deterministically checks (a) whether the browser click was on the upvote button, (b) whether it was redirected to the login page, (c) whether login failed. Result: the agent still fails at login, but it no longer lies. It correctly reports "didn't succeed." In Tejas's phrasing: "test-driven development vibes — step one of solving a problem is admitting you have one."

Phase 4: add a deterministic login handler — success

Login handler pattern

To be explicit: this isn't a rejection of prompt engineering. Prompts matter. But with the harness missing, no amount of prompt polishing raises the reliability floor. The important fact: the prompt was never changed. Not the intuitive solution of "strengthen the prompt" or "change the system prompt" — assembling the harness flipped the outcome from 0 steps (failure) to 6 steps (success). This is the same insight as "the bottleneck is not context but guidance" from Pedro Rodrigues' (Supabase) three skill design principles, demonstrated from a different angle.

Industry evolution prediction — 2025 / 2026 / 2027

Tejas's three-year prediction at the close:

2025 = year of agents — the year the term "agent" spread across the industry
2026 = year of harnesses — as the fact that "harness" was used 52,000 times at AI Engineer Europe 2026 shows, the year the harness for running agents in production becomes the industry's central topic
dynamic on-the-fly harness = the 2027 prediction — the stage where the agent dynamically generates its own harness before starting work

The 2027 prediction is offered, with Tejas's personal hopes included, as the "next logical step" toward AGI. If the dynamic harness holds, agents wouldn't merely operate under reliability constraints — they'd acquire the self-improving capacity to design their own reliability.

Editorial reading — the harness as a MEMEX industry axis

Three angles for taking this talk into MEMEX.

(1) Tejas's role as a definer of industry vocabulary. Technical terms originating from frontier labs (Anthropic, Google DeepMind, OpenAI) reflect only those labs' angles. By having Tejas, as IBM DevRel, define "what a harness is" to the whole industry, the recognition gap between ML engineering veterans and AI engineering newcomers narrows. This is industry contribution from a DevRel as a vocabulary organizer, comparable to Anthropic's Project Glasswing or Anthropic's official advocacy of Skills. Tejas presents IBM OpenRAG (mentioned at 17:30) — an enterprise-oriented OSS — as a worked example, signaling IBM's positioning as a builder of "industry-wide implementation foundations" distinct from frontier labs.

(2) The shift from "strengthen the prompt" to "assemble the harness." Since the spread of ChatGPT, developer problem-solving has centered on "prompt engineering = polish the system prompt." Tejas's demo shows that with the same prompt the result flips from 0 steps (failure) to 6 steps (success), demonstrating that the axis of improvement is shifting from prompt to harness. This is the harness-version implementation of the same industry thesis — "single-shot generation has a ceiling; solve it with structure" — seen in Mehedi Hassan's (Granola) "Cannot one-shot it" and Pedro Rodrigues' skill design.

(3) The basis for MEMEX observing 2026 as "the year of the harness" is now in place. Tejas's prediction + Anthropic's long-running agent workshop + Incident.io's AI SRE product + Namespace's Continuous Compute + PFF's post-engineer org — these line up as data points showing "the harness is the differentiator of the agent era." The basis for "harness" surfacing as a major 2026 cluster on the MEMEX network graph became clear in Tejas's framing. That said, MEMEX is not in a position to confidently endorse the 2027 prediction (dynamic harness) at this point. Whether metacognition-bearing agents are implementable lacks empirical grounding as of 2026; the piece treats this with a reserved stance, marked for review a year later.

Video outline

(00:00) Self-introduction; IBM AI Developer Advocate; the Watson models
(01:00) "Who can talk confidently about harnesses?" — a few hands go up
(01:45) Why a harness is needed — developers are in the position of "renting" a frontier model
(03:00) "The purpose of the harness is reliability"
(03:30) The climbing and dog-walking analogies
(04:00) Disentangling ML harness vs. Agent harness
(04:50) "An agent harness is everything that surrounds the model and grounds it in reality"
(05:30) The six components of an Agent Harness (tool / model / context / guardrails / loop / verify)
(07:30) Live demo begins — Hacker News upvote with GPT 3.5 Turbo
(08:30) Phase 1: no harness; failure at login + a false report
(10:30) Phase 2: guardrails (max iterations, context compression)
(12:30) Phase 3: extracted as a harness + verify step kills the lying
(15:00) Phase 4: deterministic login handler succeeds — upvote completed in 6 steps
(17:00) Emphasis: "the prompt was never changed"
(17:30) The IBM OpenRAG example — a production enterprise-harness implementation
(18:00) Industry-evolution prediction — 2025 agent / 2026 harness / 2027 dynamic harness
(19:30) The view that the dynamic harness is the next logical step toward AGI
(20:00) Close; pointer to the slides on GitHub

Key quotes

(04:50) "An agent harness is everything around the model that grounds it in reality. The mechanism that anchors a black-box model to a stable environment — that's the harness." — Tejas's core definition
(03:00) "The purpose of the harness is reliability." — the purpose compressed to one word
(around 12:30) "test-driven development vibes — step one of solving a problem is admitting you have one." — the line as the verify step dismantled the agent's lying in Phase 3
(17:00) "The prompt was never changed from the start." — emphasized across all four demo phases
(18:00) "2025 = year of agents, 2026 = year of harnesses, 2027 = year of dynamic on-the-fly harnesses" — the three-year prediction
(19:30) "The dynamic harness is the next logical step toward AGI" — a long-term prediction prefaced with "personal wishes included"
(05:30) "The harness is not the same thing as the agent loop; the harness is everything around the agent loop." — an explicit answer to the common confusion

Critical perspective — caveats on the 2027 prediction

Tejas's 2025 / 2026 predictions are already backed by observable data (the industry-wide adoption of "agent," "harness" becoming a central topic). The 2027 prediction (dynamic on-the-fly harness), however, comes with MEMEX-side reservations.

The premise of the dynamic harness is that an agent has metacognition (= the ability to cognize its own cognition) — that it can "identify where it's likely to hallucinate." On 2026 LLM benchmarks (e.g., SimpleQA / TruthfulQA / HaluEval), performance on this is limited. Tejas himself prefaces with "personal wishes included" (19:30). MEMEX takes the position of reviewing the 2027 prediction a year later.

Related resources

Sources