Debugging AI with AI — the internal tools behind Incident.io's AI SRE product, by Lawrence Jones (AI Engineer Europe 2026)

AI Engineer Europe 2026 (London) — Lawrence Jones / Incident.io · May 17, 2026

Lawrence Jones · 16:23 File systems are exceptionally good agent context. Downloading everything and handing it over as a filesystem was overwhelmingly more effective than putting MCP on top, or using a Computer Use agent.

AI Engineer Europe 2026 (London, published May 17, 2026, around 17m 29s). The speaker is Lawrence Jones (founding engineer at Incident.io, formerly a platform engineer at GoCardless, who also spoke at LDX3 London 2025 on "Becoming AI engineers"). Incident.io has been building its AI SRE product (full automation of production investigations) for 1.5–2 years, with customers including Netflix, Etsy, and Skyscanner. This talk is a technical deep dive on the design behind "managing the complexity of AI products with AI itself" — Incident.io's internal tools, across three axes.

Lawrence Jones's AI Engineer Europe 2026 talk is an on-the-ground answer to a problem that companies seriously operating production agents face — "humans can no longer debug AI systems". Incident.io's AI SRE product The Site Reliability Engineering automation product Incident.io has been building for 1.5–2 years. Starting from parts of incident response, it aims for full automation of production investigations. A complex system in which a single investigation runs hundreds to thousands of prompts, cross-referencing logs, metrics, traces, and past incident data, and reaching into the codebase. Customers include Netflix, Etsy, and Skyscanner. is a complex system in which one investigation runs hundreds to thousands of prompts and multi-step agents in coordination. The content of this talk is the internal tool design that lets the team "evaluate and improve while production is running."

From the MEMEX editorial perspective, what matters is that this is a complete implementation example, built over 1.5 years by a single product team, of the "eval architecture" discussed in Laurie Voss (Arize) — Ship Real Agents. The "agent parallel execution" discussion from Hugo Santos × Madison Faulkner — CI/CD is Dead also takes concrete shape here on the backtest-analysis side.

Evals are AI unit tests — but production evals break

Lawrence's framing: "Evals are AI unit tests. You take a prompt, give it input, produce output, and decide pass/fail with grading criteria." At Incident.io they live as YAML files next to the Go prompt code. The mechanism that lets the team prove "this really does what I intended" before changing a prompt.

Early on, the team built a "steal an eval from production" button. The moment the AI starts misbehaving, that case can be downloaded and added to the eval suite. But a problem surfaced: "production evals are not good evals." An ideal unit test is focused — "I want to test this" is clear — whereas a production case drags in the entire incident report (megabytes of YAML). The result: the coding agent overflows its context limit immediately and can no longer work with the eval suite.

The fix: an eval-tool CLI — a small utility that lets the agent operate the eval suite through an API like "list test cases," "edit a specific case," "add," "replace." Combined with a runbook (= skill), tossing a prompt issue to the coding agent runs the TDD red-green cycle on its own: (1) add the failing eval → (2) fix the prompt → (3) iterate until the eval passes → (4) re-run the full suite to check that no existing eval is broken → (5) consolidate the prompt (to prevent bloat). This is the on-the-ground implementation of "encode the opinionated workflow into a skill" from Pedro Rodrigues' (Supabase) three skill design principles.

Turning the UI into a downloadable filesystem — the biggest unlock

The most important insight of the talk. Behind a single chatbot interaction at Incident.io sit 10+ agents, 50+ prompts, and hundreds of tool calls. The investigations product is even bigger — one investigation expands to hundreds or thousands of prompts. The team initially built a trace UI An internal UI Incident.io built so humans could follow hundreds to thousands of prompts and tool calls. It visualizes each prompt's input/output, the chain of tool calls, and the hierarchy of sub-agents. But once the complexity reached agent scale, humans no longer had the time to physically trace it, and the need to delegate debugging to the agent itself emerged. for humans, but at agent scale, human time runs out physically.

When Anthropic released Claude Code, it became clear that agents were extraordinarily good with a filesystem + bash tool. Lawrence's team's judgment: "Then let's make all the UI information downloadable as a filesystem." Trace, prompts, tool calls, sub-agent hierarchy — all handed to a single sandboxed Claude Code as markdown and text. The agent then traces a self-documenting structure, cross-references back into the codebase, and can even propose "fix this prompt this way."

"Compared to layering MCP on top, or using Computer Use, it wasn't even half as effective," Lawrence says flatly. The filesystem is the strongest context you can hand an agent in 2026. This is the on-the-ground knowledge that aligns exactly with Barry Zhang × Mahesh Murag's (Anthropic) "skills = folder" philosophy. "If it can be expressed in ASCII — a trace, a UI, whatever — the agent will reach it."

The backtest pipeline — thousands of investigations analyzed in parallel

The third axis is the backtest analysis pipeline. Incident.io runs thousands of investigations a day across hundreds of customer accounts. The number "86% accuracy" comes out, but without understanding why it went up or down, you can't improve it.

The fix: a structured analysis pipeline built in a repo called "Scrapbook." It runs in Claude Code, with a markdown playbook defining the agent's steps. Key design points:

  • 25 sub-agents launched in parallel — each investigation is analyzed individually, producing "failure type" and "what could be improved"
  • Cohort clustering stage — at a meta level, "the same type of failure" is clustered, organizing the trends across a customer account as a whole
  • Saved as files in stages — resumable if stopped at any point
  • Codebase integration — when an issue is found, the agent proposes "fix this part of the code this way," and the implementation plus eval red-green verification happens in the same session

This is the concrete example, on the evaluation side, of the "agent parallel execution breaks PR-serializing CI" discussion in Hugo Santos × Madison Faulkner — CI/CD is Dead. A structure where 25 agents create one customer's worth of analysis in parallel and aggregate at a cohort stage is the archetype of microservices-with-agents.

Editorial reading — "AI-ify the internal tools first"

Three angles for taking this talk into MEMEX.

(1) A reference implementation for production agent operations. Incident.io is among the most mature examples of "how to scale an AI SRE," with 1.5–2 years of practice behind it. The trio of eval-tool CLI, filesystem-as-context, and backtest pipeline is a template for every AI product team facing similar complexity.

(2) The leading edge of the AI security defender. The title "Fighting AI with AI" is a precise summary of the incident response domain — detecting attacks and failures with AI, and responding with AI. It is the flip side of the warning at Anthropic's Project Glasswing that "LLMs can exploit cyber vulnerabilities." In 2026, with attackers' AI use accelerating, if defenders don't AI-ify at the same speed, the equilibrium tips. Incident.io is on the front line of that defender side.

(3) Internal tools as a second product — an organizational shift. "Rewriting internal tools to be agent-ready" isn't just an efficiency win; it's a redefinition of engineering organization architecture itself. This is the on-the-ground manifestation of "organize your environment for agents" — the same philosophy that runs through Mike Spitz's (PFF) post-engineer engineering org and Brian Scanlan's (Intercom) 2x in nine months.

Video outline

  • (00:00) Self-introduction — the scale of Incident.io's AI SRE product
  • (01:30) Why "telling good reports from bad ones" is not humanly tractable
  • (03:30) Known building blocks (prompts / evals / scorecards / traces / datasets / backtests)
  • (04:00) Evals = AI unit tests, YAML structure, "steal an eval" from production
  • (06:40) Why production evals break — context overflow
  • (07:30) Eval-tool CLI + runbook = agents self-iterating on prompts
  • (09:00) The real complexity of the chatbot — 10+ agents, multi-tier sub-agents
  • (11:00) The unlock of converting UI into a filesystem
  • (12:30) "The filesystem is the strongest context for an agent"
  • (13:00) The problem with backtest analysis — you can't improve from "86% accuracy" alone
  • (13:30) Scrapbook — 25 sub-agents in parallel + cohort clustering pipeline
  • (15:00) The flow where the agent auto-proposes a PR
  • (16:00) Closing — "AI-ify the internal tools first"
  • (17:00) Incident.io is hiring

Sources