Sally-Ann Delucia · 14:00 "Agents don't fail because of the prompt — they fail because of the context. Early on it was all prompt engineering. Now it's context engineering."
Sally-Ann is Head of Product at Arize, formerly a data scientist. She introduces herself as "a PM, but also a part-time AI engineer." A session in which she frankly shares — failures included — the implementation patterns for context management she discovered through the recursive process of "building Alex with Alex."
The core of the talk is a three-stage trial: Naive Truncation → Summarization → Smart Truncation + Memory. Plus an additional pattern: "offloading the heavy work to sub-agents." There's also an industry anecdote — when Claude Code leaked in early 2026, the strategy turned out to be the same strategy independently discovered by Anthropic Claude Code's internal team.
From the MEMEX perspective, what matters is that this is the production version of Karpathy's "context engineering" concept. Karpathy proposed in mid-2025 on X that "context engineering > prompt engineering," and Sally-Ann reinforces it a year later (2026) with a production case — "we experienced exactly that."
Key Observations
The vicious cycle of "building Alex with Alex" (03:45 – 04:45)
Sally-Ann's team made the design call: "If we can build an agent that helps construct our own applications, users will want it too." So they built Alex using Alex. And ran into a vicious cycle.
- Alex runs on trace + span data → span data balloons → context ceiling → Alex fails → adds its own failure data to the span → re-runs → fails again → ...
- "The system that analyzes the data is constrained by that data" (04:30)
- At that rate Alex would never succeed
The discovery led to Sally-Ann's core thesis: "context management is not an engineering problem; it's a product / UX problem." "If the agent doesn't have the right data and context, it gives a bad answer. If it gives a bad answer, nobody uses the product" (03:30).
Three failures — Truncation, Summarization, and... (05:00 – 07:50)
A frank record of Arize's trials:
Attempt 1: Naive Truncation (05:15). "Take just the first 100 characters of the context blob, throw the rest away." Worked on simple queries — broke completely on follow-ups. "'Which input is the most?' → answer → 'Tell me more about B' → Alex has no idea what we're talking about" (05:45). Over-truncation destroys reasoning.
Attempt 2: Summarization (06:17). "LLMs are good at summarization — summarize the whole context and pass that in." Consistency too low. "No control over what's important — entirely up to the LLM" (06:30). Failed.
Attempt 3: Smart Truncation + Memory (07:03). The strategy that stuck:
- Keep Head 100 characters + Tail 100 characters (first and last)
- Cut the middle and save it to a memory store
- Alex can retrieve from memory if it decides it needs to
- "Context decides what the model sees, memory decides what survives" (07:49)
Sally-Ann's evaluation: "Haven't touched it in months — it's working." Simple implementation, but the design that "lets Alex access memory at its own discretion" is decisive.
Long Session Evals — detecting failures in long conversations (08:00 – 08:50)
An important supporting mechanism. "Users don't tend to restart the chat — with Claude, with Cursor, they do everything in a single chat." And "the longer the conversation, the later failures show up" (08:20). Sally-Ann's finding: Alex loses memory late in conversations, and the team didn't notice until users reported it.
The solution: Long Session Evals. "Load 10 turns and test the 11th." This turns "memory loss in a long conversation" into a testable bug. It can be wired into automated test suites. "You don't have to wait for users to report it" (08:40).
Sub-agents — "Don't put all the context in one place" (09:00 – 10:30)
Alex's final architecture. "Offload the heavy work to a separate agent."
A concrete example: search tasks. "Alex searches Arize data. Hundreds of spans in one trace, multiple queries, intermediate reasoning. None of that needs to live in the main conversation" (09:30).
Architecture:
- Main agent: lightweight, chat + context only
- Sub-agent: holds heavy data, handles search and processing
- Main delegates to sub, only results pass back
- Memory store also available on demand
A parallel architecture to Anthropic's Skills (Barry Zhang × Mahesh Murag). The industry's converging pattern of "don't cram everything into one — split into specialized sub-units."
The Claude Code leak — same strategy, independently discovered (12:57)
An industry anecdote in the video. "When the Claude Code code leaked (early 2026), everyone could read it. We were expecting to see secrets. Instead — the same truncation + compression strategy" (13:00). A (slightly disappointing) confirmation that "our own implementation independently arrived at the same landscape as the industry's top."
The same structural phenomenon as the independent convergence Hinton/Sejnowski, Karpathy, and Boris arrived at. The whole industry is converging on the same solutions out of real constraints — LLM context windows plus reasoning capability.
"Context engineering > prompt engineering" (00:55 – 02:30)
Sally-Ann opens the video by quoting Karpathy's 2025 post on X. "Context engineering over prompt engineering."
Sally-Ann's own formulation: "The best context strategy is one in which the agent remembers what it needs and forgets what it doesn't" (02:30). "Not 'how many characters fit,' but 'what do we show strategically'" (02:00). A framing that operationalizes Karpathy's concept from a product-manager perspective.
Related Articles
- Granola: Don't try to one-shot it — same Code Summit, in-house tracing tool
- Trigger.dev: Durable Agents — same Code Summit, a different angle on state management
- Karpathy: Software 3.0 — origin of the "context engineering" concept
- Anthropic Skills — a parallel pattern to sub-agents
- Raindrop Agent Observability — Arize's (?) competitor in observability platforms
Key Quotes
- "Agents don't fail because of the prompt — they fail because of the context" (14:00)
- "The best context strategy is one in which the agent remembers what it needs and forgets what it doesn't" (02:30)
- "Context management is not an engineering problem; it's a product / UX problem" (03:30)
- "The system that analyzes the data is constrained by that data" (04:30, the core of the vicious cycle)
- "Context decides what the model sees, memory decides what survives" (07:49)
- "Long Session Evals — load 10 turns, test the 11th, the bug becomes testable" (08:30)
- "When Claude Code leaked, we expected secrets — instead, the same truncation + compression strategy" (13:00)
- "Summarization not working was surprising — no control over consistency" (06:30)
Sources
Hierarchical Memory — AI Engineer official (YouTube)
Related resources:
Glossary
- Arize
- An AI observability platform company. Provides tracing, evaluation, and debugging tools for LLM applications. Develops Arize Phoenix (OSS observability library) and Arize Alex (AI agent harness). One of the main players in the AI observability space, alongside Raindrop.
- Arize Alex
- Arize's AI agent harness. Runs on top of their own observability platform with 40+ skills, prompt optimization, data generation, annotation, and other workflows. Arize built it through the recursive development of "building Alex with Alex."
- Context Engineering
- A concept Karpathy proposed in mid-2025 on X. The industry's shift in recognition that, more than "prompt engineering" (optimizing a single instruction string), what matters is "strategically choosing what to show the model." Sally-Ann's formulation: "a strategy in which the agent remembers what it needs and forgets what it doesn't."
- Smart Truncation + Memory
- The context-management strategy adopted in Arize Alex. (1) Keep Head 100 characters + Tail 100 characters of the conversation, (2) cut the middle and save it to a memory store, (3) let the agent itself retrieve from memory on demand. Simple, but "giving the agent discretion" is decisive.
- Long Session Evals
- A test methodology Arize introduced. "Load 10 turns and test the 11th." A mechanism to automatically test outputs that "lose memory over a long conversation" — without waiting for user reports. A new standard pattern for QA on LLM products.
- Sub-agent
- The design pattern of "offload heavy work to a separate agent." The main agent is a lightweight chat; the sub-agent handles heavy-data search and processing. Only results return to the main. A converging industry pattern adopted in Anthropic's Skills, Claude Code, Cursor, and more.