Durable Agents — Replay vs Snapshot (Trigger.dev Eric Allam)

AI Engineer Code Summit (NYC) · May 10, 2026

Eric Allam · 11:00 "For 30 years, stateless compute has been the core of backend infrastructure. Agents are forcing the transition to stateful compute."

AI Engineer Code Summit (NYC, May 2026). Approximately 16 minutes. A session by Trigger.dev A platform for deploying and operating background jobs, workflows, and agents in production. Provides 'durable execution' for the Node.js / TypeScript ecosystem. Over the past several years, the company has watched the transition from 'workflows → AI agents' from the inside. co-founder Eric Allam, summarizing 30 years of backend infrastructure history in 16 minutes and proposing the new paradigm of the agent era.

The video moves through 30 years of backend infrastructure history → today's challenges → a proposed solution, systematizing two approaches (Replay vs Snapshot) for implementing "durable agents." Reads as a "declaration of an infrastructure paradigm shift" addressed to the whole industry — academically and implementationally significant material.

Eric's argument in one line: Agent = Session ≠ Transaction. The backends of the past 30 years were "stateless compute + state in the DB," but agents are long-running sessions (days → weeks) that need to retain local machine state. This is the first fundamental paradigm shift for backend infrastructure in 30 years.

Key Observations

30 years of backend history — from CGI (1993) to agents (00:42 - 04:30)

Eric's historical summary is excellent:

  • 1993: CGI — fork a new process for each HTTP request, completely stateless
  • PHP / LAMP stack: processes are reused, but the principle is "request + DB = response," state lives in the DB
  • "Shared Nothing" architecture: the compute layer is stateless, state only in the DB. Ruby on Rails, Node.js, serverless — all the same
  • 10–15 years ago: workflow / durable execution engines: the invention of the Replay model A workflow execution model adopted by Temporal, Inngest, Trigger.dev v1, and others. Each side effect is wrapped in a 'step,' and the execution result of each step is cached in a journal. On retry after a failure, already-executed steps are skipped (deterministic replay). Implements durable execution on top of stateless compute. — wrap each side effect in a step and cache it. Make send-email, charge-credit-card, and the like idempotent
  • 2023: LLMs arrive — initially they fit as one step in a workflow
  • 2024+: Tool calling, agent loop — a reversal from "code controls the LLM" to "the LLM controls the code"

Having summarized 30 years of accumulation in five minutes, he then asks: "So can the Replay model make an agent loop durable?"

"The Replay model doesn't work for agents" (05:53 - 07:15)

Eric's central claim. If you try to make an agent durable with the Replay model:

  • Each LLM call is recorded as a step in the journal; each tool call as a step
  • On resume, the entire function re-executes, and steps are retrieved from cache
  • As the agent loop grows, the log grows and grows
  • You hit a fundamental system limit — number of entries, or entry size

The decisive observation: "Replay gives you durable transactions, but an agent isn't a transaction — it's a session" (06:58 - 07:15). A multi-step workflow has a start and an end; a session continues as long as the user wants — a structural difference. " METR A research organization measuring agent capabilities. Published the observation that 'the time horizon over which AI agents can do meaningful work' is doubling every 4–7 months. Currently (May 2026) on the order of hours; projected to extend to days → weeks within a few years. says the time over which agents can do meaningful work doubles every 4–7 months. Today it's a few hours; soon it will be days, then weeks" (06:33).

Two kinds of state — Context + Execution (07:27 - 09:30)

Eric's framing for the solution. "An agent has two halves":

(1) Context: system messages, user messages, tool calls, tool results, assistant responses — everything that goes in and out of the LLM. This can be made durable as an append-only log, handled with any primitive (DB, object storage, distributed file system).

(2) Execution: the state of having a GitHub repo cloned, packages installed, a dataset in memory, a dev server running, a sandboxed sub-process. "You can't make this durable with a log" (08:53).

The solution: Snapshot and Restore. "Instead of recreating execution state, snapshot the machine, shut it down, save it to disk; when a user message comes, restore" (09:20). Durability between turns — no need to keep a machine running while a user is at lunch. It also becomes a means of error recovery.

The Snapshot/Restore lineage — IBM 1966 → CRIU (2011) → Firecracker (10:30 - 14:30)

Eric's technical history: "This isn't new. IBM mainframes (1966) already had checkpoint/restore. They didn't want to restart expensive multi-hour jobs from scratch on failure" (11:08).

The stages:

  • 2011: CRIU (Checkpoint/Restore In Userspace) A Linux process checkpoint/restore tool developed in 2011. Dumps a process's memory, file descriptors, and state from user space and restores them later. Originated in the OpenVZ project; later used for container live migration. Trigger.dev introduced it in 2024 and ran 'millions of snapshot/restores.' : suspend/restore a process from user space by injecting a "parasite" to dump memory. Trigger.dev introduced this in 2024 and ran millions of snapshot/restores
  • 2025+: Firecracker microVM A lightweight virtual machine for serverless, developed by AWS. The foundation of Lambda and Fargate. Fast startup (tens of ms), strong isolation (KVM + minimal OS), memory-snapshot capable. Trigger.dev moved from CRIU because it can snapshot 'the whole machine, not just a process.' : snapshot the whole machine and pick up. Trigger.dev migrated in 2025; a default 512 MB machine becomes a 14 MB compressed snapshot

A technical refinement: "Seekable compression — on restore, don't decompress all memory pages, only the ones you need. Plus stratified snapshots" (13:42). Result: snapshots under one second, restores in a few hundred milliseconds.

"fcrun" — a soon-to-be-open-sourced Firecracker CLI (14:28 - 15:30)

Trigger.dev's in-house fcrun (also fc-run). A Docker-style CLI that runs containers in Firecracker VMs and supports snapshot/restore. "Alpine boots extremely fast, snapshots are extremely fast, and forking a VM is extremely fast" (15:10).

Benchmark: "Time for a VM to be ready to talk to the internet (TTI), 15,000 VM starts per minute, on a par with 30 FPS video rendering" (15:30). Open source release coming soon. One of the central points in the industry's race to claim "the standard agent execution environment."

The industry paradigm shift from "stateless to stateful compute" (11:00)

Eric's biggest claim: "For 30 years, stateless compute has been the core of backend infrastructure. And agents are forcing the transition — to stateful compute" (11:00).

This stands alongside Karpathy's Software 3.0 and Boris Cherny's "printing press analogy" as a declaration of an industry paradigm shift. Karpathy: "a change in programming language." Boris: "a change in profession." Eric: "a change in infrastructure." Three independent voices, all arguing, "we are at the entrance to a new era."

Related Articles

Key Quotes

  • "For 30 years, stateless compute has been the core of backend infrastructure. Agents are forcing it to stateful compute" (11:00)
  • "Replay gives you durable transactions, but an agent isn't a transaction — it's a session" (06:58)
  • "An agent has two halves: Context (LLM input/output) and Execution (machine state)" (07:27)
  • "Context is durable as an append-only log; Execution is durable via snapshot/restore" (10:00)
  • "IBM mainframes (1966) already had checkpoint/restore — they didn't want to restart expensive jobs on failure" (11:08)
  • "METR says the time over which agents can do meaningful work doubles every 4–7 months" (06:33)
  • "A 512 MB Firecracker microVM becomes a 14 MB compressed snapshot" (13:42)
  • "15,000 VM starts per minute, on a par with 30 FPS video rendering" (15:30)

Sources

Two Roads to Durable Agents — AI Engineer official (YouTube)

Related resources:

Glossary

Trigger.dev
A platform for deploying and operating background jobs, workflows, and agents in production. Provides "durable execution" for the Node.js / TypeScript ecosystem. The company has watched the transition from "workflows → AI agents" from the inside over the past several years.
Replay model
A workflow execution model adopted by Temporal, Inngest, Trigger.dev v1, and others. Each side effect is wrapped in a "step," and the execution result of each step is cached in a journal. On retry after a failure, already-executed steps are skipped (deterministic replay). Implements durable execution on top of stateless compute.
Snapshot / Restore model
The alternative paradigm Eric proposes. Snapshot the entire machine (memory, files, processes), stop it, and later restore it to pick up. Unlike Replay, it "saves all execution state," so heavy states such as a cloned GitHub repo or installed Pip packages can be preserved.
CRIU (Checkpoint/Restore In Userspace)
A Linux process checkpoint/restore tool developed in 2011. Dumps a process's memory, file descriptors, and state from user space and restores them later. Trigger.dev introduced it in 2024 and ran "millions of snapshot/restores."
Firecracker microVM
A lightweight virtual machine for serverless, developed by AWS. The foundation of Lambda and Fargate. Fast startup (tens of ms), strong isolation (KVM + minimal OS), memory-snapshot capable. Trigger.dev moved from CRIU because it can snapshot "the whole machine, not just a process."
fcrun
An in-house Docker-style CLI for Firecracker VMs built by Trigger.dev. Scheduled to be open-sourced. Runs containers in Firecracker VMs and supports snapshot/restore. "Time to internet-reachable" reaches 15,000 VMs/minute.
METR
A research organization measuring agent capabilities. Published the observation that "the time horizon over which AI agents can do meaningful work" is doubling every 4–7 months. Currently (May 2026) on the order of hours; projected to extend to days → weeks within a few years.
comment is stripped from the HTML output. */}