Reinforcement Learning Industrializes Production — Alessandro Cappelli (Adaptive ML)

AI Engineer Europe 2026 (London) · May 12, 2026

Alessandro Cappelli · 01:38 "95% of GenAI pilots never reach production. The cause is the 'myth of the last mile.'"

Lessons from Trillion Token Deployments at Fortune 500s — Alessandro Cappelli, Adaptive ML

AI Engineer Europe 2026 (Queen Elizabeth II Centre, London, April 8–10, 2026), Abbey track, the April 10 11:40–12:00 session. About 18 minutes. A talk by Alessandro Cappelli (Adaptive ML co-founder, Chief Customer Officer).

Alessandro Cappelli is co-founder and Chief Customer Officer of Adaptive ML . Background: about three years ago, he was part of the training team for Falcon (TII), one of the most widely adopted OSS models of the time. The core insight he gained there — "the decisive gap when bringing OSS models to production = reinforcement learning (RL)" — became the starting point for founding Adaptive ML.

The talk's thesis is plain: RL is not one of many post-training algorithms — it is the only algorithm that gets a model to production. It dismantles the industry illusion of the "myth of the last mile" (the idea that the hard part is reaching MVP, with production being easy) and proves the case through implementations at large enterprises (AT&T / Manulife / CCS / a medical-supply company).

1. Counter-evidence to the "myth of the last mile" — building an MVP is only the first mile; the real marathon is the road from MVP → production → beyond. Neither prompt tuning nor SFT (instruction fine-tuning) can systematically close that distance — only RL has the structure to mathematically integrate arbitrary feedback.

2. The economic unlock of RL's outsized performance — the same level of performance can be reached with a far smaller model than SFT requires, producing huge business impact across three axes: inference cost, latency, and ownership. Just for AT&T's call-transcript summarization, the bill runs into millions of dollars a year; if you can handle it with a 10B-class model instead of a Sonnet-class one, you save by an order of magnitude.

3. With an environment, synthetic data is a by-product of the pipeline — the data needed for agent training does not exist on the web, but once an RL environment is in place, "trajectories that the reward signal scored well" become an auto-generated synthetic dataset. Bootstrap via rejection sampling and stream directly into model training.

Key Observations

"95% of GenAI pilots never reach production" — the myth of the last mile (01:38 – 03:30)

Cappelli's starting point. "95% of GenAI pilots can't make it to production. Why? Because of the 'myth of the last mile'" (01:38).

The content of the myth: the industry orthodoxy that "building the MVP is the hard part — production is the final push." This is wrong, the talk argues. "An MVP can be assembled with proprietary models or with SFT on OSS, but neither can systematically improve the solution" (02:18).

Concretely: "With proprietary models, all you can do after finding a defect is change the system prompt — fix one thing and another defect appears, with no systematic means of improvement" (02:38). "With SFT, all you can do is iterate the dataset — expensive — and after production, do you build a new dataset every week?" (02:55).

The reality Cappelli presents: "MVP is only the first mile — the real marathon is MVP → production → beyond. The secret to accelerating that stretch is RL — the only means of systematically integrating arbitrary client feedback / business metrics / environmental rewards into the model lifecycle" (03:13).

"RL is not an equal alternative to SFT — it unlocks outsized performance" (04:22 – 05:33)

Cappelli's case for RL. "Prompt engineering, SFT, and RL all aim at the same goal — steering model behavior — but they are not equally effective" (04:22).

RL's unique value: "the same level of performance with a much smaller model than SFT requires" (04:48). This is decisive for large enterprises:

Economics at scale: AT&T spends millions of dollars a year just on call-transcript summarization. Handling it with a 10B-class model rather than Sonnet / GPT scale saves by orders of magnitude
Latency constraints: customer-support speech-to-speech requires sub-0.5-second response (even half a second feels off), which large models cannot reach
Ownership: trained on your own business data, with no risk of behavior shifting suddenly on a model update

"In the agent era, RL's lead widens further" (06:48 – 09:04)

Cappelli's structural argument. "Agents consume more tokens, are more complex, and have less tolerance for error (the agent touches the DB). Production bars rise, and tokenomics become a bigger issue" (07:21).

Here, RL's historical advantage comes through. "RL was originally born to train robots and agents inside an environment — the environment is where the agent's behavior is made coherent. So RL is a natural fit for training agents" (07:54).

Two scenarios: "If you already have an agent workflow (like Manulife), plug your trained model into it directly. If you don't have an environment, build one with mocked tools + an LLM-as-mock-user. The reward combines business outcomes / KPIs / LLM-as-judge" (08:42).

"With an environment, synthetic data springs out as a by-product" (09:38 – 10:34)

The answer to the data problem. "The biggest concern in client conversations: there is no data for agent training" (09:38). Agent training data does not exist in the wild — it cannot be scraped from the web.

RL's structural answer: "Once an environment + reward are set up, a synthetic-dataset pipeline forms automatically. Run the environment and trajectories are generated — extract the ones the reward judged good via rejection sampling, and you have a dataset for bootstrapping the model" (10:08).

Existing data can also be reused. "Real customer–agent conversation transcripts can be used to train the mock user. In the medical-supply company case, even panicked customers (situations where 'call an ambulance' is the right response) can be reproduced as mocks" (10:34 – 11:01).

"Human-in-the-loop isn't an annotation campaign — it's rubric design" (11:25 – 13:10)

Cappelli's realism. "RLHF (= Reinforcement Learning from Human Feedback) became famous through ChatGPT, but behind the nice name 'human in the loop' lies a high-cost annotation campaign. In our experience, nobody wants to do annotation — it's either expensive or useless" (11:25).

The Adaptive Engine approach: "The human role is limited to defining the rubric, designing the LLM-as-judge system prompt, and tuning scenarios. Minutes to hours — not weeks of repetition" (12:55).

How rewards are assembled:

Systematic reward: does the code execute, is the syntax correct — machine-checkable
Direct KPIs: maximize CCS's containment rate (= the percentage of calls completed end-to-end inside the agent) directly
Open-ended qualitative evaluation: tone correctness, adherence to business guidelines — substitute with LLM-as-judge

"PPO orchestrates 4 LLMs at once" — the complexity of RL (13:35 – 14:38)

Cappelli's honest caveat. "The catch with RL is that RL is genuinely hard. It's not changing a prompt or building an SFT dataset — the complexity is on another order" (13:35).

Concrete example: "PPO (a representative RL algorithm) requires orchestrating four LLMs at once, not one" (13:55). This is the reason for Adaptive Engine — "leave decisions like the rubric to the client; hide the complexity of RL behind pre-built recipes" (14:18).

In contrast to the "vibe coding from agents" direction Karpathy presented at AI Ascent, Adaptive ML's focus is industrializing RL in enterprise production environments. Individual-developer productivity and Fortune 500 production operations require infrastructure on different orders of magnitude.

Video Outline

(00:00) Self-introduction, overview of Adaptive ML, cases like AT&T / Manulife / CCS
(00:45) The argument that RL is not merely a post-training algorithm but the algorithm for getting a model to production
(01:08) Experience on the Falcon training team — discovering that "the gap from OSS → production = RL"
(01:38) "95% of GenAI pilots never reach production"; "the myth of the last mile"
(02:18) Limits of MVP construction (proprietary / SFT) and the absence of an improvement path
(03:13) The real marathon = MVP → production → beyond, with RL the way to accelerate it
(04:22) RL vs SFT vs prompts — same goal, different effectiveness
(04:48) Outsized performance — equivalent performance with smaller models
(05:30) Economics at scale (AT&T millions-of-dollars example), latency constraints (0.5 seconds), ownership
(06:48) The agent era = more tokens, more complexity, less error tolerance — RL's historical advantage
(07:54) RL was originally built for agent / robot training; the environment is where it lives
(08:42) Existing agent vs building a new environment, the Manulife case
(09:38) The data problem (not present on the web), environment = a by-product of synthetic dataset
(10:34) Repurposing real customer-conversation transcripts to train the mock user, the medical-supply company's "call an ambulance" example
(11:25) The reality of RLHF = an annotation campaign, high cost
(12:00) Assembling reward signals (systematic / KPI / LLM-as-judge)
(12:55) Human role is rubric definition only — minutes to hours
(13:35) RL is hard; PPO orchestrates 4 LLMs at once
(14:18) Adaptive Engine hides the complexity behind pre-built recipes
(15:00 –) Q&A: implicit feedback for Cursor's tab completion, scaling reward models

Sources

From the AI Engineer Europe 2026 official YouTube playlist. The video ID can be confirmed on the AI Engineer official channel.

アレッサンドロ・カペリ

Alessandro Cappelli

Adaptive ML 共同創業者・Chief Customer Officer / 元 Falcon (TII) チーム

Glossary

Adaptive ML / Adaptive Engine: An RL Ops (Reinforcement Learning Operations) platform for large enterprises. A holistic platform through which large enterprises like AT&T / Manulife / CCS build, evaluate, and run their own specialized LLMs in production. Official: adaptive-ml.com
The myth of the last mile: Cappelli's critique term. An argument that the industry orthodoxy — "the hard part is reaching MVP, and production is the final push" — is wrong. In reality, MVP is the first mile; the real marathon is MVP → production → beyond. Neither prompt tuning nor SFT carries a systematic improvement path, so this is where projects get stuck.
Falcon: An open-source large language model announced by the UAE's Technology Innovation Institute (TII) in 2023. At the time it was one of the most widely adopted OSS models, alongside Llama-2. Cappelli was on its training team and there gained the insight that "to bring OSS into production, RL is indispensable" — which led to founding Adaptive ML.
Outsized performance: Cappelli's term for RL's unique value over SFT. Equivalent performance can be achieved with a far smaller model than SFT requires. As a result, RL unlocks business impact across three axes: inference cost, latency, and ownership.
PPO (Proximal Policy Optimization): A representative RL algorithm. Requires orchestrating four LLMs at once, not one (policy / reference / value / reward model). A symbol of RL's implementation complexity. Adaptive Engine hides this behind pre-built recipes.
LLM-as-judge: A method that has an LLM rate open-ended qualitative dimensions (tone, compliance with business guidelines, etc.). One component of the reward signal. The division of labor — humans define the rubric → LLM rates each case → ratings convert to reward — minimizes the human-in-the-loop burden.
Containment rate: A customer-support KPI at CCS (the medical-supply company). It measures "what percentage of calls were completed end-to-end inside the agent (without escalating to a human)." A good example of a metric being directly maximized as RL reward.