Nitya Narasimhan · 17:30 "Penetration testing is like hiring a contractor to 'break into our place.' Evaluations are more like a building inspector showing up to find code violations. To defend something, you first have to let someone try to seriously break it."
A live demo workshop by two of Microsoft's Foundry Microsoft's cloud agent platform (formerly Azure AI Foundry). An enterprise agent platform that integrates build / host / observe / safeguard end-to-end. Development, operations, observation, and safety are handled in a single control plane. Includes OpenTelemetry-based tracing, red teaming via PyRIT, and Azure Monitor integration. Docs: learn.microsoft.com/azure/foundry/ DevRel team on the enterprise agent platform. The London Underground's "Mind the Gap" announcement is used as a metaphor for observability — a state in which there's a gap between the train (implementation) and the platform (the requirements spec) means you need both guardrails and announcements.
Amy Boyd is Microsoft's Principal Cloud Advocate / AI Foundry Advocacy Lead. Nitya Narasimhan holds a Ph.D., has 25-plus years of software research and development experience, and is the Senior AI Advocate who serves as Foundry's "developer 0" — the person who's the first to feed feedback back to the product. She came to Microsoft after 10 years of research at Motorola Labs and founding the NY Google Developer Groups. The "Mind the Gap" framing came from a personal moment when Nitya moved from New York to London and noticed, "oh, over here it's not 'Watch the Gap,' it's 'Mind the Gap.'"
1. Three axes of reliability — (a) Evaluate (benchmark quality, safety, and adaptation), (b) Monitor (observe behavioral change over time), (c) Optimize (run an improvement loop from data). These are folded into Microsoft Foundry's build / host / observe / manage planes. Not a single SDK, but framed as "a development flow with observation built in from the start."
2. Evaluation is hierarchical — intent resolution (did the agent interpret the user's intent correctly) → tool call accuracy (did it call the right tool) → response completeness (did the response meet the requirements) → task adherence (is quality holding up over time). The four points evaluate the quality of the entire agent workflow, not "a single LLM output." The DevRel team notes from production experience that task adherence (quality maintained over time) is the most tuning-intensive.
3. Tracing on OTel plus Azure Monitor — not a Microsoft-proprietary instrumentation, but the OpenTelemetry A vendor-neutral telemetry collection standard. Specifies three signals: trace, metric, and log. An incubating project at the CNCF (Cloud Native Computing Foundation). Microsoft Foundry's design emits agent internal LLM calls, tool calls, and steps all as OTel spans. Official: opentelemetry.io standard. This secures interoperability — agents built with other vendors can be observed in the Foundry control plane. Multi-agent fleet management (integration with the A2A protocol) is also in view.
Key Observations
"Once a human enters the evaluator loop, development speed drops exponentially" (during the demo, Amy)
A claim Boyd made while demoing Foundry's Skill feature. Foundry Skill is a tool that takes an empty agent endpoint and autonomously runs build / evaluate / improve loops on it. They set up a zero-setup environment in Codespaces plus dev container and ran the autonomous loop in a single command.
The direction is the same as Project Glasswing (Anthropic plus Apple plus financial-institution early access) — a design where the evaluation work itself is run by AI. It's an operational version of the same problem as the RLHF (human in the evaluation loop) vs RLVR (verifiable rewards, human-less) discussion. Microsoft is on the democratize line; Anthropic is on the limited-access line — a contrast that sat side by side on Day 1 of AI Engineer Europe.
"Foundry Skill autonomously builds from an empty agent" (33:00, Nitya)
Narasimhan demoed building "Contoso Travel Agent." The standard course assembles the agent step by step via the SDK, but as an optional course there's an autonomous-build mode in which "you register an empty agent endpoint with Foundry and launch Skill." Hackathon participants are literally getting it running with a single command in their local environment.
A case to set alongside Karpathy's agentic coding — automation at the level of "a coding agent builds the agent" implemented in an enterprise context. It's telling that Skill is built into Foundry as a platform feature — the development flow assumes "rather than writing code, hand over a spec and have the agent build it."
"PyRIT (OSS), so attack simulation against the agent is one click" (around 14:00, Amy)
Microsoft's move to standardize red teaming at the product level. PyRIT (Python Risk Identification Toolkit) is Microsoft's OSS framework that automates adversarial prompt injection and jailbreak attempts against LLMs. It's wired into Foundry's build flow, offering "call the contractor who breaks things" as a parallel experience to evaluations.
The design implies that "safeguarding is not a one-time activity — it's a continuous activity woven into the build flow." The "building inspector vs intruder" analogy for evaluations and safeguarding (the lead quote) maps directly onto this design.
"Tracing is OpenTelemetry standard, so agents built with other vendors can be observed in Foundry" (around 13:00)
The choice of OTel as the basis rather than Microsoft-proprietary instrumentation. The implications:
- Agents built with other SDKs (LangChain, AutoGen, LlamaIndex, even Pi or Claude Code) can be observed in the Foundry control plane once OTel instrumentation is in place
- Immediate integration with existing Microsoft cloud operations infrastructure such as Azure Monitor and Application Insights
- For multi-agent fleets (a future in which one organization operates many agents), every agent becomes visible against the same observation standard
Templestein's event-sourced agent harness argues for "events as the common language," while Microsoft argues for "OTel spans as the common language" — the abstraction levels differ, but the direction is the same. Both aim at "connecting the agent ecosystem under a single standard."
Video outline (main sections)
- (00:00) London Underground "Mind the Gap" metaphor introduced — three meanings (quality / safety / adaptation)
- (06:00) Non-determinism is not just a demo problem — the three axes of reliability: evaluate / monitor / optimize
- (09:00) The Microsoft Foundry structure — a cloud agent platform spanning host / observe / manage, not just build
- (13:00) Tracing on OpenTelemetry, Azure Monitor integration, aggregation across multiple development environments
- (17:00) Red teaming and PyRIT (OSS), Nitya's "call the contractor to break in" analogy
- (22:00) Codespaces plus dev container — zero-setup access for workshop participants
- (28:00) Step-by-step Contoso Travel Agent — adding instructions, Bing search, app insights in order
- (33:00) Skill demo — Foundry autonomously builds and evaluates from an empty agent endpoint
- (40:00) Evaluator catalog — intent resolution / tool call accuracy / task adherence / response completeness
- (50:00) Tracing demo — visualizing the agent's internal LLM calls via OTel spans, measuring cost, tokens, and steps
- (1:05:00) Multi-agent fleet management roadmap, A2A protocol interoperability, future direction
Sources
- Workshop video (AI Engineer official YouTube)
- Microsoft Foundry official documentation
- PyRIT (Python Risk Identification Toolkit) on GitHub
- OpenTelemetry
- AI Engineer Europe 2026 official schedule
Glossary
- Microsoft Foundry
- Microsoft's cloud agent platform (formerly Azure AI Foundry). An enterprise agent platform that integrates build / host / observe / safeguard. Development, operations, observation, and safety are handled in a single control plane. Includes OpenTelemetry-based tracing, red teaming via PyRIT, and Azure Monitor integration.
- Three axes of reliability (Evaluate / Monitor / Optimize)
- The Microsoft DevRel team's framing of agent reliability. Evaluate = benchmarking (three perspectives: quality, safety, adaptation). Monitor = observation of behavioral change over time. Optimize = an improvement loop driven by data. A philosophy of designing reliability as a continuous cycle, not a one-off evaluation.
- Four stages of evaluation (Intent Resolution → Tool Call → Response → Task Adherence)
- A design that evaluates an agent workflow at four observation points: intent resolution (interpreting intent), tool call accuracy (selecting the right tool), response completeness (meeting the requirements of the response), task adherence (maintaining quality over time). Measures the quality of the whole workflow rather than a single LLM output.
- Foundry Skill
- A Microsoft Foundry feature. Register an empty agent endpoint, and it autonomously runs build / evaluate / improve. Realizes a development flow where "rather than writing code, hand over a spec and have the agent build it." Combined with a Codespaces plus dev container zero-setup environment, it runs in a single command.
- PyRIT (Python Risk Identification Toolkit)
- Microsoft's OSS red teaming framework. Automates adversarial prompt injection and jailbreak attempts against LLMs. Wired into Foundry's build flow, providing "call the contractor who breaks things" as a parallel experience to evaluations. GitHub: Azure/PyRIT.
- OpenTelemetry (OTel)
- A vendor-neutral telemetry collection standard. Specifies three signals (trace, metric, log). An incubating CNCF project. Microsoft Foundry adopts a design that emits agent internal LLM calls, tool calls, and steps as OTel spans. Agents built with other vendors' SDKs can be observed in Foundry once OTel instrumentation is in place.
- Multi-agent fleet
- The state in which one organization operates many agents. Microsoft offers as future direction a control plane that "uniformly observes the entire organization's agent fleet" — beyond a single multi-agent system. Interoperability with the A2A (Agent-to-Agent) protocol is also in view.