The Era When Agents Train Models — Merve Noyan (Hugging Face)

AI Engineer Europe 2026 (London) · May 13, 2026

Merve Noyan (Hugging Face) · 06:30 "Just say 'fine-tune Qwen2-VL on LLaVA-Instruct-Mix,' and the agent calculates VRAM, asks you questions, and trains it automatically. After six years of machine learning, this looks like science fiction to me."

AI Engineer Europe 2026 (London, published May 13, 2026, around 19 minutes 10 seconds). Japanese subtitles included.

The speaker is Merve Noyan, an ML advocacy engineer on the Hugging Face open-source team. Co-author of the book "Vision Language Models" (O'Reilly, 2025). Title of the talk: "Your Agent Can Now Train Models" — a declaration that the era has arrived in which integrating Hugging Face Hub's MCP server, Skills, and Inference Providers lets Claude Code fine-tune models directly.

The starting point is strong: referencing "Cloud performance degradation" (the industry conversation about Anthropic's Cloud models recently dropping in performance), Merve argues, "if everything is open, your performance can't degrade behind your back." She organizes a three-tier framework — open model (open weights), open source (commercially permissible license), and fully open (code and harness included) — and continues that quantization, distillation, and fine-tuning can be done freely, with privacy guarantees by running in the user's browser on-device.

Key observations

Open models dominate the top of SWE-bench Pro — GLM 5.1 at 58.4 (04:34)

What settles the argument is the Artificial Analysis Intelligence Index plus the SWE-bench Pro leaderboard. Green is open, black is closed. "Lately the open side has fully caught up to closed," Merve says. The recent SWE-bench Pro rankings: GLM 5.1 (Z.ai) at 58.4 on top, followed by MiniMax-M2.5 at 55.4, Kimi K2.5 (Moonshot) at 50.7, Qwen3-Coder-Next at 44.3, and Qwen3-Coder-480B-A35B-Instruct at 38.7 — the top five are all open.

"We caught up, and we'll catch up more," Merve says. Hugging Face Hub now hosts nearly 3 million models, and Merve walks carefully through the flow from "benchmark → audition → local run" all completed on the Hub. The Hub's real value, she argues, is routing through Inference Providers (Groq, Cerebras, Novita, etc.) while filtering for "cheapest, fastest, and tool-use compatible."

Skills made the "VRAM calculator" unnecessary — agent-driven fine-tuning (13:14)

The lead segment of the talk is Hugging Face Skills. Inside Skills there is an LLM trainer skill, and you instruct Claude Code as naturally as "fine-tune Qwen2-VL on LLaVA-Instruct-Mix." Under the hood, the agent:

  • Estimates VRAM (model size × batch size × precision as a memory calculator)
  • Asks for an appropriate instance (from multiple candidates)
  • Asks about validation split, epoch count, and so on
  • Kicks off a job on Hugging Face Infra
  • Uploads the trained model to the Hub upon completion

Merve: "Six years into my ML engineering career, this is science fiction." Coverage extends beyond LLMs and VLMs to object detectors and segmentation models. Even bounding-box format differences are absorbed by Skills. This is the reality of "vibe training."

Hermes Agents — open weights surpassing Claude's memory management (07:51)

Hermes Agents is the project Merve publicly declares she "will die on this hill" for. Open-weight, with a Setup Wizard that handles end-to-end integration into Slack, WhatsApp, and other messaging platforms. Her claim: in memory management, the design takes a step beyond OpenClaw.

A concrete experience: when stuck on a Slack integration, she asked "GLM 5.1 + Hermes, fix the Slack integration" and it diagnosed the cause and fixed it itself. You can run it via Hugging Face Inference Providers, or serve it locally via Llama.cpp. The flexibility of the development experience exceeds a single Anthropic or OpenAI API, she argues.

30,000 papers processed in bulk with Codex plus cheap open OCR (16:25)

A live introduction to a project by colleague Nils Reimers, who, to strengthen Hugging Face Hub's "Papers" section, used OCR to convert 30,000 AI papers into markdown. The method:

  1. Choose a model with the olm OCR Bench (Chandra OCR is on top, but you can ask Skills "what's the best OCR for fine-tuning?")
  2. Have the agent write an OCR script
  3. The agent computes VRAM and cost estimates (Hugging Face Bucket — S3-compatible, but cheap and fast)
  4. Kick off as a parallel job on Hugging Face Infra
  5. Done — markdown is linked to each paper, searchable and RAG-ready via the Hub

A concrete example of "running a scientific data pipeline with prompts alone." OCRing 30,000 papers, the only places a human touched were "writing the instructions and the final check." Merve: "The era of doing the napkin math for OCRing a single paper by hand is over."

Video outline

  • (00:00) Introduction — Merve, Hugging Face open-source team
  • (00:40) Open Weight / Open Source / Fully Open — three tiers
  • (01:23) Anthropic Cloud performance decline and "if everything is open, silent degradation can't happen"
  • (02:09) Strengths of open models — quantization, distillation, fine-tuning, edge deployment
  • (02:35) Open models like GLM 5.1 catching up on the Intelligence Index
  • (03:50) Hugging Face Hub = the open infra layer, 3 million models
  • (04:34) Convergence of vision LM and LLM; Day 0 VLM releases as the new standard
  • (04:58) Benchmark Datasets feature — comparing rankings on SWE-bench Pro
  • (05:17) Routing across Inference Providers — filtering by cheapest, fastest, tool-use compatible
  • (06:31) Overview of the HF Hub MCP server + Skills
  • (06:42) Options for local coding agents — Pi / Llama Agent / Llama.cpp integration
  • (07:51) Pitch for Hermes Agents — memory management surpassing OpenClaw
  • (09:22) Traces dataset repository — upload Claude / Codex / Pi sessions
  • (10:30) List of local app integrations — LM Studio / Jan / Llama CPP
  • (11:24) GGUF + Use This Model — minimum commands for local inference
  • (12:14) Skills overview — HF CLI / LLM trainer / Gradio / Dataset / OCR
  • (13:14) Skills demo — fine-tuning Qwen2-VL on LLaVA-Instruct-Mix
  • (15:18) What to serve over MCP — Spaces search, jobs, semantic search
  • (16:00) Image generation from Spaces (baklava made of yarn)
  • (16:25) Colleague Nils's 30,000-paper OCR project
  • (18:39) Close + slides shared on Twitter

Sources

Your Agent Can Now Train Models — Merve Noyan, Hugging Face (AI Engineer Europe 2026)