Deploying to GPUs Without Leaving Your IDE — Audry Hsu's (RunPod) Flash and Serverless GPUs

AI Engineer Europe 2026 (London) / approximately 20-minute talk

Audry Hsu · 14:11 "Instead of code change → commit → rebuild Docker → upload somewhere → allocate a GPU, all of this happens inside the IDE, and you never have to leave it once."

Talk "GPU Cloud Deployment Without Leaving Your IDE" at AI Engineer Europe 2026 (London) (approximately 20-minute talk, video published 2026-06-09, AI Engineer official channel). The speaker is Audry Hsu (developer advocate / DevRel at RunPod). RunPod is a GPU cloud platform for AI workloads. The subject is the company's Python SDK "Flash" — a development experience that lets you deploy an async function to the GPU cloud without leaving your IDE, just by adding a single decorator to it.

RunPod's mission is to build "a foundational platform for developers to scale AI workloads." RunPod provides the hardware (GPUs / compute), and developers bring their code and models to deploy quickly. The framing is that RunPod takes on the pains of "infrastructure setup" — matching CUDA versions, PyTorch combinations, validating new GPU SKUs — so developers can focus on training models and building applications.

What RunPod is — from mining rigs in a basement

RunPod An AI cloud / GPU infrastructure company founded in 2022. Co-founders are Zhen Lu (CEO) and Pardeep Singh (CTO). It provides hardware and compute (GPUs), and developers bring their code and models to deploy quickly. Publicly stated: over 500,000 developers, $120M ARR, more than 30 data centers. Products are Pods / Serverless / Instant Clusters / Hub, and the star of this talk, the Python SDK 'Flash'. — Audry tells its origin. Co-founders Zhen Lu and Pardeep Singh had stacked Ethereum mining rigs (a venture that did not work out) in a basement at the end of 2021. They built the prototype of RunPod with the leftover GPUs and posted on Reddit, "Want to use GPUs for free in exchange for feedback?" — that was the start of the company. They have built in public with the community ever since, and have been generating revenue from the very beginning (which is rare, says Audry). Today they have reached over 500,000 developers, more than 30 data centers (on the scale of 10 countries — France, Romania, Iceland, Asia-Pacific, and so on), and $120M in ARR (annual recurring revenue). She describes it as "punching above our weight."

Four shapes — Pods / Serverless / Instant Clusters / Hub

RunPod splits by purpose. Pods A persistent VM environment. On-demand with per-second billing; while you rent it, that GPU is reserved and can't be taken by anyone else. The shape for when you need a reserved GPU. When you're done you can destroy it and start again. is a persistent VM environment — on-demand, per-second billing, and while you rent it the GPU is yours. Serverless RunPod's scaling-focused shape. When a workload's frequency / load fluctuates, it auto-scales workers and shrinks to zero when there are no requests, avoiding idle billing. It carries a markup over Pods for the scaling. is scaling-focused — it auto-scales workers for fluctuating workloads and shrinks when there are no requests to avoid idle billing. Instant Clusters are for training / multi-node. Hub is the place to instantly deploy OSS repositories that RunPod has pre-validated, like ComfyUI / Stable Diffusion / vLLM.

Flash — deploying to GPUs without leaving your IDE

The star of the talk is Flash RunPod's OSS Python SDK. Just add an endpoint decorator (in the talk @flash.endpoint, in the official docs @Endpoint) to an ordinary async Python function, and it deploys and packages that function to the GPU cloud. The main function and helpers run locally, and only the part that needs GPU compute runs in the cloud. It has hot file reload, so when you make a change it is instantly re-packaged and pushed. No Dockerfile required. — RunPod's Python SDK. A developer's big pain is the iteration cycle. Every time you tweak inference model code and try it, you repeat: commit → push to GitHub → build a Docker image → pull it from the registry → load it onto a server → allocate a GPU → finally test. Flash eliminates this. Just add a single endpoint decorator to an ordinary async Python function, and it deploys that function to the GPU cloud. The main function and helpers run locally, and only the part that needs GPU compute runs in the cloud. With hot file reload, whatever you change is instantly re-packaged and pushed.

The demo runs live. She runs the generate_image function (generating an image with PyTorch + Stable Diffusion XL Turbo, returning it as base64) with `flash run`, and sends requests to a local FastAPI server. The endpoint decorator specifies the endpoint name, the GPU family (a variation of NVIDIA's H100 line), the max worker count (5), an always-on active worker (1), a timeout, and so on. She asks the audience "what should we generate?" and generates "a cat flying through London's cloudy sky" (an abstract cat appears, to wry laughter). She then swaps the model out for DreamShaper (a fine-tune based on Stable Diffusion 1.5, leaning toward art / illustration) and regenerates — without committing the code or rebuilding Docker, switching it entirely inside the IDE.

Pipelines and billing

Finally, Audry shows that the true value of a developer tool comes out not in a single model call but in "the orchestration around it," and shows a pipeline. First she has Qwen3 (a public endpoint) generate a prompt, passes it to DreamShaper on her own endpoint, and then composes a founder's photo with Nano Banana 2 The nickname for Google's image model Gemini 3.1 Flash Image. A premium model good at composing multiple photos together. In RunPod's demo, it was used in the final stage to compose the image generated by DreamShaper with a founder's reference photo. (Google's premium image model, good at compositing photos) — building a multi-stage pipeline. Without leaving the IDE, prompt generation → image generation → compositing are chained together.

Billing is "only for the time a request ran." During the demo, she shows that an H100 is about $0.00116 per second, and shows in the console that 3 of the 5 workers are running (because 3 images were requested). Serverless carries a markup over Pods for the scaling. The recommendation is a division of use: while iterating, keep the worker count low or start from Pods (1–2 GPUs are enough while experimenting); if you want to distribute hundreds of workers / hundreds of GPUs across regions for availability in production, use Serverless.

Editorial Notes

The core of this talk comes down to a single point of developer experience: "never leaving the IDE." It folds the conventional GPU-deploy iteration (commit → Docker → upload → allocate GPU → test) into a local experience of just sticking one decorator on a function. Given that RunPod is in the GPU infrastructure business this is a position pitch, but the honesty of showing the demo "with failures included" (the abstract cat, forgetting to pass the prompt, live stumbles) conveys the real picture of the tool. The founding story — starting from mining rigs in a basement and free GPUs offered on Reddit, then reaching $120M ARR and 500,000 developers through build in public — also reads as continuous with the product philosophy of "taking on the infrastructure setup so developers can focus on compute." Like Google's DevRel that lets you try things for free in AI Studio, it's one example of developer acquisition centered on "letting people touch it first."

Points of Focus

One decorator = remote GPU, as an abstraction

The idea of Flash is that when you stick an endpoint decorator on a function, "just that function" runs in the GPU cloud while the surrounding code stays local. It eliminates the packing-and-shipping steps of Dockerization, registry, and securing a server, and changes are reflected instantly with hot reload. The abstraction of "flying only the part that needs compute off to the remote" makes the boundary between local development and GPU execution granular down to the function level.

The economics of Serverless and Pods

Even for the same GPU: if you want to spin up hundreds of workers under fluctuating load, use Serverless (auto-scale + scale-to-zero to avoid idle billing, but with a markup); if you want to steadily hold a small number of GPUs, use Pods (reserved, cheaper). The guidance "Pods or low workers while experimenting, Serverless for the variable load of production" organizes GPU cost as a choice of "secure it constantly, or spin it up each time."

Video Outline

  • (00:00) Self-introduction (Audry, RunPod), interactive intro with the audience
  • (00:45) What RunPod is — AI cloud infrastructure, taking on infrastructure setup
  • (01:38) Why it exists + founding story (Zhen Lu / Pardeep Singh, 2021 mining rigs → Reddit)
  • (03:20) Scale — 500,000 developers / 30+ data centers / $120M ARR
  • (03:54) Four shapes — Pods / Serverless / Instant Clusters / Hub
  • (05:42) Flash's motivation — the pain of the deploy iteration cycle
  • (06:28) Flash = endpoint decorator sends a local function to the GPU, hot reload
  • (07:45) Demo — image generation with SDXL Turbo, flash run + FastAPI
  • (11:28) Inside the endpoint decorator (GPU family, worker count, timeout)
  • (13:03) Swap to DreamShaper and regenerate inside the IDE
  • (14:30) Orchestration is the real strength — pipeline demo
  • (15:20) Prompt generation with Qwen3 → DreamShaper → compositing with Nano Banana 2
  • (16:57) Billing — per-second, dividing use between Serverless and Pods
  • (18:39) The final composited photo, wrap-up

Related Links

Glossary

RunPod
An AI cloud / GPU infrastructure company founded in 2022 (co-founded by Zhen Lu = CEO, Pardeep Singh = CTO). It provides hardware and compute, and developers bring their code and models to deploy. Publicly stated: over 500,000 developers, $120M ARR, more than 30 data centers. It started from the repurposing of Ethereum mining rigs at the end of 2021 and free GPUs offered on Reddit.
Flash
RunPod's OSS Python SDK. Just add an endpoint decorator (in the talk @flash.endpoint, in the official docs @Endpoint) to an async Python function, and it deploys that function to the GPU cloud. The main function and helpers stay local; only the GPU-compute part runs in the cloud. Changes are reflected instantly with hot file reload. No Dockerfile required.
Pods / Serverless / Instant Clusters / Hub
RunPod's four shapes. Pods = persistent VM, per-second billing, reserved GPU. Serverless = auto-scale + scale-to-zero to avoid idle billing (with a markup). Instant Clusters = for training / multi-node. Hub = instantly deploy pre-validated OSS repositories (ComfyUI / Stable Diffusion / vLLM, etc.).
Nano Banana 2
The nickname for Google's image model Gemini 3.1 Flash Image. A premium model strong at compositing multiple photos. In RunPod's demo, it was used in the final stage of a multi-stage pipeline: Qwen3 generates the prompt → DreamShaper generates the image → Nano Banana 2 composites.
Per-second billing / Serverless markup
RunPod's billing is for the time a request ran. In the demo, an H100 was shown at about $0.00116 per second. Serverless carries a markup over Pods for the scaling. The recommendation: Pods or low workers while iterating, and Serverless — which can distribute hundreds of workers — for the variable load of production.
comment is stripped from the HTML output. */}