Transformers Finally Ate Vision — Isaac Robinson / Roboflow (AI Engineer Europe)

AI Engineer Europe · May 8, 2026

Isaac Robinson · 10:24 "We have, in a sense, won."

AI Engineer channel (published May 8, 2026, about 17 minutes). A technical session at AI Engineer Europe 2026 (April 8–10) in London.

In Vision AI, CNNs (convolutional neural networks) and Transformers have competed for a long time. Initially, many thought that because Transformer compute grows as n to the 4th in image side length n, while CNN is n squared, Transformers could not possibly beat CNNs. But in 2026, the major Vision foundation models are almost all in the ViT family. A 17-minute technical session that organizes "why an n⁴-scaling model beat the efficient CNN."

The speaker is Isaac Robinson — head of research at Roboflow (a computer-vision models / tools / platform company). Leading the development of the company's RF-DETR (a real-time object detection + segmentation model, set to be presented at ICLR 2026) and the RF100-VL benchmark, he organizes the evolution of Vision foundation models from the front line.

The structure of the talk is plain. (1) CNNs are eye-inspired inductive bias (translation invariance, hierarchical structure) at n² compute; (2) ViTs have no such inductive bias, at n⁴; (3) between them sit intermediate attempts like SWIN (windowed attention), ConvNeXt (CNN redesigned in a Transformer style), and HERA (acceleration + pyramidal structure). And the conclusion — ViT won in the end. Three reasons: large-scale pre-training, LLM-derived acceleration (FlashAttention, etc.), and NAS compatible with pre-training.

The concrete example shown at the end is the lineage of the SAM (Segment Anything Model) series. SAM (ViT MAE) → MobileSAM (replaced with a TinyViT hybrid) → SAM2 (shifted to HERA) → SAM3 (gives up the architectural overhaul and uses a large pre-trained ViT backbone as-is). The flow from "architectural optimization" to "pre-training lock-in" is visible clearly across four generations. And Roboflow's RF-DETR — by applying NAS to this pre-trained ViT — achieves 40x speed-up over SAM3 at the same accuracy. That is the latest result with which the talk closes.

Key Observations

The note hiding in the HERA paper that "we didn't measure with FlashAttention" (08:50)

HERA (a hierarchical-attention Vision model) was reported to deliver a speed-up over ViT at the same accuracy. But the paper has a footnote: "we did not measure with FlashAttention." When Isaac re-added FlashAttention and remeasured, HERA's edge vanished. A crucial pivot is shown via that footnote: "the infrastructure that exploded in the LLM world (FlashAttention, etc.) is being borrowed directly for Vision, and ViT's n⁴ has become a non-issue in practice." A moment when the cross-industry dynamic — "ViT receives the benefits of the LLM boom indirectly" — becomes visible.

"Pre-training lock-in" visible across SAM generations of backbones (10:40)

The per-generation backbone progression of the SAM (Segment Anything Model) series. SAM = ViT (pre-trained with MAE); MobileSAM = TinyViT (an attempt to replace it with a CNN-Transformer hybrid); SAM2 = HERA + MAE (pyramidal structure to improve speed); SAM3 = "give up the architectural overhaul and use a large pre-trained backbone as-is." Isaac's reading: SAM3 effectively declares "this is the best we can do." But the price is also stated: SAM3 is 800 million parameters, 300 ms on a T4 GPU — not usable on edge devices.

RF-DETR = pre-trained ViT + NAS for 40x faster than SAM3 (12:30)

Roboflow's answer — without changing the large pre-trained ViT backbone, introduce flexible knobs via NAS (Neural Architecture Search) in a drop-in compatible way. Generate high-performing models in the same family at once, depending on the target data and target hardware. The result: measured on RF100-VL, 40x speed-up over a fine-tuned SAM3 at the same accuracy, and 15x over SAM3 itself. Roboflow's positioning — "solve the deployment-flexibility problem of a one-size-fits-all foundation model with NAS" — becomes concrete. A paper set to be presented at ICLR 2026.

The three-element prescription: "ViT-specific pre-training + LLM-derived acceleration + NAS" (15:01)

The reasons ViT won, organized into three lines at the end. (1) ViT-specific large-scale pre-training (DynaV2/V3, MAE, etc.); (2) borrowed acceleration from the infrastructure that exploded in the LLM world (FlashAttention, etc.); (3) neural architecture search compatible with pre-training. The close — "and that's it; we have, in a sense, won" — punctuates the Vision community's years-long debate. Set against Stephen Batifol (BFL)'s Self-Flow at the same venue arguing "generate without an external encoder," two different approaches to representation learning come into view — "leverage the pre-trained ViT backbone" vs "do away with the external encoder."

Video Outline

(00:00) Self-introduction, overview of the evolution of Vision backbones
(00:30) CNN characteristics — eye-inspired inductive bias, translation invariance, hierarchical structure, n²
(01:20) The arrival of the Transformer — sets-to-sets, no inductive bias, n⁴
(01:55) ViT (Vision Transformer) — 16x16 patches + positional encoding
(02:50) The competition question — does CNN's n² or ViT's n⁴ win?
(03:00) The conclusion — ViT, thanks to large-scale pre-training and LLM-derived acceleration
(04:00) SWIN — windowed attention, shifting windows to approximate convolution
(05:00) ConvNeXt — CNN redesigned in a Transformer style
(06:00) HERA — pyramidal structure + acceleration claim
(08:00) DynaV2/V3 + ViT-specific pre-training — self-supervised learning approaches fully supervised
(08:50) The HERA paper's footnote: "not measured with FlashAttention"
(09:30) Re-adding FlashAttention → HERA's speed edge disappears
(10:24) "We have, in a sense, won" — the ViT camp's victory declaration
(10:40) SAM lineage — SAM (ViT MAE) → MobileSAM (TinyViT) → SAM2 (HERA) → SAM3 (pre-trained ViT as-is)
(12:00) SAM3's cost — 800 million parameters, 300 ms on a T4 GPU, edge devices not feasible
(12:30) Roboflow RF100-VL — measures how well a foundation model transfers to downstream tasks
(13:00) RF-DETR — 40x faster than SAM3 at the same accuracy, to be presented at ICLR 2026
(14:00) NAS generates a full family of models from the same pre-trained backbone
(15:01) The three-element prescription — large-scale pre-training + LLM acceleration + compatible NAS
(15:30) Q&A — multimodal (video / image / text) architecture, evaluation of JEPA / V-JEPA

Sources

How Transformers Finally Ate Vision — Isaac Robinson, Roboflow (AI Engineer)

アイザック・ロビンソン

Isaac Robinson

Roboflow リサーチ責任者 (Research Lead) / RF-DETR 主導開発

comment is stripped from the HTML output. */}