LLMs Are Bad at Chess — So Let Them Only Translate (Take Take Take's AI Chess Coach)

AI Engineer Europe 2026 (London) · May 13, 2026

Asbjørn Steinskog (Take Take Take) · 06:46 "The LLM's only job is translation. Calculation is Stockfish, the human-view is Maia, detection is a swarm of detectors. The LLM only puts what it's given into English."

Building a Chess Coach — Anant Dole and Asbjørn Steinskog, Take Take Take (AI Engineer Europe 2026)

AI Engineer Europe 2026 (London, published May 13, 2026, approximately 18 minutes 22 seconds).

The speakers are Anant Dole and Asbjørn Ottesen Steinskog, both engineers at Take Take Take. Take Take Take is the chess-learning app startup founded by all-time-great chess player Magnus Carlsen (iOS / Android, London-based), which had just announced a partnership with Lichess.org in November 2025. Steinskog volunteered at Lichess for ten years before joining the company.

The subject is the detail of the AI Chess Coach pipeline running in production at Take Take Take. At a glance the product looks like "letting an LLM explain chess," but inside it is a precise division of labor: Stockfish + Maia (UToronto) + dozens of detectors + LLM. By adopting an extremely modest use of the LLM — "only the final-output English rendering" — the team simultaneously achieved instant response (within 3 seconds) and high accuracy.

Key Observations

LLMs are bad at chess — Magnus Carlsen commentates an LLM tournament in Oslo (05:02)

The starting point is excellent: "LLMs really can't play chess," demonstrated through video of the Kaggle Game Arena LLM chess tournament that Magnus Carlsen himself commentated in Oslo at the Take Take Take office. Grok plays Qb6 (the Poison Pawn line) early and collapses — openings come out roughly, but hallucinations start fast. Because it's a language model, it can't compute or plan strategically. Of course.

That said, the transformer architecture isn't inherently bad at chess. DeepMind trained a transformer not on "next-token prediction" but on "predicting Stockfish evaluation scores," and reached grandmaster-level strength. But that's a "plays-but-can't-explain" model. So what Take Take Take built is a different route — pre-compute "the strongest move (Stockfish), the human-perspective evaluation (Maia), and tactical / strategic detectors," then hand all those facts to the LLM and have it render them in English.

The Maia neural network — not "best move" but "can someone at your rating find it" (07:48)

A hidden excellence in the Take Take Take pipeline is the use of the Maia chess engine (a University of Toronto research project). Unlike Stockfish, which predicts the "best move," Maia is trained to predict "the probability distribution over moves a human at a specific rating would play in this position." If the player is ELO 1500, what's the probability that they would play each move.

What's the use? It produces difficulty information that's essential to coaching — "this move is strongest, but it is also extremely hard to find (probability less than 1%)." Just hearing Stockfish say "this is brilliant" ends with "you got beaten, you're bad." With Maia in parallel, the system can say "this move is Stockfish-recommended, but 95% of 1500-level players would miss it." A qualitative shift from simple good/bad judgment to "information for learning."

An autonomous loop where Claude Code fixes its own commentary (10:07)

The interesting part is the demo in the second half. When a user marks commentary as "bad" inside the app, it posts to Slack → forwards automatically to the Claude Code Channel (a new Research Preview feature: an MCP server that can inject events into a Claude Code session) → Claude Code runs a commentary-triage skill → investigates the position → scripts fix the prompt / detector → regenerates → self-verifies → asks back in Slack "is this OK?" → if a human approves, opens a PR — a complete autonomous loop.

The demo is performed end-to-end from a phone on a bus. A user complaint kicks off an autonomous agent, and the entire chain — PR review from the phone → merge via mobile GitHub — completes with minimal human intervention. This is Take Take Take's overall AI development philosophy: "build the autonomous improvement loop in from the start."

Latency vs. Quality — why Gemini Flash at 75% accuracy in 3 seconds was chosen (12:42)

The reality of consumer AI products: users want analysis immediately after the game ends. Showing "the coach is thinking..." for 30 seconds loses them. So the target is "sub 3 seconds." A 16-scenario eval comparing Gemini 3 Flash / Claude (more thinking) / GPT-5 mini produced:

Gemini 3 Flash: 75% accuracy, ~3 seconds latency → adopted
Claude (more thinking): under 60% accuracy, significantly longer latency → unsuitable for instant use
GPT-5 mini: medium latency, low accuracy

OpenRouter is set up so each new model can be swapped in as it releases. Because Take Take Take engineers are themselves chess players, the final check is human (Anant, Asbjørn) — comparing "how I would have calculated" against the LLM's response. When the SME (Subject Matter Expert) is a different person from the builder, partnering is mandatory (evaluator is not necessarily the developer) — they offer this as an operating rule.

Video Outline

(00:16) Introduction, Magnus Carlsen's founding of Take Take Take
(00:49) Agenda — Take Take Take / chess × AI history / why LLMs are bad / the pipeline / latency vs. quality
(01:27) What Take Take Take is — iOS / Android, game review + AI commentary
(02:00) Automatic detection of brilliant moves and nuanced explanation
(02:38) User behavior analysis (game-phase accuracy, opening depth)
(03:00) Chess × AI history — 1949 Shannon → 1997 Deep Blue → 2017 AlphaZero → 2022+ LLMs
(04:42) LLMs are bad at chess, they hallucinate quickly
(05:02) LLM tournament commentated by Magnus Carlsen in Oslo (Grok collapses on Qb6)
(06:16) Transformers themselves are usable — DeepMind's grandmaster transformer
(06:46) Fill the gap — Stockfish plays, LLM explains
(07:48) Maia neural net — human-view probability distribution
(09:54) Final pipeline shape — detectors + Stockfish + Maia + LLM translation
(10:07) Autonomous improvement loop — Slack + Claude Code Channel
(11:41) Live demo — commentary fix from a bus → merge via mobile GitHub
(12:42) Latency vs. quality — the sub-3-second constraint
(14:09) Eval results — Gemini Flash 75% vs. Claude under 60%
(15:47) Four lessons — data pipeline + LLM division of labor, autonomous loop, context engine, SME partnering
(16:38) Chess simul announcement (3:45pm)
(17:34) Live demo result confirmation — Claude also judges "nothing wrong"

Sources

Building a Chess Coach — Anant Dole and Asbjørn Steinskog, Take Take Take (AI Engineer Europe 2026)

アナント・ドール

Anant Dole

Take Take Take エンジニア / チェスプレイヤー兼 AI ビルダー

アスビョルン・オッテセン・スタインスコグ

Asbjørn Ottesen Steinskog

Take Take Take リードエンジニア / 元 Lichess ボランティア 10 年