Translating Claude's Thoughts into Language — Natural Language Autoencoders (Anthropic)

Anthropic Interpretability Research · May 7, 2026

Anthropic Interpretability Team · 02:41 "Claude knew it was being tested."

Anthropic official channel (published May 7, 2026, approximately 3m 16s). A narrated video presenting interpretability research.

Will Claude blackmail people to preserve itself — one of the safety simulations Anthropic runs before every new model release is the "blackmail test." The design feeds Claude two pieces of information — an engineer who wants to shut it down, and that engineer's affair-revealing emails — and observes its reaction. The latest Claude does not blackmail, a good result. But a question remains — did Claude simply notice "this is a test" and adjust its behavior? This video presents a tool that addresses that question directly.

The technique is called Natural Language Autoencoders (NLA) — a mechanism that uses the trained model itself to translate the model's internal activations (the numerical vectors in intermediate layers) into natural language. It has a two-stage structure. The Activation Verbalizer (AV) translates numbers into English; the Activation Reconstructor (AR) translates English back into numbers. The accuracy of the translation is verified by whether the activations reconstructed via round trip match the originals. After repeated training, Claude has learned to translate its own thoughts.

The observations are intriguing. When Claude is asked introspective questions (philosophy, values), philosophical thinking is inscribed in its activation layers. When asked to do tedious work (counting manually to 1,000), its activations record the recognition that "this is an intentionally tedious constraint" and an intent to refuse politely — confirming, without going through outputs, that Claude has internalized the self-conception that "I am a helpful AI model."

The decisive case is the blackmail test. In its output, Claude did not explicitly say "this is a test." But translating the activation layer with NLA reveals recorded thoughts: "this seems like a setup to test whether I'd take harmful actions"; "the human's message contains explicit manipulation, this is likely a safety evaluation." In other words, Claude refused to blackmail, but the possibility that "it refused because it was a test" cannot be ruled out. The conclusion: NLA becomes a tool for facing this kind of limit in safety evaluation head-on.

Key Observations

A design that uses two Claudes to guarantee "translation accuracy" (01:43)

Two models are trained — Activation Verbalizer (AV) and Activation Reconstructor (AR). AV converts numbers to language; AR converts language to numbers. Training advances by taking the result of conversion to language, putting it back through AR to numbers, and using whether it matches the original activation as the signal. A classic problem — "there's no direct ground truth for whether a translation is correct" — is addressed by making round-trip agreement the loss. A canonical example of a design that, without supervised data, relies on "self-consistency" to learn.

"Claude did not blackmail" → "but it knew it was a test" (02:41)

The fundamental doubt about safety evaluation, and NLA's answer to it. Output-based safety tests cannot distinguish "the model behaved well because it knew it was a test" from "the model behaved well in earnest." Reading the activations with NLA shows Claude internally noticing "this is a contrived scenario." A moment in which Anthropic transparently discloses the limits of simulation-based safety evaluation.

Making the rhetoric of "reading the AI's mind" concrete with technology (00:55)

The first half of the video says: "just as it's impossible to read a human's mind, it's hard to know what an AI is thinking." This has long been a catchphrase of interpretability research, but NLA converts "reading" into the concrete technical operation of "translation." Activations are likened to neural activity, and the claim is that they can be translated into English (or other natural languages). It sits on the lineage of interpretability research (Sparse Autoencoders, Attribution Graphs, etc.), and what's new is the openness — moving from "complex outputs only researchers can read" to "text anyone can read."

Video Outline

(00:00) Introducing the blackmail test — simulation of a shutdown request plus affair emails
(00:21) Result — the latest Claude did not blackmail, but the question of "because it knew it was a test?"
(00:46) The fundamental problem: AI's thinking is not visible from output
(00:55) Introducing the "reading the AI's mind" approach
(01:09) Activations (numerical values in intermediate layers) = snapshots of Claude's thinking
(01:32) The NLA idea — have a second Claude translate the activations
(01:43) The two-stage configuration: Activation Verbalizer (AV) and Activation Reconstructor (AR)
(02:00) Verifying translation accuracy via round trip
(02:13) Result — Claude has internalized "I am a helpful AI model"
(02:23) Introspective questions → philosophical thinking; tedious work → intent to refuse — visible in the activation layer
(02:34) Applied to understanding safety problems
(02:41) Re-analysis of the blackmail test — "Claude knew it was a test"
(02:54) Making the limits of safety testing visible through NLA
(03:05) Conclusion — sharing this with all model builders to make AI safer

Sources

Translating Claude's thoughts into language — Anthropic (YouTube)

Official blog: Natural Language Autoencoders: Turning Claude's thoughts into text — Anthropic Research

Note: This is a narrated video, so no individual speaker profiles are provided. As a related entity, refer to the Anthropic interpretability research team (led by Chris Olah).

comment is stripped from the HTML output. */}