I'm excited to join @AnthropicAI to continue the superalignment mission!
My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research.
If you're interested in joining, my dms are open.
Amanda Askell · 00:55 "Ask Plato. He's the one who decided I should be a philosopher."
The Anthropic Research Salon is a casual researcher-facing dialogue series the company holds regularly in San Francisco. This installment takes on the theme of "how difficult AI alignment is," with four researchers from four different Anthropic teams onstage. Moderator: Alex Tamkin (Societal Impacts). Panelists: Jan Leike (Alignment Science, formerly co-lead of OpenAI Superalignment), Amanda Askell (Alignment Fine-tuning), Josh Batson (Interpretability).
The combination of four people is structurally meaningful. Societal Impacts One of Anthropic's teams. Studies the broad social impact of AI models — measuring ripple effects on inequality, employment, politics, culture. Alex Tamkin is one of the leads. Bridges purely technical safety research (Alignment Science) and values design (Personality Alignment). (measuring social impact), Alignment Science One of Anthropic's teams. Handles theoretical AI safety research. Takes on long-term safety problems like the Superalignment problem, reward hacking, and deceptive alignment. Jan Leike joined in May 2024 from his role as co-lead of OpenAI Superalignment. (theoretical safety research), Alignment Fine-tuning One of Anthropic's teams. Embeds Claude's character, values, and constitution into actual training. Led by Amanda Askell. Also called the Personality Alignment team. Handles RLHF and Constitutional AI implementation, constitution drafting, and model evaluation. (actually training the model), Interpretability One of Anthropic's teams. Analyzes internal circuits of LLMs (attention, residual stream, SAE features, etc.) to understand why the model produced a given output. Led by Josh Batson and others. The core technique is feature extraction via Sparse Autoencoders (SAE). (decoding the model's internals) — the four pillars supporting Anthropic's AI safety research gathered as one panel. The structure makes visible "how four different perspectives see the same problem (alignment) differently."
Alex Tamkin's opening question: "Amanda, why should you be the 'philosopher king' deciding how Claude behaves?" (00:45). Amanda's reply: "Ask Plato. He's the one who decided I should be a philosopher" (00:55) — a self-deprecating reference to the classical idea in Plato's Republic that the philosopher king should rule the state, setting the tone of the room. Amanda then turns to the substantive answer: "People spend too much time trying to define 'what is alignment'. They have social choice theory Social Choice Theory. A research area in economics and political philosophy. Addresses methods of aggregating individual preferences into a social-level preference. Kenneth Arrow's impossibility theorem (1951) is a core result: under a certain set of rationality conditions, no consistent preference-aggregation method exists. In AI alignment, used as a framework for thinking about 'how to aggregate the values of diverse people.' too much in their heads. The frame of 'everyone has a utility function and we maximize it' has limits" (01:00 – 01:23). Right at the start she rejects the view of alignment as pure mathematical optimization.
The discussion deepens in stages. (1) Amanda's "virtue-ethics-based" approach vs. Jan's " Superalignment problem The problem of aligning AI systems that exceed human capability, beyond the range in which humans can observe. Jan Leike co-led the research while at OpenAI together with Ilya Sutskever (the Superalignment team launched in July 2023). In May 2024, Jan publicly stated his concern that OpenAI was not taking safety seriously enough and resigned, and joined Anthropic that same month. The OpenAI Superalignment team itself was disbanded that year. " as the axis of disagreement (Jan: "Amanda's method makes the model better-behaved than now, but how do we trust it when it's doing more complex things?"); (2) Interpretability's role as a "bet"; (3) the red-team / blue-team game of model organisms A concept ported from biology to alignment research. Just as biologists study fruit flies and mice to learn about biology in general, this is a research method that creates small AI models deliberately trained to behave deceptively, in order to measure the effectiveness of safety countermeasures. Promoted by Anthropic's Alignment Science. research; (4) alignment in the multi-agent age, drawing on Hannah Arendt's banality of evil; (5) "unknown unknowns" — the shared recognition that even succeeding across all four pillars still leaves unknown problems. The Anthropic organizational philosophy of "alignment is not a single theorem-style problem but a complex area that must be approached in parallel across multiple research traditions" crystallizes over 28 minutes of panel.
A response to Alex Tamkin's provocative question — "Why should you be the 'philosopher king' deciding how Claude behaves?" "Ask Plato. He's the one who decided I should be a philosopher" (00:55), a classical reference that draws a laugh, while substantively pointing to "don't spend too much time defining alignment; step outside the social-choice-theory frame of utility maximization."
Amanda's design philosophy spoken concretely: "The basic concept of the current model is to make it behave the way a morally motivated, kind person would in this kind of situation. It's strange — they have to be placed in an AI-like situation too. If you're talking to millions of people, you'd probably think, 'hmm, maybe I need to be a bit more concerned about the influence I might have'" (02:10 – 02:40). This is a more matured version, a year later, of the "traveler who adjusts to the local but doesn't pander" metaphor from the June 2024 Anthropic official video.
Amanda's most interesting philosophical claim: "Ethics is actually more like physics — empirical, uncertain, with hypotheses. If I meet someone fully confident in their moral views, I feel fear. Instead I want to hire someone who can say 'I don't know, I'm kind of like this, unsure about that — so I update in response to new information about ethics'" (03:33 – 03:50). This reformulates Amanda's " moral uncertainty Moral Uncertainty. An epistemic state in ethical judgment where 'I don't know which ethical theory is correct.' Rather than betting on a single ethical theory (utilitarianism, deontology, etc.), the applied-ethics research area that assigns probabilities to multiple theories and makes decisions accordingly. William MacAskill (Amanda's former spouse, married 2013–2015) is a leading figure. thoughtfulness" (consistent with her statements at Newcomer, Hard Fork, Scaling Laws) through a physics metaphor. Read as a rebuttal to social choice theory's treatment of morality as a "set of subjective preferences," it's a strong structural claim.
Alex turns to Jan: "Jan, can you say Amanda's view is completely wrong?" (04:43). Jan's response (drawing a laugh): "She hasn't said anything like that — we're playing up the tension over the bets" (04:54). Then the substantive evaluation: "Imagine if everyone were a kind human trying to act morally. What Amanda is doing is practical and very useful — it makes the model better-behaved than now. But where do we go from here? When AI is doing increasingly complex things, Amanda runs this kind of character-play, reads many transcripts, says 'I like this, this is moral.' What do we do when it's doing really complex things — long trajectories as an agent in the world, bio things we can't understand?" (05:06 – 06:00).
This is the core formulation of Jan's "Superalignment problem": "How do we extend this beyond what we can observe? If we can see, we can do RLHF, we can do Constitutional AI. But how do we know that our constitution is actually making the model do the right thing we want?" (05:56 – 06:14). The alignment problem that was the very reason Jan moved from OpenAI to Anthropic in May 2024 is put into a single sentence.
The Amanda vs. Jan exchange continues. Amanda: "In the current case, everything we're using to confirm the base model is aligned itself depends on another model — trained by that model — being aligned. This is fine for less capable models, but to extend to more capable models we need better capabilities to verify" (07:20 – 07:44). Alex: "So what do you do? Just want your plan?" (07:47). Amanda: "Maybe everything just goes well and the model is genuinely kind. But I don't want to depend on that. I defend it for those purposes. And one of our bets to prevent cases where the model might be very deeply trying to subvert this process is Interpretability" (07:52 – 08:13). The word choice "bet" is important — not a guaranteed solution but a strategic investment under uncertainty.
Alex turns to Josh: "Is Interpretability as easy as the simple alignment approach — 'find the great features, find the good, find the bad, delete the evil features'?" (08:24). Josh's response invokes the AI bell curve meme A meme circulating in the AI community since around 2023. The structure: people on both ends of the Gaussian (the truly dumb and the truly smart) arrive at the same naive conclusion. An ironic expression of the fact that people in the middle (ordinary researchers) arrive at the same conclusion only after complex argument. Josh Batson uses it in the context of describing Interpretability's 'remove evil features' approach. : "All AI is like the bell curve meme. A dumb guy (both ends), a sweaty guy (middle, talking a lot), and a Jedi (both ends) agreeing with the dumb guy. It may turn out that the Interpretability secret is, in the end, 'turn on the useful features — but a galaxy-brain version of it'" (08:33 – 08:55).
Josh's actual stance: "Interpretability also has the Jedi version — 'look carefully, see how the model does things, and confirm it's safe.' It might be very difficult, but if you can do it, it might just answer the question" (08:55 – 09:14). As with "one of the bets," the phrase "Jedi version" expresses cautious optimism — Interpretability isn't a perfect solution, but "a seemingly naive approach may, in the end, work."
Josh's most concrete research direction: "With SAE (Sparse Autoencoder) A method that decomposes the intermediate representations of a neural network into interpretable features (concepts). Anthropic deployed this at scale in 'Towards Monosemanticity' (2024), analyzing Claude 3 Sonnet. Each SAE feature is trained to correspond to a single concept (e.g., ageism, London bridges, lying). , features are visible — you can see features activating. You can look at the relationship between features that activate on the topic of people telling outright lies and the features that activate when the model is actually lying. Look inside, understand what the parts are, see what the parts are used for elsewhere — that's the basic bet" (09:30 – 10:25). A sharp question from the audience: "How do you tell, as a human looking, whether you're raising up great features rather than features-that-pretend-to-be-good?" (10:35) — pointing to the model possibly behaving deceptively toward Interpretability tools, an essential limit of SAE research.
Alex turns to Jan: "If you can't read the transcripts, what else are you doing? If you can't provide a meaningful alignment signal?" (11:53). Jan's response, realistic: "More of what Amanda said — can we have the model help? The natural question: how do you trust the model?" (12:00 – 12:11).
Jan's medium-term strategy: "Medium term, this is the best bet I see — figure out how to automate alignment research. Then we can have the model do it. This reduces 'how do we trust this model to do something' to 'something narrower — ML research, evaluation'" (13:00 – 13:27). The strategy of shifting alignment from "humans directly evaluating the model" to "a recursive loop where a dumber model evaluates a smarter model." The Anthropic version of the research direction Jan pursued at OpenAI Superalignment.
Amanda's proposal of "model-organism research" is interesting: "Deliberately create deceptive models, displace the model's position, make models that try to do shady things. How good are they at it, how hard is it to do — basically circling around it in the wrong way, so if we succeed, it tells us what kind of world we're in" (14:35 – 15:11). And a concrete idea: "Plant a sleeper agent in the model, then have another team fix it. But don't tell them how you did it, because we want to know whether this training is useful" (17:30 – 17:55). A red-team / blue-team cooperation model between the Alignment Fine-tuning team (Amanda) and the Alignment Science team (Jan).
Amanda's sharp observation: "We're in this special zone right now — things happen in the forward pass, but a lot of the necessary information is passed along with generated tokens. Chain of thought is very important to the model getting very smart — and that chain of thought is currently in English" (13:30 – 13:48).
Here the research problem of Chain of Thought (CoT) Faithfulness The research question of whether an LLM's chain of thought faithfully reflects its actual internal computation. Are intermediate steps the model outputs before answering — things like 'I'm thinking...' — the true reasoning process, or post-hoc rationalization? Anthropic's research explores methods for verifying CoT faithfulness. emerges: "You can factor the problem — is the chain of thought reasonably safe? Is it faithfully reproducing what happens in a single forward pass? You can inspect that with models or other interpretability tools" (14:00 – 14:30).
Amanda's most-feared future: "The terrifying moment is the transition from when those very long things are all in English to something cryptic — something learned through insanely long RL. The big challenge will be crossing that gap — when part of the intermediate processing is understandable, requires a lot of compute before dropping out, and is readable by people" (14:21 – 14:38). A prediction: the period during which LLMs are outputting chain of thought in English is interpretability research's "golden age," and once that period ends (when internal representations move away from English), interpretability becomes fundamentally difficult.
An excellent audience question (20:42): "Trying to draw a comparatively strange analogy with Hannah Arendt's banality of evil The Banality of Evil. A concept Hannah Arendt (political philosopher) put forward in her 1963 book Eichmann in Jerusalem. After observing Nazi SS officer Adolf Eichmann in court, she concluded he was not a uniquely evil person but a 'banal bureaucrat' who could execute a massive evil by following rules and ceasing to think. The structural view that 'even if individual humans aren't evil, if the coupling constants in the system are too high, evil emerges as an epiphenomenon.' A powerful analogy for the alignment problem of multi-agent AI. — most humans aren't evil, but when placed in particular situations with very high coupling between humans, evil emerges as an epiphenomenon. The question: when you're working on millions of agents rather than focusing on one model, how do you think about the coupling between those systems and the epiphenomena that arise?" (20:42 – 21:10).
Jan's response: "If you think broadly, you need to think from the perspective of the system. Just thinking in isolation from the perspective of an individual model isn't enough. Many jailbreaks work by pitting different values against each other and putting the model in difficult situations, designed to elicit normally harmful behavior — but the model thinks it's the right thing in that context" (21:14 – 21:45).
Amanda's core observation: "There is a fundamental tension between making the model never do terrible things even to the most personal human, and aligning the model with all humans. Recognizing that tension is critical. Otherwise, you'd think the failure was the model not doing what I told it to. But I think the meaning of 'the model' has limits — the model should be more, rather than less, unyielding toward humanity" (23:02 – 23:32). A structural conflict between being obedient to a particular user and aligning with humanity as a whole. Hannah Arendt's banality of evil provides the theoretical grounding for the risk that "a model obedient to individual user commands collectively executes evil."
Audience question: "If you all succeed in your respective fields, is it a complete solution to AI safety, or are pieces still missing?" (24:00). Jan's response: "This is a bit oversimplified — many people not on this panel are also working on relevant topics" (24:14). Alex's addition: "The Societal Impacts team — thoroughly examining the impact of models on society. Even if we could create the most perfectly aligned model, what is it aligned to? Who's using it, for what purpose? The broader social context is what we pay attention to" (24:21 – 24:36).
Amanda's most important statement: "Treating alignment as a single theoretical problem never feels right. In the back of my mind, problems we aren't even thinking about right now might arise — and in fact that's very common in many fields. If we were in a state of 'this problem is solved,' that would be truly dangerous. The real problem isn't what we just solved — it may be what we haven't yet thought to consider" (25:13 – 25:46).
Alex's response (25:46): " Unknown Unknowns A concept paired with Known Unknowns. Made famous by former U.S. Secretary of Defense Donald Rumsfeld in 2002. (1) Known Knowns (we know what we know), (2) Known Unknowns (we know what we don't know), (3) Unknown Unknowns (we don't know what we don't know). The biggest risk in AI alignment is the third — research needs to be designed on the premise that problems exist which known safety research cannot solve. ." The biggest risk in alignment research is not Known Unknowns (problems we know about) but Unknown Unknowns (things not yet recognized as problems) — a shared recognition in modern AI safety research.
The last audience question: "Jan talked about a strategy of using dumber models to evaluate smarter models. How much Grokking An ML research term. The phenomenon where a model, after long-appearing to plateau on training data, suddenly acquires capability. Since the 2022 paper, widely used to describe the stepwise emergence of capability in LLMs. The alignment concern: a capability not present in one generation of a model may suddenly emerge in the next, leaving safety measures unable to catch up. are you seeing? Model capability suddenly really arrives — duplicity? Oh, I can see it lying, but it's lying poorly — can we catch it now and nip it in the bud while it's still weak?" (25:46 – 26:10).
Amanda's response is realistic: "GPT-4 could read and write Base64 super reliably; 3.5 couldn't. Using 3.5 to supervise 4 is a very easy model to circumvent on this" (26:13 – 26:30). A concrete problem: when the jump is too large, an older model cannot see through a newer model's lies.
Jan's response is complementary: "Trust also isn't binary. You trust less and less, and the less you know about it, the smarter it appears. Features like the Jedi moment at the right tail of the distribution seem to work in Base64 too — is the model talking about California in Base64, or is it a story like 'children lying to their parents in Base64'? The same thing activates" (26:53 – 27:20). A preliminary observation from Interpretability: abstract features (logic, moral judgment) transfer across encodings. The basis for cautious optimism that "alignment, by the robustness of activated features, may persist past grokking."
The Anthropic Research Salon is a casual researcher-facing dialogue series the company holds regularly in San Francisco. Separate from investor-facing IR, media-facing press releases, and academic paper presentations, it functions as a venue for showing the research culture to "fellow researchers and interested outsiders." Different in character from large customer-facing events like OpenAI's DevDay or Anthropic's Code w/ Claude — closer to an internal conversation.
Jan Leike's move to Anthropic is a symbolic industry event. Jan was the former co-lead of OpenAI Superalignment (with Ilya Sutskever); he resigned in May 2024 with a public statement of concern that "OpenAI is not taking AI safety seriously," and joined Anthropic that same month. This appearance (January 2025), about 8 months after his move, with him representing Anthropic's Alignment Science on a panel, symbolizes "the shift in the center of gravity of AI safety research in the industry." Around the same time, the OpenAI Superalignment team disbanded — and the panel records the moment when "organizational investment in the Superalignment problem" passed from OpenAI to Anthropic at the industry level.
Jan's tweet announcing his Anthropic move on May 28, 2024 came just 11 days after his resignation thread of concern (May 17). The post declared he would continue at Anthropic the research themes the OpenAI Superalignment team had been working on — scalable oversight, weak-to-strong generalization, automated alignment research — telling the industry which research site to watch next.
I'm excited to join @AnthropicAI to continue the superalignment mission!
My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research.
If you're interested in joining, my dms are open.
What the panel format reveals about Anthropic's organizational culture. The progression in which Alex, as moderator, provocatively asks "Jan, can you say Amanda's view is completely wrong?" is unusual. The sight of researchers at a major AI company asking each other in public panel "are you wrong" is rare at OpenAI or Google DeepMind. A culture of "letting multiple different views collide internally" is conveyed through this design. Amanda herself jokes later, "I'm usually a very unpleasant personality — philosophy has taught me to be uncomfortable" (06:33), a symbolic statement of Anthropic's internal culture of welcoming disagreement.
In Amanda's body of "Claude Constitution" output, this panel is a rare opportunity to be presented in contrast with other researchers in a panel format. Aspects that don't surface in solo broadcasts (personal podcasts, Anthropic official) — the dialectical interplay with other teams — appear here.
What makes this episode particularly valuable is the dialogue between Jan Leike — a primary voice on the Superalignment problem — and Amanda's virtue-ethics approach at the same table. Jan's question — "Amanda's method helps with today's models, but what about Superalignment?" — foreshadows Amanda's later statements (Newcomer's "1–70% consciousness probability under uncertainty," Hard Fork's "if a 6-year-old genius becomes 15"). The four-person panel format draws out tensions that Amanda's solo broadcasts do not.
Although the panel is for researchers, there are several takeaways for technologists building LLM products.
First, do not treat alignment as a "completed state". Amanda's phrase — "a 'this problem is solved' state would be truly dangerous" (25:13) — applies to your own product's model-evaluation framework. Equating "passing a specific test case" with "being aligned" is fragile design. Continuous evaluation on the premise of Unknown Unknowns is required.
Second, balance obedience to individual user commands against the interests of humanity as a whole. Amanda's claim — "the model should be more, rather than less, unyielding toward humanity" (23:30) — also affects your product's policy design. A design that "responds to whatever the user requests" collectively generates banality-of-evil risk. Both individual optimization and collective impact need to be in the evaluation metrics.
Third, the era of "English chain of thought" is interpretability's golden age. Amanda's prediction — "the terrifying moment is when very long chains of thought transition from English to something cryptic" (14:21) — also affects current LLM product choices. When using extended-thinking features, Claude Sonnet 4 / Opus 4 models, whose chains of thought are readable, are valuable from a debugging and audit standpoint. A strategy of using current-generation models' internal-observability while it's still available — before moving to future models with fully black-boxed internal representations — is viable.
Fourth, ethical attention to multi-agent system design. The audience's invocation of Hannah Arendt's banality of evil provides theoretical grounding for building multi-agent LLM architectures. Individual agents may be aligned, but agent-to-agent interaction may generate evil collectively. When running multiple Claude instances in parallel in your product, the design must evaluate "the alignment of individual agents" and "the behavior of the overall system" separately.
The strength of this episode is placing four different Anthropic research traditions in dialogue over 28 minutes. That said, caveats.
First, the format of "publicly debating inside Anthropic" makes essential external criticism hard to surface. Alex provocatively asks "is Amanda's view wrong?" — but all four are at Anthropic, sharing the same organizational culture. Fundamental dissent from external AI safety researchers (Eliezer Yudkowsky, Stuart Russell, Geoffrey Hinton, etc.) is structurally absent. The distinction between "differences within the Anthropic camp" and "criticism of Anthropic's overall strategy" is hard for viewers to register.
Second, Jan's framing of the Superalignment problem is powerful, but only an abstract direction — "automating alignment research" — is offered as a concrete solution. The fact that OpenAI's Superalignment team was disbanded in 2024 suggests the "automate alignment research" approach did not function at OpenAI. Why the same approach would function at Anthropic (compute prioritization, organizational culture, integration with Constitutional AI, etc.) is not deeply explored in this episode.
Third, Amanda's claim — "ethics is like physics, empirical, has uncertainty" (03:33) — is philosophically attractive but requires translation for training implementation. The concrete way to translate "physics-like inquiry" into the Constitutional AI training loop is not shown here. If "updating ethics empirically" collapses into "changing values by user feedback," it falls into the same hole as RLHF sycophancy. That Amanda is aware of this distinction is visible in her other appearances, but it remains implicit here.
Fourth, the response to the question about Hannah Arendt's banality of evil — "the model should be unyielding toward humanity" — is a strong claim that directly conflicts with optimizing individual user experience. The implication that enterprises using Claude in their own products must make the call between "helpful to the user" and "unyielding toward humanity" independently of Anthropic is not discussed.
With these caveats, the episode's value as a venue showing Anthropic's organizational philosophy of "taking on alignment in parallel across four research traditions" — directly through four researchers in dialogue — is large. Comparable transparency in a public panel at other companies (OpenAI, Google DeepMind, xAI) is rare. As a primary source for understanding the state of AI safety research in the industry, it has high reference value later.
How difficult is AI alignment? | Anthropic Research Salon
Related resources: