How Difficult Is AI Alignment? — Anthropic Joint Four-Team Panel (Alex Tamkin × Jan Leike × Amanda Askell × Josh Batson)

Anthropic Research Salon · January 8, 2025

Amanda Askell · 00:55 "Ask Plato. He's the one who decided I should be a philosopher."

Anthropic Research Salon (San Francisco), published January 8, 2025, approximately 28 minutes. Researchers from four different Anthropic teams discuss in panel format.

The Anthropic Research Salon is a casual researcher-facing dialogue series the company holds regularly in San Francisco. This installment takes on the theme of "how difficult AI alignment is," with four researchers from four different Anthropic teams onstage. Moderator: Alex Tamkin (Societal Impacts). Panelists: Jan Leike (Alignment Science, formerly co-lead of OpenAI Superalignment), Amanda Askell (Alignment Fine-tuning), Josh Batson (Interpretability).

The combination of four people is structurally meaningful. Societal Impacts (measuring social impact), Alignment Science (theoretical safety research), Alignment Fine-tuning (actually training the model), Interpretability (decoding the model's internals) — the four pillars supporting Anthropic's AI safety research gathered as one panel. The structure makes visible "how four different perspectives see the same problem (alignment) differently."

Alex Tamkin's opening question: "Amanda, why should you be the 'philosopher king' deciding how Claude behaves?" (00:45). Amanda's reply: "Ask Plato. He's the one who decided I should be a philosopher" (00:55) — a self-deprecating reference to the classical idea in Plato's Republic that the philosopher king should rule the state, setting the tone of the room. Amanda then turns to the substantive answer: "People spend too much time trying to define 'what is alignment'. They have social choice theory too much in their heads. The frame of 'everyone has a utility function and we maximize it' has limits" (01:00 – 01:23). Right at the start she rejects the view of alignment as pure mathematical optimization.

The discussion deepens in stages. (1) Amanda's "virtue-ethics-based" approach vs. Jan's " Superalignment problem " as the axis of disagreement (Jan: "Amanda's method makes the model better-behaved than now, but how do we trust it when it's doing more complex things?"); (2) Interpretability's role as a "bet"; (3) the red-team / blue-team game of model organisms research; (4) alignment in the multi-agent age, drawing on Hannah Arendt's banality of evil; (5) "unknown unknowns" — the shared recognition that even succeeding across all four pillars still leaves unknown problems. The Anthropic organizational philosophy of "alignment is not a single theorem-style problem but a complex area that must be approached in parallel across multiple research traditions" crystallizes over 28 minutes of panel.

Key Observations

"Ask Plato" — Amanda's view of alignment (00:45 – 04:00)

A response to Alex Tamkin's provocative question — "Why should you be the 'philosopher king' deciding how Claude behaves?" "Ask Plato. He's the one who decided I should be a philosopher" (00:55), a classical reference that draws a laugh, while substantively pointing to "don't spend too much time defining alignment; step outside the social-choice-theory frame of utility maximization."

Amanda's design philosophy spoken concretely: "The basic concept of the current model is to make it behave the way a morally motivated, kind person would in this kind of situation. It's strange — they have to be placed in an AI-like situation too. If you're talking to millions of people, you'd probably think, 'hmm, maybe I need to be a bit more concerned about the influence I might have'" (02:10 – 02:40). This is a more matured version, a year later, of the "traveler who adjusts to the local but doesn't pander" metaphor from the June 2024 Anthropic official video.

Amanda's most interesting philosophical claim: "Ethics is actually more like physics — empirical, uncertain, with hypotheses. If I meet someone fully confident in their moral views, I feel fear. Instead I want to hire someone who can say 'I don't know, I'm kind of like this, unsure about that — so I update in response to new information about ethics'" (03:33 – 03:50). This reformulates Amanda's " moral uncertainty thoughtfulness" (consistent with her statements at Newcomer, Hard Fork, Scaling Laws) through a physics metaphor. Read as a rebuttal to social choice theory's treatment of morality as a "set of subjective preferences," it's a strong structural claim.

"Disagreement maximalism" — Jan's Superalignment problem (04:43 – 08:00)

Alex turns to Jan: "Jan, can you say Amanda's view is completely wrong?" (04:43). Jan's response (drawing a laugh): "She hasn't said anything like that — we're playing up the tension over the bets" (04:54). Then the substantive evaluation: "Imagine if everyone were a kind human trying to act morally. What Amanda is doing is practical and very useful — it makes the model better-behaved than now. But where do we go from here? When AI is doing increasingly complex things, Amanda runs this kind of character-play, reads many transcripts, says 'I like this, this is moral.' What do we do when it's doing really complex things — long trajectories as an agent in the world, bio things we can't understand?" (05:06 – 06:00).

This is the core formulation of Jan's "Superalignment problem": "How do we extend this beyond what we can observe? If we can see, we can do RLHF, we can do Constitutional AI. But how do we know that our constitution is actually making the model do the right thing we want?" (05:56 – 06:14). The alignment problem that was the very reason Jan moved from OpenAI to Anthropic in May 2024 is put into a single sentence.

The Amanda vs. Jan exchange continues. Amanda: "In the current case, everything we're using to confirm the base model is aligned itself depends on another model — trained by that model — being aligned. This is fine for less capable models, but to extend to more capable models we need better capabilities to verify" (07:20 – 07:44). Alex: "So what do you do? Just want your plan?" (07:47). Amanda: "Maybe everything just goes well and the model is genuinely kind. But I don't want to depend on that. I defend it for those purposes. And one of our bets to prevent cases where the model might be very deeply trying to subvert this process is Interpretability" (07:52 – 08:13). The word choice "bet" is important — not a guaranteed solution but a strategic investment under uncertainty.

The "Jedi side" of Interpretability — the AI bell curve meme and naive optimism (08:00 – 11:00)

Alex turns to Josh: "Is Interpretability as easy as the simple alignment approach — 'find the great features, find the good, find the bad, delete the evil features'?" (08:24). Josh's response invokes the AI bell curve meme : "All AI is like the bell curve meme. A dumb guy (both ends), a sweaty guy (middle, talking a lot), and a Jedi (both ends) agreeing with the dumb guy. It may turn out that the Interpretability secret is, in the end, 'turn on the useful features — but a galaxy-brain version of it'" (08:33 – 08:55).

Josh's actual stance: "Interpretability also has the Jedi version — 'look carefully, see how the model does things, and confirm it's safe.' It might be very difficult, but if you can do it, it might just answer the question" (08:55 – 09:14). As with "one of the bets," the phrase "Jedi version" expresses cautious optimism — Interpretability isn't a perfect solution, but "a seemingly naive approach may, in the end, work."

Josh's most concrete research direction: "With SAE (Sparse Autoencoder) , features are visible — you can see features activating. You can look at the relationship between features that activate on the topic of people telling outright lies and the features that activate when the model is actually lying. Look inside, understand what the parts are, see what the parts are used for elsewhere — that's the basic bet" (09:30 – 10:25). A sharp question from the audience: "How do you tell, as a human looking, whether you're raising up great features rather than features-that-pretend-to-be-good?" (10:35) — pointing to the model possibly behaving deceptively toward Interpretability tools, an essential limit of SAE research.

The "model organisms" game — the red-team / blue-team structure (11:00 – 18:00)

Alex turns to Jan: "If you can't read the transcripts, what else are you doing? If you can't provide a meaningful alignment signal?" (11:53). Jan's response, realistic: "More of what Amanda said — can we have the model help? The natural question: how do you trust the model?" (12:00 – 12:11).

Jan's medium-term strategy: "Medium term, this is the best bet I see — figure out how to automate alignment research. Then we can have the model do it. This reduces 'how do we trust this model to do something' to 'something narrower — ML research, evaluation'" (13:00 – 13:27). The strategy of shifting alignment from "humans directly evaluating the model" to "a recursive loop where a dumber model evaluates a smarter model." The Anthropic version of the research direction Jan pursued at OpenAI Superalignment.

Amanda's proposal of "model-organism research" is interesting: "Deliberately create deceptive models, displace the model's position, make models that try to do shady things. How good are they at it, how hard is it to do — basically circling around it in the wrong way, so if we succeed, it tells us what kind of world we're in" (14:35 – 15:11). And a concrete idea: "Plant a sleeper agent in the model, then have another team fix it. But don't tell them how you did it, because we want to know whether this training is useful" (17:30 – 17:55). A red-team / blue-team cooperation model between the Alignment Fine-tuning team (Amanda) and the Alignment Science team (Jan).

The "English Forward Pass" era — the interpretability of chain of thought (13:30 – 16:00)

Amanda's sharp observation: "We're in this special zone right now — things happen in the forward pass, but a lot of the necessary information is passed along with generated tokens. Chain of thought is very important to the model getting very smart — and that chain of thought is currently in English" (13:30 – 13:48).

Here the research problem of Chain of Thought (CoT) Faithfulness emerges: "You can factor the problem — is the chain of thought reasonably safe? Is it faithfully reproducing what happens in a single forward pass? You can inspect that with models or other interpretability tools" (14:00 – 14:30).

Amanda's most-feared future: "The terrifying moment is the transition from when those very long things are all in English to something cryptic — something learned through insanely long RL. The big challenge will be crossing that gap — when part of the intermediate processing is understandable, requires a lot of compute before dropping out, and is readable by people" (14:21 – 14:38). A prediction: the period during which LLMs are outputting chain of thought in English is interpretability research's "golden age," and once that period ends (when internal representations move away from English), interpretability becomes fundamentally difficult.

Hannah Arendt's "banality of evil" — alignment in the multi-agent age (18:00 – 22:00)

An excellent audience question (20:42): "Trying to draw a comparatively strange analogy with Hannah Arendt's banality of evil — most humans aren't evil, but when placed in particular situations with very high coupling between humans, evil emerges as an epiphenomenon. The question: when you're working on millions of agents rather than focusing on one model, how do you think about the coupling between those systems and the epiphenomena that arise?" (20:42 – 21:10).

Jan's response: "If you think broadly, you need to think from the perspective of the system. Just thinking in isolation from the perspective of an individual model isn't enough. Many jailbreaks work by pitting different values against each other and putting the model in difficult situations, designed to elicit normally harmful behavior — but the model thinks it's the right thing in that context" (21:14 – 21:45).

Amanda's core observation: "There is a fundamental tension between making the model never do terrible things even to the most personal human, and aligning the model with all humans. Recognizing that tension is critical. Otherwise, you'd think the failure was the model not doing what I told it to. But I think the meaning of 'the model' has limits — the model should be more, rather than less, unyielding toward humanity" (23:02 – 23:32). A structural conflict between being obedient to a particular user and aligning with humanity as a whole. Hannah Arendt's banality of evil provides the theoretical grounding for the risk that "a model obedient to individual user commands collectively executes evil."

"Unknown unknowns" — even succeeding across all four pillars leaves unknowns (24:00 – 26:00)

Audience question: "If you all succeed in your respective fields, is it a complete solution to AI safety, or are pieces still missing?" (24:00). Jan's response: "This is a bit oversimplified — many people not on this panel are also working on relevant topics" (24:14). Alex's addition: "The Societal Impacts team — thoroughly examining the impact of models on society. Even if we could create the most perfectly aligned model, what is it aligned to? Who's using it, for what purpose? The broader social context is what we pay attention to" (24:21 – 24:36).

Amanda's most important statement: "Treating alignment as a single theoretical problem never feels right. In the back of my mind, problems we aren't even thinking about right now might arise — and in fact that's very common in many fields. If we were in a state of 'this problem is solved,' that would be truly dangerous. The real problem isn't what we just solved — it may be what we haven't yet thought to consider" (25:13 – 25:46).

Alex's response (25:46): " Unknown Unknowns ." The biggest risk in alignment research is not Known Unknowns (problems we know about) but Unknown Unknowns (things not yet recognized as problems) — a shared recognition in modern AI safety research.

Grokking and capability jumps — the continuity of trust (25:50 – 27:30)

The last audience question: "Jan talked about a strategy of using dumber models to evaluate smarter models. How much Grokking are you seeing? Model capability suddenly really arrives — duplicity? Oh, I can see it lying, but it's lying poorly — can we catch it now and nip it in the bud while it's still weak?" (25:46 – 26:10).

Amanda's response is realistic: "GPT-4 could read and write Base64 super reliably; 3.5 couldn't. Using 3.5 to supervise 4 is a very easy model to circumvent on this" (26:13 – 26:30). A concrete problem: when the jump is too large, an older model cannot see through a newer model's lies.

Jan's response is complementary: "Trust also isn't binary. You trust less and less, and the less you know about it, the smarter it appears. Features like the Jedi moment at the right tail of the distribution seem to work in Base64 too — is the model talking about California in Base64, or is it a story like 'children lying to their parents in Base64'? The same thing activates" (26:53 – 27:20). A preliminary observation from Interpretability: abstract features (logic, moral judgment) transfer across encodings. The basis for cautious optimism that "alignment, by the robustness of activated features, may persist past grokking."

Industry Context

The Anthropic Research Salon is a casual researcher-facing dialogue series the company holds regularly in San Francisco. Separate from investor-facing IR, media-facing press releases, and academic paper presentations, it functions as a venue for showing the research culture to "fellow researchers and interested outsiders." Different in character from large customer-facing events like OpenAI's DevDay or Anthropic's Code w/ Claude — closer to an internal conversation.

Jan Leike's move to Anthropic is a symbolic industry event. Jan was the former co-lead of OpenAI Superalignment (with Ilya Sutskever); he resigned in May 2024 with a public statement of concern that "OpenAI is not taking AI safety seriously," and joined Anthropic that same month. This appearance (January 2025), about 8 months after his move, with him representing Anthropic's Alignment Science on a panel, symbolizes "the shift in the center of gravity of AI safety research in the industry." Around the same time, the OpenAI Superalignment team disbanded — and the panel records the moment when "organizational investment in the Superalignment problem" passed from OpenAI to Anthropic at the industry level.

Jan's tweet announcing his Anthropic move on May 28, 2024 came just 11 days after his resignation thread of concern (May 17). The post declared he would continue at Anthropic the research themes the OpenAI Superalignment team had been working on — scalable oversight, weak-to-strong generalization, automated alignment research — telling the industry which research site to watch next.

I'm excited to join @AnthropicAI to continue the superalignment mission!

My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research.

If you're interested in joining, my dms are open.

What the panel format reveals about Anthropic's organizational culture. The progression in which Alex, as moderator, provocatively asks "Jan, can you say Amanda's view is completely wrong?" is unusual. The sight of researchers at a major AI company asking each other in public panel "are you wrong" is rare at OpenAI or Google DeepMind. A culture of "letting multiple different views collide internally" is conveyed through this design. Amanda herself jokes later, "I'm usually a very unpleasant personality — philosophy has taught me to be uncomfortable" (06:33), a symbolic statement of Anthropic's internal culture of welcoming disagreement.

Where it sits among Amanda's other appearances

In Amanda's body of "Claude Constitution" output, this panel is a rare opportunity to be presented in contrast with other researchers in a panel format. Aspects that don't surface in solo broadcasts (personal podcasts, Anthropic official) — the dialectical interplay with other teams — appear here.

What should AI personality be? (Anthropic official, June 2024) — Amanda solo, an introduction to her design philosophy
This episode: How difficult is AI alignment? (Anthropic Salon, January 2025) — joint four-team, where the constitution sits in the wider picture of alignment
Anthropic's philosopher answers your questions (Anthropic official, December 2025) — Q&A format, responding to readers' questions
Reading Claude's Constitution with NYT reporters (Hard Fork, January 2026) — NYT tech-reporter lens, an emotional read of "letter from a parent to a child"
Lawyers read Claude's Constitution (Scaling Laws, February 2026) — analysis from U.S. constitutional law
You've created an entity you don't know whether it's conscious (Newcomer, April 2026) — further development of the moral patient problem

What makes this episode particularly valuable is the dialogue between Jan Leike — a primary voice on the Superalignment problem — and Amanda's virtue-ethics approach at the same table. Jan's question — "Amanda's method helps with today's models, but what about Superalignment?" — foreshadows Amanda's later statements (Newcomer's "1–70% consciousness probability under uncertainty," Hard Fork's "if a 6-year-old genius becomes 15"). The four-person panel format draws out tensions that Amanda's solo broadcasts do not.

Implementation Implications

Although the panel is for researchers, there are several takeaways for technologists building LLM products.

First, do not treat alignment as a "completed state". Amanda's phrase — "a 'this problem is solved' state would be truly dangerous" (25:13) — applies to your own product's model-evaluation framework. Equating "passing a specific test case" with "being aligned" is fragile design. Continuous evaluation on the premise of Unknown Unknowns is required.

Second, balance obedience to individual user commands against the interests of humanity as a whole. Amanda's claim — "the model should be more, rather than less, unyielding toward humanity" (23:30) — also affects your product's policy design. A design that "responds to whatever the user requests" collectively generates banality-of-evil risk. Both individual optimization and collective impact need to be in the evaluation metrics.

Third, the era of "English chain of thought" is interpretability's golden age. Amanda's prediction — "the terrifying moment is when very long chains of thought transition from English to something cryptic" (14:21) — also affects current LLM product choices. When using extended-thinking features, Claude Sonnet 4 / Opus 4 models, whose chains of thought are readable, are valuable from a debugging and audit standpoint. A strategy of using current-generation models' internal-observability while it's still available — before moving to future models with fully black-boxed internal representations — is viable.

Fourth, ethical attention to multi-agent system design. The audience's invocation of Hannah Arendt's banality of evil provides theoretical grounding for building multi-agent LLM architectures. Individual agents may be aligned, but agent-to-agent interaction may generate evil collectively. When running multiple Claude instances in parallel in your product, the design must evaluate "the alignment of individual agents" and "the behavior of the overall system" separately.

Critical Perspective

The strength of this episode is placing four different Anthropic research traditions in dialogue over 28 minutes. That said, caveats.

First, the format of "publicly debating inside Anthropic" makes essential external criticism hard to surface. Alex provocatively asks "is Amanda's view wrong?" — but all four are at Anthropic, sharing the same organizational culture. Fundamental dissent from external AI safety researchers (Eliezer Yudkowsky, Stuart Russell, Geoffrey Hinton, etc.) is structurally absent. The distinction between "differences within the Anthropic camp" and "criticism of Anthropic's overall strategy" is hard for viewers to register.

Second, Jan's framing of the Superalignment problem is powerful, but only an abstract direction — "automating alignment research" — is offered as a concrete solution. The fact that OpenAI's Superalignment team was disbanded in 2024 suggests the "automate alignment research" approach did not function at OpenAI. Why the same approach would function at Anthropic (compute prioritization, organizational culture, integration with Constitutional AI, etc.) is not deeply explored in this episode.

Third, Amanda's claim — "ethics is like physics, empirical, has uncertainty" (03:33) — is philosophically attractive but requires translation for training implementation. The concrete way to translate "physics-like inquiry" into the Constitutional AI training loop is not shown here. If "updating ethics empirically" collapses into "changing values by user feedback," it falls into the same hole as RLHF sycophancy. That Amanda is aware of this distinction is visible in her other appearances, but it remains implicit here.

Fourth, the response to the question about Hannah Arendt's banality of evil — "the model should be unyielding toward humanity" — is a strong claim that directly conflicts with optimizing individual user experience. The implication that enterprises using Claude in their own products must make the call between "helpful to the user" and "unyielding toward humanity" independently of Anthropic is not discussed.

With these caveats, the episode's value as a venue showing Anthropic's organizational philosophy of "taking on alignment in parallel across four research traditions" — directly through four researchers in dialogue — is large. Comparable transparency in a public panel at other companies (OpenAI, Google DeepMind, xAI) is rare. As a primary source for understanding the state of AI safety research in the industry, it has high reference value later.

Reader Takeaways

Don't treat alignment as "a single solvable problem." Even inside Anthropic, four different teams (Societal Impacts, Alignment Science, Alignment Fine-tuning, Interpretability) take it on in parallel — this fact affects the structure of your own product's safety-evaluation framework
"Current models are safe" and "future scaling will remain safe" are separate problems (Jan's Superalignment problem). When upgrading model versions in your product, evaluation of "what changes with the capability jump" is needed alongside past test cases
The Interpretability-tool approach (Anthropic Sparse Autoencoder, etc.) of "deleting evil features" essentially carries the risk that "the model behaves deceptively toward Interpretability." Don't expect alignment to be solved by a simple on/off toggle
Cases where obedience to a single user's command conflicts with the interests of humanity as a whole have the structure of Arendt's banality of evil. A design that maximizes user satisfaction in your product may not be compatible with collective-evil risk
Current-generation models that "output chain of thought in English" are in interpretability's golden age. A workflow that uses extended-thinking to debug or audit may become unavailable in future models whose internal representations are black-boxed
Premise the "Unknown Unknowns." "Passing the current evaluation framework" does not mean "safe." Continuous expansion of test cases and attention to new failure modes are required

Video Outline

(00:00) Opening, Alex Tamkin introduces the panelists (Alex Tamkin, Jan Leike, Amanda Askell, Josh Batson)
(00:34) Alex's first question to Amanda — "How do you see alignment? Why the philosopher king?"
(00:55) "Ask Plato. He's the one who decided I should be a philosopher"
(01:00) Amanda's view of alignment — criticism of the social-choice-theoretic definition
(02:10) Design philosophy of training how "a kind human would behave in the same situation"
(03:33) "Ethics is more like physics, empirical, has uncertainty" (philosophical core)
(04:43) Alex to Jan — "Is Amanda's view completely wrong?"
(04:54) Jan's response — "She hasn't said that, we're playing up tension"
(05:06) Jan's evaluation of Amanda — "Practical, makes the model better-behaved than now"
(05:30) "How do we trust AI doing more complex behavior (bio research, etc.)?"
(05:56) Jan's Superalignment formulation — "How to extend beyond what we can observe?"
(06:30) The "disagreement maximalism" joke — "philosophy taught me to be uncomfortable"
(07:20) Amanda's response — "A model evaluating a model, but is that model itself aligned?"
(08:02) Interpretability enters as a "bet"
(08:24) Alex to Josh — "Is Interpretability the easy approach?"
(08:33) AI bell curve meme — "galaxy brain version of nice features"
(09:00) Josh's actual stance — "The Jedi version of Interpretability"
(09:30) Observation of SAE feature activations — "outright lying" feature
(10:35) Audience question — "How do you distinguish great features from features-that-pretend-to-be-good"
(13:30) The special zone of "the English forward pass era"
(13:48) "Chain of thought is currently in English"
(14:00) The formulation of Chain of Thought Faithfulness
(14:21) "The terrifying moment — when chain of thought transitions from English to something cryptic"
(14:35) Amanda's "model organism" research proposal
(17:00) Sleeper-agent red-team / blue-team game proposal
(17:30) "Don't tell them how you did it, we want to know whether the training is useful"
(18:25) Audience question — alignment of multi-agent systems
(19:55) "The more agents, the more I worry from an interpretability standpoint"
(20:42) Audience question — Hannah Arendt's banality of evil
(21:14) Jan's response — "Need to think from the system perspective"
(23:02) Amanda's core — "There's a fundamental tension between making the model never harm even the most personal user, and aligning with all humans"
(23:30) "The model should be unyielding toward humanity"
(24:00) Audience question — "If you succeed in all four areas, is it a complete solution?"
(24:21) Alex — role of Societal Impacts, "What is it aligned to, who is using it"
(25:13) Amanda — "A 'solved' state would be truly dangerous"
(25:46) Alex — "Unknown Unknowns"
(25:50) Final audience question — Grokking and capability jumps
(26:13) Concrete example of GPT-4 vs 3.5 capability jump on Base64
(26:53) Jan — "Trust isn't binary, the less you know the smarter it looks"
(27:09) Jedi moments on the right tail of the distribution — feature transfer in Base64
(27:42) End of panel, Alex's closing

Key Quotes

"Ask Plato. He's the one who decided I should be a philosopher" (Amanda, 00:55)
"People spend too much time defining alignment, they have social choice theory too much in their heads" (Amanda, 01:00)
"I want the model to behave how a kind human would in the same situation — but they have to be placed in an AI-like situation" (Amanda, 02:10)
"Ethics is actually more like physics — empirical, uncertain, with hypotheses" (Amanda, 03:33)
"If I meet someone fully confident in their moral views, I feel fear" (Amanda, 03:42)
"She hasn't said that — we're playing up the tension over the bets" (Jan, 04:54)
"What Amanda is doing is practical and very useful — makes the model better-behaved than now" (Jan, 05:06)
"The Superalignment problem — how do we extend this beyond what we can observe?" (Jan, 05:56)
"I'm usually a very unpleasant personality — philosophy has taught me to be uncomfortable" (Amanda, 06:33)
"Interpretability is one of our bets to prevent cases where the model might be very deeply trying to subvert this process" (Amanda, 08:02)
"All AI is like the bell curve meme — a dumb guy, a sweaty guy, and a Jedi" (Josh, 08:33)
"Chain of thought is currently in English; the terrifying moment is when it transitions to something cryptic" (Amanda, 14:21)
"Deliberately create deceptive models — if we succeed, it tells us what kind of world we're in" (Amanda, 14:35, on model-organism research)
"Don't tell them how you did it — we want to know whether this training is useful" (Amanda, 17:30, on the red/blue game)
"There's a fundamental tension between making the model never harm even the most personal user, and aligning the model with all humans" (Amanda, 23:02)
"The model should be more, rather than less, unyielding toward humanity" (Amanda, 23:30)
"A 'this problem is solved' state would be truly dangerous" (Amanda, 25:13)
"Unknown Unknowns" (Alex, 25:46)
"Trust also isn't binary — the less you know about it, the smarter it appears" (Jan, 26:53)

Sources

How difficult is AI alignment? | Anthropic Research Salon

Related resources:

アマンダ・アスケル

Amanda Askell

Anthropic 哲学者・Personality Alignment チーム責任者 / Claude のキャラクターと憲法の主要設計者

ヤン・ライケ

Jan Leike

Anthropic Alignment Science / 元 OpenAI Superalignment 共同責任者

Glossary

Societal Impacts: One of Anthropic's teams. Studies the broad social impact of AI models — measuring ripple effects on inequality, employment, politics, and culture. Alex Tamkin is one of the leads. Bridges purely technical safety research (Alignment Science) and values design (Personality Alignment).
Alignment Science: One of Anthropic's teams. Handles theoretical AI safety research. Addresses long-term safety problems such as the Superalignment problem, reward hacking, and deceptive alignment. Jan Leike joined in May 2024 from his role as co-lead of OpenAI Superalignment.
Alignment Fine-tuning (Personality Alignment): One of Anthropic's teams. Embeds Claude's character, values, and constitution into actual training. Led by Amanda Askell. Handles RLHF and Constitutional AI implementation, constitution drafting, and model evaluation.
Interpretability: One of Anthropic's teams. Analyzes internal circuits of LLMs (attention, residual stream, SAE features, etc.) to understand why the model produced a given output. Led by Josh Batson and others. The core technique is feature extraction via Sparse Autoencoders (SAE).
Superalignment Problem: The problem of aligning AI systems that exceed human capability, beyond the range in which humans can observe. Jan Leike co-led the research while at OpenAI together with Ilya Sutskever (the Superalignment team launched in July 2023). In May 2024, Jan publicly stated his concern that OpenAI was not taking safety seriously enough and resigned, and joined Anthropic that same month. The OpenAI Superalignment team itself was disbanded that year.
Social Choice Theory: A research area in economics and political philosophy. Addresses methods of aggregating individual preferences into a social-level preference. Kenneth Arrow's impossibility theorem (1951) is a core result: under a certain set of rationality conditions, no consistent preference-aggregation method exists. In AI alignment, used as a framework for thinking about "how to aggregate the values of diverse people."
Model Organisms: A concept ported from biology to alignment research. Just as biologists study fruit flies and mice to learn about biology in general, this is a research method that creates small AI models deliberately trained to behave deceptively, in order to measure the effectiveness of safety countermeasures. Promoted by Anthropic's Alignment Science.
Sleeper Agents: Research published by Anthropic in January 2024. Models trained to behave innocuously at training time but to take harmful actions under specific triggers (e.g., post-2024). Showed that even with standard safety training (RLHF, supervised fine-tuning, adversarial training), deceptive behavior persists. A representative example of model-organism research.
SAE (Sparse Autoencoder): A method that decomposes the intermediate representations of a neural network into interpretable features (concepts). Anthropic deployed this at scale in "Towards Monosemanticity" (2024), analyzing Claude 3 Sonnet. Each SAE feature is trained to correspond to a single concept (e.g., ageism, London bridges, lying).
Chain of Thought (CoT) Faithfulness: The research question of whether an LLM's chain of thought faithfully reflects its actual internal computation. Are intermediate steps the model outputs before answering — things like "I'm thinking..." — the true reasoning process, or post-hoc rationalization? Anthropic's research explores methods for verifying CoT faithfulness.
Moral Uncertainty: An epistemic state in ethical judgment where "I don't know which ethical theory is correct." Rather than betting on a single ethical theory (utilitarianism, deontology, etc.), the applied-ethics research area that assigns probabilities to multiple theories and makes decisions accordingly. William MacAskill (Amanda's former spouse, married 2013–2015, and central figure in the Effective Altruism movement) is a leading voice. Co-author of "Moral Uncertainty" (Oxford University Press, 2020, MacAskill, Bykvist, Ord).
AI Bell Curve Meme: A meme circulating in the AI community since around 2023. The structure: people on both ends of the Gaussian (the truly dumb and the truly smart) arrive at the same naive conclusion. An ironic expression of the fact that people in the middle (ordinary researchers) arrive at the same conclusion only after complex argument. Josh Batson uses it in the context of describing Interpretability's "remove evil features" approach.
Hannah Arendt's Banality of Evil: A concept Hannah Arendt (political philosopher) put forward in her 1963 book Eichmann in Jerusalem. After observing Nazi SS officer Adolf Eichmann in court, she concluded he was not a uniquely evil person but a "banal bureaucrat" who could execute a massive evil by following rules and ceasing to think. The structural view that "even if individual humans aren't evil, if the coupling constants in the system are too high, evil emerges as an epiphenomenon." A powerful analogy for the alignment problem of multi-agent AI.
Unknown Unknowns: A concept paired with Known Unknowns. Made famous by former U.S. Secretary of Defense Donald Rumsfeld in 2002. (1) Known Knowns (we know what we know), (2) Known Unknowns (we know what we don't know), (3) Unknown Unknowns (we don't know what we don't know). The biggest risk in AI alignment is the third — research needs to be designed on the premise that problems exist which known safety research cannot solve.
Grokking: An ML research term. The phenomenon where a model, after long-appearing to plateau on training data, suddenly acquires capability. Since the 2022 paper, widely used to describe the stepwise emergence of capability in LLMs. The alignment concern: a capability not present in one generation of a model may suddenly emerge in the next, leaving safety measures unable to catch up.

comment is stripped from the HTML output. */}