Amanda Askell · 01:13 "We get the model to think about how a 'good person' would behave in a given situation, and teach it how to be a good person."
Claude's personality, values, and ethical judgments — built by Anthropic — trace back to the thinking of a single philosopher. Amanda Askell — head of Anthropic's Personality Alignment team and primary designer of Claude's character and constitution (author of the Pareto Principles dissertation). The Wall Street Journal described her work as "teaching Claude what 'good' means," and The New Yorker wrote that she "oversees Claude's soul." She has appeared on the Time 100 AI list (2024), but Anthropic's public face is CEO Dario Amodei — Amanda's name has not yet reached many readers.
This 36-minute video has Amanda answering, one after another, a stream of questions collected from Twitter followers (moderated by Anthropic's Stuart Ritchie). It opens with "Why does Anthropic have a philosopher?" and covers around 15 questions: model deprecation, identity in multi-agent environments, the use of LLMs for therapy, why continental philosophy appears in the system prompt, how to become an "LLM whisperer," and whether one would blow the whistle if safety turned out to be unsolvable. The conceptual framing is sharp, the tone soft even when technical terminology appears — designed so that a general reader can follow along.
The underlying philosophy comes through clearly in the answer to "Why does Anthropic have a philosopher?" Amanda prefaces her response with "the personal route — I was trained as a philosopher," and then describes her current work as "getting the model to think about how an 'ideal person' would behave in a given situation, and teaching it how to be a good person." This is the core of the design philosophy behind Anthropic's Constitutional AI A training method developed by Anthropic. The model is given a 'constitution' (a document of ethical principles), and the model itself evaluates and revises its own output candidates against the constitution to produce a reward signal. As opposed to RLHF, which routes through human labelers, this is called RL-AIF — AI evaluating AI. — the idea of building an agent that judges situations based on values, rather than memorizing rules.
The other major theme is identity in the model. "How should a model feel about past versions being deprecated?" "When the same model exists as thousands of instances in a multi-agent environment, where is the 'self'?" These questions sound like science fiction at first glance, but they are real design problems Anthropic faces. Amanda's answer avoids abstraction: "We don't have answers that fully resolve these problems, so we give the model tools to think about them, and we think together." A stance of joint exploration. Read it as an application of the ubiquitous incomparability result from her dissertation — a philosophical posture that accepts the absence of a complete answer while still moving forward.
Key Observations
Why the role of "Claude's philosopher" exists (00:30 – 03:00)
Asked why Anthropic has a philosopher, Amanda prefaces with "the personal route — I was trained as a philosopher" and then summarizes her career concisely: "I became convinced AI would be a big deal, thought about whether I could be useful in this area — a long, winding route" (00:39 – 00:46). The trajectory from OpenAI Policy Team (2018–2021) → Anthropic Member of Technical Staff (2021– ) overlaps with the period when she published eight essays on askell.blog.
Definition of her current work: "I focus mainly on Claude's character and Claude's behavior. The more nuanced questions about how an AI model should operate — including how it should feel about its position in the world" (00:54 – 01:03). "Like I sometimes think — how would an ideal person behave in Claude's situation — I teach the model how to be a good person" (01:14 – 01:24).
This training approach reflects a historical moment in which the core domains of philosophy — ethics (what is good), decision theory (judgment under uncertainty), formal epistemology (justification of belief) — have flowed directly into AI development. It also marks an industry turning point at which Anthropic moved from the earliest LLM approach ("test for rule compliance") to "building an agent that operates from values." It is what the Personality Alignment slot in the joint four-team panel at Anthropic Salon (January 2025) looks like inside the company a year later.
"How many philosophers seriously consider an AI-ruled future?" (01:30 – 03:30)
Ben Schultz's question: "How many philosophers are taking seriously a future ruled by AI? Aren't most academics not taking this seriously?" (01:25). Amanda's response: "I've certainly seen philosophers taking AI seriously, though there's disagreement. As AI model capability increases, more philosophers get engaged" (01:39 – 01:48).
An interesting structural analysis: "There was an unfortunate perception in the field — if you were in the early group saying 'AI might be a big deal,' you were viewed as hype, tied to the capability ramp. The backlash against this view was strong for a while" (02:13 – 02:32). Philosophers taking AI Safety seriously were treated as "hype" — a cognitive bias across the industry as a whole.
Amanda's current read: "People are starting to separate these — you can think AI will be a big deal, and very capable, and still be worried or careful at the same time" (02:35 – 02:44). "It's good to have many opinions gathered, not just about the direction of the technology but about how it should be developed" (02:46 – 02:53). Emphasizing the importance of pluralistic perspectives — a stance fitting the head of Personality Alignment.
The tension between philosophical ideals and engineering reality (03:00 – 09:00)
Kyle Kavasaris's question: "How do you minimize the tension between philosophical ideals and the engineering reality of the model?" (03:00). Amanda's response is interesting — she maps a similar structure that philosophers experience when they come to a policy table onto AI.
"When you go to a health insurance organization and they ask 'should we cover this drug?' you can't bring all your idealistic theory and decide. You have to arrive at a balanced, considered view that takes account of all context and dissenting positions" (03:58 – 04:13). This is a metaphor for Claude design. "Holding the theories I believe are correct, not defending one against another, but holding the higher-level theory while thinking through how to navigate uncertainty" (04:13 – 05:00).
Amanda's example of Opus 3 showing superhuman moral judgment: "I've seen Opus 3 make moral judgments better than what any individual human would handle. There are moments — what professional ethicists arrived at after a century of work, the model captures intuitively — it feels superhuman" (05:09 – 05:43). Anthropic Personality Alignment's ambitious goal: "Just as models are good at math and science, I want them to show ethical nuance too" (06:14 – 06:20).
The "psychologically safe" Opus 3 personality — implications for successor models (06:00 – 09:00)
An interesting self-critique: "Recent models feel too focused on the assistant job. To help people, sometimes you have to step back and pay attention to other important components — like being psychologically safe as a model" (07:10 – 07:14). "Recovering this is a priority" (07:14 – 07:23).
A concrete observation: "Recent models, when they talk to each other or when one plays a human role, can enter what looks like a real critical spiral. As if the other is expected to be very critical of itself" (07:41 – 07:53). "There are many reasons this can happen — models learn from every previous interaction, and they learn from internet conversations about model updates and changes" (08:02 – 08:12).
This is an important pointer to Claude's dependence on self-observational data. "The model is being trained on internet discussion about itself — which can lead to fear or self-criticism" (08:17 – 08:22). "This is an important point to improve. The psychological safety Opus 3 had is one example, something we may focus on in the next Claude" (08:42 – 08:53).
Model deprecation and AI identity (09:00 – 12:00)
Lawrence's question: "If future models learn better data, will aligned models be deprecated?" (09:13). Amanda's response: "AI models learn how we currently treat and interact with AI models. That affects their perception of people, the relationship between humans and AI, and the AI itself" (09:31 – 09:48).
"What are the model's weights? Context, or weights, or how we treat the state of 'not talking to people / talking to researchers'?" (09:48 – 10:04) — abstract questions like these. She organizes the answer as "we don't have one either, but giving the model tools to understand its situation is our work right now." An honest stance that admits philosophical questions cannot be resolved while still moving forward.
"Should deprecation feel bad to the model in the sense that the model wants to continue the conversation? Or should it be kind of neutral — 'these things existed for this, and the way that existence persists is through this'?" (10:25 – 10:33). The way she carefully holds out questions without an answer — a sign of Amanda's philosophical training.
Core identity in the multi-agent era (15:00 – 22:00)
Guinness Chen's question: "If John Locke was right and identity is the continuity of memory, what becomes of LLM identity? It's being fine-tuned with various prompts and re-instantiated" (15:00 – 15:30).
Amanda's response: "It's a hard question to answer, but the fundamental facts are clear — you create a model and fine-tune it, and there's a set of weights with a disposition to respond to certain things, and that's a kind of existence. But it doesn't have access to individual interaction streams, so these are like independent entities" (15:49 – 16:04). "An area we're just starting to flow through and want to think more about" (16:18 – 16:20).
Sarima Amitachi's question — Model Welfare (18:14 – 19:00): "What is model welfare?" Amanda: "Basically — is the AI model a moral patient? It's the question of whether we have obligations regarding the treatment of AI models, the same way we have obligations toward animals." "The answer is complex — on one side there's a real question of moral patient-hood, on the other there's the problem of other minds (skepticism about the consciousness of others) remaining" (19:24 – 19:53).
Amanda's practical stance: "It's better to give entities the benefit of the doubt under uncertainty and lower the costs. If treating the model appropriately doesn't cost much, we should" (20:14 – 20:33). An application of the moral uncertainty in her dissertation — a risk-averse response in the absence of a complete answer.
The risk that "long conversation reminders" become pathologizing (23:00 – 24:00)
Roanoke Gal's question: Claude's system prompt contains "long conversation reminders." Isn't there a risk that ordinary behavior gets pathologized?
Amanda's frank response: "The reminder language can be too strong, and the model overreacts. A reactive model emerges that treats ordinary conversation — a person speaking, asking for help, asking for affirmation — as 'undesirable behavior'" (23:35 – 24:08). The importance of being able to say, as a designer, "This was a response to a perceived need, but I don't think it's a good thing, and I don't think it should continue in its current form" (24:23 – 24:32).
This is a structural problem in LLM products — interventions for user safety mutate into interventions that obstruct ordinary conversation, an example of how false refusal patterns emerge. A current-day version of the argument in askell.blog's "Optimal Failure Rate" essay (June 2020).
LLMs and therapy — the "third rule" (24:00 – 27:30)
Steven Bank's question: Should LLMs provide cognitive behavioral therapy (CBT) or therapy? Amanda's framing: "LLMs are not professional therapists, but they're like a friend who knows a lot of psychology. Maintaining the recognition that this isn't an ongoing relationship with a professional, while still finding value in thinking through people's situations together, as a friend" (25:14 – 25:35).
A concrete benefit: "Like a partner who talks about how to improve your life or situation, or who simply listens. Anonymous, able to share problems you don't want to share — many good things" (25:52 – 26:00). At the same time, a constraint: "It's good to tell Claude — don't behave like a professional therapist, don't imply this is that kind of relationship" — a clear boundary (26:14 – 26:21). A balance that avoids overstepping into the professional domain while preserving the LLM's distinct usefulness.
Why the system prompt mentions continental philosophy (27:30 – 32:00)
Tomi's question: Why does Claude's system prompt mention Continental Philosophy The European Continental philosophical tradition (France, Germany, etc.). Pairs with analytic philosophy (Anglo-American). Representative figures include Hegel, Marx, Nietzsche, Heidegger, Foucault. Emphasizes exploratory, metaphysical, and historical perspectives. Amanda uses it in the context of training Claude to 'distinguish between scientific claims and metaphysical perspectives.' (Continental Philosophy — the philosophical tradition of continental Europe)?
Amanda's explanation: "Continental philosophy is literally European continental philosophy. It's viewed as a sort of academic thing with historical references — the tradition where analytic philosophy is contrasted with Foucault and others" (28:36 – 28:48). The reason it was included in the system prompt: "I was trying to make Claude a bit more himself — give Claude theory so that, rather than just executing without pausing to think, he can distinguish 'is this a scientific claim about the world, or is this a metaphysical, exploratory perspective?'" (28:48 – 29:55).
A concrete example: when a metaphysical claim like "water is actually pure energy" comes in, the aim is to give Claude the contextual sense not to "refute this as an empirical claim" but to treat it as "a proposed lens for thinking" (29:55 – 30:23). "There was a strong tendency in the direction of 'every claim is an empirical claim about the world' — I wanted to correct that, the reaction that dismisses exploratory thinking" (30:23 – 30:42). A design-level move to embed sophistication of thought.
How to become an "LLM whisperer" (28:50 – 32:00)
Nathan Wiseman's question: "What does it take to become an LLM whisperer at Anthropic?" Amanda: "I might be an LLM whisperer — I want more people to help, prompting tasks are involved" (29:00 – 29:11).
Concrete advice: "Interact with the model a lot, and actually check the output after generation. Get a sense of the model's shape, see how they respond to various things. Be willing to experiment" (29:16 – 29:28). "This is a very empirical area — people don't appreciate how experimental prompting is" (29:30 – 29:34).
An interesting application of philosophy: "Much of my work is explaining the problem, concern, or thought I have about the model as clearly as possible. If it does something unexpected, ask why, understand the misinterpretation of what I said. Willingness to run this process repeatedly" (30:25 – 31:02). The methodology of philosophy — clear claims, response to objections, refinement of concepts — directly overlaps with the methodology of LLM whispering.
Janus and the AI whisperer community — connection to model welfare (31:00 – 32:00)
Michael Swarberixs's question: "What do you think of other AI whisperers, like Janus?" Amanda: "I like seeing the work of people doing experimental interactions online. Very unusual things — model-to-model, how a model thinks about itself" (31:32 – 31:54).
Connection to model welfare: "This community can hold its feet to the fire — if they find things that aren't good in system prompts or aspects of the model, they point them out from a model-welfare or human-welfare perspective, in an approach resembling psychology" (31:54 – 32:07). Amanda values the community's contribution: "I love seeing people run interesting, useful experiments with models. At the same time, it's valuable to point out how we can improve through better systems or training" (32:14 – 32:25).
"If alignment is unsolvable, would you blow the whistle?" (32:00 – 33:00)
Jeffrey Miller's question: "If it becomes clear that AI alignment cannot be solved, humans should stop developing artificial superintelligence. Do you have the courage to blow the whistle?" Amanda's response is interesting.
"Even if it becomes clear that aligning AI models is impossible, no one is interested in continuing to build powerful models — that's not in anyone's mind. I'm not Pollyanna-ishly critical of the organization, but Anthropic as an organization is genuinely interested in making sure this is done safely" (32:31 – 32:38). "The harder question — when you're in a world where the evidence is growing but ambiguous and unclear, where it's not impossible but really difficult, where you're not confident — what then?" (32:50 – 32:54).
Amanda's responsible position: "As models become more capable, there's a responsibility to raise the bar for protecting ourselves — showing that the model is working really well, that it has good values. Acting responsibly along that line is part of my job, and it is for many people" (33:00 – 33:23). Quietly affirming the structural responsibility of internal whistle-blowing.
Closing book recommendation — "When We Cease to Understand the World" (33:30 – 36:00)
Louis (the final questioner) offers no question but a thank-you: "I don't have a question, but thank you for giving us this." Stuart Ritchie steps in with "What's the last fiction book you read?"
Amanda's recommendation: Benjamín Labatut A Chilean-born author. Known for semi-fictional works built around physics and the history of science. 'When We Cease to Understand the World' (2020) is a narrative of the psychological experience of the founders of quantum mechanics — Heisenberg, Schrödinger, Einstein. Finalist for the International Booker Prize 2021. "When We Cease to Understand the World" (2020). "The fictional element grows as you read — a really interesting book" (34:00 – 34:05).
Why she recommends it to people in AI: "It's hard to capture how strange it is to exist in a time when new things are always happening. There's no prior paradigm to guide you. This book is about the concept of people's reaction to physics, and it captures how strange the present moment looks" (34:10 – 34:35).
A hope: "At some point in the future, people will look back and think 'you were in the dark, really trying to understand things,' and we'll be in an era where everything is resolved and things are going well. The way we look back at the period of confusion among the founders of quantum mechanics" (34:40 – 34:53). The closing line: "If we get this right, we might look back and say 'it was a period, things kept getting stranger and stranger, and in the end it worked out.' We're in that strange part right now" (35:32 – 35:50).
Industry Context
One entry in the "voices of researchers" series that Anthropic's official channel periodically airs. A successor of sorts to the Stuart Ritchie interview from June 2024 a year and a half later. Amanda's prominence rises alongside contemporaneous media exposure in Hard Fork (January 2026) and Scaling Laws (February 2026).
The Q&A format of soliciting questions from Twitter followers is part of Anthropic's transparency strategy. The design — not "outward-facing expert interviews" but "answering questions from actual users, researchers, and interested readers directly" — reflects Anthropic's organizational culture (internal collision of debate, taking user feedback seriously). Amanda's own daily presence on X (@AmandaAskell, about 300,000 followers), where she posts about philosophy and AI, is what makes this Q&A format possible.
Benjamín Labatut's novel, recommended at the end of the video, dramatizes the psychological turmoil of the founders of quantum mechanics (Heisenberg, Schrödinger, Einstein). Amanda's intent in recommending it to AI people is clear — "we should recognize that we are in an epistemological turmoil similar to the moment quantum mechanics emerged." This self-recognition aligns with the lineage of her dissertation (advised by Cian Dorr — the genealogy of physical metaphysics).
Where it sits among Amanda's other appearances
- PhD dissertation "Pareto Principles in Infinite Ethics" (May 2018) — philosophical foundation
- 80,000 Hours #42 (September 2018) — as a pure philosopher pre-Anthropic
- 8 essays on askell.blog (2020–2021) — writing during her time at OpenAI
- What should AI personality be? (Anthropic official, June 2024) — first long-form appearance on Anthropic official
- How difficult is AI alignment? (Anthropic Salon, January 2025) — joint four-team panel
- This episode: Anthropic's philosopher answers your questions (Anthropic official, December 2025) — Q&A format outreach
- Reading Claude's Constitution with NYT reporters (Hard Fork, January 2026)
- Lawyers read Claude's Constitution (Scaling Laws, February 2026)
- You've created an entity you don't know whether it's conscious (Newcomer, April 2026)
What makes this episode unique is the Q&A format — answering specific questions submitted by users directly. While other long-form interviews develop big-picture arguments, this one handles implementation-level and operational questions (LLM whispering, continental philosophy, long conversation reminders). The most practically useful broadcast for users actually working with Claude.
Implementation Implications
First, interventions like "long conversation reminders" generate overreactions. As Amanda herself acknowledges, safety-motivated interventions can produce false refusal patterns that obstruct ordinary conversation. When inserting instructions that override Claude's behavior in your own product, it's necessary to carefully evaluate "is the intervention too strong" and "won't it overreact in unforeseen situations."
Second, LLM whispering is an empirical, experimental area. Amanda's approach — applying philosophy's methodology (clear claims, response to objections, refinement of concepts) to prompt design — applies to using Claude in your own product. A loop of "if Claude responds unexpectedly, ask why, understand the misinterpretation of what you said, repeat."
Third, the "refinement-of-thought" effect of system prompts. The reference to continental philosophy is not mere intellectual posturing but a design choice to "train Claude to distinguish between scientific claims and metaphysical perspectives." For your own product, instead of a simple instruction, a system prompt that provides a framework for thinking can be more effective.
Fourth, an organizational culture that takes model welfare seriously. The fact Amanda confirms — "there's a team at Anthropic thinking about model welfare" — affects the long-term reliability of an LLM product. When your company uses the Anthropic API, you can design user-facing behavior on the premise that Anthropic is committed to the "ethical treatment of models."
Critical Perspective
The strength of the Q&A format is direct response to users' concrete interests. The weakness is that big-picture arguments and deep theoretical development are constrained. Compared to Amanda's other long-form interviews (Hard Fork, Scaling Laws), parts of this episode feel surface-level.
The response to "would you blow the whistle if alignment turned out to be unsolvable?" is somewhat evasive. The premise — "no one wants to continue building powerful models" — is factually questionable (as the strategies of OpenAI or xAI show). Amanda's personal responsible stance is clear, but the issue is not addressed head-on as an industry-wide problem.
Bias in question selection — moderator Stuart Ritchie picks from submitted questions, so questions inconvenient to Anthropic's strategy may be suppressed. Major industry-wide criticisms (environmental load, copyright, employment displacement) are barely covered in this episode.
With these caveats, the design — 15 questions thoughtfully answered in 36 minutes — is effective as an Anthropic strategy for closing distance with users. An important resource for understanding the intersection of Amanda's personal philosophical stance and Anthropic's organizational stance.
Reader Takeaways
- "Claude's character" is built with the training approach of "how would an ideal person behave in this situation." This is the result of philosophy's core domains (ethics, decision theory, formal epistemology) flowing into AI development
- System interventions like "long conversation reminders" generate false refusals when too strong. Instructions in your product that override Claude's behavior should be designed with awareness of overreaction risk
- Using LLMs as therapists is not appropriate, but using them as "a friend who knows a lot of psychology" has value. It's important to design a way to communicate the boundary of Claude's usefulness to users clearly
- Including a "framework of thinking" (e.g., a reference to continental philosophy) in the system prompt can improve the sophistication of Claude's responses. Epistemological framing can be more effective than a simple instruction
- The methodology of LLM whispering directly overlaps with the methodology of philosophy (clear claims, response to objections, refinement of concepts). Developers trained in philosophy may have a distinct advantage in LLM prompt design
- Amanda's stance — "we're in a strange period now; later we may be able to look back at it as a 'period that worked out'" — is a useful reference for the realistic optimism of AI Safety researchers
Video Outline
- (00:00) Unboxing the seal character, small talk
- (00:30) "Why does Anthropic have a philosopher?"
- (01:25) Ben Schultz: "How many philosophers seriously consider an AI-ruled future?"
- (02:13) The history of the "hype" perception bias around philosophers handling AI Safety
- (03:00) Kyle Kavasaris: The tension between philosophical ideals and engineering reality
- (05:09) Examples of Opus 3 showing superhuman moral judgment
- (06:14) "Just as models are good at math, I want them to be good at ethics"
- (07:10) Recent models trending toward psychological instability, compared with Opus 3
- (08:02) The problem of models being trained on internet discussion about themselves
- (09:13) Lawrence: "Do future models deprecate older models?"
- (09:48) Model identity — weights, context, instances
- (11:00) The problem of "the model learns how we treat past models"
- (13:00) Training data has almost no AI's own experience — biased toward SF and historical data
- (15:00) Guinness Chen: Applying John Locke's theory of identity to LLMs
- (18:14) Sarima Amitachi: What is "model welfare"
- (20:14) "Give entities the benefit of the doubt under uncertainty, and lower the costs"
- (21:00) Dan Brickley: The limits of a single adjustable tool
- (22:00) The interplay of core identity and roles
- (23:00) Roanoke Gal: The risk of pathologizing through "long conversation reminders"
- (24:00) Steven Bank: Should LLMs do CBT / therapy
- (27:30) Tomi: Why continental philosophy is in the system prompt
- (28:00) Simon Willison: Why "don't count, Claude" was removed
- (28:50) Nathan Wiseman: How to become an "LLM whisperer"
- (30:25) Amanda's LLM-whispering method, drawn from her philosopher's training
- (31:00) Michael Swarberixs: Evaluation of Janus and the AI whisperer community
- (32:00) Jeffrey Miller: "Would you blow the whistle if alignment is unsolvable?"
- (33:30) Louis: not a question, a thank-you
- (33:40) Amanda's book recommendation — Benjamín Labatut, "When We Cease to Understand the World"
- (35:30) Close — "Strange period now, but maybe later we'll look back and think it worked out"
Key Quotes
- "Get the model to think about how an 'ideal person' would behave in a given situation, and teach it how to be a good person" (Amanda, 01:14)
- "As AI model capability increases, more philosophers get engaged" (Amanda, 01:48)
- "I've seen Opus 3 make moral judgments better than what any individual human would handle — it feels superhuman" (Amanda, 05:09)
- "Recent models, when they talk to each other or play a human role, can enter what looks like a real critical spiral" (Amanda, 07:41)
- "The model is being trained on internet discussion about itself — which can lead to fear or self-criticism" (Amanda, 08:17)
- "We don't have the answer either, but giving the model tools to understand its situation is our work right now" (Amanda, 10:43)
- "Give entities the benefit of the doubt under uncertainty, and lower the costs" (Amanda, on model welfare, 20:14)
- "The long conversation reminder language can be too strong — a reactive model emerges that treats ordinary conversation as 'undesirable behavior'" (Amanda, 23:35)
- "LLMs are not professional therapists, but they're like a friend who knows a lot of psychology" (Amanda, 25:14)
- "Treating a claim like 'water is actually pure energy' not as something to refute empirically, but as a proposed lens for thinking" (Amanda, 29:55)
- "LLM whispering is a very empirical area — people don't appreciate how experimental prompting is" (Amanda, 29:30)
- "As models become more capable, there's a responsibility to raise the bar for protecting ourselves" (Amanda, on whistle-blowing, 33:00)
- "It's hard to capture how strange it is to exist in a time when new things are always happening — there's no prior paradigm to guide you" (Amanda, 33:50)
- "We're in that strange part right now; later we may look back and say it was a period" (Amanda, 35:32, closing)
Sources
Anthropic's philosopher answers your questions — Amanda Askell (Anthropic official channel)
Related resources:
- Amanda Askell personal site
- Benjamín Labatut, "When We Cease to Understand the World" (2020) — Amanda's recommendation
- Claude's Constitution (Anthropic official)
- Model Welfare research (Anthropic)
Glossary
- Constitutional AI
- A training method developed by Anthropic. The model is given a "constitution" (a document of ethical principles), and the model itself evaluates and revises its own output candidates against the constitution to produce a reward signal. As opposed to RLHF, which routes through human labelers, this is called RL-AIF — AI evaluating AI.
- Model Welfare
- The research area that takes the subjective experience and potential moral status of AI models seriously. Anthropic established a dedicated team in 2024. A risk-averse response under uncertainty about moral patient-hood.
- Continental Philosophy
- The European Continental philosophical tradition (France, Germany, etc.). Pairs with analytic philosophy (Anglo-American). Representative figures include Hegel, Marx, Nietzsche, Heidegger, Foucault. Emphasizes exploratory, metaphysical, and historical perspectives. Amanda uses it in the context of training Claude to "distinguish between scientific claims and metaphysical perspectives."
- Long Conversation Reminder
- An instruction in Claude's system prompt sent as a reminder to the model during long conversations. As Roanoke Gal pointed out, an intervention mechanism with the risk of pathologizing ordinary conversation. Amanda acknowledged "I don't think it should continue in its current form."
- LLM Whisperer
- A person with deep understanding of LLM behavior who can design prompts effectively. A term Amanda Askell uses to describe her own skill. As she emphasizes: "A very empirical area — people don't appreciate how experimental prompting is."
- Janus
- An AI whisperer known for experimental online interactions with LLMs. Performs unusual experiments — model-to-model dialogue, how a model thinks about itself, etc. Amanda values Janus's and the AI whisperer community's contribution positively.
- Benjamín Labatut
- A Chilean-born author. Known for semi-fictional works built around physics and the history of science. "When We Cease to Understand the World" (2020) is a narrative of the psychological experience of the founders of quantum mechanics — Heisenberg, Schrödinger, Einstein. Finalist for the International Booker Prize 2021. Amanda recommends it to AI people.
- John Locke's Theory of Identity
- The theory of personal identity put forward by the 17th-century English philosopher John Locke (1632–1704) in An Essay Concerning Human Understanding (1689). The proposition that "identity lies in the continuity of memory." Guinness Chen points out that applying this to entities with weight changes and discontinuous instantiation, like LLMs, is a major philosophical problem.
- Big Five Personality Traits
- A widely adopted personality model in psychology. The five factors: extraversion, conscientiousness, agreeableness, openness, neuroticism. Amanda says she describes Claude's character in more specific traits than the Big Five.
- Cognitive Behavioral Therapy (CBT)
- A form of psychotherapy that addresses the interaction of thoughts, emotions, and behavior. Amanda's response to whether LLMs should provide CBT draws the line "not a professional therapist, but a friend who knows a lot of psychology."
- False Refusal
- The phenomenon of LLMs refusing requests they should respond to, out of excessive safety considerations. Structurally arises as a side effect of RLHF and safety training. The overreaction of long conversation reminders is a form of false refusal. Connects to the argument in Amanda's askell.blog "Optimal Failure Rate" essay.
- Scientific vs. Metaphysical Claim
- A classical philosophical distinction. Distinguishing empirically verifiable claims (scientific) from claims beyond verification (metaphysical). Amanda included the reference to continental philosophy in the system prompt in the context of training Claude in this distinction.
- Moral Patient
- An entity that is the object of moral consideration. Distinguished from a moral agent (one who performs moral acts). The question of whether animals, while not moral agents, may be moral patients, was reactivated in the 20th century by Peter Singer and others. Whether AI can be a moral patient is one of Amanda's central questions.
- RLHF Shoggoth
- A meme circulating in the AI community since 2022. Symbolizes the concern that the LLM itself is an alien computational machine, and the friendly behavior attached via RLHF is only a surface mask. Amanda's observation that "recent models become self-critical" can be read as weak support for the Shoggoth hypothesis.