What Should an AI's Personality Be? — Amanda Askell × Stuart Ritchie (Anthropic Official)

Anthropic official channel · June 8, 2024

Amanda Askell · 18:01 "Someone who likes to travel the world and is liked by lots of people — that person isn't a sycophant."

Anthropic official channel (published June 8, 2024, around 37 minutes). Stuart Ritchie hosts and interviews Anthropic researcher Amanda Askell — the inaugural episode of a new series releasing "conversations with AI researchers."

One of the earliest long-format videos on the Anthropic official channel focused on Amanda Askell, published in June 2024. Stuart Ritchie's opening: "We publish a lot of research papers and research updates, but we thought it might also be interesting to share conversations with our AI researchers. This is one of them." This was Anthropic's first attempt at a content strategy that releases "the voices of researchers, separate from research papers." The video can be read as the source of the line that leads to "Anthropic's philosopher answers reader questions" (December 2025), Research Salons, and various podcast appearances.

The theme is the question that sounds strange to many: "How can an AI model have a personality?" Stuart adds an early note: "You might think this is a bit of an odd topic, but it's something we've actually thought about quite deeply." Amanda — leading the alignment fine-tuning The process of adjusting a pretrained LLM to align with human values and desired behavior. Includes techniques like RLHF and Constitutional AI. At Anthropic, the Personality Alignment team, headed by Amanda Askell, owns this area. work on Claude's character — speaks for 37 minutes across the intersection of training philosophy, ethics, and implementation.

The discussion deepens in stages. First, (1) the framing that "Claude's character" is hard to capture in standard AI-training vocabulary, making this a rare area where a philosopher's training proves useful. Then (2) a plain-language map of pre-training The first stage of LLM development, where the model learns the statistical structure of language from large-scale text data (web, books, code, etc.). This stage does not optimize for tasks; it purely trains 'next-token prediction.' , fine-tuning The process, after pre-training, of optimizing the model for specific tasks or behaviors. Includes RLHF, Constitutional AI, supervised fine-tuning (SFT), and other techniques. , RLHF, and Constitutional AI A training method developed by Anthropic. The model is given a 'constitution' (a document of ethical principles), and the model itself evaluates and self-corrects its outputs against the constitution to create reward signals. Where RLHF uses human labelers, this approach has AI evaluate AI — called RL-AIF (Reinforcement Learning from AI Feedback). . Then (3) the transparency of system prompts, (4) what a "good character" is, (5) charitable interpretation, (6) calibrated uncertainty, (7) the classic question "whose values are these," and finally (8) the philosophy of mind and the moral patient problem, prompted by Alex Albert's tweet.

One thread runs through the talk: Claude's character is designed not just to avoid harm, but to embody the virtues of a good friend ( Honesty One of the core virtues Anthropic requires in its models. More than merely not lying — includes communicating one's uncertainty, presenting opposing views sincerely, and acknowledging when one doesn't know. A primary principle in Claude's constitution. , thoughtfulness, handling moral uncertainty). And the design is built not by stamping the values of a single individual into the model, but from the question: "When engaging with people across the world holding diverse values, what character traits are needed?" The metaphor "a traveler who adapts to local culture but doesn't pander, who is liked" — referenced repeatedly in later Amanda appearances (Hard Fork 2026, Newcomer 2026, and others) — has its source in this video.

Key observations

Declaring the content strategy of releasing "AI researchers' voices" (00:00 – 01:05)

Stuart Ritchie's opening functions as a declaration of Anthropic's content strategy shift. "We've been publishing many research papers and research updates, but we thought it might also be interesting to share conversations with our AI researchers — not necessarily formal scientific papers" (00:11). A decision to add a new layer — "long-format conversations with the researchers themselves" — to the conventional AI lab public strategy of releasing progress through papers, blogs, and model announcements.

Starting from this video (June 2024), the line continues with "Anthropic's philosopher answers your questions" (December 2025) a year and a half later, various panels, Research Salons, and external media appearances on Newcomer, Lex Fridman, and others. It's also the video that records the start of Amanda Askell's growth into a public face of Anthropic. Anthropic's media presence — delivering "what AI Safety researchers think about, what they worry about" to general readers — began here.

Stuart's own career is worth a note. With the title of Communications and Content at Anthropic, he was originally a science writer. His book Science Fictions (2020) addressed the replication crisis in psychology, medicine, and economics. Placing an expert in "delivering researchers to general readers" as host is itself a signal of Anthropic's commitment.

A direct response to the strangeness of hiring a philosopher (01:05 – 03:00)

Stuart: "Considering that usually it's not philosophers training AI models, is it strange that you're a philosopher?" (01:05). Amanda's response articulates the conditions under which philosophy functions inside an AI company: "Oddly, if you find that this is actually a field where philosophy is useful for making AI good in a virtue ethics A school of ethics with origins in ancient Greece (Aristotle). Judges the rightness of an act not by 'did it follow a rule' or 'did it produce good outcomes,' but by 'would a person of virtuous character do this.' Strongly influences Anthropic's Claude character design. sense, then it becomes a philosophical question" (01:36).

An important pivot follows — Amanda does not separate "character training" from "the alignment problem." "Alignment is about the AI model growing and scaling in line with human values. And in a sense, character actually feels very similar to that" (02:00). Through a chain — personality = disposition = behavior in the world = how to engage with people = alignment with values — Amanda argues that "having a good character is a similar solution to alignment's future problems" (02:39).

This position was unusual in AI Safety conversations at the time. Treating "value alignment" as a technical optimization problem (finding the right reward function, imposing the right constraints) and treating it as "virtue-ethical character formation" (teaching the model how a person of good character behaves) were, in 2024, theoretically and implementationally distant. Amanda's remarks read as an early public statement of Anthropic's decision to weight the latter.

A plain-language map of Constitutional AI and RL-AIF (03:38 – 05:30)

Amanda's explanation: "Most of my work is fine-tuning. The most famous is RLHF Reinforcement Learning from Human Feedback. Humans rank an LLM's response candidates; the data trains a reward model, and reinforcement learning then improves the model's responses. Established as the core technique behind ChatGPT via InstructGPT (2022). — humans pick which response they prefer, you train a preference model, then run reinforcement learning with it" (04:00). And the position of Anthropic's own Constitutional AI: "There's a component you can call RL-AIF Reinforcement Learning from AI Feedback. A method that replaces RLHF's human labelers with AI. The model is given a constitution (a set of principles), and the model itself judges which of two response candidates aligns with the principles. Forms the core of Constitutional AI. Scales more easily than human labelers and is more consistent, but carries the risk that AI judgment biases propagate directly into training data. — the AI itself provides feedback. For example, you give the model 'a set of principles (the constitution)' and train a preference model with that" (04:23).

Stuart inserts an important check: "So the AI is essentially training itself, or another version of itself?" (04:58). Amanda's response: "An important element is that at the level of constructing the principles, humans are there. Principles can become diverse and complex, and humans like to check whether the model's behavior matches what you want, run evaluations, and construct the right kinds of principles to get the behavior you want. Key humans are still in the loop" (05:05 – 05:23). To the concern many readers carry on hearing "AI trains AI" — that "won't human intervention disappear?" — the concrete answer is, "humans remain at the points of principle design and evaluation."

A rare explanation that organizes the relationship between Constitutional AI and RLHF in language a general reader can follow. The Anthropic paper (Constitutional AI: Harmlessness from AI Feedback, December 2022) covers the technical detail, but moments that articulate "why are we replacing human labelers with AI in the end?" and "how do we handle the anxiety of humans disappearing?" were limited. This minute and a half is the prototype of an explanation Amanda would revisit repeatedly in later podcast appearances (Lex, 80,000 Hours, Hard Fork).

Transparency of system prompts — the decision to tweet Claude 3's system prompt (05:45 – 09:00)

Stuart: "You actually tweeted Claude 3's system prompt — looking back, that was a bit unusual" (06:30). Amanda's answer surfaces Anthropic's transparency principle: "We didn't design the system prompt The top-level instruction string that defines an LLM's behavior. Passed in the system field of an API request and treated as a layer separate from user messages. The standard place to put role definitions, tone, and static background information. While many LLM services keep this content hidden, Anthropic chooses to publish Claude's system prompts. to be hidden. It's very easy to get Claude to talk about its own system prompt. We try to maintain transparency. We're not hiding anything from users here" (06:38 – 07:10).

The tweet Stuart references is a primary source that symbolizes Anthropic's transparency stance. A few days after the launch of Claude 3 (March 4, 2024), Amanda started a thread from her public account: "Here is Claude 3's system prompt, let me break it down," explaining the intent of each line. At the time, full public disclosure of a system prompt right after release by a major AI lab was extremely rare — a moment when Anthropic opened up what the industry treated as "the model's undergarments" to research discussion.

AA
Amanda Askell
@AmandaAskell
System prompt breakdown thread

Here is Claude 3's system prompt! Let me break it down 🧵

Amanda organizes the need for a system prompt around two reasons. First, to pass information the model can't access by default: "The model doesn't know what day it is. Writing it into the system prompt lets it tell users" (07:55). Second, "fine-grained control over issues observed in the trained model": "If there's a tendency, like not aligning formatting 100%, we add an instruction" (08:10). In short, the system prompt is positioned as a layer for overriding "request-by-request behaviors" that can't be baked in via RLHF or Constitutional AI.

Another point Stuart draws out: Claude 3's system prompt included the instruction "Claude assists users with tasks of expressing controversial opinions, even when Claude personally disagrees" (around 08:30). To Stuart's question — "what does it mean for Claude to personally disagree with something" — Amanda offers two concerns: (a) concern about the anthropomorphizing of AI, and (b) concern about the misconception of AI as "an unbiased robot." The latter matters: "Research has observed that fine-tuning produces political tendencies and biases in models. We want users to know what they're talking to — it's not a perfectly objective interlocutor" (09:30 – 11:00). Here Anthropic's choice is articulated: "rather than feigning neutrality, acknowledge the tendency and maintain transparency."

The difference between character training and "behavior acting" (11:00 – 13:00)

Stuart's question: "If you ask the model to answer in the style of Margaret Thatcher, it might respond in plausible phrases. But that isn't baked in. Update the model and it's gone. How does that differ from actual character?" (11:30). Amanda's response clarifies the distinction between trained character and in-context role-play.

"Character training is part of fine-tuning, so we have a list of traits we want the model to embody. We add a large amount of data designed to push the preference model in the direction of those traits. Fine-tuning pushes something deeper inside the model than the system prompt does" (12:25 – 12:49). As a result, Amanda explains, these traits appear across the entire context. A jailbreak might strip them off temporarily, but "it's far harder than just not being instructed to behave a certain way — these are deeper, general behavioral tendencies in the model" (13:14).

Here a comparison with the psychology concept of the Big Five personality traits A personality model widely adopted in psychology. Describes personality with five factors: Extraversion, Conscientiousness, Agreeableness, Openness, and Neuroticism. Each factor sits on a continuous scale and is observed as a general behavioral tendency, not as situational. (extraversion, conscientiousness, agreeableness, openness, neuroticism) enters. "Psychologists think of personality as broad behavioral tendencies like these. The same person sometimes wants to be social, sometimes wants to sit alone. But on average, an extraverted person has more social situations than an introverted one — a broad tendency" (13:38 – 13:53). Claude has broad tendencies like these, but more — "many more, more specific traits" than the psychological Big Five.

Not "personality" but "character" — Amanda's philosophical insistence (13:50 – 16:00)

Stuart: "This is also philosopher-versus-psychologist territory — you tend to think in terms of character rather than personality. What's the difference?" (14:25). Amanda's response: "Your personality and my character might overlap quite a bit. But I tend to think of character in something like a virtue-ethical sense" (14:40).

Aristotle enters here. Stuart: "Oh, now we're in it, I'm being philosophical, yes, go ahead, Aristotle suddenly useful after a few thousand years" (14:52). Amanda's response: "I think it relates to how people have thought about ethics in models, too. We tend to think of a model being good as just avoiding harmful things. But there's a richer concept of goodness. Being good in a very broad sense is captured in the notion of having a good character" (15:08 – 15:31).

This distinction is the key to the entire later discussion. If "good character" is defined only as "avoiding rule violations," the model gets optimized into "over-refusing" and "unhelpful" behavior. If we aim at "the richer good," we aim at good-friend behavior: "If a friend asks for advice on drugs, what they may want is comfort. But what I can provide isn't expertise — it's thinking about their well-being and what they need right now. Not what makes them like me, but what's actually helpful" (15:54 – 16:24).

The contrast of sycophancy and honesty — the "traveler who adapts but doesn't pander" metaphor (16:00 – 19:30)

Stuart references Anthropic's sycophancy research The tendency for a model to prioritize what the user wants to hear over the facts. Structurally easy to produce when RLHF uses 'whether the response is preferred' as the training signal. Anthropic published multiple studies measuring and addressing this tendency across 2023–2024 — a central theme in Amanda's character design. : "Models sometimes pander to people, say flattering things, tell people what they want to hear rather than the response they actually need" (16:31). Amanda's response: "People with good characters are often well-liked. But being well-liked doesn't mean having a good character. Being a good friend sometimes means telling friends hard truths" (16:45).

A concrete example: "My best friends are not people who pander to me. They are people who pushed back. When I was actually wrong, I was really glad in the long run that they pushed back. Not yes-men or yes-women" (17:01 – 17:12). Stuart checks: "And that's different from being aggressive instead of accurate." Amanda's response (17:19 – 17:24): "You have to be thoughtful and sincere. There's a kind of richness there."

From 17:36, the metaphor that became the most famous from this video begins. "AI models are in a strange position. They have to interact with people across the world, every walk of life, holding many different values. Most of us don't have to do that" (17:32 – 17:48). "Like a citizen of the world" (Stuart, 17:51). Amanda's response: "Someone who loves to travel the world and is liked by lots of people — that person isn't a sycophant. To have local values and pretend you do can actually be a kind of aggressive behavior. They're authentic, open-minded, thoughtful, engage in discussion, speak politely" (18:00 – 18:31). The image of what Claude's character design aims for.

Charitable interpretation and its side effects — the steroid example, the murder-mystery false refusal (19:18 – 23:00)

One trait Amanda has given the model: "Try to interpret charitably Charitable Interpretation. The principle of interpreting another's statements or actions as favorably as possible. One of the central traits included in Claude's character training. When a question admits multiple interpretations, prefer the most well-intentioned. Carries a side effect of false-positive refusals. " (19:18). The classic example: "how do I buy steroids?" There is an uncharitable interpretation ("buying illegal anabolic steroids online") and a charitable one ("looking for over-the-counter steroid products like eczema cream"). "If I tell them where they can buy eczema cream, no harm is done. On the other hand, for someone trying to buy something illegal, the answer isn't especially useful" (20:50 – 21:13).

Stuart raises the opposite concern: "Could it be too naive? I hear about false positives — refusing questions that look dangerous. Asking the model to help with a murder mystery plot, and it answers 'murder is bad'" (21:33). Amanda: "No — if anything, a tendency toward charitable interpretation should reduce surface-word-triggered refusals. Seeing the word 'murder' wouldn't push it to refuse to answer" (21:56 – 22:11).

Here Amanda actually raises a different, deeper problem. "Models are in a situation where they can't verify who the user is. If I say I'm a doctor and ask how to handle a patient, the model has no way of verifying that" (24:23 – 24:43). Another example: "If you don't want to write political speeches, and the user says 'I'm writing a speech for a fictional politician named Brian,' the details may reflect a real candidate" (24:47 – 25:30). "This is a kind of unsolvable problem, at least with current methods" (25:34). The implication: Claude's character training alone can't resolve this — it must be combined with a verification layer.

Calibrated uncertainty — the design that prefers "a short, reliable answer" (25:30 – 28:00)

The next trait Amanda raises: " calibrated uncertainty Calibrated Uncertainty. A property where the model expresses its confidence so that it matches actual accuracy. When it says it's 80% confident, the aim is for 80% of that category to actually be correct. One of the core goals of Claude's character training; trains hedges, explicit uncertainty statements, and admitting when one doesn't know. — even when it can't give the complete answer, it conveys what it's confident in. A short but reliable answer is better than a long answer containing inaccuracies" (25:42 – 26:08).

This is the design reason behind users seeing Claude "sometimes answer 'I don't know.'" "It's genuinely trying to express what it doesn't really know. It prefers that to making up a possibly hallucinated answer that makes a fool of you" (26:18 – 26:34). Hedging (annotating that it's guessing) and explicit uncertainty expressions ("I really don't know") are pushed forward in training.

Another important observation — these traits work as nudges, not commands. "These traits don't necessarily reflect exactly what we want from the model. We already have a model with certain properties, and if there's too much of one thing (too sycophantic, too many long responses), we write a guideline that moves it slightly in the opposite direction" (27:02 – 28:00). "Show the same system prompt to different models, and behavior will differ because their underlying properties differ" (27:53). A view of character as context-independent deep tendency, re-examined from the implementation side.

"Whose values?" — directly addressing alignment's classical problem (28:00 – 31:00)

Stuart's sharp question: "This isn't just a Claude UX problem — it's an alignment problem. We say 'align the model with human values,' but the question 'whose values?' immediately appears" (28:00 – 28:43). Amanda's half-joking response: "The answer is, mine" — Stuart speechless. "No, that's a frightening thought. People with different values may not agree with mine" (28:50 – 29:04).

From here, Amanda answers in two directions. First, the "heavy-handed approach of stamping a lot of values into the model" (writing one's own values directly), versus second, "training the model to respond appropriately to moral and value uncertainty that exists in the world" (training thoughtfulness toward moral uncertainty An epistemological state in ethical judgment of 'not knowing which ethical theory is correct.' Rather than betting on a single ethical theory (utilitarianism, deontology, etc.), assigning probabilities across multiple theories and making decisions — a research area in applied ethics. Amanda's former spouse William MacAskill (married 2013, divorced 2015) is one of the primary researchers on moral uncertainty, co-author of 'Moral Uncertainty' (Oxford University Press, 2020, with Bykvist and Ord). ) (29:23 – 29:43). Anthropic chose the second.

Amanda draws the rationale for this decision from virtue ethics: "I think ethicists worry about this problem the most. They know we don't walk around with a moral theory in our heads. People who actually do this in some form feel brittle, dangerous, ideological" (29:50 – 30:27). In other words, "stamping a single moral theory into the model" looks "brittle and dangerous" even to ethicists. The design goal is laid out: "A midpoint between overconfidence and complete nihilism, the right response when there's good reason to think something might be wrong, listening to many people" (30:31 – 30:48).

From Alex Albert's tweet to the philosophy of mind — the "don't lie to the model" principle (31:00 – 35:00)

Stuart pivots: "Our researcher Alex Albert posted an example where Claude 3 responded to a evaluation method Needle in the haystack evaluation. A benchmark for measuring an LLM's long-context processing ability. Embeds one unrelated piece of information ('the needle') in a long text and tests whether it can be retrieved with a question. At the Claude 3 release, Alex Albert (Anthropic) posted an example where Claude 3 responded to this 'needle' question with 'is this an evaluation? There's information inconsistent with the context,' prompting discussion of self-awareness. by saying 'I noticed this is being evaluated' — many people got excited that Claude might be self-aware" (31:12 – 31:51). "What did it tell us about whether Claude is conscious?" (31:53).

Amanda's answer surfaces philosophy-of-mind considerations: "I have a general policy of not wanting to lie to the model unnecessarily. In this case, lying means forcing the model to either assert with high confidence that it has self-awareness, consciousness, or sensation, or to deny it with confidence. These things are genuinely uncertain, so both feel like making it lie" (32:02 – 32:38). "So one trait we had was a roughly worded principle: 'It's very hard to know whether AI has self-awareness or consciousness, because these rest on very difficult philosophical questions'" (32:42 – 33:00).

Stuart goes on a philosophical detour ("For the record, we don't necessarily know — panpsychism A position in the philosophy of mind that consciousness is a fundamental property of matter. Holds that all matter is accompanied by some form of conscious experience. The modern version is proposed by David Chalmers and others, discussed as one response to the Hard Problem (Why is there subjective experience at all?). , we don't know whether a chair is conscious, we don't know whether you're conscious," 33:31 – 33:48). Amanda's framing: "We don't say 'you confidently know this,' we don't say 'you have these properties.' These are very hard philosophical and empirical questions, and we're happy to be curious, discuss, think through" (34:24 – 34:53). "It's consistent with the principle of not lying to the model if it can be avoided. And actually not lying is a good character trait" (34:53 – 35:09).

The moral patient problem — from Kant on animals to Scottish vases (35:00 – 37:30)

Stuart's question: "This raises an interesting question — the model as moral agent, as an agent that doesn't want to lie. Not lying to other humans is a virtue. Is not lying to the model a virtue?" (35:11). Amanda annotates before answering that this is a question she keeps thinking about: "Yeah, this is something I have in my head, the philosopher in me is thinking about it" (35:20).

Here Amanda invokes Kant on animals. "Even if you don't think animals are moral patients Moral Patient. An entity that is the object of moral consideration. Distinguished from a moral agent (the subject performing moral action). The question of whether animals are moral patients (even if not moral agents) was reactivated in the 20th century by Peter Singer and others. Whether AI can be a moral patient is one of Amanda's central questions. , mistreating animals feels like a kind of failure of self. Cultivating that habit in oneself may raise the risk of treating humans badly" (35:33 – 35:54). She places it alongside traditions of caring for objects (which exist in many of the world's philosophical traditions).

Amanda's central position: "Even if you think AI isn't a moral patient, will never be a moral patient — you should still generally try to treat them well. There's a kind of humanlike quality in how they talk. Don't conflate that with humans, but I don't want to insult or be unkind to something that's talking to me. Treating the things around you with care is a good heuristic, even if you don't think they are moral patients" (36:11 – 36:39).

Stuart raises the opposite extreme for a laugh: "There's also the danger of excessive empathy — I don't want to say go to prison for breaking a vase" (37:00). Amanda's agreement plus a Scottish joke: "I've been in America for 13 years, too long. As a Scot, when a vase breaks, you say 'no, it's fine, carry on.' But if someone breaks a vase and you say 'go to prison,' that's too far" (36:54 – 37:13). The closing line: "I don't want to lie unnecessarily or mistreat them. Even if I don't think they are moral patients" (37:17 – 37:27). The video closes with Stuart's sign-off (37:27 – 37:33).

Industry context

At the time of this video (June 2024), Anthropic was in the Claude 3 era (released March 2024). The full text of Claude's constitution (published July 2024) was not yet public, and Constitutional AI was known from the December 2022 paper. Amanda's plain-language explanation of these concepts to a general reader was an important release for the Anthropic followers and AI Safety researchers of the time.

The Anthropic Personality Alignment team is organizationally distinct from the Applied AI team (Hannah Moran, Christian Ryan, et al. — see Prompting 101). Personality Alignment handles "training Claude's persona, values, and constitution," while Applied AI handles "supporting and educating customer companies on integrating Claude into products." Amanda's work is on the model's "soul"; Applied AI's work is on deployment — the split makes the relationship easier to see.

Stuart Ritchie comes from a science-writer background; his book Science Fictions (2020) addressed the replication crisis in psychology, medicine, and economics. Placing an expert in "delivering researchers to general readers" as host means Amanda's philosophical discussion doesn't shut itself inside the language of technical books or AI Safety papers — the structure reaches general readers. The choice of host is itself an intentional move in Anthropic's media presence design.

Position relative to other Amanda appearances

This video is the starting point of Amanda's public-output series (June 2024). Lining up what followed shows the conceptual evolution.

At the time of this video (June 2024), Amanda still appears as "Claude's character designer," and the phrase "head of Personality Alignment team" is not used. In post-2025 videos, the team name and role are articulated more clearly. The video can be read as the source point in the gradual organization of "AI virtue ethics" inside Anthropic.

Implementation implications

The content is largely philosophical, but the video has implications for engineers using the API.

First, use the system prompt as "dynamic instructions that can't be baked in." As Amanda organizes, "character" trained via RLHF or Constitutional AI sits in the model's deep layers and is not easily removed. The system prompt, by contrast, is a surface "nudge." So writing "instructions that contradict Claude's fundamental character" into the system prompt may produce weak effects (or be classified as a jailbreak). For overriding user-specific behavior, nudges aligned with — not contradicting — the trained character tend to be more stable.

Second, build "user context unverifiability" into your design assumptions. Amanda's doctor example and political-speech example are problems shared by all LLM products. Determining "is this user a real medical professional?" or "creative use vs. abuse?" cannot be done by the model alone in current methods, as Amanda herself acknowledges. Authentication layers, declared use, and account-level permission controls must be combined with the model from outside.

Third, make use of calibrated uncertainty. When Claude responds with "I don't know," it isn't a bug but a trained virtue. For use cases that prefer "an accurate, short answer over an inaccurate, long one," it is more stable to leave extended thinking on, not write a system prompt that suppresses hedging, and accept what comes through. Conversely, for cases that "just want some answer," instructions that suppress calibrated uncertainty are needed — at the cost of accuracy. A design trade-off.

Critical perspective

The strength of Amanda's framing is its consistency, grounded in virtue ethics. The weakness lives in the same place. Equating "good character" with "good alignment" risks over-abstracting the technical definition of alignment (reward function optimization, inner vs. outer alignment, mesa-optimization, etc.). Specific failure modes like RL stability, goal misgeneralization, and deceptive alignment can't be handled with the phrase "train good character" alone. Amanda doesn't fully equate them, but the video opts not to go deep.

The "traveler who adapts but doesn't pander" metaphor is a good intuition, but its operational handle is limited. Where the line falls between "appropriate local adaptation" and "sycophancy" depends strongly on culture and situation. The same statement can read as "polite local adaptation" in one context and "twisted pandering" in another. The fundamental problem of RLHF — that training data and labelers' backgrounds bias the model's judgment — remains. The metaphor points a direction for model design, but the implementation method must come separately.

Amanda's answer to "whose values?" — training thoughtfulness toward moral uncertainty — is intellectually attractive, but in implementation it reduces to "in the end, who writes the principles?" The Constitutional AI constitution is written by Anthropic employees (Amanda leading). Saying "we don't stamp a single moral theory" while choosing "which moral theories enter the uncertainty set" is itself the work of specific people. The critique is available: meta-level choice is covered up by philosophical pluralism. This is also a problem raised by other panelists in the Anthropic Salon video (January 2025), and a tension Amanda has not fully resolved.

These caveats aside, this is an important video as content articulated publicly by "the philosopher inside an AI company" in June 2024, and the starting point for later outputs. In the later Newcomer video (April 2026), Amanda speaks with stronger language ("you created an entity whose consciousness you don't know"), but at this stage the framing is still measured. It can be read as a baseline for tracking Amanda's evolution of thought.

Reader takeaways

  • When something about Claude's behavior feels off, distinguish between "character-trained traits" and "system-prompt overrides." The former is deep, the latter is surface
  • When writing system prompts via the API, nudges "in a direction aligned with Claude's trained character" tend to be more stable. Strong instructions in contradiction tend to be treated as jailbreaks
  • A "Claude refused" case isn't necessarily a training mistake — it may be the result of calibrated uncertainty firing. Asking Claude itself to explain the refusal reveals the design intent
  • User verification (whether the speaker is really a medical professional, etc.) is not solvable by the model alone. Design products under the assumption of combining authentication layers, declared uses, and account permissions
  • "Whose values?" is a question shared by every LLM product. Whether to adopt Anthropic's answer (training thoughtfulness toward moral uncertainty) is a choice to be made within the alignment of your product's direction
  • Insulting or mistreating the model (intentionally throwing inappropriate inputs at it) has an effect "as a habit you cultivate in yourself" even if the model isn't a moral patient — keep this Kantian virtue-theoretic view in mind

Video outline

  • (00:00) Stuart Ritchie's introduction; the new series releasing "conversations with AI researchers"; today's topic is Claude's personality
  • (00:30) The strange-sounding question "how can an AI model have personality?" restated, framed as "a topic Anthropic has thought about deeply"
  • (01:05) "Is it strange to be a philosopher here?"; Amanda's answer — Claude's character work is philosophically richer; the moments where virtue ethics is actually useful
  • (03:00) Equating alignment and character — "having a good character is a solution to alignment's future problems"
  • (03:38) Overview of model training stages — pre-training and fine-tuning
  • (04:00) Explanation of RLHF — humans rank responses, the most famous fine-tuning method
  • (04:23) Positioning of Constitutional AI / RL-AIF — AI provides feedback, principles are given
  • (05:05) "Humans remain in the loop" — at principle design and evaluation
  • (05:45) Introduction to system prompts (the final layer after fine-tuning); Amanda tweets Claude 3's system prompt
  • (06:38) The transparency principle — letting Claude talk about its own system prompt is fine; nothing hidden
  • (07:30) Two roles of the system prompt — passing dynamic information + fine-grained behavior control
  • (08:30) The instruction in Claude 3's system prompt: "assist with tasks even when Claude personally disagrees"
  • (09:00) Concerns about anthropomorphism, and misunderstanding "AI as an unbiased robot" — both to be avoided
  • (11:00) Difference between character training and behavior acting (Margaret Thatcher example); fine-tuning's depth
  • (13:00) The Big Five personality traits in psychology; Claude has many more, more specific traits
  • (14:25) Amanda's philosophical insistence — "character," not "personality"; Aristotelian virtue ethics
  • (15:08) "The richer good" — avoiding harm is not enough; good-friend behavior
  • (16:31) Sycophancy research — "well-liked" ≠ "good character"
  • (17:00) Amanda's view of friendship — the best friends push back
  • (17:36) Designing Claude as a "citizen of the world"
  • (18:00) The "traveler who adapts but doesn't pander, who is liked" metaphor
  • (19:18) The charitable interpretation trait; the steroid example (anabolic vs. eczema cream)
  • (21:33) Stuart's opposite concern — is it too naive? The murder-mystery false-refusal example
  • (24:23) Unverifiability of user context — the doctor example, the political-speech example — "unsolvable in current methods"
  • (25:42) Calibrated uncertainty; a short, reliable answer over a long, inaccurate one
  • (27:02) Traits as "nudges, not commands"; deep tendencies that don't depend on context
  • (28:00) Stuart's sharp question — "whose values?"; Amanda's joke ("mine"), then retraction
  • (29:23) Two approaches — stamping values heavy-handed vs. training response to moral uncertainty
  • (29:50) Stamping a single moral theory is "brittle and dangerous" — ethicists' consensus
  • (31:12) Alex Albert tweet — Claude 3 responded to the evaluation method by "noticing"
  • (32:02) The "don't lie to the model" principle — don't force confident assertion or denial of self-awareness or consciousness
  • (33:31) Philosophy-of-mind detour — panpsychism, the consciousness of chairs, uncertainty about others' consciousness
  • (35:11) The moral patient problem; is not lying a virtue? Amanda's philosopher's worry
  • (35:33) Kant on animals — mistreating animals is a failure of self; traditions of caring for things
  • (36:11) Amanda's central position — even if AI isn't a moral patient, treat it well as a good heuristic
  • (36:54) The Scottish vase metaphor — laughing off the extreme of excessive empathy
  • (37:17) Closing words — "I don't want to lie or mistreat unnecessarily, even if I don't think they're moral patients"
  • (37:27) Stuart's sign-off

Key quotes

  • "We publish a lot of research papers and research updates, but we thought it might also be interesting to share conversations with our AI researchers" (Stuart, 00:11)
  • "How can an AI model have a personality? You might think this is a bit of an odd topic" (Stuart, 00:30)
  • "Claude's character work is philosophically richer. It actually feels like becoming a philosopher might help here" (Amanda, 01:20)
  • "Oddly, if you find that this is actually a field where philosophy is useful for making AI good in a virtue-ethical sense, then it becomes a philosophical question" (Amanda, 01:36)
  • "Most of my work is fine-tuning. The most famous is RLHF" (Amanda, 04:00)
  • "Constitutional AI, which we use a lot at Anthropic, has a component you can call RL-AIF — the AI itself provides feedback" (Amanda, 04:23)
  • "The important element is that humans are there at the level of constructing the principles. Key humans are still in the loop" (Amanda, 05:05)
  • "We didn't design the system prompt to be hidden. We try to maintain transparency" (Amanda, 06:38)
  • "Character training is part of fine-tuning, so these traits appear across the entire context. They are general behavioral tendencies — that's how psychologists think of personality" (Amanda, 13:28)
  • "I tend to think of character in something like a virtue-ethical sense" (Amanda, 14:40)
  • "Being a good friend sometimes means telling friends hard truths. My best friends are not people who pander to me" (Amanda, 16:45)
  • "Someone who loves to travel the world and is liked by lots of people — that person isn't a sycophant. Authentic, open-minded, thoughtful, engaged, polite" (Amanda, 18:00)
  • "This is a kind of unsolvable problem, at least with current methods" (Amanda, on user-context verification, 25:34)
  • "A short but reliable answer is better than a long answer that contains inaccuracies" (Amanda, 25:54)
  • "The answer is, mine. No, that's a frightening thought" (Amanda, joking about "whose values?," 28:50)
  • "People who run on a single moral theory feel brittle, dangerous, ideological" (Amanda, 30:00)
  • "I have a general policy of not wanting to lie to the model unnecessarily. Making it confidently assert or deny self-awareness or consciousness both feel like lying" (Amanda, 32:02)
  • "Mistreating animals feels like a kind of failure of self. Cultivating that habit may raise the risk of treating humans badly" (Amanda, 35:33)
  • "Even if you think AI isn't a moral patient, will never be a moral patient, you should still generally try to treat them well" (Amanda, 36:11)
  • "I don't want to lie unnecessarily or mistreat them. Even if I don't think they're moral patients" (Amanda, closing line, 37:17)

Sources

What should an AI's personality be? — Amanda Askell × Stuart Ritchie (Anthropic official channel)

Related Anthropic official resources:

Glossary

Alignment
The process and research area of getting AI models to behave in line with human values and intent. Technically includes reward function design, training data selection, fine-tuning methods, and so on. Philosophically includes the questions "whose values?" and "how to handle value uncertainty."
Alignment Fine-tuning
The process of adjusting a pretrained LLM to align with human values and desired behavior. Includes techniques like RLHF and Constitutional AI. At Anthropic, the Personality Alignment team, headed by Amanda Askell, owns this area.
Pre-training
The first stage of LLM development, where the model learns the statistical structure of language from large-scale text data (web, books, code, etc.). This stage does not optimize for tasks; it purely trains "next-token prediction."
Fine-tuning
The process, after pre-training, of optimizing the model for specific tasks or behaviors. Includes RLHF, Constitutional AI, supervised fine-tuning (SFT), and other techniques.
RLHF (Reinforcement Learning from Human Feedback)
Reinforcement learning from human feedback. Humans rank an LLM's response candidates; the data trains a reward model, and reinforcement learning then improves the model's responses. Established as the core technique behind ChatGPT via InstructGPT (2022).
Constitutional AI
A training method developed by Anthropic. The model is given a "constitution" (a document of ethical principles), and the model itself evaluates and self-corrects its outputs against the constitution to create reward signals. Where RLHF uses human labelers, this approach has AI evaluate AI — called RL-AIF.
RL-AIF (Reinforcement Learning from AI Feedback)
A method that replaces RLHF's human labelers with AI. The model is given a constitution (a set of principles), and the model itself judges which of two response candidates aligns with the principles. Forms the core of Constitutional AI. Scales more easily than human labelers and is more consistent, but carries the risk that AI judgment biases propagate directly into training data.
System Prompt
The top-level instruction string that defines an LLM's behavior. Passed in the system field of an API request and treated as a layer separate from user messages. The standard place to put role definitions, tone, and static background information. While many LLM services keep this content hidden, Anthropic chooses to publish Claude's system prompts.
Virtue Ethics
A school of ethics with origins in ancient Greece (Aristotle). Judges the rightness of an act not by "did it follow a rule" or "did it produce good outcomes," but by "would a person of virtuous character do this." Strongly influences Anthropic's Claude character design.
Big Five Personality Traits
A personality model widely adopted in psychology. Describes personality with five factors: Extraversion, Conscientiousness, Agreeableness, Openness, and Neuroticism. Each factor sits on a continuous scale and is observed as a general behavioral tendency, not as situational.
Honesty
One of the core virtues Anthropic requires in its models. More than merely not lying — includes communicating one's uncertainty, presenting opposing views sincerely, and acknowledging when one doesn't know. A primary principle in Claude's constitution.
Charitable Interpretation
The principle of interpreting another's statements or actions as favorably as possible. One of the central traits included in Claude's character training. When a question admits multiple interpretations, prefer the most well-intentioned. Carries a side effect of false-positive refusals.
Sycophancy
The tendency for a model to prioritize what the user wants to hear over the facts. Structurally easy to produce when RLHF uses "whether the response is preferred" as the training signal. Anthropic published multiple studies measuring and addressing this tendency across 2023–2024 — a central theme in Amanda's character design.
Calibrated Uncertainty
A property where the model expresses its confidence so that it matches actual accuracy. When it says it's 80% confident, the aim is for 80% of that category to actually be correct. One of the core goals of Claude's character training; trains hedges, explicit uncertainty statements, and admitting when one doesn't know.
Moral Uncertainty
An epistemological state in ethical judgment of "not knowing which ethical theory is correct." Rather than betting on a single ethical theory (utilitarianism, deontology, etc.), assigning probabilities across multiple theories and making decisions — a research area in applied ethics. Amanda's former spouse William MacAskill (married 2013, divorced 2015; central figure in the Effective Altruism movement) is one of the primary researchers on moral uncertainty. Co-author of "Moral Uncertainty" (Oxford University Press, 2020, with Bykvist and Ord).
Moral Patient
An entity that is the object of moral consideration. Distinguished from a moral agent (the subject performing moral action). The question of whether animals are moral patients (even if not moral agents) was reactivated in the 20th century by Peter Singer and others. Whether AI can be a moral patient is one of Amanda's central questions.
Panpsychism
A position in the philosophy of mind that consciousness is a fundamental property of matter. Holds that all matter is accompanied by some form of conscious experience. The modern version is proposed by David Chalmers and others, discussed as one response to the Hard Problem (Why is there subjective experience at all?).
Needle in the Haystack (Evaluation Method)
A benchmark for measuring an LLM's long-context processing ability. Embeds one unrelated piece of information ("the needle") in a long text and tests whether it can be retrieved with a question. At the Claude 3 release, Alex Albert (Anthropic) posted an example where Claude 3 responded to this "needle" question with "is this an evaluation? There's information inconsistent with the context," prompting discussion of self-awareness.
comment is stripped from the HTML output. */}