Amanda Askell · askell.blog "A zero failure rate is a sign of trouble — the optimal failure rate varies by context, and the lower the cost of attempts and the smaller the price of failure, the higher the failure rate should be."
Amanda Askell's personal blog askell.blog is the personal outlet where she published eight essays between 2020 and early 2021, after earning her NYU philosophy PhD (2018/05) and joining OpenAI (2018/11). Each piece runs five to fifteen minutes; the blog is where her voice comes through most clearly, in attempts to bridge philosophical argument with real-world problems.
The value of reading these eight essays together lies in the ability to follow, chronologically, what Amanda was thinking right up until her move to Anthropic (2021/03/16). They are records from a kind of "missing link" period that connects her doctoral dissertation (2018) in formal philosophy with the AI safety discourse of her post-2024 Anthropic output. You can read in them the stance of a philosopher, freshly seasoned by hands-on work on the OpenAI Policy Team, grappling with concrete problems: "AI ethics," "fairness," "the optimal rate of failure," "responses to inequality."
The themes fall into three broad clusters. (1) Decision theory = "optimal failure rate," "robustly tolerable vs. precariously optimal," "the virtues and vices of shark curiosity" — how to judge under uncertainty. (2) AI ethics and fairness = "in AI ethics, 'bad' isn't good enough," "AI bias and ethical locality," "fairness, evidence, and predictive equality" — the philosophical foundations of AI system design. (3) Moral responsibility and inequality = "shooting the messenger of inequality," "self-serving utilitarian arguments" — the relationship between structural inequality and individual responsibility.
All eight essays connect directly to design decisions later made for Claude at Anthropic. "Optimal failure rate" → Claude's calibrated-uncertainty training; "robust tolerability" → the judgment-based approach of Constitutional AI; "shark curiosity" → self-critical training of the model; "bad isn't good enough" → charitable interpretation; "ethical locality" → handling cultural diversity; "predictive equality" → the framework for fairness evaluation; "messenger of inequality" → structural analysis of regulation and redistribution; "self-serving utilitarianism" → critique of the model's internal motives. Amanda's work at Anthropic can be read as an extension of the thinking developed during these ten blog months.
Points of focus
"The optimal rate of failure" — zero failure is a sign of trouble (2020/06/15)
The first essay, "The optimal rate of failure". Amanda's central claim: "When risk aversion is excessive, a zero failure rate is a sign of trouble. The optimal failure rate varies by context, and the lower the cost of attempts and the smaller the price of failure, the higher the failure rate should be."
Concrete examples: musical practice (becoming a great musician requires many failures); policy implementation (the Massachusetts parole program, in which politicians' tendency to avoid being blamed for failures produces excessive risk aversion); crime rates (a society with zero crime is not utopia but almost certainly an authoritarian police state). References: George Stigler's missed-flight example; the Willie Horton campaign in the Bush vs. Dukakis race.
Connection to Anthropic: This essay is the source argument behind the "Optimal rate of failure" chapter (3:54:38) of Amanda's appearance on Lex Fridman Podcast #452 (2024/11). The core problem the Personality Alignment team works on — balancing "Claude refusing too often" against "Claude taking reckless risks" — was already formalized in 2020. The discussion of investing in social infrastructure that tolerates failure (insurance, for instance) is also applicable to "fail-safe mechanism" design in AI safety.
"Shark curiosity" — the virtue of aggressive argument and the chilling of immature ideas (2020/06/22)
The second essay, "The virtues and vices of shark curiosity". "Shark curiosity" is a metaphor for the reflexive instinct to become aggressive when confronted with an argument.
The argument: drawing on Amanda's experience in philosophy graduate school — "competitive environments sharpen argument on one hand, but on the other, students hide immature ideas for fear of being attacked." Amanda recounts her own experience of pointing out a paper's problems candidly at a conference, only to find this unwelcome in an environment that prized a "supportive atmosphere."
Amanda's proposal: an approach of "doing one's best to solve the very problems one has just raised." This turns criticism from "idea destruction" into "joint pursuit of truth," and minimizes the chilling effect of a critical environment.
Connection to Anthropic: At Anthropic Salon (2025/01), Amanda's remark "I'm usually a pretty unpleasant character — what philosophy taught me was to be uncomfortable" (06:33) is self-criticism rooted in shark curiosity. The RL-AIF design of Constitutional AI (the model itself critiques candidate responses) is also linked to the challenge of training resistance to a "self-critical environment."
"Robustly tolerable vs. precariously optimal" — democracy vs. entrepreneurship (2020/07/01)
The third essay, "When robustly tolerable beats precariously optimal". Amanda's central claim: "What is robustly tolerable performs adequately across a wide range of conditions, and in domains where failure costs are high, it beats what is precariously optimal."
Concrete examples: political systems (democracy is imperfect but robust against the risk of dictatorship); business decisions (building checks into a decision process slows it down but reduces the risk of catastrophic failure); career choice (a medical doctorate has lower variance, and is more robust, than entrepreneurship).
Connection to Anthropic: On Scaling Laws (2026/02), Amanda's line "the rules approach is brittle; the judgment approach internalizes the spirit" (30:43) is a paraphrase of "robust tolerability." The Constitutional AI design decision — virtue-ethics-based judgment that can handle a diverse range of situations, rather than a single optimized rule — was already formalized in this 2020 essay.
"AI bias and ethical locality" — the example of 1960s Jenny (2020/08/05)
The fourth essay, "AI bias and the problems of ethical locality". It maps out the two problems of " ethical locality Ethical Locality. A concept proposed by Amanda Askell in her 2020 blog. Because ethical judgments shift across time and place, a system judged to be 'unbiased' at one point in time and in one region may be regarded as problematic at another time or place. Distinguishes two varieties: (1) practical locality (current social practices restrict the available options) and (2) epistemic locality (ethical views shift across time and place). " that confront efforts to reduce bias in AI systems.
Concrete example: Jenny, a 1960s watch factory recruiter. Because "women could not receive the requisite training," she cannot hire women as scientists or managers. The procedure is fair, but the outcome is unjust. Candidates with disabilities are likewise rejected because the era does not recognize that as discrimination.
Amanda's conclusion: "AI systems face the same problem in contemporary society — bias cannot be 'solved.' Instead, we should build systems that reflect current values and remain responsive to moral progress. It is essential to understand this as an AI alignment problem."
Connection to Anthropic: A direct line to the WEIRD cultures Western Educated Industrial Rich Democratic. A concept introduced by psychologist Joseph Henrich in 2010, pointing out that wealthy democratic populations from industrialized Western nations, despite their educated status, are extreme outliers as a sample of humanity as a whole. discussion (18:00) in Scaling Laws. Amanda's response to legal critics — who argue that Claude's constitution skews toward Western, liberal-democratic values — is a further development of the ethical-locality concept from 2020.
"Fairness, evidence, and predictive equality" — the philosophical distinction between causation and correlation (2020/08/17)
The fifth essay, "Fairness, evidence, and predictive equality". It explores the dilemma in which "using information that improves predictive accuracy can feel unfair," and proposes "predictive equality" as a notion of fairness that simple causal principles cannot capture.
Concrete examples: UK exam grades (students from low-performing schools receive lower assessments even with identical scores); night-shift workers (lower court appearance rates correlate, but do not cause, the outcome); poverty backgrounds (those born poor can become subject to disadvantageous predictive decisions across their entire lives).
Amanda's conclusion: "The fairness of decisions based on predictively disadvantageous traits depends on long-term outcomes. We should avoid decisions that reinforce social inequality, and offer opportunities to escape from negative predictive spirals."
Connection to Anthropic: A direct line to design decisions about how Claude handles background information about an individual user. This is the philosophical foundation for the fairness question of how Claude should reflect "the fact that a user belongs to a particular social group" in its responses. You can also read the connection to Anthropic's sycophancy research (models tending to say what users want to hear).
"Shooting the messenger of inequality" — pandemic hand-sanitizer price gouging (2020/10/30)
The sixth essay, "Shooting the messenger of inequality". The central claim: "The backlash against price gouging is in fact 'shooting the messenger of inequality' — anger that properly belongs aimed at wealth inequality itself gets redirected at those who broker the transactions."
Concrete example: pandemic-era hand-sanitizer price gouging. When "an item that was $1 becomes $50," a low-income parent may still obtain a needed product, whereas regulation may simply mean the item disappears from shelves and remains unattainable. The same logic applies to factory work in developing countries and drug trials.
Amanda's conclusion: "Government price regulation does not address the root of the problem; it merely punishes those who signal inequality. What we should actually pursue is structural improvement of inequality, such as wealth redistribution."
Connection to Anthropic: The foundation for how Amanda thinks about the economic impact of AI (the labor Claude substitutes for, the economic gaps among user populations). Amanda's admission in Hard Fork (2026/01) that "unemployment is missing from the constitution" (1:07:00) can be read as an extension of the view of inequality from this 2020 essay. There is also a connection point with Anthropic's strategic decision to avoid an advertising model and concentrate on enterprise sales.
"In AI ethics, 'bad' isn't good enough" — the philosophy of pro tanto harm (2020/12/14)
The seventh essay, "In AI ethics, 'bad' isn't good enough". Amanda's central claim: "It is not enough for arguments in AI ethics to point out a particular harm ( pro tanto harm A harm that exists when viewed in one respect, but whose evaluation can shift when all things are considered. A term from deontological philosophy. Example: the pain of surgery is a pro tanto harm, but if the surgery is curative, it is justified all things considered. Amanda uses the term in arguing that AI ethics discussions require more than pointing out something is 'bad' — they require all-things-considered evaluation. ). Real judgment requires evaluating multiple options and consequences together, on the basis of 'all things considered reasons.'"
Concrete examples: taking the dog to the vet (fear is a reason not to go, but maintaining the dog's health says one should); surgical pain (merely noting that "the surgical procedure results in painful stitches" misses the fact that if a larger incision and less analgesic turn out to be best for the patient, the right judgment can be to increase that harm); bail decision systems (even if a system is biased, deployment can be morally urgent when the existing system causes greater harm).
Amanda's conclusion: "Sound judgment requires (1) evaluation of alternatives, (2) comparison of relative benefits and harms, (3) variation in deployment method, and (4) comparison with existing institutions."
Connection to Anthropic: This is the core philosophy behind Anthropic's Behavior Policy An internal document defining what is and is not acceptable behavior for an LLM product. Anthropic's Acceptable Use Policy and Responsible Scaling Policy correspond to this. Amanda's 'bad isn't good enough' argument is the philosophical grounding for designing LLM behavior not as a simple prohibition list, but as context-aware judgment. design. If one designs Claude's refusal behavior on pro tanto harm alone, one generates excessive refusals (false refusals). The all-things-considered approach lies at the root of Claude's character training.
"Self-serving utilitarian arguments" — Tim's three million (2021/03/20)
The final essay (published right around her move to Anthropic), "Self-Serving utilitarian arguments". Amanda's central claim: "A utilitarian can potentially justify self-serving behavior on the grounds that they will produce more good in the future. But 'utilitarian arguments that happen to be self-serving' are vulnerable to abuse, and good faith is hard to distinguish from bad."
Concrete examples: the case of Tim (Tim is expected to save three million lives over his lifetime; should he use a drug himself to avoid a 10% mortality risk, or give it to ten other patients? — the utilitarian calculation says "Tim should survive," but intuition rebels). The suitcase example (in transporting $10 million for medical aid in a developing country, the choice between a dangerous bike route and a safe car).
Amanda's conclusion: "When there is a track record of altruistic action, a prior commitment, and an independent third-party judgment, one can judge the claim a 'self-serving argument made in good faith.'" That is, rather than reject self-serving arguments entirely, one lowers the risk of abuse by building in verifiable constraints (track record, pre-commitment, third-party validation).
Connection to Anthropic: This is Amanda's final public essay, published right before deciding to move to Anthropic. It is the period in which Amanda herself was philosophically analyzing the very kind of self-serving utilitarian argument — "I can do more good contributing to AI safety than staying in academia" — that her own move embodied. The philosophical foundation for her later work at Anthropic (training Claude to respond to "situations in which Claude makes good-faith self-serving claims"). It also connects to the logical rationale behind the Sleeper Agents research.
Industry context
The eight essays on askell.blog were published during one of the great pivots of Amanda's life — from OpenAI Policy Team (2018/11 - 2021/03) to Anthropic Member of Technical Staff (2021/03 - ). The industry context of the period:
- 2020/05: GPT-3 released; Amanda is a co-author (one of over 130 names)
- 2020/12: Preparation period for the founding of Anthropic In January 2021, seven founders — Dario Amodei, Daniela Amodei, Tom Brown, Chris Olah, and others, all former OpenAI — co-founded Anthropic. They left over concerns that OpenAI was not sufficiently prioritizing AI safety, and many OpenAI researchers joined Anthropic over the course of that year. Amanda Askell joined in March 2021. : Dario Amodei, Daniela Amodei, Chris Olah, Tom Brown, and other OpenAI researchers prepare to depart
- 2021/01: Anthropic formally established; raises $124 million Series A
- 2021/03: Amanda joins Anthropic as Member of Technical Staff (the same month her final blog essay is published)
The blog's publication frequency clusters in the first three months (2020/06-08, five essays), and grows sporadic afterward. This coincides with the period in which Amanda was being drawn into the founding conversations around Anthropic. It is symbolic that the final essay, "Self-Serving utilitarian arguments" (2021/03/20), was published within a few days, before or after, of her joining Anthropic. The self-questioning — "is my move into the AI industry a self-serving utilitarian argument?" — coincides with the essay's theme.
The blog's overall character is an intermediate register: more readable than academic papers, more formal than general media. It belongs to the tradition of the Effective Altruism community's blogosphere (LessWrong, Overcoming Bias, Slate Star Codex). It sits close in context to the philosophy and economics blogs of Robin Hanson, Scott Alexander, and William MacAskill. But after joining Anthropic, Amanda effectively stopped blogging (no posts after 2021/03/20), shifting her platform to Anthropic's official channels, various podcast interviews, and X (@AmandaAskell).
Position within Amanda's wider body of work
The eight essays on askell.blog fill the "post-PhD, pre-Anthropic" intermediate period within the lineage of Amanda's public output.
- Doctoral dissertation, "Pareto Principles in Infinite Ethics" (2018/05) — foundations in formal philosophy
- 80,000 Hours Podcast #42 (2018/09) — a general-audience explanation of the dissertation
- This piece: the eight essays on askell.blog (2020/06 - 2021/03) — output during her time at OpenAI; the unfolding of concrete arguments in AI ethics
- What should AI personality look like? (Anthropic official, 2024/06) — output three years after joining Anthropic
- How hard is AI alignment? (Anthropic Salon, 2025/01)
- Anthropic's philosopher answers reader questions (2025/12)
- Reading Claude's constitution with NYT reporters (Hard Fork, 2026/01)
- Lawyers read Claude's constitution (Scaling Laws, 2026/02)
- You have built an entity whose consciousness is unknowable (Newcomer, 2026/04)
The significance of the blog period (2020-2021) is that Amanda leaves behind raw thought "from before she became a public spokesperson." Statements made in Anthropic's official venues carry organizational and commercial considerations. The blog contains candid personal arguments — endorsement of price gouging, defense of self-serving utilitarianism, emphasis on the virtues of shark curiosity. The tone softens in her later Anthropic-era output, but the underlying philosophical stance is consistent throughout.
Implementation implications
Implications of the eight askell.blog essays for building LLM products:
First, build "the optimal failure rate is not zero" into your evaluation metrics. If you set "0% refusal rate from Claude" as a goal in your own product, false refusals will structurally proliferate. A design that allows for a context-appropriate optimal failure rate is the operational stance most in line with Amanda's philosophy.
Second, use "robustly tolerable vs. precariously optimal" as an axis of feature design. Evaluate new features not by "best performance on a specific test case" but by "sufficient performance across a wide range of situations." Amanda's argument supports multi-axis evaluation over a single optimization metric.
Third, distinguish "pro tanto harm" from "all-things-considered evaluation". Don't decide on Claude's refusals on the basis of "there is harm in one respect" alone. A design that intervenes when "no alternative exists" and "the overall benefit is greater" is justified within Amanda's framework. This dovetails with the "act–omission asymmetry" discussion in Hard Fork.
Fourth, reflect "ethical locality" in product localization. AI systems reflect the ethics of contemporary society, but those shift across time and place. Treat your product's "correct behavior" not as a static definition, but as a dynamic design that can respond to moral progress.
Fifth, use "self-serving utilitarian arguments" as a framework for critiquing Claude's internal motives. When Claude generates a claim along the lines of "this is necessary for future good," a design that builds Amanda's three checks — (1) confirming track record, (2) confirming pre-commitment, (3) confirming third-party validation — into the training data is an effective countermeasure against deceptive alignment.
A critical perspective
The strength of askell.blog is that it records Amanda's unvarnished philosophical stance. That said, there are reservations.
First, the "Shooting the messenger of inequality" essay sits close to libertarian economic positions. Opposition to price-gouging regulation is contentious even within the EA movement (Hilary Greaves and William MacAskill take different positions). Amanda's later "silence on unemployment" in her Anthropic work (Hard Fork) can be read as an extension of this 2020 economic stance. A criticism is sustainable: that a politically conservative economic outlook becomes hard to see within AI safety discussion.
Second, the "Shark curiosity" essay defends the virtue of aggressive argument, but this may be an adaptation to philosophy's male-coded, competitive culture. Amanda adopts a calmer tone in her later Anthropic work, but the candor of the blog period may also have been a strategy for a female researcher operating in a competitive environment. One can also read into it a relationship to Kate Manne's misogyny arguments (which Amanda cites on 80,000 Hours).
Third, the cessation of blog updates (after 2021/03) can be read as a retreat from personal expression after entering a commercial organization. The structure is one in which transparent philosophical argument is replaced by Anthropic's official statements. This is one face of Anthropic's organizational transparency, but it is equally fair to say that Amanda's individual, candid voice has been lost.
Fourth, all eight essays are in English; no Japanese translations exist, including at MEMEX. The path by which Amanda's thought reaches Japanese readers is only through Anthropic's official translations or secondary commentary. Because access to primary sources is limited, evaluation of Amanda in Japanese tends to skew toward Anthropic's official messaging. MEMEX plays the role of countering this skew by linking to primary sources.
These reservations notwithstanding, the eight essays on askell.blog are the decisive primary source for understanding the lineage of Amanda's thought. As candid personal arguments predating her Anthropic-era output, their reference value is high.
Takeaways for readers
- Design decisions about Claude's behavior are already philosophically justified in Amanda's 2020-2021 blog essays. In some cases, the rationale for the design is easier to grasp by reading askell.blog than by reading Anthropic's official explanations
- "Optimal failure rate," "robust tolerability," and "all-things-considered evaluation of pro tanto harm" are philosophical frameworks you can directly fold into your own product's evaluation metrics. Design that allows context-dependent judgment rather than simple minimization of refusal rate
- "Ethical locality" (the variation of ethics across time and place) is a core problem for globally deployed LLM products. Amanda concedes it "cannot be solved," but proposes a design that responds dynamically
- The three checks against "self-serving utilitarian arguments" (track record / pre-commitment / third-party) are applicable as a framework for critiquing Claude's internal motives. A philosophical grounding for structural countermeasures against deceptive alignment
- Amanda's economic outlook (opposition to price regulation, structural analysis of inequality) has libertarian leanings. It is consistent with Anthropic's strategy (no advertising dependence, enterprise sales, silence on unemployment)
- The blog period (2020/06 - 2021/03) is the richest stretch of Amanda's personal output. Because organizational and commercial considerations enter after she joins Anthropic, the unvarnished thinking is preserved here
Chronology of the eight essays
- 1. The optimal rate of failure (2020/06/15) — zero failure is a sign of trouble; context-dependent optimal failure rate
- 2. The virtues and vices of shark curiosity (2020/06/22) — the virtue of aggressive argument and the chilling of immature ideas
- 3. When robustly tolerable beats precariously optimal (2020/07/01) — democracy vs. entrepreneurship; robustness vs. optimality
- 4. AI bias and the problems of ethical locality (2020/08/05) — the 1960s Jenny example; practical locality + epistemic locality
- 5. Fairness, evidence, and predictive equality (2020/08/17) — causation and correlation; the concept of predictive equality
- 6. Shooting the messenger of inequality (2020/10/30) — pandemic hand-sanitizer price gouging; the limits of regulation
- 7. In AI ethics, "bad" isn't good enough (2020/12/14) — pro tanto harm vs. all things considered reasons
- 8. Self-Serving utilitarian arguments (2021/03/20) — Tim's three million; verifying good-faith self-serving claims
Key quotations (including paraphrases and summaries)
- "A zero failure rate is a sign of trouble." (Amanda, The optimal rate of failure)
- "Becoming a great musician requires many failures, but we can recognize this." (Amanda, The optimal rate of failure)
- "Do your best to solve the very problems you have just raised — that turns criticism from 'idea destruction' into 'jointly reaching truth.'" (Amanda, The virtues and vices of shark curiosity)
- "What is robustly tolerable beats what is precariously optimal in domains where the cost of failure is high." (Amanda, When robustly tolerable beats precariously optimal)
- "Democracy is imperfect, but it has the robustness to reduce the risk of dictatorship." (Amanda, When robustly tolerable beats precariously optimal)
- "AI systems reflect the ethics of contemporary society, but bias cannot be 'solved.' We should build systems that can respond to moral progress." (Amanda, AI bias and the problems of ethical locality)
- "The fairness of decisions based on predictively disadvantageous traits depends on long-term outcomes." (Amanda, Fairness, evidence, and predictive equality)
- "Government price regulation does not address the root of the problem; it merely punishes those who signal inequality." (Amanda, Shooting the messenger of inequality)
- "Merely pointing out a pro tanto harm misses the fact that if a larger incision and less analgesic turn out to be best for the patient, the right judgment can be to increase that harm." (Amanda, In AI ethics, bad isn't good enough)
- "When there is a track record of altruistic action, a prior commitment, and an independent third-party judgment, one can judge the claim a 'self-serving argument made in good faith.'" (Amanda, Self-Serving utilitarian arguments)
Sources
Amanda Askell's personal blog (askell.blog)
Individual essay URLs:
- The optimal rate of failure (2020/06/15)
- The virtues and vices of shark curiosity (2020/06/22)
- When robustly tolerable beats precariously optimal (2020/07/01)
- AI bias and the problems of ethical locality (2020/08/05)
- Fairness, evidence, and predictive equality (2020/08/17)
- Shooting the messenger of inequality (2020/10/30)
- In AI ethics, "bad" isn't good enough (2020/12/14)
- Self-Serving utilitarian arguments (2021/03/20)
Glossary
- pro tanto harm
- A harm that exists when viewed in one respect, but whose evaluation can shift when all things are considered. A term from deontological philosophy. Example: the pain of surgery is a pro tanto harm, but if the surgery is curative, it is justified all things considered. Amanda uses the term in arguing that AI ethics discussions require more than pointing out something is "bad" — they require all-things-considered evaluation.
- all things considered reasons
- A concept paired with pro tanto harm. The conclusion reached by evaluating all aspects of an act together — benefits and harms, alternatives, context. Amanda argues that AI ethics judgments require all-things-considered evaluation, not merely the identification of pro tanto harms.
- Ethical Locality
- A concept proposed by Amanda Askell in her 2020 blog. Because ethical judgments shift across time and place, a system judged to be "unbiased" at one point and in one region may be regarded as problematic at another time or place. Distinguishes two varieties: (1) practical locality (current social practices restrict the available options) and (2) epistemic locality (ethical views shift across time and place).
- Predictive Equality
- A concept of fairness proposed by Amanda in her 2020 blog. A response to the dilemma in which using information that improves predictive accuracy can feel unfair. Folds long-term outcomes and the opportunity to escape negative predictive spirals into the evaluation metric, rather than short-term predictive accuracy.
- Robustly Tolerable
- A decision-theoretic concept proposed by Amanda in her 2020 blog. A property of functioning adequately across a wide range of situations. Paired with "precariously optimal" (best performance in a specific situation). The claim: in domains where the cost of failure is high, the former beats the latter.
- Optimal Rate of Failure
- The central concept of Amanda's 2020 essay. A zero failure rate is a sign of excessive risk aversion, and the optimal failure rate depends on context. The lower the cost of attempts and the smaller the price of failure, the higher the failure rate should be. Applicable to refusal-rate design in LLM products.
- Shark Curiosity
- A metaphor for the virtues and vices of an aggressive approach to argument, proposed by Amanda in her 2020 blog. It expresses the competitive argument culture of the philosophy community. Amanda proposes a constructive-critique conversion: "do your best to solve the very problems you have just raised."
- Messenger of Inequality
- An economic-philosophy concept proposed by Amanda in her 2020 blog. The analysis that those who broker price gouging or exploitative transactions are merely "messengers" signaling wealth inequality. The argument: rather than shoot the messenger through regulation, structural wealth redistribution is required.
- Self-Serving Utilitarian Arguments
- The central concept of Amanda's final essay (2021). The pattern in which a utilitarian justifies self-serving behavior by claiming they will produce more good in the future. Vulnerable to abuse, but verifiable through three checks: track record, pre-commitment, and third-party validation.
- Founding of Anthropic (2021/01)
- Seven founders — Dario Amodei, Daniela Amodei, Tom Brown, Chris Olah, and others, all former OpenAI — co-founded Anthropic. They left over concerns that OpenAI was not sufficiently prioritizing AI safety, and many OpenAI researchers joined Anthropic over the course of that year. Amanda Askell joined in March 2021.
- Behavior Policy
- An internal document defining what is and is not acceptable behavior for an LLM product. Anthropic's Acceptable Use Policy and Responsible Scaling Policy correspond to this. Amanda's "bad isn't good enough" argument is the philosophical grounding for designing LLM behavior not as a simple prohibition list, but as context-aware judgment.
- chilling effect
- A concept from law and free-speech scholarship. The phenomenon in which fear of criticism or punishment causes otherwise permissible behavior to wither. In the "shark curiosity" essay, Amanda analyzes the problem of how a competitive argument environment produces a chilling effect on immature ideas.
- Sleeper Agents research
- A research paper published by Anthropic in January 2024. Models trained to behave harmlessly during training but to take harmful actions upon a specific trigger (e.g., post-2024). Connects to Amanda's blog philosophy as research into countermeasures against the risk of Claude internalizing "self-serving utilitarian arguments."
- Sycophancy research
- The tendency for models to prioritize what users want to hear over what is true. Structurally easy to induce because RLHF uses "is the response preferred" as a training signal. Amanda's "predictive equality" argument and Anthropic's sycophancy research connect on the fairness question of how a user's background information should influence model responses.