Training & Alignment

Constitutional AI (CAI)

Constitutional AI (CAI) is a method developed by Anthropic for training AI models to be helpful and harmless using a written set of principles, called a constitution, instead of relying mainly on human labels for harmfulness. The model critiques and revises its own outputs against those principles.

What is Constitutional AI?

Constitutional AI (CAI) is a training method introduced by Anthropic for aligning language models to be helpful and harmless using a written set of principles known as a constitution. Instead of relying primarily on humans to label which model responses are harmful, CAI uses the AI itself to critique and revise its outputs against the constitution, and to generate preference labels for harmlessness. Anthropic described the method in the 2022 paper 'Constitutional AI: Harmlessness from AI Feedback.'

The core motivation is scalability and transparency in alignment. Collecting large volumes of human labels for harmful content is slow, costly, and can expose human annotators to disturbing material. By encoding the desired behavior as explicit principles, CAI makes the values guiding the model legible and auditable: anyone can read the constitution rather than inferring norms from a hidden dataset of labels. CAI is one of the methods Anthropic uses to train its Claude models.

A notable property of CAI-trained models is that they aim to be harmless without being evasive. Rather than refusing sensitive questions outright, a well-trained CAI model is meant to engage and explain its reasoning or objections, which keeps it useful while still avoiding harmful outputs.

Developed by Anthropic to make models helpful and harmless.
Uses a written constitution of principles instead of mostly human harm labels.
The model critiques and revises its own outputs against those principles.
Described in 'Constitutional AI: Harmlessness from AI Feedback' (2022).
One of the methods used to train Anthropic's Claude models.

How does Constitutional AI work? The two stages

CAI has two stages: a supervised learning stage and a reinforcement learning stage. In the supervised stage, called SL-CAI, an existing helpful model generates responses to prompts, including prompts designed to elicit harmful answers. The model is then asked to critique its own response according to a principle drawn from the constitution, and to revise the response in light of that critique. This critique-and-revise loop can repeat several times, with different principles sampled at each step. The revised responses are used to fine-tune the model.

The second stage is reinforcement learning from AI feedback, often abbreviated RLAIF, which produces the RL-CAI model. Here the model generates pairs of responses, and a separate AI feedback model is asked, guided by a constitutional principle, to choose which response is less harmful. These AI-generated preferences train a preference model, also called a reward model, and that reward model is then used to fine-tune the policy with reinforcement learning, typically PPO.

The key shift from standard RLHF is where the harmlessness signal comes from. In RLHF, humans label which response is better. In CAI, the harmlessness comparisons are generated by an AI model following the constitution, which is why the second stage is called RLAIF. Human feedback is still typically used for helpfulness, so CAI often mixes AI harmlessness labels with human helpfulness labels.

Stage 1 (SL-CAI): the model critiques and revises its own answers against principles, then is fine-tuned on the revisions.
Stage 2 (RL-CAI): an AI feedback model ranks response pairs by harmlessness.
AI preferences train a reward model used for reinforcement learning (typically PPO).
This AI-generated preference signal is called RLAIF.
Human feedback is still commonly used for the helpfulness objective.

What is in the constitution?

The constitution is a list of natural-language principles that describe how the model should behave. The principles guide the critique-and-revise step and the AI preference comparisons. Anthropic has drawn its principles from a range of sources, including widely recognized documents such as the United Nations Universal Declaration of Human Rights, principles from trust and safety work, and considerations specific to AI systems. The aim is to encode broadly held values about being helpful, honest, and avoiding harm.

Because the constitution is written in plain language, it can be inspected, debated, and revised independently of any particular dataset. Anthropic has published material describing the principles behind Claude's behavior, and the constitution can be updated as understanding of safe and helpful behavior improves. This makes the source of the model's values explicit in a way that a large pile of individual human labels is not.

Importantly, the constitution does not script exact answers. It provides high-level guidance that the model applies to specific situations during the critique, revision, and preference steps. The model still generalizes from those principles to novel prompts rather than matching against a fixed rulebook.

A list of plain-language principles for helpful, honest, and harmless behavior.
Drawn from sources such as the UN Universal Declaration of Human Rights and safety practice.
Written so it can be inspected, debated, and revised independently.
Guides the critique, revision, and AI preference steps.
Provides high-level guidance rather than scripting exact answers.

How CAI differs from standard RLHF

Reinforcement learning from human feedback (RLHF) is the dominant alignment method, in which human annotators rank model outputs and those rankings train a reward model that guides reinforcement learning. CAI keeps the same overall reinforcement learning machinery but replaces much of the human harmfulness labeling with AI feedback derived from the constitution. This reduces the volume of human labels needed for harmlessness and limits human exposure to harmful content.

The constitution also makes the alignment target more transparent. In standard RLHF, the norms the model learns are implicit in the labelers' aggregate judgments and can be hard to audit. In CAI, the governing principles are stated explicitly, so the values can be examined and changed directly. This addresses a common criticism that RLHF encodes opaque preferences.

CAI and RLHF are not mutually exclusive. In practice they are combined: human feedback is often used for helpfulness while AI feedback handles harmlessness. CAI should be understood as a refinement of the RLHF pipeline that changes the source and transparency of the preference signal, not as a wholesale replacement for reinforcement learning.

RLHF uses human rankings; CAI uses AI feedback guided by the constitution.
CAI cuts the human labeling needed for harmfulness and reduces exposure to harmful content.
Explicit principles make the alignment target auditable, unlike opaque human labels.
CAI reuses the same reward-model and reinforcement-learning machinery.
In practice CAI and RLHF are combined (AI for harmlessness, humans for helpfulness).

Key takeaways

Constitutional AI (CAI) is Anthropic's method for training helpful and harmless models using a written constitution of principles instead of mostly human harm labels.
It has two stages: a supervised critique-and-revise stage (SL-CAI) and a reinforcement learning stage (RL-CAI) driven by AI feedback (RLAIF).
The model critiques and revises its own outputs against constitutional principles, and an AI model generates harmlessness preference labels for a reward model.
The constitution is plain-language, drawn from sources such as the UN Universal Declaration of Human Rights, and can be inspected and revised.
CAI refines the RLHF pipeline by changing where the harmlessness signal comes from, making the alignment target more transparent and scalable.

Frequently asked questions

Constitutional AI is a training method from Anthropic that aligns language models to be helpful and harmless using a written set of principles called a constitution. Rather than relying mostly on human harm labels, the model critiques and revises its own outputs against the constitution and generates AI feedback used to train a reward model.

CAI has two stages. In the supervised stage (SL-CAI), the model critiques and revises its own answers against constitutional principles, then is fine-tuned on the revisions. In the reinforcement stage (RL-CAI), an AI model ranks response pairs by harmlessness, training a reward model that guides reinforcement learning, an approach called RLAIF.

RLHF uses human rankings to train a reward model. Constitutional AI replaces much of the human harmfulness labeling with AI feedback derived from an explicit constitution, which reduces human labeling and makes the alignment target auditable. The two are often combined, with humans labeling helpfulness and AI feedback handling harmlessness.

The constitution is a list of plain-language principles describing how the model should behave to be helpful, honest, and harmless. Anthropic has drawn principles from sources such as the United Nations Universal Declaration of Human Rights and safety practice. The principles guide the model's self-critique and AI preference steps and can be revised over time.

RLAIF stands for reinforcement learning from AI feedback. It is the second stage of Constitutional AI, in which an AI feedback model, guided by the constitution, chooses which of two responses is less harmful. Those AI-generated preferences train a reward model that drives reinforcement learning, replacing much of the human labeling used in standard RLHF.

Sources

Put the idea into practice

MemX is an AI memory agent built on these ideas: store anything, skip the folders, and find it again by asking in plain English.

Try MemX Free