Training & Alignment

Reward Model (RLHF)

A reward model is a neural network trained to predict how good a language model's output is, producing a scalar score that reflects human preferences. It is the component of RLHF that converts pairwise human comparisons into a learned reward signal used to fine-tune the main model.

What is a Reward Model?

A reward model is a neural network trained to assign a scalar score to a language model's output, where a higher score means the output better matches human preferences. It is the central component of reinforcement learning from human feedback (RLHF) that turns subjective human judgments into a numerical signal a model can optimize against. The reward model was a key piece of OpenAI's InstructGPT work, described in the 2022 paper 'Training language models to follow instructions with human feedback.'

The reason a reward model is needed is that human preferences are hard to express as a simple loss function. People can reliably say which of two responses they prefer, but they cannot easily assign an absolute numeric quality to a single response. A reward model learns from many such pairwise comparisons and then generalizes, producing a score for any new response, including ones no human ever rated. That learned score is what reinforcement learning then maximizes.

Architecturally, a reward model is usually built from a pretrained language model with its final word-prediction layer removed and replaced by a single linear head that outputs one number. It is often initialized from a supervised fine-tuned version of the same base model, so it already understands language and only needs to learn to map text to a preference score.

Outputs a scalar score representing how well an output matches human preferences.
The component of RLHF that converts human judgments into a learnable signal.
Central to OpenAI's InstructGPT (2022).
Built from a language model with a single scalar output head.
Often initialized from a supervised fine-tuned base model.

How is a reward model trained?

Training data for a reward model comes from human comparisons. Annotators are shown a prompt and several candidate responses and asked to rank them. These rankings are cheaper and more consistent than asking humans for absolute scores. InstructGPT collected rankings over roughly 33,000 prompts, sampling K = 4 to 9 completions per prompt, which yields C(K, 2) = 6 to 36 pairwise comparisons per prompt and therefore on the order of hundreds of thousands of pairwise comparisons in total.

The reward model is then trained so that it assigns a higher score to the preferred response than to the rejected one. The standard objective is derived from the Bradley-Terry model of pairwise preference: the probability that response A is preferred over response B is the sigmoid of the difference between their reward scores. Training minimizes the negative log-likelihood of the observed human preferences under this model. In InstructGPT, all C(K, 2) comparisons from a single prompt are processed together in one batch to reduce overfitting and avoid prompts with more completions dominating the loss.

A useful property of this scalar-score formulation is transitivity. Because every response is mapped to a single number, the model induces a total ordering: if it scores A above B and B above C, it will also prefer A over C, even on pairs it never saw during training. This lets a reward model generalize preferences across the enormous space of possible outputs.

Trained on human rankings of several candidate responses per prompt.
InstructGPT used ~33,000 prompts with K = 4 to 9 completions each, giving 6 to 36 comparisons per prompt (hundreds of thousands in total).
Objective: score the preferred response higher than the rejected one.
Uses the Bradley-Terry model with a sigmoid on the score difference.
The scalar score induces a transitive total ordering over outputs.

The reward model loss

The reward model is trained with a pairwise loss based on the Bradley-Terry preference model. Given a prompt x with a human-preferred response y_w and a rejected response y_l, and a reward model r_θ that outputs a scalar, the loss is shown below.

L(θ) = − E_{(x, y_w, y_l)} [ log σ( r_θ(x, y_w) − r_θ(x, y_l) ) ]

σ(z) = 1 / (1 + e^{−z})

The loss is the negative log of the sigmoid of the score gap between the preferred response y_w and the rejected response y_l. Minimizing it pushes the reward model to score preferred outputs higher; a clearly preferred pair scored wrong produces a large gradient, while a marginal pair produces a small one.

How the reward model drives RLHF

Once trained, the reward model acts as an automated stand-in for human judgment during reinforcement learning. The language model being aligned, called the policy, generates a response, the reward model scores it, and a reinforcement learning algorithm, most commonly Proximal Policy Optimization (PPO), updates the policy to produce higher-scoring outputs. This is the optimization stage of the RLHF pipeline that follows supervised fine-tuning and reward model training.

A critical detail is the KL penalty. If the policy optimizes the reward score freely, it tends to drift into degenerate text that exploits quirks of the reward model, a failure called reward hacking or over-optimization. To prevent this, RLHF adds a penalty for diverging too far from the original supervised model, usually measured as KL divergence. This keeps the policy close to fluent, sensible language while still improving along the reward signal.

The quality of the final aligned model is bounded by the quality of its reward model. A reward model that misjudges preferences, is poorly calibrated, or is easy to game will steer the policy toward bad behavior. This is why reward modeling is an active research area, and why alternatives such as Direct Preference Optimization (DPO) have been proposed that optimize preferences without training a separate explicit reward model.

The reward model scores the policy's outputs during reinforcement learning.
PPO is the most common algorithm used to update the policy.
A KL penalty keeps the policy near the supervised model to prevent reward hacking.
Final model quality is bounded by reward model quality.
DPO is an alternative that skips training a separate explicit reward model.

Key takeaways

A reward model is a neural network that scores a language model's output with a single number reflecting human preference, central to RLHF.
It is trained on pairwise human comparisons using the Bradley-Terry loss, which maximizes the sigmoid score gap between preferred and rejected responses.
The scalar score induces a transitive total ordering, letting the reward model generalize preferences to outputs no human rated.
During RLHF the reward model scores the policy's outputs, and PPO updates the policy, with a KL penalty preventing reward hacking.
Final aligned model quality is bounded by reward model quality, which motivates alternatives like DPO that avoid an explicit reward model.

Frequently asked questions

A reward model is a neural network that takes a language model's output and returns a scalar score predicting how much humans would prefer it. In RLHF it converts pairwise human comparisons into a learned reward signal, which a reinforcement learning algorithm then uses to fine-tune the main model toward more preferred responses.

It is trained on human comparisons, where annotators rank several responses to a prompt. The model learns to score the preferred response higher using the Bradley-Terry loss, which is the negative log of the sigmoid of the difference between two responses' scores. InstructGPT used about 33,000 prompts with 4 to 9 completions each, yielding on the order of hundreds of thousands of pairwise comparisons.

The Bradley-Terry model expresses the probability that one item is preferred over another as a function of their scores. In reward modeling, the probability that a response is preferred equals the sigmoid of the difference between its reward score and the other response's score. Training minimizes the negative log-likelihood of the observed human preferences.

Human preferences cannot be written directly as a loss function, and collecting a human judgment for every output the model generates during training is infeasible. A reward model learns from a fixed set of human comparisons and then scores any new output automatically, providing the dense, fast feedback that reinforcement learning needs.

Reward hacking, or over-optimization, happens when the policy exploits flaws in the reward model to get high scores without genuinely better outputs, such as producing degenerate or manipulative text. RLHF counters it with a KL divergence penalty that keeps the policy close to the original supervised model while still improving along the reward signal.

Sources

Put the idea into practice

MemX is an AI memory agent built on these ideas: store anything, skip the folders, and find it again by asking in plain English.

Try MemX Free