Training & Alignment

Proximal Policy Optimization (PPO)

By Arpit Tripathi, Founder

Proximal Policy Optimization (PPO) is a policy-gradient reinforcement learning algorithm that improves a model using a clipped objective, keeping each update close to the previous policy for stable training. It became the default optimizer for RLHF fine-tuning of large language models.

What is Proximal Policy Optimization (PPO)?

Proximal Policy Optimization (PPO) is a reinforcement learning algorithm that updates a policy to maximize expected reward while constraining how far each update can move from the current policy. Introduced by Schulman and colleagues at OpenAI in 2017, it belongs to the family of policy-gradient methods, where the model directly learns a probability distribution over actions rather than estimating a value table.

The central idea is the word proximal: every gradient step stays in a trust region close to the previous policy. PPO enforces this with a clipped surrogate objective, a simple first-order substitute for the more complex second-order constraint used by its predecessor, Trust Region Policy Optimization (TRPO). This combination of stability and implementation simplicity is why PPO became the standard choice for Reinforcement Learning from Human Feedback (RLHF) used to align models such as InstructGPT and ChatGPT.

  • Family: on-policy, policy-gradient reinforcement learning.
  • Goal: maximize reward without destabilizing the policy through overly large updates.
  • Key mechanism: a clipped probability ratio that bounds each update.

The clipped surrogate objective

PPO computes the probability ratio between the new and old policy for each action, then multiplies it by an advantage estimate that says whether an action was better or worse than expected. To prevent the ratio from pushing the policy too far in one step, PPO clips the ratio to a narrow band around 1 and takes the minimum of the clipped and unclipped terms. This caps the incentive to over-update when an action looks very good or very bad.

The hyperparameter epsilon, commonly set near 0.1 to 0.2, sets the width of the clipping band. A smaller epsilon makes training more conservative, while a larger one allows bigger steps at the cost of stability.

L^CLIP(θ) = E_t[ min( r_t(θ) Â_t , clip(r_t(θ), 1 − ε, 1 + ε) Â_t ) ]
PPO's clipped surrogate loss. r_t(θ) is the ratio of new to old policy probabilities for an action, Â_t is the estimated advantage, and ε bounds how far the ratio may move before the gain is clipped.
r_t(θ) = π_θ(a_t | s_t) / π_θ_old(a_t | s_t)
The probability ratio compares the current policy to the policy that collected the data. A ratio near 1 means the policy has changed little for that action.
  • The advantage is often estimated with Generalized Advantage Estimation (GAE).
  • Clipping removes the need for TRPO's expensive second-order optimization.
  • Epsilon controls the size of the trust region per update.

PPO in the RLHF pipeline

In RLHF, PPO is the third stage after supervised fine-tuning and reward model training. The language model acts as the policy, generating responses to prompts. A separate reward model scores each response based on learned human preferences, and PPO updates the policy to produce higher-scoring responses.

To stop the policy from drifting into degenerate text that games the reward model, RLHF adds a Kullback-Leibler (KL) penalty that keeps the trained policy close to the original supervised model. PPO therefore juggles two pressures at once: climb the reward signal, and stay near the reference distribution.

  • The policy model, reward model, reference model, and value model are all held in memory during training.
  • A KL penalty against the reference model curbs reward hacking.
  • Reward comes from a learned preference model rather than a hand-coded function.

The value model and its cost

Standard PPO trains a critic, also called a value model, alongside the policy. The critic estimates the expected future reward of a state and is used to compute the advantage, which lowers the variance of the gradient. The critic is typically the same size as the policy, so PPO for LLMs keeps several large models resident in memory simultaneously.

This memory and compute overhead is the main practical drawback of PPO at LLM scale, and it motivated later critic-free variants such as Group Relative Policy Optimization (GRPO), which estimates advantages from groups of sampled outputs instead of a learned value network.

  • The critic reduces gradient variance but doubles much of the memory footprint.
  • Tuning PPO involves several interacting hyperparameters: epsilon, KL coefficient, learning rate, and batch size.
  • Critic-free successors trade the value model for group-based baselines.

Key takeaways

  • PPO is a policy-gradient RL algorithm that constrains each update with a clipped objective for stable training.
  • Its clipped surrogate loss caps the policy-update incentive using a probability ratio bounded by epsilon.
  • PPO is the default optimizer in RLHF, pairing a reward model with a KL penalty toward a reference policy.
  • Standard PPO trains a separate value model (critic), which makes it memory-heavy at LLM scale.
  • Memory cost motivated critic-free alternatives such as GRPO.

Frequently asked questions

PPO is a reinforcement learning algorithm used to train policies toward higher reward. In large language models it is the standard optimizer for RLHF, updating the model so its responses better match human preferences scored by a reward model.
PPO combines stable, trust-region-style updates with simple first-order optimization. Its clipped objective prevents destabilizing policy jumps without the second-order math of TRPO, making it easier to implement and tune at scale.
Clipping bounds the ratio between the new and old policy probabilities within a small band around 1, set by epsilon. This caps how much a single update can change the policy, even when an action's advantage is large.
The value model, or critic, estimates expected future reward for a state and is used to compute advantages with lower variance. It is usually as large as the policy, which is the main source of PPO's memory cost.
GRPO removes PPO's separate value model. Instead of a learned critic, GRPO estimates advantages by normalizing rewards across a group of sampled responses to the same prompt, cutting memory use while keeping a clipped objective.