Training & Alignment

Group Relative Policy Optimization (GRPO)

By Arpit Tripathi, Founder

Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm that removes PPO's separate value model. It estimates each response's advantage by comparing its reward to the average reward of a group of responses sampled for the same prompt.

What is Group Relative Policy Optimization (GRPO)?

Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm for fine-tuning language models that estimates advantages from groups of sampled outputs rather than from a learned value model. It was introduced in the DeepSeekMath paper in 2024 and later used to train DeepSeek-R1, a reasoning model released in early 2025.

GRPO keeps the clipped objective and KL regularization familiar from Proximal Policy Optimization (PPO) but discards the critic. For each prompt, GRPO samples a group of responses, scores them with a reward model, and uses the group's mean and standard deviation to compute each response's relative advantage. A response that scores above the group average is reinforced; one below it is discouraged.

  • Origin: DeepSeekMath (2024), later central to DeepSeek-R1 (2025).
  • Core change from PPO: the value model is removed entirely.
  • Baseline: the average reward of a group of responses to the same prompt.

How GRPO removes the critic

PPO uses a value model, a second large network trained alongside the policy, to estimate expected reward and reduce gradient variance. This critic roughly doubles the memory footprint of training. GRPO replaces it with a statistical baseline computed on the fly from the group of sampled responses.

For a prompt, GRPO samples several outputs, scores each, then normalizes the scores within the group. The normalized score becomes the advantage, so a reward that is better than its peers yields a positive advantage and one that is worse yields a negative one. Because the baseline is computed from samples rather than a learned network, no critic weights need to be stored or trained.

Â_i = ( r_i − mean(r_1, ..., r_G) ) / std(r_1, ..., r_G)
GRPO's group-relative advantage. For a group of G responses to one prompt, each reward r_i is standardized against the group mean and standard deviation, giving an advantage with no learned critic.
J(θ) = E[ (1/G) Σ_i min( ρ_i Â_i , clip(ρ_i, 1−ε, 1+ε) Â_i ) − β · D_KL(π_θ || π_ref) ]
GRPO's objective keeps PPO's clipped ratio ρ_i and adds a KL penalty toward a reference policy, averaged across the group of sampled responses.
  • No value network means fewer model copies in GPU memory.
  • The group baseline replaces the learned advantage estimate.
  • Simpler pipeline with fewer interacting components to tune.

GRPO and reasoning models

GRPO became prominent because DeepSeek used it to train DeepSeek-R1, where reinforcement learning on verifiable tasks such as math and code produced strong chain-of-thought reasoning. When a task has an automatic correctness check, the reward can be rule-based, for example whether a final answer matches the ground truth, which pairs well with GRPO's group comparison.

Sampling many responses per prompt naturally produces a spread of outcomes, some correct and some not. GRPO turns that spread into a learning signal by pushing the policy toward the better members of each group, which suits exploratory reasoning where the model must discover useful solution paths.

  • Works well with verifiable, rule-based rewards for math and code.
  • Group sampling supplies built-in variance for the relative comparison.
  • Used in DeepSeek-R1's reasoning-focused training.

Trade-offs versus PPO

GRPO's main advantage is efficiency: removing the critic lowers memory use and simplifies the training stack. Its main cost is that it relies on sampling a sufficiently large and diverse group per prompt to get a stable baseline, which increases generation work during training.

GRPO is not always superior to PPO; the better choice depends on the reward structure, compute budget, and task. Both share the same lineage of clipped, KL-regularized policy-gradient updates, so GRPO is best understood as a critic-free simplification of PPO rather than a wholly different method.

  • Pro: lower memory and a simpler pipeline than PPO.
  • Con: needs enough samples per prompt for a reliable group baseline.
  • Both methods share clipped, KL-regularized policy-gradient updates.

Key takeaways

  • GRPO is a reinforcement learning algorithm that fine-tunes language models without PPO's separate value model.
  • It estimates advantage by standardizing each response's reward against a group of responses to the same prompt.
  • Removing the critic lowers memory use and simplifies the training pipeline.
  • GRPO was introduced in DeepSeekMath and used to train the DeepSeek-R1 reasoning model.
  • It pairs naturally with verifiable, rule-based rewards for math and code tasks.

Frequently asked questions

GRPO, or Group Relative Policy Optimization, is a reinforcement learning algorithm for fine-tuning language models. It removes PPO's value model and instead estimates each response's advantage by comparing its reward to a group of responses sampled for the same prompt.
PPO trains a separate value model to estimate advantages, while GRPO computes advantages from the mean and standard deviation of rewards across a group of sampled responses. This removes a large network from memory and simplifies training.
The critic in PPO is usually as large as the policy, so it roughly doubles training memory. By replacing it with a sampled group baseline, GRPO cuts memory use and removes a component that requires its own training and tuning.
GRPO was introduced in DeepSeekMath in 2024 and used to train DeepSeek-R1 in 2025. It has since been adopted in open-source RL libraries for training reasoning-focused language models.
No. GRPO is more memory-efficient but depends on sampling a large enough group per prompt for a stable baseline, which adds generation cost. The better choice depends on the reward structure, task, and compute budget.