Training & Alignment

Group Relative Policy Optimization (GRPO)

Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm that removes PPO's separate value model. It estimates each response's advantage by comparing its reward to the average reward of a group of responses sampled for the same prompt.

What is Group Relative Policy Optimization (GRPO)?

Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm for fine-tuning language models that estimates advantages from groups of sampled outputs rather than from a learned value model. It was introduced in the DeepSeekMath paper in 2024 and later used to train DeepSeek-R1, a reasoning model released in early 2025.

GRPO keeps the clipped objective and KL regularization familiar from Proximal Policy Optimization (PPO) but discards the critic. For each prompt, GRPO samples a group of responses, scores them with a reward model, and uses the group's mean and standard deviation to compute each response's relative advantage. A response that scores above the group average is reinforced; one below it is discouraged.

Origin: DeepSeekMath (2024), later central to DeepSeek-R1 (2025).
Core change from PPO: the value model is removed entirely.
Baseline: the average reward of a group of responses to the same prompt.

How GRPO removes the critic

PPO uses a value model, a second large network trained alongside the policy, to estimate expected reward and reduce gradient variance. This critic roughly doubles the memory footprint of training. GRPO replaces it with a statistical baseline computed on the fly from the group of sampled responses.

For a prompt, GRPO samples several outputs, scores each, then normalizes the scores within the group. The normalized score becomes the advantage, so a reward that is better than its peers yields a positive advantage and one that is worse yields a negative one. Because the baseline is computed from samples rather than a learned network, no critic weights need to be stored or trained.

Â_i = ( r_i − mean(r_1, ..., r_G) ) / std(r_1, ..., r_G)

GRPO's group-relative advantage. For a group of G responses to one prompt, each reward r_i is standardized against the group mean and standard deviation, giving an advantage with no learned critic.

J(θ) = E[ (1/G) Σ_i min( ρ_i Â_i , clip(ρ_i, 1−ε, 1+ε) Â_i ) − β · D_KL(π_θ || π_ref) ]

GRPO's objective keeps PPO's clipped ratio ρ_i and adds a KL penalty toward a reference policy, averaged across the group of sampled responses.

No value network means fewer model copies in GPU memory.
The group baseline replaces the learned advantage estimate.
Simpler pipeline with fewer interacting components to tune.

GRPO and reasoning models

GRPO became prominent because DeepSeek used it to train DeepSeek-R1, where reinforcement learning on verifiable tasks such as math and code produced strong chain-of-thought reasoning. When a task has an automatic correctness check, the reward can be rule-based, for example whether a final answer matches the ground truth, which pairs well with GRPO's group comparison.

Sampling many responses per prompt naturally produces a spread of outcomes, some correct and some not. GRPO turns that spread into a learning signal by pushing the policy toward the better members of each group, which suits exploratory reasoning where the model must discover useful solution paths.

Works well with verifiable, rule-based rewards for math and code.
Group sampling supplies built-in variance for the relative comparison.
Used in DeepSeek-R1's reasoning-focused training.

Trade-offs versus PPO

GRPO's main advantage is efficiency: removing the critic lowers memory use and simplifies the training stack. Its main cost is that it relies on sampling a sufficiently large and diverse group per prompt to get a stable baseline, which increases generation work during training.

GRPO is not always superior to PPO; the better choice depends on the reward structure, compute budget, and task. Both share the same lineage of clipped, KL-regularized policy-gradient updates, so GRPO is best understood as a critic-free simplification of PPO rather than a wholly different method.

Pro: lower memory and a simpler pipeline than PPO.
Con: needs enough samples per prompt for a reliable group baseline.
Both methods share clipped, KL-regularized policy-gradient updates.

Key takeaways

GRPO is a reinforcement learning algorithm that fine-tunes language models without PPO's separate value model.
It estimates advantage by standardizing each response's reward against a group of responses to the same prompt.
Removing the critic lowers memory use and simplifies the training pipeline.
GRPO was introduced in DeepSeekMath and used to train the DeepSeek-R1 reasoning model.
It pairs naturally with verifiable, rule-based rewards for math and code tasks.

Frequently asked questions

GRPO, or Group Relative Policy Optimization, is a reinforcement learning algorithm for fine-tuning language models. It removes PPO's value model and instead estimates each response's advantage by comparing its reward to a group of responses sampled for the same prompt.

PPO trains a separate value model to estimate advantages, while GRPO computes advantages from the mean and standard deviation of rewards across a group of sampled responses. This removes a large network from memory and simplifies training.

The critic in PPO is usually as large as the policy, so it roughly doubles training memory. By replacing it with a sampled group baseline, GRPO cuts memory use and removes a component that requires its own training and tuning.

GRPO was introduced in DeepSeekMath in 2024 and used to train DeepSeek-R1 in 2025. It has since been adopted in open-source RL libraries for training reasoning-focused language models.

No. GRPO is more memory-efficient but depends on sampling a large enough group per prompt for a stable baseline, which adds generation cost. The better choice depends on the reward structure, task, and compute budget.

Put the idea into practice

MemX is an AI memory agent built on these ideas: store anything, skip the folders, and find it again by asking in plain English.

Try MemX Free