Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm that removes PPO's separate value model. It estimates each response's advantage by comparing its reward to the average reward of a group of responses sampled for the same prompt.
What is Group Relative Policy Optimization (GRPO)?
Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm for fine-tuning language models that estimates advantages from groups of sampled outputs rather than from a learned value model. It was introduced in the DeepSeekMath paper in 2024 and later used to train DeepSeek-R1, a reasoning model released in early 2025.
GRPO keeps the clipped objective and KL regularization familiar from Proximal Policy Optimization (PPO) but discards the critic. For each prompt, GRPO samples a group of responses, scores them with a reward model, and uses the group's mean and standard deviation to compute each response's relative advantage. A response that scores above the group average is reinforced; one below it is discouraged.
- Origin: DeepSeekMath (2024), later central to DeepSeek-R1 (2025).
- Core change from PPO: the value model is removed entirely.
- Baseline: the average reward of a group of responses to the same prompt.
How GRPO removes the critic
PPO uses a value model, a second large network trained alongside the policy, to estimate expected reward and reduce gradient variance. This critic roughly doubles the memory footprint of training. GRPO replaces it with a statistical baseline computed on the fly from the group of sampled responses.
For a prompt, GRPO samples several outputs, scores each, then normalizes the scores within the group. The normalized score becomes the advantage, so a reward that is better than its peers yields a positive advantage and one that is worse yields a negative one. Because the baseline is computed from samples rather than a learned network, no critic weights need to be stored or trained.
- No value network means fewer model copies in GPU memory.
- The group baseline replaces the learned advantage estimate.
- Simpler pipeline with fewer interacting components to tune.
GRPO and reasoning models
GRPO became prominent because DeepSeek used it to train DeepSeek-R1, where reinforcement learning on verifiable tasks such as math and code produced strong chain-of-thought reasoning. When a task has an automatic correctness check, the reward can be rule-based, for example whether a final answer matches the ground truth, which pairs well with GRPO's group comparison.
Sampling many responses per prompt naturally produces a spread of outcomes, some correct and some not. GRPO turns that spread into a learning signal by pushing the policy toward the better members of each group, which suits exploratory reasoning where the model must discover useful solution paths.
- Works well with verifiable, rule-based rewards for math and code.
- Group sampling supplies built-in variance for the relative comparison.
- Used in DeepSeek-R1's reasoning-focused training.
Trade-offs versus PPO
GRPO's main advantage is efficiency: removing the critic lowers memory use and simplifies the training stack. Its main cost is that it relies on sampling a sufficiently large and diverse group per prompt to get a stable baseline, which increases generation work during training.
GRPO is not always superior to PPO; the better choice depends on the reward structure, compute budget, and task. Both share the same lineage of clipped, KL-regularized policy-gradient updates, so GRPO is best understood as a critic-free simplification of PPO rather than a wholly different method.
- Pro: lower memory and a simpler pipeline than PPO.
- Con: needs enough samples per prompt for a reliable group baseline.
- Both methods share clipped, KL-regularized policy-gradient updates.
Key takeaways
- GRPO is a reinforcement learning algorithm that fine-tunes language models without PPO's separate value model.
- It estimates advantage by standardizing each response's reward against a group of responses to the same prompt.
- Removing the critic lowers memory use and simplifies the training pipeline.
- GRPO was introduced in DeepSeekMath and used to train the DeepSeek-R1 reasoning model.
- It pairs naturally with verifiable, rule-based rewards for math and code tasks.
Frequently asked questions
Related terms
Related reading
Sources
Put the idea into practice
MemX is an AI memory app built on these ideas: store anything, skip the folders, and find it again by asking in plain English.
Try MemX Free