Background
The release of DeepSeekMath[1] and DeepSeek-R1[2] brought Group Relative Policy Optimization (GRPO) into the spotlight, and it quickly became one of the most widely adopted post-training algorithms in the open-source LLM community.
GRPO's significance lies in making Reinforcement Learning with Verifiable Rewards (RLVR) practical at scale. In domains like math reasoning and code generation, correctness can be checked directly: a solution is either right or wrong, and a unit test either passes or fails. GRPO turns these discrete, verifiable signals into an effective training signal for large language models, and the strong reasoning results from DeepSeek-R1 made it one of the de facto recipes for RL-based post-training.
How does GRPO work

As shown in the figure, GRPO training can be split into three steps:
- Sampling. For each question q, the policy from the previous update samples a group of G candidate answers {o₁, …, o_G}. Sampling within a group is what gives GRPO its name. All subsequent computation is relative to this group.
- Reward / Advantage. Each answer is scored independently by a reward model or a rule-based reward function, producing per-sample rewards {r₁, …, r_G}. These rewards are then normalized within the group (subtracting the group mean and dividing by the group standard deviation) to produce the advantages {A₁, …, A_G}. The group itself serves as the baseline, so no separate value network is required.
- Update. The advantages drive the policy gradient term, while a KL divergence between the current policy and a frozen reference model keeps the update from drifting too far. These two terms are combined into the GRPO loss, which is then used to update the policy for the next iteration.
Different variants
Original GRPO[1,2]
Published by DeepSeek as part of DeepSeekMath and later adopted in DeepSeek-R1, GRPO is the foundation that all later variants build on. For each prompt q, the old policy samples a group of G candidate outputs. Each output is scored by a reward function, and the advantage is computed by normalizing the reward against the group's mean and standard deviation. The group itself serves as the baseline, so no separate value network is required. The training objective applies token-level importance-sampling ratios with PPO-style clipping, averaged first within each response (by 1/|o_i|) and then across the group (by 1/G).

GRPO Dr.GRPO[3]
Published by Sea AI Lab, the National University of Singapore, and Singapore Management University, Dr. GRPO ("GRPO Done Right") observes that the original GRPO objective introduces two systematic biases: a response-level length bias from the 1/|o_i| normalization, which inflates short-response gradients and dilutes long ones, and a question-level difficulty bias from the std denominator in the advantage, which over-weights easy questions. Dr. GRPO removes both terms, dropping 1/|o_i| from the loss and std from the advantage; the rest of the algorithm is left unchanged.

Dr.GRPO
DAPO[4]
Published by ByteDance, DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization) refines GRPO with a set of practical fixes discovered while scaling RL training to long chain-of-thought reasoning. It introduces four key techniques:
- Clip-Higher. The upper and lower clipping bounds are decoupled into ε_low and ε_high, with a larger upper bound. This gives low-probability tokens more room to grow, promoting exploration and avoiding the entropy collapse commonly seen in long-CoT training.
- Dynamic Sampling. Groups in which all responses are either fully correct or fully incorrect produce zero advantage and waste compute. DAPO filters out such degenerate groups and resamples until the batch contains enough informative groups, improving both training efficiency and stability.
- Token-Level Policy Gradient Loss. Instead of averaging the loss per response and then across the group (as in GRPO), DAPO sums over all tokens in the group and divides by the total token count. This prevents long responses from being unfairly down-weighted. This is a critical fix in long-CoT scenarios where response lengths vary by orders of magnitude.
- Overlong Reward Shaping. Responses that exceed the maximum length are no longer assigned an arbitrary penalty. DAPO instead applies a soft, length-aware reward shaping that reduces reward noise and stabilizes training near the length limit.
Together, these changes make DAPO substantially more stable than vanilla GRPO at scale, and it has become a common starting point for open-source long-CoT RL pipelines.

DAPO GSPO[5]
Published by Alibaba, GSPO (Group Sequence Policy Optimization) argues that GRPO's token-level importance ratio is ill-justified: applying off-policy correction independently at every token accumulates high-variance noise across long sequences, and PPO-style clipping makes the instability worse. GSPO replaces it with a sequence-level importance ratio s_i(θ), defined as the geometric mean of the per-token ratios over the full response. Clipping is then applied once per sequence rather than per token, which lowers gradient variance and proves especially valuable for MoE models where routing changes amplify token-level noise.

GSPO
Conclusion
With the success of GRPO and RLVR, RL post-training has become a core part of the LLM training pipeline. Given a well-designed reward, an LLM can be continually improved through a simple sample–reward–update loop. In practice, however, the high variance inherent to sampling makes GRPO training prone to instability, and a wave of variants (Dr. GRPO, DAPO, GSPO, and others) has emerged to address different facets of this problem. As reward design and training algorithms continue to co-evolve, RL post-training is likely to remain one of the most active and impactful directions in LLM development.
Reference
[1] Shao, Zhihong, et al. "Deepseekmath: Pushing the limits of mathematical reasoning in open language models." arXiv preprint arXiv:2402.03300 (2024).
[2] Guo, Daya, et al. "Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning." arXiv preprint arXiv:2501.12948 (2025).
[3] Liu, Zichen, et al. "Understanding r1-zero-like training: A critical perspective." arXiv preprint arXiv:2503.20783 (2025).
[4] Yu, Qiying, et al. "DAPO: An Open-Source LLM Reinforcement Learning System at Scale." arXiv preprint arXiv:2503.14476 (2025).
[5] Zheng, Chujie, et al. "Group sequence policy optimization." arXiv preprint arXiv:2507.18071 (2025).