The secret behind DeepSeek 1 | DeepSeekMath and GRPO details

Today I’d like to share an article from DeepSeek, titled DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.

This article introduces DeepSeekMath 7B, which is pre-trained on DeepSeek-Coder-Base-v1.5 7B based on a collection of 120B math-related tokens, natural language and code data.

The model achieved an astonishing score of 51.7% in competitive-level MATH benchmarks without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4.

DeepSeekMath 7B’s mathematical reasoning ability is attributed to two key factors: First, through a carefully designed data selection pipeline, high-quality mathematics-related data is iteratively mined from publicly available web data.

Second, group relative policy optimization (GRPO) is introduced, which is a variant of proximal policy optimization (PPO) that can enhance mathematical reasoning ability while optimizing the memory usage of PPO.

The method features are summarized as follows:A high-quality mathematical pre-training corpus was constructed, and a carefully designed pipeline was used to mine high-quality mathematical data from Common Crawl.
The GRPO algorithm was proposed, which reduces the resources required for training and improves the mathematical reasoning ability of the model. 3) State-of-the-art performance was achieved in multiple mathematical reasoning benchmark tests.

Overview

Title: DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

URL: click here

Authors: Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo

Code: click here

Motivation

Mathematical reasoning poses a significant challenge to language models due to the complexity and structured nature of mathematics. The most advanced models, such as GPT-4 and Gemini-Ultra, are powerful but not publicly available. Therefore, there is significant room for improvement in the performance of open source models.

Complexity and structure: Mathematical reasoning poses a significant challenge to language models due to the complexity and structured nature of mathematics.

Potential of public data: Publicly available web data may contain rich mathematical information that has yet to be mined and utilized.

Methods

Data collection: A DeepSeekMath corpus of 120B tokens was constructed by collecting high-quality math-related web data from Common Crawl through an iterative pipeline.

Model training: The corpus was used for pre-training on top of DeepSeek-Coder-Base-v1.5 7B, and the mathematical instruction fine-tuning and group relative policy optimization (GRPO) algorithm was applied.

GRPO algorithm: GRPO is an improved reinforcement learning algorithm that removes the Critic model in PPO and estimates the baseline from the group score, thereby significantly reducing training resources.

Detailed methods and procedures:

Data collection and processing:

Build DeepSeekMath Corpus: Using a fastText-based classifier, extract 120B math-related tokens from Common Crawl to build a large-scale, high-quality pre-trained corpus, DeepSeekMath Corpus.

Iterative data filtering: An iterative strategy is used, using OpenWebMath as seed data to train an initial classifier, and then using this classifier to mine more positive examples from Common Crawl, which are manually annotated to continuously optimize the classifier performance.

Multilingual features: DeepSeekMath Corpus contains multilingual data, which improves the model’s performance on Chinese math benchmarks.

De-pollution processing: De-pollution processing is performed on the training data to avoid overlap with the test benchmark.

Pretraining:

Code-based model initialization: Initialization using the DeepSeek-Coder-Base-v1.5 7B model was found to be more effective than initialization from a general LLM.

Pretraining data composition: 56% DeepSeekMath Corpus, 4% AlgebraicStack, 10% arXiv, 20% Github code, 10% Common Crawl natural language data.

Pretraining parameters: AdamW optimizer is used, with a learning rate of 4.2e-4, batch size of 10M tokens, and training of 500B tokens.

Instruction fine-tuning:

Construct an instruction fine-tuning dataset: Construct a mathematical instruction fine-tuning dataset containing 776K samples, covering a variety of mathematical fields and difficulty levels, including CoT, PoT, and tool-integrated inference formats for solving steps.

Training parameters: Batch size 256, learning rate 5e-5, train for 500 steps.

Reinforcement learning – Group Relative Policy Optimization (GRPO):

Propose GRPO algorithm: Propose a PPO variant algorithm GRPO, which avoids the need for a Critic model by using group-wise scores to estimate the baseline, thereby reducing training resources.

Objective function: GRPO optimizes the policy model by maximizing an objective function that takes into account the relative advantage of in-group outputs and directly adds the KL divergence as a regularization term.

Advantage calculation: GRPO calculates the advantage through in-group relative rewards, avoiding cross-group comparisons and better conforming to the comparative nature of the reward model.

Supports both outcome and process monitoring: GRPO can support both outcome and process monitoring, and more effectively monitor the policy by providing rewards at the end of each inference step.

Iterative RL: Uses an iterative RL strategy to generate a new training set based on the sampling results of the policy model, continuously train the old reward model, and use the new reward model to update the policy model.

Training data: Uses the CoT format problems related to GSM8K and MATH in the SFT data, about 144K problems.

Training parameters: The learning rate of the policy model is 1e-6, the KL coefficient is 0.04, 64 outputs are sampled for each problem, the maximum length is 1024, and the training batch size is 1024.

Conclusion

Conclusion 1:DeepSeekMath 7B outperforms all open source models in mathematical reasoning ability. In the competitive MATH benchmark test, DeepSeekMath 7B achieved an accuracy of 51.7%, which is close to the performance level of Gemini-Ultra and GPT-4.

Conclusion 2:Well-designed pretraining data and GRPO algorithms are key to the success of the model. The combination of a high-quality mathematical corpus and GRPO algorithms enables the model to achieve significant performance gains in mathematical reasoning tasks.

Conclusion 3:Code training helps improve mathematical reasoning ability. Adding code data to the pretraining stage can improve the model’s ability to solve mathematical problems, both with and without tools.

Conclusion 4: Limited usefulness of arXiv data: Contrary to previous beliefs, the arXiv data was found to be of limited help in improving mathematical reasoning.

Limitation

Geometry and proof capabilities are relatively weak: Although DeepSeekMath excels in quantitative reasoning, its capabilities in geometry and proof are still inferior to closed-source models. This may be due to the biased data selection in the pretraining and fine-tuning stages.

Weakness in small sample capacity: DeepSeekMath is inferior to GPT-4 in terms of small sample learning, which may be due to the limitation of model size.

More efficient reinforcement learning methods are needed: Although the reinforcement learning methods proposed in the paper are effective, there is still room for improvement, for example, how to make more effective use of the feedback from the reward model and how to deal with noisy reward signals.

Details

Reinforcement Learning Exploration and Analysis

Overview:

Introduction of Group Relative Policy Optimization (GRPO): The paper proposes a new reinforcement learning algorithm, GRPO, as a variant of Proximal Policy Optimization (PPO). The main feature of GRPO is that it abandons the Critic model commonly used in PPO and estimates the baseline through group scores, thereby greatly reducing the computational resources required for training.

GRPO effectiveness demonstration: The paper experimentally demonstrates that GRPO can effectively improve the performance of command fine-tuning models, including both in-domain and out-of-domain mathematical tasks.

Unified framework for reinforcement learning methods: The paper proposes a unified framework for understanding different reinforcement learning methods, such as Rejection Sampling Fine-Tuning (RFT), Direct Preference Optimization (DPO), PPO and GRPO. The framework treats these methods as direct or simplified reinforcement learning techniques.

In-depth exploration of the elements of reinforcement learning: The paper explores in-depth key elements of reinforcement learning, such as online training and offline training, result supervision and process supervision, single-round reinforcement learning and iterative reinforcement learning, through detailed experiments, and summarizes potential directions for improving the effectiveness of reinforcement learning.

GRPO (Group Relative Policy Optimization) algorithm

Limitations of PPO: PPO is a commonly used reinforcement learning algorithm, but it requires training an additional Critic model to estimate the value function, which imposes an additional computational and memory burden. In addition, in the LLM scenario, Critic model training can be complicated because it requires evaluating the output of each token.

GRPO core idea: The core idea of GRPO is to abandon the Critic model and instead use the average score of a set of outputs for the same problem as a baseline. This baseline can be used to estimate the advantage function and for policy optimization. This approach significantly reduces the complexity of training.

Advantage function calculation: GRPO calculates the advantage function by calculating the relative ranking of each output in the same set of outputs, rather than relying on a separate value function as in PPO.

KL divergence penalty: GRPO does not add a KL divergence penalty to the reward like PPO, but instead adds the KL divergence between the policy model and the reference model directly to the loss function. This avoids the complex advantage function calculation.

The core idea of GRPO

does not require a Critic (value function): GRPO avoids the need for a value function and uses the within-group score to estimate the baseline, thereby reducing training resources.

Intra-group relative advantage: For each problem q, GRPO samples a set of outputs {o(1), o(2), …, o(G)} from the old policy π(θold) and then optimizes the policy model by maximizing the following equation as the objective function.

Specifically:

The key here is Â(i,t), which represents the advantage and is calculated by the relative reward of the intra-group output, rather than relying on a separate value function as in PPO.

The objective function also directly adds KL divergence as a regularization term to control the magnitude of policy updates

and align with the comparison nature of the reward model: GRPO uses the relative intragroup reward to calculate the advantage, which is more consistent with the nature of the reward model, which is usually trained based on pairwise comparison.

How can the Reward model of GRPO be designed (refer to DeepSeek R1)?

Features:

format reward: forces the generation of long cot results, which can push the model to generate inference processes and improve the inference effect of the model.

accuracy reward: mathematics can use the final result, and code can use compiler feedback.

Advantages of GRPO

Less memory footprint: no Critic model required, reducing memory requirements.

More efficient training: calculation using intra-group relative advantage simplifies the training process.

More compatible with the nature of reward models: improves training stability and efficiency.

RL Unified Paradigm Summary

Unified Paradigm Proposed

The authors propose a unified paradigm to understand different training methods such as SFT (Supervised Fine-tuning), RFT (Rejection Sampling Fine-tuning), DPO (Direct Preference Optimization), PPO, GRPO, etc. RL Key Elements: The key elements of the unified framework include: data sources, reward functions, and algorithms.

Data source: This refers to the data used for training, which can be derived from manual labeling, SFT models, or real-time policy models.
Reward function: This refers to the function used to evaluate the quality of the output, which can be a rule or a model.
Algorithm: This refers to the method used to process the data and reward signal and update the model parameters.

Analysis of different methods based on a unified paradigm

Table 10 summarizes the similarities and differences between SFT, RFT, DPO, Online RFT, PPO and GRPO in terms of data sources, reward functions and gradient coefficients.

Method	Training data	Reward function	Gradient coefficient	Training method	Advantages/features	Applicable scenarios
SFT	Manually labeled SFT data	Manually selected (implicit reward)	Fixed to 1	Supervised learning	Simple and stable, dependent on high-quality labeled data	Basic model training, initial alignment task
RFT	SFT dataset problem + SFT model sample output	Based on answer correctness (rule judgment)	0 (wrong) or 1 (correct)	Offline policy optimization	Efficient calculation, direct use of rule feedback	Mathematical/logical tasks with clear rules
DPO	SFT dataset problem + model output to	Human preference labeling or rule comparison	Based on preference probability calculation (e.g., Bradley-Terry model)	Comparison learning	Avoids explicit reward modeling, directly optimizing preferences	Human preference alignment tasks (e.g., dialogue generation)
Online RFT	Real-time policy model sampling problem-output pairs	Based on answer correctness (rule judgment)	0 (wrong) or 1 (correct)	Online policy optimization	Dynamically updates policies with real-time feedback optimization	Scenarios that require online interaction (e.g., game AI)
PPO	SFT dataset problem + policy model sampling output	Reward model (RM) trained	Dominance function (based on reward estimation)	Policy gradient method	Efficient and stable, supports multi-step optimization	Complex tasks (e.g. text generation, robot control)
GRPO	SFT dataset problem + policy model sampling output	Reward model (RM) trained	Intra-group relative reward (normalized comparison)	Group policy optimization	Reduce reward variance and improve intra-group comparison	Tasks with high variance (e.g. long text generation)

Observations on data sources

Online vs offline training: Online training refers to using the output of the real-time policy model as training data, while offline training refers to using the output of a fixed model (such as the SFT model) as training data. Experimental results show that online training is generally better than offline training.

Outcome supervision vs process supervision: Outcome supervision refers to only rewarding the final step of the output, while process supervision refers to rewarding each step of the reasoning process. Experimental results show that process supervision is more effective in complex tasks.

Single-episode vs iterative reinforcement learning: Single-episode reinforcement learning refers to a single strategy optimization, while iterative reinforcement learning refers to the continuous updating of the reward model after multiple strategy optimizations. Experimental results show that iterative reinforcement learning can significantly improve performance, especially in the first iteration.

Observation of gradient coefficients

Rule-based vs. model-based: Rule refers to determining the reward based on the correctness of the answer, and Model refers to training a reward model to score.

Difference in gradient coefficients: The key difference between GRPO and Online RFT is that GRPO adjusts its gradient coefficients based on the reward values provided by the reward model, while Online RFT does not.

GRPO advantages: Experiments show that GRPO is superior to Online RFT, demonstrating the effectiveness of changing the sign of the gradient coefficients. GRPO+PS is superior to GRPO+OS, demonstrating the benefits of using fine-grained, step-aware gradient coefficients.

RL effectiveness and directions for improvement

Why is RL effective?

Experimental results: RL improves Maj@K performance but not Pass@K.

Explanation: RL improves the overall performance of the model by making the output distribution more robust, i.e., it improves the probability of correct answers in TopK, rather than enhancing the underlying ability of the model.

How can more effective RL be achieved?

Based on the unified paradigm, the authors propose future directions for improving RL in three aspects: data sources, algorithms, and reward functions.

Data sources:
- Explore issues beyond the SFT stage.
- Use more advanced sampling (decoding) strategies, such as tree search-based methods.
- Use efficient inference techniques to improve the exploration efficiency of the policy model.
Algorithm:
- Explore reinforcement learning algorithms that are more robust to noisy reward signals.
- Study WEAK-TO-STRONG type alignment methods.
Reward function:
- Enhance the generalization ability of the reward model to handle out-of-distribution problems and advanced decoded outputs.
- Reflect the uncertainty of the reward model and use it as a bridge to connect weak reward models and WEAK-TO-STRONG learning algorithms.
- Efficiently construct high-quality process reward models to provide fine-grained training signals for the inference process.

Summary

DeepSeekMath has significantly improved the ability of open source language models in mathematical reasoning by constructing a large-scale mathematical corpus and proposing a new reinforcement learning algorithm. The highlights of this paper are

the construction and validation of the DeepSeekMath Corpus, a large-scale, high-quality, multilingual mathematical corpus.
An efficient reinforcement learning algorithm, GRPO, is proposed to reduce memory usage while improving the mathematical reasoning ability of the model.
The impact of code training on mathematical reasoning ability is discussed in depth, and it is found that the arXiv data has a limited effect. The value of DeepSeekMath:
It provides the open source community with a powerful mathematical reasoning model and promotes the development of mathematical AI.
It provides valuable experience and methods for building mathematical corpora and training mathematical reasoning models.
The proposed GRPO algorithm provides new ideas for reinforcement learning training in other fields.

The secret behind DeepSeek 1 | DeepSeekMath and GRPO details