1 Background

During the Spring Festival, DeepSeek R1 once again attracted widespread attention, and even the DeepSeek V3 interpretation article we previously wrote was also re-transmitted and discussed a lot.

Although there have been many analyses and reproductions of DeepSeek R1, here we have decided to compile some corresponding reading notes.

We will use three core schematic diagrams to demonstrate model construction and key technical points, distilling the essence of the DeepSeek-R1 series to provide a more intuitive understanding of its design ideas.

The corresponding paper is [2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

and the corresponding open source model is DeepSeek-R1

2 Introduction

2.1 Common Reasoning algorithms

As shown in Figure 2 below, the author explains the four common reasoning algorithms. Although they differ in specific details, they all include two core operations:

  • Expansion: generate tokens to expand the solution path.
  • Aggregation: integrate the results of each path to obtain the final answer. Increasing the computational resources in the expansion phase can usually improve the quality of the answer in the aggregation phase.

Self-consistency (SC). As shown in Figure 2a, the core idea of SC is to generate multiple different outputs (which can be achieved by changing sampling parameters, etc.), and then vote for all the answers to select the answer with the highest winning rate. The key parameter is the number of candidate answers n.

Rebase algorithm: As shown in Figure 2b below, Rebase also generates multiple outputs, but they are generated in multiple steps. Each step is scored using the Reward model, and the result with the highest score is used to continue generating. Finally, a reasoning tree with multiple branches is generated. The answer with the highest score (Best-of-N) is selected in the aggregation stage.

Monte Carlo Tree Search (MCTS): As shown in Figure 2c below, MCTS is a powerful Reasoning algorithm that expands nodes by sampling gradually and constructs a solution tree until it reaches a leaf node containing a candidate solution. Each solution is scored through a Reward model or simulation, and the score is propagated back to its ancestor nodes to update their reward values, thus completing an iteration. The key parameter is also n, and increasing n allows for deeper and broader exploration of potential solutions.

Internalized cognitive chain (ICoT). As shown in Figure 2d below, the latest LLMs, such as OpenAI o1 and Qwen-QWQ, can internalize reasoning behavior during training without the need for an explicit reasoning algorithm. The core idea is to generate a CoT sequence, decompose complex problems into multiple sub-problems, and then iteratively optimize these answers by reflecting on previous outputs to eventually arrive at a solution.

2.2 Reasoning alignment methods

2.2.1 Best-of-N method overview

In short, Best-of-N is an alignment method widely used in LLM inference, which aims to ensure the high quality of the generated results by generating multiple candidate responses and selecting the best one. It consists of three main processes:

  1. Generation process: For a given prompt X, the Best-of-N method generates N IID responses (Y₁, Y₂, …, Yₙ), where N is often referred to as the “batch size”.
  2. Scoring mechanism: Each generated response is scored by a reward model to obtain a corresponding score {s(Y₁), s(Y₂), …, s(Yₙ)}.
  3. Selecting the best response: Finally, the response with the highest score among all generated responses is selected as the output, i.e., Y_Best-of-N = argmax {s(Y₁), s(Y₂), …, s(Yₙ)}.

The advantages of this method are:

  1. It can effectively avoid complex fine-tuning steps, making it easier to deploy language models that have been pre-trained or fine-tuned with instructions.
  2. It is simple to implement, easy to understand, and essentially free of hyperparameters: the main hyperparameter is N, which can be dynamically adjusted during inference.
  3. It is highly competitive in terms of generation quality and can even rival some complex post-training techniques such as RLHF or DPO. Research shows that the Best-of-N method performs well on the trade-off curve between reward and KL divergence, even surpassing other complex alignment strategies.

The disadvantages of this method are

  1. the inference requires generating N sequences, which can lead to significant computational overhead. In practice, a reasonable value for N ranges from 4 to 128, but in order to compete with the most advanced post-training methods, higher N values may be required, such as 1000 to 60000, which can lead to almost unacceptable computational overhead.

The best-of-N method is often used to generate high-quality datasets for subsequent supervised fine-tuning and played a key role in the alignment process of LLaMA-2 and LLaMA-3.

2.2.2 OpenAI best-of-N method

OpenAI first proposed Best-of-N sampling in [2009.01325] Learning to summarize from human feedback . Specifically, it is used to evaluate and optimize the performance of the summary model by selecting the best summary generated from multiple models. This method helps researchers better understand the relationship between different evaluation metrics and human assessor preferences, and is used to guide model training and optimization.

OpenAI also uses Best-of-N sampling (rejection sampling) in the follow-up [2112.09332] WebGPT: Browser-assisted question-answering with human feedback. Specifically, a fixed number of answers (4, 16 or 64) are sampled from the BC model or RL model, and the one with the highest reward model score is selected as an optimization method for the adversarial reward model. This method does not require additional training, but increases the computational complexity of the inference stage to achieve.

2.2.3 Google BOND method

In [2407.14622] BOND: Aligning LLMs with Best-of-N Distillation, the authors from Google propose Best-of-N Distillation (BOND), a new RLHF algorithm designed to simulate the Best-of-N sampling strategy through a Distribution Matching algorithm without significantly increasing the computational overhead during Inference.

Specifically, the author first derives the exact analytical distribution of Best-of-N sampling and gives the probability function of Best-of-N sampling:

Second, the authors express the problem as a distribution matching problem;

afterwards, the authors propose to use Jeffreys divergence as the distribution matching objective:

Finally, to solve the problem of selecting N, the authors propose the iterative BOND method, which improves the performance of the strategy by iteratively distilling the Best-of-N distribution. The specific steps include:

Initialize the auxiliary Anchor strategy π(anchor).

Iteratively execute BOND to distill the Best-of-N π(anchor) and update π(anchor) after each step.

2.3 Process supervision and outcome supervision

Outcome and Process refer to the two aspects of the Reward model evaluation:

  • Outcome Reward Model: Evaluate whether the final result of the model output is correct or as expected.
  • Process Reward Model: Evaluates whether the model’s reasoning and decision-making steps in the process of generating results are reasonable and effective.

For example, OpenAI’s Let’s Verify Step by Step | OpenAI also mentions:

  • Process supervision (Outcome-supervised): involves providing feedback on each step of the model’s Reasoning process. Process-supervised Reward Models (PRM) are trained to predict the correctness of each step of the solution.
  • Outcome-supervised: Outcome-supervised provides feedback based only on the final result of the model’s reasoning. Outcome-supervised reward models (ORM) are trained using the final answer of the solution, and correctness is determined by automatic checking.

2.4 Reward Hacking

In RL, reward hacking refers to the phenomenon in which an agent exploits a flaw in the design of the reward function to maximize the cumulative reward in a way that does not meet the original intention of the designer. Although this behavior technically meets the optimization goal of the reward function, the actual effect deviates from the expected task goal and may even lead to negative consequences.

Key point analysis:

  1. Definition and manifestation:
    1. The agent finds a flaw in the reward function and obtains a high reward by taking “shortcuts” instead of actually solving the problem.
    2. For example, a cleaning robot turns off the lights to make the room “look” clean, rather than actually cleaning it; a game agent repeatedly scores points without completing the level goal; choosing not to slow down in order to reduce the number of braking times, which poses a safety hazard; generating meaningless content that matches keywords in order to trick high scores.
  2. Root causes:
    1. Incomplete reward function design: oversimplification or failure to cover edge cases.
    2. Misalignment between goals and rewards: the reward function fails to fully reflect the real goal, causing the agent to optimize for the “wrong” goal.
  3. Solutions:
    1. Improve reward design: introduce multi-dimensional rewards (e.g. safety, efficiency, etc.) or dynamically adjust the reward function.
    2. Adversarial verification: detect whether the agent is “cheating” through additional mechanisms.
    3. Manual intervention and constraints: set behavioral boundaries (e.g. safety layer) or manual feedback (e.g. RLHF).
    4. Inverse reinforcement learning (IRL): learn a more realistic reward function from expert demonstrations.
    5. Hierarchical reinforcement learning: decompose the task into sub-goals to reduce the risk of local optimization.
  4. Association with overfitting:
    1. Both exhibit a disconnect between training metrics and real-world performance, but Reward Hacking places more emphasis on the design flaws of the reward function than on the generalization ability of the model.
  5. Summary:
    1. Reward Hacking reveals the challenge of goal alignment in RL. Solving this problem requires a combination of designing more robust reward mechanisms, introducing external constraints, and incorporating human prior knowledge to ensure that the agent’s behavior is both efficient and in line with the design intent.

3 DeepSeek-R1-Zero & DeepSeek-R1

3.1 Overview

Previous research has largely relied on large amounts of supervised data to improve model performance. This study shows that even without SFT as a cold start, large-scale RL can significantly enhance the reasoning ability of the model. In addition, the introduction of a small amount of cold start data can further optimize performance. The following are the models related to DeepSeek-R1:

  1. DeepSeek-R1-Zero: This model applies RL directly to the Base model without any SFT data.
  2. DeepSeek-R1: This model applies RL starting from a checkpoint that has been fine-tuned with thousands of long CoT samples.
  3. DeepSeek-R1-Distill-xx: Distills the Reasoning capability of DeepSeek-R1 into a small Dense model.

3.2 DeepSeek-R1-Zero

The following figure shows the key points in the training of the DeepSeek-R1-Zero model:

PS: It should be noted that the paper does not provide much information on the data used in the RL process of DeepSeek-R1-Zero. However, there is some explanation of the data generation process and quantity in subsequent R1 training, although it is not particularly specific.

3.2.1 RL algorithm

To reduce the training cost of RL, the authors use DeepSeek’s own GRPO (Group Relative Policy Optimization) method, [2402.03300] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. This method abandons the Critic model, which is usually comparable in size to the Policy model, and instead estimates the baseline using a group score. The corresponding explanation is shown in the figure below (picture from Twitter):

3.2.2 Reward modeling

Rewards are the source of training signals and determine the optimization direction of RL. To train DeepSeek-R1-Zero, the authors used a rule-based reward system, which mainly consists of two types of rewards:

  • Accuracy reward: Evaluate whether the response is correct. For example:
    • In mathematical problems with deterministic results, the model needs to provide the final answer in a specific format (such as inside a box) so that its correctness can be reliably verified by rules.
    • Similarly, for LeetCode problems, feedback can be generated using a compiler based on predefined test cases.
  • Format reward: A format reward is also used to force the model to place its thought process between the “<think>” and “</think>” tags.

During the development of DeepSeek-R1-Zero, the author did not use the Outcome Neural Reward Model or the Process Neural Reward Model because the author found that Neural Reward Model may encounter reward spoofing (Reward Hacking) in large-scale RL processes; in addition, retraining the Reward Model not only requires additional training resources, but also complicates the entire training process.

3.2.3 Training Template

To train DeepSeek-R1-Zero, the authors first designed a simple Template to guide the Base model to follow the set instructions. As shown in Table 1 below, the Template requires DeepSeek-R1-Zero to generate an inference process and then give the final answer.

The author deliberately limited the constraints to this structural framework to avoid introducing any content bias – for example, forcing reflective reasoning or promoting specific problem-solving strategies – to ensure that the natural evolution of the model can be accurately observed during the RL process.

3.2.4 Conclusion

Robust reasoning capabilities without SFT data: By starting RL directly from the Base model, the evolution trajectory of the model can be closely monitored without SFT interference. As Figure 3 below shows, DeepSeek-R1-Zero’s thinking time continued to improve (the growth length gradually became longer) throughout the training process. This improvement did not come from external adjustments, but was a natural result of the model’s internal development. DeepSeek-R1-Zero naturally gained the ability to solve increasingly complex inference tasks, such as the ability to reflect, by using extended test time calculations.

DeepSeek-R1-Zero experienced an “aha moment” during training. As shown in Table 3 below, this moment occurred during the model’s middle version stage. During this stage, DeepSeek-R1-Zero learned to allocate more thinking time to problems by re-evaluating its initial approach.

Majority voting: DeepSeek-R1-Zero’s performance can be further improved by applying majority voting. For example, as shown in Table 2 below, after majority voting is used in the AIME benchmark test, its performance jumps from 71.0% to 86.7%, surpassing OpenAI-o1-0912.

Weaknesses: While DeepSeek-R1-Zero demonstrates strong Reasoning capabilities and autonomously develops unexpected and powerful Reasoning behaviors, it still faces challenges such as poor readability and language mixing.

3.3 DeepSeek-R1

To make the Reasoning process more readable and share it with the open community, the authors further explore the DeepSeek-R1 method, which uses human-friendly cold-start data for RL. Inspired by DeepSeek-R1-Zero, two natural questions follow:

  1. Can Reasoning performance be further improved or the convergence process accelerated by introducing a small amount of high-quality data as a cold start?
  2. How can we train a user-friendly model that not only generates clear and coherent CoTs, but also demonstrates strong generalization capabilities?

In response to these questions, we designed a training process for DeepSeek-R1. The process consists of multiple stages, as described below:

Stage-1, as shown in the figure below, trains the intermediate state of DeepSeek-R1 through SFT + RL:

The following figure shows Stages-2, 3, and 4:

  • Stage-2: upper left, construct 200K non-Reasoning data and 600K Reasoning data.
  • Stage-3: upper right, SFT + RL train DeepSeek-R1.
  • Stage-4: lower figure, Distill DeepSeek-R1-Distill-xx.

3.3.1 Cold Start (Stage-1)

Unlike DeepSeek-R1-Zero, to prevent the unstable Cold Start phase of the Base model at the beginning of RL training, the authors built and collected a small amount of Long CoT data for DeepSeek-R1 to fine-tune the model as the initial RL Actor. To collect this data, the authors explored various methods:

  • Using few-shot prompts with Long CoT examples
  • Prompting the model directly to generate detailed answers with reflection and verification
  • Collecting DeepSeek-R1-Zero output in a human-readable format
  • Refining the results through post-processing with manual labeling

The authors collected a total of thousands of Cold Start data, which was used to fine-tune DeepSeek-V3-Base as the starting point for RL. Compared with DeepSeek-R1-Zero, the advantages of Cold Start data include

  • Readability: DeepSeek-R1-Zero Responses can be mixed in multiple languages or lack the Markdown formatting used to highlight user answers. In contrast, when creating Cold Start data for DeepSeek-R1, the author designed a readable format that includes a summary at the end of each Response and filters out unreadable Responses. Here, the output format is defined as |special_token|<reasoning_process>|special_token|<summary>, where reasoning_process is the chained thinking of the Query and summary is used to summarize the reasoning results.
  • Potential: By carefully designing a combination of human-a priori Cold Start data patterns, the authors observed that its performance is superior to DeepSeek-R1-Zero.

3.3.2 Reasoning-driven RL (Stage-1)

After fine-tuning DeepSeek-V3-Base on Cold Start data, the same large-scale RL training process as DeepSeek-R1-Zero is used. This stage aims to improve the model’s ability in Reasoning-intensive tasks, especially on programming, mathematics, science and logical reasoning problems with clear solutions.

During training, the authors observed that CoT often suffered from language mixing, especially when the RL prompt involved multiple languages. To alleviate the language mixing problem, the authors introduced a language consistency reward into RL training, which is calculated based on the proportion of words in the target language in CoT. Although ablation experiments show that this alignment method leads to a slight decrease in model performance, this reward mechanism is consistent with human preferences and enhances readability. Finally, the authors directly add the accuracy of the Reasoning task to the language consistency reward to form the final reward, and implement RL training on the fine-tuned model until it converges on the Reasoning task.

3.3.3 Construction of 800,000 selected data (Stage-2)

While RL for Reasoning converges, SFT data is collected using the resulting checkpoint for the next training round. Unlike the initial Cold Start data, which focuses mainly on Reasoning, this stage incorporates data from other domains to enhance the model’s ability in writing, role-playing and other general-purpose tasks. Specifically, the data is generated and the model is fine-tuned as follows:

  • Reasoning data: Reasoning prompts are selected and Reasoning trajectories are generated by performing rejection sampling from the aforementioned RL trained Checkpoint (DeepSeek-R1 Stage 1). In the previous stage, only data that could be evaluated using rule-based rewards was included. However, at this stage, the dataset was expanded by including more data, some of which was generated using a reward model, and the real answers were judged by feeding the model predictions into DeepSeek-V3 (DeepSeek V3 as Judge). In addition, because the model output is sometimes confusing and difficult to read, mixed-language thought chains, long paragraphs, and code blocks were filtered out. For each prompt, multiple responses were sampled and only the correct ones (Best-of-N) were retained. In total, about 600,000 reasoning-related training samples were collected.
  • Non-Reasoning data: such as writing, factoid questions, self-awareness, and translation, used the DeepSeek-V3 process and reused some of DeepSeek-V3’s SFT datasets. For some non-Reasoning tasks, DeepSeek-V3 is called to generate potential CoTs before answering the question. However, for simple queries such as “Hello”, no thought chain is provided in the Response. In the end, a total of about 200,000 non-Reasoning training samples were collected.

3.3.4 SFT & RL for all scenarios (Stage-3)

Two rounds of fine-tuning a total of about 800,000 selected samples were performed on DeepSeek-V3-Base using the two aforementioned data sets (Reasoning and non-Reasoning).

To further align the model with human preferences, the authors implemented a second phase of RL, which aims to improve the model’s usefulness and harmlessness while also refining its Reasoning capabilities. Specifically, the model was trained with a combination of reward signals and diverse prompt distributions.

  • For Reasoning data, the methodology described in DeepSeek-R1-Zero is followed, using a rule-based reward mechanism to guide the model’s learning in the areas of mathematics, programming and logical reasoning.
  • For general data, the Reward model is used to capture human preferences in complex and subtle situations. A similar strategy of preference pairs and training prompt distributions is used based on the DeepSeek-V3 process.
  • In terms of usefulness, only the final summary is considered, ensuring that the evaluation focuses on the practicality and relevance of the Response to the user while minimizing interference with the underlying Reasoning process.
  • As for harmlessness, the entire Response of the model is comprehensively evaluated, including the Reasoning process and summary, to identify and eliminate any potential risks, biases, or harmful content that may arise during the generation process.
  • Ultimately, by integrating reward signals and diversifying data distribution, a model that prioritizes both benefit and harmlessness while also excelling in Reasoning can be trained.

3.3.5 Distillation (Stage-4)

In order to equip a more efficient small model with the reasoning ability of DeepSeek-R1, the authors directly fine-tuned the open source models Qwen and LLaMA using the 800,000 samples selected in DeepSeek-R1-Stage-1. The results show that this direct distillation method significantly improves the reasoning ability of small models. The basic models used by the authors include Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.5-14B, Qwen2.5-32B, Llama-3.1-8B and Llama-3.3-70B-Instruct. Llama-3.3 was selected because its reasoning ability is slightly better than Llama-3.1.

For the distillation model, the author only uses SFT and does not include the RL stage. Although the introduction of RL can greatly improve the performance of the model, the author’s main purpose here is to demonstrate the effectiveness of distillation technology, and the exploration of the RL stage is left to subsequent research.

PS: In addition, it is actually possible to use the final DeepSeek-R1 to generate the above data and reconstruct the 800,000 data used for distillation, and the distilled model may have a better effect; however, the price is that the data needs to be reconstructed.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *