Breaking news! DeepSeek researcher reveals online: R1 training only took two to three weeks, and a powerful evolution of R1 zero was observed during the Chinese New Year holiday
Just now, we noticed that DeepSeek researcher Daya Guo responded to netizens’ questions about DeepSeek R1 and the company’s plans going forward. We can only say that DeepSeek R1 is just the beginning, and internal research is still advancing rapidly. DeepSeek researchers didn’t even take a break during the Chinese New Year holiday, and they have been working tirelessly to advance research. DeepSeek has some big moves coming up
Here’s the thing: On February 1, Daya Guo posted a tweet revealing the thing that got him the most excited during the Chinese New Year holiday: witnessing the “continuous growth” of the performance curve of the R1-Zero model, and feeling the powerful force of reinforcement learning (RL)!
Deepseek AI researcher Daya Guo talks to netizens
I will now help you to reproduce Daya Guo’s conversation with netizens:
Netizen A @PseudoProphet: “Big shot, I want to ask how long this continuous improvement in performance will last. Is this still in the early stages? Does it feel like DeepSeek’s RL model is just getting started, like GPT-2 in language models? Or has it reached a more mature stage like GPT-3.5, and is about to hit a bottleneck?”
This is a very sharp question, which directly relates to the potential of DeepSeek’s RL technology! Daya Guo’s response is also very honest:
Daya Guo: “I think we are still in a very early stage, and there is still a long way to go in the field of RL. But I believe we will see significant progress this year.”
Highlight the key points! “Very early”, “a long way to explore”, “significant progress this year”! These keywords are full of information. This means that DeepSeek believes that they still have a lot of room for improvement in the field of RL, and the current results of R1 may just be the tip of the iceberg, so the future is promising!
Immediately afterwards, another netizen @kaush_trip (Cheeku Tripathi) asked a more professional question that goes straight to the heart of model capabilities:
User B @kaush_trip: “Based on the performance of R1-Zero, how do you assess whether the model really has generalization ability, or whether it just memorizes state transitions and rewards?”
This question is very to the point! After all, many models seem very powerful, but in reality they are just ‘rote learning’ from the training data, and they will fail in a different environment. Is DeepSeek R1 really up to scratch?
Daya Guo: “We use a benchmark for domains not covered by RL prompt to evaluate generalization ability. At present, it seems to have generalization ability.”
The phrase “areas not covered by RL prompt” is the key! This means that DeepSeek is not “cheating” the evaluation with training data, but is tested with new scenarios that the model has never seen before, which can truly reflect the generalization level of the model. Daya Guo’s use of the rigorous wording “seems to have” also makes it more realistic and credible
Next, a netizen with the ID @teortaxesTex, a big fan of DeepSeek (his comment even included the words “DeepSeek whale cheerleading team”), started with the DeepSeek V3 technical report and asked a question about model training time:
User C @teortaxesTex: “If it’s not a secret: how long did the RL training take this time? It feels like you already had R1 or at least R1-Zero as early as December 10th, because the V3 technical report mentions that the V2.5 model used R1 knowledge distillation, and the score of V2.5-1210 is the same as the current model. Is this one a continuation of that training?”
This netizen has amazing powers of observation! He was able to extract so many details from the technical report. Daya Guo also patiently explained the iterative process of the model:
Daya Guo: “The R1-Zero and R1 parameters of 660B only started running after the release of V3, and the training took about 2-3 weeks. The R1 model we mentioned before (such as in the V3 technical report) is actually R1-Lite or R1-Lite-Zero.”
So that’s it! The R1-Zero and R1 we see now are “new and upgraded versions”, and the previous R1-Lite series are minor versions. It seems that DeepSeek has quietly iterated and upgraded many versions behind the scenes
Regarding the training speed, netizens @jiayi_pirate (Jiayi Pan) and netizen B @kaush_trip have relayed a “soul interrogation”:
User D @jiayi_pirate: ”10,000 RL steps in 3 weeks, each gradient propagation (grpo) step takes ~3 minutes 🤔”
User B @kaush_trip: ”If each gradient propagation (grpo) step takes ~3 minutes, that’s about 5 steps per hour, 120 steps per day, which is indeed very slow.”
This is a really meticulous calculation! According to the netizen’s calculation, the training speed of DeepSeek R1 is indeed not fast. This also shows that the training cost and time investment of such a high-performance RL model are huge. “Slow work produces fine work” seems to be a pretty appropriate way to describe AI model training
Finally, a netizen named @davikrehalt (Andy Jiang) asked a question from a more cutting-edge application perspective:
User E @davikrehalt: “Have you tried using RL to do formal proof of the environment, instead of just answering questions? It would be great if an open-source model could win a gold medal at IMO (International Mathematical Olympiad) this year! (And more hopes!)”
Formal proof! IMO gold medal! This netizen is quite ambitious! However, applying AI to the hardcore field of mathematical proof is indeed the future trend. Daya Guo’s reply is once again surprising:
Daya Guo: “We are also trying to apply R1 to formal proof environments such as Lean. We hope to release better models to the community soon.”
From Daya Guo’s words, it seems that they have already made progress in this area, and there may be even more impressive models released in the future!
In closing
Three key signals can be distilled from Daya Guo’s response:
Technical positioning: RL is still in its early stages, and performance improvements are far from reaching their limits;
Verification logic: generalization ability for cross-domain testing, rejecting “memory speculation
Application boundaries: from language models to mathematical proofs, RL is moving towards high-order reasoning