What Is Behind The DeepSeek Hype?

The tech industry have been rocked by the DeepSeek R1 model since its early January 2025 release. Is the hype justified? This series of posts will examine the technology behind the R1 model.

Feb 17, 2025

DeepSeek R1, a new open-sourced large language model released by a Chinese startup in January 2025, has garnered significant attention. Developed on a tight budget and timeline, it rivals the industry's top LLMs. While concerns persist around security, safety, veracity, and accuracy, the technical innovations behind the model are undeniable. This tri-part series will focus exclusively on exploring these technical advancements.

This is the first blog in a three part series. The other two posts are at Model Architecture Behind DeepSeek R1 and DeepSeek's Model Training Methodology.

First Some Context

To gain a better understanding of DeepSeek's innovations and contributions, it is crucial to first examine the broader context surrounding the rapid rise of Large Language Models (LLMs) and the current industry practices and trends. The way for the GenAI revolution has been paved by a long list of innovations in deep learning and NLP research over the past few decades. However, for sake of time, we will keep our highly opinionated review short and just focus on the recent few innovations since transformer model architecture.

Fig 1: Few key innovations behind rise of ChatGPT and Generative AI

Transformer Architecture

The 2017 paper "Attention Is All You Need" introduced the Attention mechanism, which allows the model to focus on different parts of a sequence to effectively complete the task. For example, if the model is given the sentence "he walked to her home" and asked to classify the gender of the main character, the Attention mechanism would give more weight to the word "he". The below diagram shows the encoder part of the encoder-decoder transformer architecture in that paper.

Fig 2: Encoder block of the transformer, showing the multi-head attention and the single feed forward network. (https://arxiv.org/pdf/1706.03762)

Mixture of Experts (MoE) Architecture

Mixture of Experts (MoE) architecture then emerged, which created a sparser network with specialized segments called experts. Rather than updating the parameters of one large feedforward network, a subset of many smaller feedforward networks is updated. This decreased the computational requirements during both training and inference, allowing for the training of larger networks with more data. Below figure shows how the feed forward network in the transformer’s decoder (highlighted in blue) was broken down into multiple smaller FFNs in MoE.

Fig 3: The decoder block of original transformer on the left and that of MoE on right. The dense feed forward network (FFN) layer on left has been replaced with a sparse Switch FFN layer (light blue) on the right. The layer operates independently on the tokens in the sequence. Also shows two tokens (x1 = "More" and x2 = "Parameters" below) being routed (solid lines) across four FFN experts, where the router independently routes each token. The switch FFN layer returns the output of the selected FFN multiplied by the router gate value, show by dotted-lines. (https://arxiv.org/abs/2101.03961)

Two Stage Training Pipeline - GPT, SFT and RLHF

The next big step was in the training methodology used to train LLMs. The 2018 GPT paper introduced a two-stage training pipeline, consisting of unsupervised Generative Pre-Training (GPT) on massive datasets followed by Supervised Fine Tuning (SFT). In some sense, the true birth of generative AI revolution was marked by this paper. The InstructGPT paper from 2022 then added a third step: Reinforcement Learning From Human Feedback (RLHF). This step uses human annotators to score the model's output and then fine-tunes the model further, ensuring alignment with human expectations.

Fig 4: The two stage model training pipeline introduced by Generalized Pre-Training - GPT (https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf) and InstructGPT (https://arxiv.org/abs/2203.02155) papers.

Chain of Thought

The focus then shifted to post-training or test-time improvements with prompt engineering, specifically Chain of Thought reasoning. Chain of thought prompting decomposes complex tasks into a series of logical sub-tasks, guiding the model to reason in a human-like manner. Finally, ChatGPT was introduced as a user-friendly chat interface.

Scaling Laws

The last stop on our tour is scaling laws. This theoretical innovation provided the framework that shifted industry focus from pre-training to post-training and finally to test-time.

The 2020 publication Scaling Laws for Neural Language Models introduced the concept that LLM performance improves with increases in model size, dataset size, and compute used for training. This spurred a focus on scaling:

Initially, pre-training scaling was the focus, with efforts to increase the size of models, pre-training datasets, and compute clusters.

When the limits of internet data were reached, the focus shifted to post-training scaling, using Reinforcement Learning with Human Feedback (RLHF) and Supervised Fine Tuning (SFT). High-quality, human-annotated, task-specific data for fine tuning became a key differentiator for many companies.
As post-training gains diminished, test-time scaling came into focus, using prompt engineering and Chain of Thought reasoning.

Enter DeepSeek

The previous section highlighted that the industry was more focused on scaling these past few years, rather than model architecture or training methodologies. This is changing with DeepSeek’s announcement of its R1 model on January 20, 2025. US export restrictions on China seemingly limited scaling as an option to improve LLMs, forcing innovation across multiple aspects of model building, which can be categorized into three areas.

Model Architecture

DeepSeekMoE with Auxiliary-Loss-Free Load Balancing
- Only 37 billion parameters activated out of 671 billion for each token
MLA - Multi-Headed Latent Attention
- Compressing key/value vectors using down-projection and up-projection matrices for more optimal memory and compute use
Multi-Token Prediction Training Objective
- Predicting more than 1 token at a time, again optimizing compute
- Powers speculative decoding during inference to speed up inference

Training Methodology

Direct Reinforcement Learning on the Base Model
- No supervised fine tuning (SFT)
- Surfaced emergent CoT behaviors - self-verification, reflection etc.
New Group Relative Policy Optimization (GRPO)
- Estimates from group score instead of using critic model
- Rules based reward with accuracy and format rewards
New Four Stage Training Pipeline for Final Model
- Cold start data
- Reasoning oriented RL
- Rejection sampling and SFT
- RL for all scenarios for alignment
Distillation of Reasoning Patterns
- Data generated by DeepSeek-R1 used to fine-tune smaller dense models like Qwen and Llama

Training Framework

FP8 Mixed Precision Training Framework
- Previous approaches of Quantization were about converting the weights form FP32 to FP8 after model training
DualPipe Algorithm Pipeline Parallelism
- Bidirectional pipeline scheduling and overlapping communication with computation
- Reduces pipeline bubbles and communication overhead introduced by cross-node expert parallelism
- Near-zero communication overhead while scaling the model and employing fine-grained experts across nodes
Better Utilization of InfiniBand and NVLink Bandwidths
- Improved cross-node communication
- 20 of the 132 processing units on each H800 specifically programmed to manage cross-chip communications
Memory Optimizations
- Selective compression and caching

Conclusion

This concludes our first post in this three part series on DeepSeek. We explored the past few years when scaling laws drove the industry to focus on scaling datasets, compute, and model size to enhance LLMs. We briefly discussed key innovations - Transformer architecture, Mixture of Experts architecture, 2-stage training pipeline, prompt engineering, and Chain of Thought - that facilitated this scaling. Additionally, we did a quick review of DeepSeek's innovations, given their scaling limitations due to export restrictions. The technical details of these innovations are explored in detail in these two posts:

Founders Creative

Discussion about this post