![]() | |
Native name | 杭州深度求索人工智能基础技术研究有限公司 |
---|---|
Company type | Private |
Industry | Information technology Artificial intelligence |
Founded | 17 July 2023 [1] |
Founder | |
Headquarters | Hangzhou, Zhejiang, China |
Key people |
|
Owner | High-Flyer |
Number of employees | 160 (2025) [2] |
Website | www |
Hangzhou DeepSeek Artificial Intelligence Basic Technology Research Co., Ltd., [3] [4] [5] [a] doing business as DeepSeek, [b] is a Chinese artificial intelligence company that develops large language models (LLMs). Based in Hangzhou, Zhejiang, it is owned and funded by the Chinese hedge fund High-Flyer. DeepSeek was founded in July 2023 by Liang Wenfeng, the co-founder of High-Flyer, who also serves as the CEO for both companies. [7] [8] [9] The company launched an eponymous chatbot alongside its DeepSeek-R1 model in January 2025.
Released under the MIT License, DeepSeek-R1 provides responses comparable to other contemporary large language models, such as OpenAI's GPT-4o and o1. [10] Its training cost is reported to be significantly lower than other LLMs. The company claims that it trained its V3 model for US$6 million compared to $100 million for OpenAI's GPT-4 in 2023, [11] and approximately one-tenth of the computing power used for Meta's comparable model, Llama 3.1. [11] [12] [13] [14] DeepSeek's success against larger and more established rivals has been described as "upending AI". [15] [16]
DeepSeek's models are "open weight", which provides less freedom for modification than true open-source software. [17] [18] The company reportedly recruits AI researchers from top Chinese universities [15] and hires from outside the computer science field to diversify its models' knowledge and abilities. [12]
The DeepSeek R1 model was trained at a significantly lower cost than other models by using techniques such as mixture of experts to reduce costs. [19] The model was also trained during ongoing trade restrictions on AI chip exports to China, causing it to be trained on weaker AI chips made for export to China, [13] and using fewer chips compared to other models. [15] This breakthrough in reducing expenses, although increasing efficiency and maintaining the model's performance power and quality in the AI industry, sent "shockwaves" through the market. It threatened the dominance of AI leaders like Nvidia and contributed to the largest drop for a single company in US stock market history, as Nvidia lost $600 billion in market value. [20] [21]
In February 2016, High-Flyer was co-founded by AI enthusiast Liang Wenfeng, who had been trading since the 2007–2008 financial crisis while attending Zhejiang University. [22] The company began stock trading using a GPU-dependent deep learning model on 21 October 2016. Prior to this, it used CPU-based models, mainly linear models. Most trading was driven by AI by the end of 2017. [23]
In 2019, Liang established High-Flyer as a hedge fund focused on developing and using AI trading algorithms. By 2021, High-Flyer exclusively used AI in trading, [24] often using Nvidia chips. [25]
Initial computing cluster Fire-Flyer began construction in 2019 and finished in 2020, at a cost of 200 million yuan. It contained 1,100 GPUs interconnected at a rate of 200 Gbit/s. It was 'retired' after 1.5 years in operation. [23]
In 2021, Liang began stockpiling Nvidia GPUs for an AI project. [25] According to 36Kr, Liang acquired 10,000 Nvidia A100 GPUs [26] before the United States restricted chip sales to China. [24] Computing cluster Fire-Flyer 2 began construction in 2021 with a budget of 1 billion yuan. [23]
It was reported that in 2022, Fire-Flyer 2's capacity had been used at over 96%, totaling 56.74 million GPU hours. 27% was used to support scientific computing outside the company. [23]
During 2022, Fire-Flyer 2 had 5000 PCIe A100 GPUs in 625 nodes, each containing 8 GPUs. At the time, it exclusively used PCIe instead of the DGX version of A100, since at the time the models it trained could fit within a single 40 GB GPU VRAM and so there was no need for the higher bandwidth of DGX (i.e., it required only data parallelism but not model parallelism). [27] Later, it incorporated NVLinks and NCCL to train larger models that required model parallelism. [28] [29]
On 14 April 2023, [30] High-Flyer announced the start of an artificial general intelligence lab dedicated to research developing AI tools separate from High-Flyer's financial business. [31] [32] Incorporated on 17 July 2023, [1] with High-Flyer as the investor and backer, the lab became its own company, DeepSeek. [24] [33] [32] Venture capital firms were reluctant to provide funding, as they considered it unlikely that the venture would be able to quickly generate an "exit". [24]
On 16 May 2023, the company Beijing DeepSeek Artificial Intelligence Basic Technology Research Company, Limited. was incorporated. It was later taken under 100% control of Hangzhou DeepSeek Artificial Intelligence Basic Technology Research Co., Ltd, which was incorporated two months after.[ citation needed ]
On 2 November 2023, DeepSeek released its first model, DeepSeek Coder. On 29 November 2023, DeepSeek released the DeepSeek-LLM series of models. [34] : section 5 On 9 January 2024, it released two DeepSeek-MoE models (Base and Chat); [35] and in April three DeepSeek-Math models Base, Instruct, and RL. [36]
DeepSeek-V2 was released in May 2024, and the following month the DeepSeek-Coder V2 series. [37] In September, DeepSeek V2.5 was released in September, updated in December. [38] On 20 November, DeepSeek-R1-Lite-Preview became accessible via API and chat. [39] [40] In December, the company released the base model DeepSeek-V3-Base and the chat model DeepSeek-V3. [28]
On 20 January 2025, DeepSeek released the DeepSeek chatbot, based on the DeepSeek-R1 model, free of charge for iOS and Android. By 27 January, DeepSeek had surpassed ChatGPT as the most downloaded freeware app on the iOS App Store in the United States, [15] causing Nvidia's share price to drop by 18%. [41] [42]
Based in Hangzhou, Zhejiang, DeepSeek is owned and funded by the Chinese hedge fund High-Flyer co-founder Liang Wenfeng, who also serves as its CEO. As of May 2024, Liang owned 84% of DeepSeek through two shell corporations. [note 1] [43]
DeepSeek is focused on research and has no detailed plans for commercialization. [44] This strategy allows its technology to avoid the most stringent provisions of China's AI regulations, such as requiring consumer-facing technology to comply with government controls on information. [12]
DeepSeek's hiring preferences target technical abilities rather than work experience; most new hires are either recent university graduates or developers whose AI careers are less established. [32] [12] Likewise, the company recruits individuals without computer science background to help its technologists understand more knowledge areas, [15] such as poetry and China's notoriously difficult college admissions exams (Gaokao). [12]
High-Flyer/DeepSeek operates at least two computing clusters, Fire-Flyer (萤火一号) and Fire-Flyer 2 (萤火二号). Fire-Flyer 2 consists of co-designed software and hardware architecture. On the hardware side, Nvidia GPUs use 200 Gbps interconnects. The cluster is divided into two "zones", and the platform supports cross-zone tasks. The network topology was two fat trees, chosen for high bisection bandwidth. On the software side are: [29] [23]
3FS
(Fire-Flyer File System): A distributed parallel file system, specifically designed for asynchronous random reads. It uses Direct I/O and RDMA Read. In contrast to standard Buffered I/O, Direct I/O does not cache data. Caching is useless for this case, since each data read is random and is not reused. [45] [46] hfreduce
: Library for asynchronous communication, originally designed to replace Nvidia Collective Communication Library (NCCL). [27] It is mainly used for allreduce, especially of gradients during backpropagation. It is asynchronously run on the CPU to avoid blocking kernels on the GPU. [29] It uses two-tree broadcast like NCCL. [27] hfai.nn
: Software library of commonly used operators for neural network training, similar to torch.nn
in PyTorch.HaiScale Distributed Data Parallel
(DDP): Parallel training library that implements various forms of parallelism such as Data Parallelism (DP), Pipeline Parallelism (PP), Tensor Parallelism (TP), Experts Parallelism (EP), Fully Sharded Data Parallel (FSDP) and Zero Redundancy Optimizer (ZeRO). It is similar to PyTorch DDP, which uses NCCL on the backend.HAI Platform
: Various applications such as task scheduling, fault handling, and disaster recovery. [47] As of 2022, Fire-Flyer 2 had 5000 PCIe A100 GPUs in 625 nodes, each containing 8 GPUs. [27] It later incorporated NVLinks and NCCL to train larger models that required model parallelism. [28] [29]
Major versions | Release date | Major variants | Remarks |
---|---|---|---|
DeepSeek Coder | 2 Nov 2023 | Base (pretrained); Instruct (with instruction-finetuned) | The architecture is essentially the same as Llama. |
DeepSeek-LLM | 29 Nov 2023 | Base; Chat (with SFT) | |
DeepSeek-MoE | 9 Jan 2024 | Base; Chat | Developed a variant of mixture of experts (MoE). |
DeepSeek-Math | Apr 2024 | Base | Initialized with DS-Coder-Base-v1.5 |
Instruct (with SFT) | |||
RL (using a process reward model) | Developed Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO). | ||
DeepSeek V2 | May 2024 | DeepSeek-V2, DeepSeek-V2-Chat DeepSeek-V2-Lite, DeepSeek-V2-Lite-Chat DeepSeek-Coder-V2 DeepSeek-V2.5 | Developed multi-head latent attention (MLA). Also used mixture of experts (MoE). Implemented KV caching. |
DeepSeek V3 | Dec 2024 | DeepSeek-V3-Base DeepSeek-V3 (a chat model) | The architecture is essentially the same as V2. |
DeepSeek R1 | 20 Nov 2024 | DeepSeek-R1-Lite-Preview | Only accessed through API and a chat interface. |
20 Jan 2025 | DeepSeek-R1 DeepSeek-R1-Zero | Initialized from DeepSeek-V3-Base and sharing the V3 architecture. | |
Distilled models | Initialized from other models, such as Llama, Qwen, etc. Distilled from data synthesized by R1 and R1-Zero. [48] |
The first DeepSeek models were essentially the same as Llama, [34] which were dense decoder-only transformers. Later models incorporated the multi-head latent attention (MLA), Mixture of Experts (MoE), and KV caching. [35] [37] [ verification needed ]
A decoder-only transformer consists of multiple identical decoder layers. Each of these layers features two main components: an attention layer and a FeedForward network (FFN) layer. [37] In the attention layer, the traditional multi-head attention mechanism has been enhanced with multi-head latent attention. This update introduces compressed latent vectors to boost performance and reduce memory usage during inference. [37] [ citation needed ]
Meanwhile, the FFN layer adopts a variant of the mixture of experts (MoE) approach, effectively doubling the number of experts compared to standard implementations. It distinguishes between two types of experts: shared experts, which are always active to encapsulate general knowledge, and routed experts, only a select few of which are activated to capture specialized information. [35] [ citation needed ]
Consider the current sequence of n tokens as input. To predict the next token based on the current input, the attention mechanism involves extensive calculations of matrices, including query (Q), key (K), and value (V) matrices. The dimensions of Q, K, and V are determined by the current number of tokens and the model’s embedding size. Once the new token is generated, the autoregressive procedure appends it to the end of the input sequence, and the transformer layers repeat the matrix calculation for the next token. A mathematical analysis reveals that the new token introduces a new query, key, and value vector, appended to Q, K, and V, respectively. Appending these new vectors to the K and V matrices is sufficient for calculating the next token prediction. Consequently, storing the current K and V matrices in memory saves time by avoiding the recalculation of the attention matrix. This feature is known as K-V caching. [37] [ verification needed ] This technique effectively reduces computational cost during inference.
![]() | This section may be too technical for most readers to understand.(January 2025) |
DeepSeek's models are "open weight", which provides less freedom for modification than true open source software. [17] [49]
DeepSeek Coder is a series of eight models, four pretrained (Base
) and four instruction-finetuned (Instruct
). All have 16K context lengths. The model was made source-available under the DeepSeek License, which includes "open and responsible downstream usage" restrictions. [50]
The training program was: [51] [52] [53]
Base
models.Instruct
models.They were trained on clusters of A100 and H800 Nvidia GPUs, connected by InfiniBand, NVLink, NVSwitch. [51]
Params. | |||||
---|---|---|---|---|---|
1.3B | 24 | 2048 | 5504 | 16 | 16 |
5.7B | 32 | 4096 | 11008 | 32 | 1 [note 2] |
6.7B | 32 | 4096 | 11008 | 32 | 32 |
33B | 62 | 7168 | 19200 | 56 | 7 [note 2] |
The DeepSeek-LLM series was released in November 2023. It has 7B and 67B parameters in both Base and Chat forms. DeepSeek's accompanying paper claimed benchmark results higher than Llama 2 and most open-source LLMs at the time. [34] : section 5 The model code is under the source-available DeepSeek License. [55]
The architecture was essentially the same as the Llama series. They used the pre-norm decoder-only Transformer with RMSNorm as the normalization, SwiGLU in the feedforward layers, rotary positional embedding (RoPE), and grouped-query attention (GQA). Both had vocabulary size 102,400 (byte-level BPE) and context length of 4096. They trained on 2 trillion tokens of English and Chinese text obtained by deduplicating the Common Crawl. [34]
Params. | |||||
---|---|---|---|---|---|
7B | 30 | 4096 | 11008 | 32 | 32 |
67B | 95 | 8192 | 22016 | 64 | 8 [note 2] |
The Chat versions of the two Base models was released concurrently, obtained by training Base by supervised finetuning (SFT) followed by direct policy optimization (DPO). [34]
DeepSeek-MoE models (Base and Chat), each have 16B parameters (2.7B activated per token, 4K context length). The training was essentially the same as DeepSeek-LLM 7B, and was trained on a part of its training dataset. They claimed performance comparable to a 16B MoE as a 7B non-MoE. It is a variant of the standard sparsely-gated MoE, with "shared experts" that are always queried, and "routed experts" that might not be. They found this to help with expert balancing. In standard MoE, some experts can become overused, while others are rarely used, wasting space. Attempting to balance expert usage causes experts to replicate the same capacity. They proposed the shared experts to learn core capacities that are often used, and let the routed experts learn peripheral capacities that are rarely used. [35]
DeepSeek-Math includes 3 models: Base, Instruct, and RL. Math was trained as follows: [36]
In May 2024, DeepSeek released the DeepSeek-V2 series. The series includes 4 models, 2 base models (DeepSeek-V2, DeepSeek-V2 Lite) and 2 chatbots (Chat). The two larger models were trained as follows: [57]
They opted for 2-staged RL, because they found that RL on reasoning data had "unique characteristics" different from RL on general data. For example, RL on reasoning could improve over more training steps. [57]
The two V2-Lite models were smaller, and trained similarly. DeepSeek-V2 Lite-Chat underwent only SFT, not RL. They trained the Lite version to help "further research and development on MLA and DeepSeekMoE". [57]
Architecturally, the V2 models were significantly different from the DeepSeek LLM series. They changed the standard attention mechanism by a low-rank approximation called multi-head latent attention (MLA), and used the previously published mixture of experts (MoE) variant. [35]
Name | Params. | Active params | Context length | |||
---|---|---|---|---|---|---|
V2-Lite | 15.7B | 2.4B | 27 | 32K | 2 | 64 |
V2 | 236B | 21B | 60 | 128K | 2 | 160 |
The Financial Times reported that it was cheaper than its peers with a price of 2 RMB for every million output tokens. The University of Waterloo Tiger Lab's leaderboard ranked DeepSeek-V2 seventh on its LLM ranking. [33]
The DeepSeek-Coder V2 series included V2-Base, V2-Lite-Base, V2-Instruct, and V20-Lite-Instruct.. Training: [37] [note 3]
DeepSeek-V2.5 was made by combining DeepSeek-V2-Chat and DeepSeek-Coder-V2-Instruct. [38]
DeepSeek-V3-Base and DeepSeek-V3 (a chat model) use essentially the same architecture as V2 with the addition of multi-token prediction, which (optionally) decodes extra tokens faster but less accurately. Training: [28]
Name | Params. | Active params | Context length | |||
---|---|---|---|---|---|---|
V3 | 671B | 37B | 61 | 128K | 1 | 256 |
V3
The DeepSeek team performed extensive low-level engineering to improve efficiency. They used mixed-precision arithmetic. Much of the forward pass was performed in 8-bit floating point numbers (5E2M: 5-bit exponent and 2-bit mantissa) rather than the standard 32-bit, requiring special GEMM routines to accumulate accurately. They used a custom 12-bit float (E5M6) only for the inputs to the linear layers after the attention modules. Optimizer states were in 16-bit (BF16). They minimized communication latency by extensively overlapping computation and communication, such as dedicating 20 streaming multiprocessors out of 132 per H800 for only inter-GPU communication. They lowered communication by rearranging (every 10 minutes) the exact machine each expert was on so as to avoid querying certain machines more often than others, adding auxiliary load-balancing losses to the training loss function, and other load-balancing techniques. [28]
After training, it was deployed on clusters of H800 GPUs. The 8 H800 GPUs within a cluster were connected by NVLink, and the clusters were connected by InfiniBand. [28]
Stage | Cost (in one thousand GPU hours) | Cost (in one million USD$) |
---|---|---|
Pre-training | 2,664 | 5.328 |
Context extension | 119 | 0.24 |
Fine-tuning | 5 | 0.01 |
Total | 2,788 | 5.576 |
The cost has been discussed [62] [63] [64] and called misleading, because it covers only parts of the true cost. [65]
Benchmark tests show that V3 outperformed Llama 3.1 and Qwen 2.5 while matching GPT-4o and Claude 3.5 Sonnet. [32] [66] [67] [68]
In January 2025, DeepSeek released the DeepSeek-R1 model under the MIT License. [69]
DeepSeek-R1-Lite-Preview [39] [40] [note 4] was trained for logical inference, mathematical reasoning, and real-time problem-solving. DeepSeek claimed that it exceeded performance of OpenAI o1 on benchmarks such as American Invitational Mathematics Examination (AIME) and MATH. [70] However, The Wall Street Journal reported that on 15 problems from the 2024 edition of AIME, the o1 model reached a solution faster. [71]
DeepSeek-R1 and DeepSeek-R1-Zero [72] were initialized from DeepSeek-V3-Base and share its architecture. DeepSeek-R1-Distill models were instead initialized from other pretrained open-weight models, including LLaMA and Qwen, then fine-tuned on synthetic data generated by R1. [48]
DeepSeek-R1-Zero
A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: <prompt>. Assistant:
DeepSeek-R1-Zero was trained exclusively using GRPO RL without SFT. Unlike previous versions, it used no model-based reward. All reward functions were rule-based, "mainly" of two types (other types were not specified): accuracy rewards and format rewards. Accuracy reward was checking whether a boxed answer is correct (for math) or whether a code passes tests (for programming). Format reward was checking whether the model puts its thinking trace within a <think>...</think> tag. [48]
R1-Zero has issues with readability and mixing languages. R1 was trained to address these issues and further improve reasoning: [48]
|special_token|<reasoning_process>|special_token|<summary>
, designed to improve model output readability.Distilled models were trained by SFT on 800K data synthesized from DeepSeek-R1, in a similar way as step 3. They were not trained with RL. [48]
R2, the successor to R1, is originally planned for release in early May 2025, but release schedule accelerated. [73]
DeepSeek's success against larger and more established rivals has been described as "upending AI". [15] [74]
The DeepSeek-R1 model provides responses comparable to other contemporary large language models, such as OpenAI's GPT-4o and o1. [75] Its training cost is reported to be significantly lower than other LLMs.
The company claims that it trained V3, a predecessor of R1, for US$6 million compared to $100 million for OpenAI's GPT-4 in 2023, [11] and approximately one tenth of the computing power used for Meta's comparable model, LLaMA 3.1. [11] [12] [13] [14]
The January 2025 release of the R1 model, which offered significantly lower costs than competing models, some investors anticipated a price war in the American AI industry. [76] It was dubbed the "Pinduoduo of AI", and other Chinese tech giants such as ByteDance, Tencent, Baidu, and Alibaba cut the price of their AI models. Despite its low price, it was profitable compared to its money-losing rivals. [44]
DeepSeek-Coder-V2 Chat
in the paper was released as DeepSeek-Coder-V2-Instruct
in HuggingFace.R1-Lite-Preview
required selecting "Deep Think enabled", and every user could use it only 50 times a day.