Hangzhou DeepSeek Artificial Intelligence Basic Technology Research Co., Ltd.,

doing business as A trade name, trading name, or business name is a pseudonym used by companies that do not operate under their registered company name. The term for this type of alternative name is fictitious business name. Registering the fictitious name with ...

DeepSeek, is a Chinese

artificial intelligence Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...

company that develops

large language model A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are g ...

s (LLMs). Based in Hangzhou, Zhejiang, Deepseek is owned and funded by the Chinese hedge fund High-Flyer. DeepSeek was founded in July 2023 by

Liang Wenfeng Liang Wenfeng (; born 1985) is a Chinese entrepreneur and businessman who is the co-founder of the quantitative hedge fund High-Flyer, as well as the founder and CEO of its artificial intelligence company DeepSeek. Early life and education ...

, the co-founder of High-Flyer, who also serves as the

CEO A chief executive officer (CEO), also known as a chief executive or managing director, is the top-ranking corporate officer charged with the management of an organization, usually a company or a nonprofit organization. CEOs find roles in variou ...

for both companies. The company launched an eponymous chatbot alongside its DeepSeek-R1 model in January 2025. Released under the

MIT License The MIT License is a permissive software license originating at the Massachusetts Institute of Technology (MIT) in the late 1980s. As a permissive license, it puts very few restrictions on reuse and therefore has high license compatibility. Unl ...

, DeepSeek-R1 provides responses comparable to other contemporary large language models, such as

OpenAI OpenAI, Inc. is an American artificial intelligence (AI) organization founded in December 2015 and headquartered in San Francisco, California. It aims to develop "safe and beneficial" artificial general intelligence (AGI), which it defines ...

GPT-4 Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model trained and created by OpenAI and the fourth in its series of GPT foundation models. It was launched on March 14, 2023, and made publicly available via the p ...

and o1. Its training cost was reported to be significantly lower than other LLMs. The company claims that it trained its V3 model for US$6 million—far less than the US$100 million cost for OpenAI's

in 2023—and using approximately one-tenth the computing power consumed by

Meta Meta most commonly refers to: * Meta (prefix), a common affix and word in English ( in Greek) * Meta Platforms, an American multinational technology conglomerate (formerly ''Facebook, Inc.'') Meta or META may also refer to: Businesses * Meta (ac ...

's comparable model, Llama 3.1. DeepSeek's success against larger and more established rivals has been described as "upending AI". DeepSeek's models are described as "open weight," meaning the exact parameters are openly shared, although certain usage conditions differ from typical

open-source software Open-source software (OSS) is Software, computer software that is released under a Open-source license, license in which the copyright holder grants users the rights to use, study, change, and Software distribution, distribute the software an ...

. The company reportedly recruits AI researchers from top Chinese universities and also hires from outside traditional

computer science Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...

fields to broaden its models' knowledge and capabilities. DeepSeek significantly reduced training expenses for their R1 model by incorporating techniques such as

mixture of experts Mixture of experts (MoE) is a machine learning technique where multiple expert networks (learners) are used to divide a problem space into homogeneous regions. MoE represents a form of ensemble learning. They were also called committee machines. ...

(MoE) layers. The company also trained its models during ongoing trade restrictions on AI chip exports to China, using weaker AI chips intended for export and employing fewer units overall. Observers say this breakthrough sent "shock waves" through the industry, threatening established AI hardware leaders such as

Nvidia Nvidia Corporation ( ) is an American multinational corporation and technology company headquartered in Santa Clara, California, and incorporated in Delaware. Founded in 1993 by Jensen Huang (president and CEO), Chris Malachowsky, and Curti ...

; Nvidia's share price dropped sharply, losing US$600 billion in market value, the largest single-company decline in U.S.

stock market A stock market, equity market, or share market is the aggregation of buyers and sellers of stocks (also called shares), which represent ownership claims on businesses; these may include ''securities'' listed on a public stock exchange a ...

history.

History

Founding and early years (2016–2023)

In February 2016, High-Flyer was co-founded by AI enthusiast

, who had been trading since the

2008 financial crisis The 2008 financial crisis, also known as the global financial crisis (GFC), was a major worldwide financial crisis centered in the United States. The causes of the 2008 crisis included excessive speculation on housing values by both homeowners ...

while attending

Zhejiang University Zhejiang University (ZJU) is a public university, public research university in Hangzhou, Zhejiang, China. It is affiliated with the Ministry of Education (China), Ministry of Education. The university is part of Project 211, Project 985, and D ...

. The company began stock trading using a

GPU A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...

-dependent deep learning model on 21 October 2016; before then, it had used

CPU A central processing unit (CPU), also called a central processor, main processor, or just processor, is the primary processor in a given computer. Its electronic circuitry executes instructions of a computer program, such as arithmetic, log ...

-based linear models. By the end of 2017, most of its trading was driven by AI. Liang established High-Flyer as a hedge fund focused on developing and using AI trading algorithms, and by 2021 the firm was using AI exclusively, often using

chips. In 2019, the company began constructing its first

computing cluster A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software. The newes ...

, Fire-Flyer, at a cost of 200 million yuan; it contained 1,100 GPUs interconnected at 200 Gbit/s and was retired after 1.5 years in operation. By 2021, Liang had started buying large quantities of Nvidia GPUs for an AI project, reportedly obtaining 10,000 Nvidia A100 GPUs before the United States restricted chip sales to China. Computing cluster Fire-Flyer 2 began construction in 2021 with a budget of 1 billion yuan. It was reported that in 2022, Fire-Flyer 2's capacity had been used at over 96%, totaling 56.74 million GPU hours. 27% was used to support scientific computing outside the company. During 2022, Fire-Flyer 2 had 5000

PCIe PCI Express (Peripheral Component Interconnect Express), officially abbreviated as PCIe, is a high-speed standard used to connect hardware components inside computers. It is designed to replace older expansion bus standards such as Peripher ...

A100 GPUs in 625 nodes, each containing 8 GPUs. At the time, it exclusively used PCIe instead of the DGX version of A100, since at the time the models it trained could fit within a single 40 GB GPU

VRAM Video random-access memory (VRAM) is dedicated computer memory used to store the pixels and other graphics data as a framebuffer to be rendered on a computer monitor. It often uses a different technology than other computer memory, in order to ...

and so there was no need for the higher bandwidth of DGX (i.e., it required only data parallelism but not model parallelism). Later, it incorporated

NVLink NVLink is a wire-based serial multi-lane near-range communications protocol, communications link developed by Nvidia. Unlike PCI Express, a device can consist of multiple NVLinks, and devices use mesh networking to communicate instead of a central ...

s and NCCL (Nvidia Collective Communications Library) to train larger models that required model parallelism. On 14 April 2023, High-Flyer announced the launch of an

artificial general intelligence Artificial general intelligence (AGI)—sometimes called human‑level intelligence AI—is a type of artificial intelligence that would match or surpass human capabilities across virtually all cognitive tasks. Some researchers argue that sta ...

(AGI) research lab, stating that the new lab would focus on developing AI tools unrelated to the firm's financial business. Two months later, on 17 July 2023, that lab was spun off into an independent company, DeepSeek, with High-Flyer as its principal investor and backer.

Venture capital Venture capital (VC) is a form of private equity financing provided by firms or funds to start-up company, startup, early-stage, and emerging companies, that have been deemed to have high growth potential or that have demonstrated high growth in ...

investors were reluctant to provide funding, as they considered it unlikely that the venture would be able to quickly generate an "

exit Exit(s) may refer to: Architecture and engineering * Door * Portal (architecture), an opening in the walls of a structure * Emergency exit * Overwing exit, a type of emergency exit on an airplane * Exit ramp, a feature of a road interchange A ...

Model releases (2023–present)

DeepSeek released its first model, DeepSeek Coder, on 2 November 2023, followed by the DeepSeek-LLM series on 29 November 2023. In January 2024, it released two DeepSeek-MoE models (Base and Chat), and in April three DeepSeek-Math models (Base, Instruct, and RL).. DeepSeek-V2 was released in May 2024, followed a month later by the DeepSeek-Coder V2 series. In September 2024, DeepSeek V2.5 was introduced and revised in December. On 20 November 2024, the preview of DeepSeek-R1-Lite became available via

API An application programming interface (API) is a connection between computers or between computer programs. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how to build ...

and chat. In December, DeepSeek-V3-Base and DeepSeek-V3 (chat) were released. Deepseek login error

On 20 January 2025, DeepSeek launched the DeepSeek chatbot—based on the DeepSeek-R1 model—free for

iOS Ios, Io or Nio (, ; ; locally Nios, Νιός) is a Greek island in the Cyclades group in the Aegean Sea. Ios is a hilly island with cliffs down to the sea on most sides. It is situated halfway between Naxos and Santorini. It is about long an ...

and Android. By 27 January, DeepSeek surpassed

ChatGPT ChatGPT is a generative artificial intelligence chatbot developed by OpenAI and released on November 30, 2022. It uses large language models (LLMs) such as GPT-4o as well as other Multimodal learning, multimodal models to create human-like re ...

as the most downloaded freeware app on the

iOS App Store The App Store is an app marketplace developed and maintained by Apple, for mobile apps on its iOS and iPadOS operating systems. The store allows users to browse and download approved apps developed within Apple's iOS SDK. Apps can be download ...

in the United States, triggering an 18% drop in Nvidia's share price. On 24 March 2025, DeepSeek released DeepSeek-V3-0324 under the MIT License. In February 2025, Singaporean authorities arrested several individuals for illegally exporting advanced Nvidia chips to DeepSeek. In April 2025, it was reported that the

Trump administration Presidency of Donald Trump may refer to: * First presidency of Donald Trump, the United States presidential administration from 2017 to 2021 * Second presidency of Donald Trump, the United States presidential administration since 2025 See also * ...

was considering penalties that would attempt to block DeepSeek from buying U.S. technology. On 28 May 2025, DeepSeek released DeepSeek-R1-0528 under the MIT License.

Company operation

DeepSeek is headquartered in Hangzhou, Zhejiang, and is owned and funded by High-Flyer. Its co-founder,

, serves as CEO. As of May 2024, Liang personally held an 84% stake in DeepSeek through two

shell corporation A shell corporation is a company or corporation with no significant assets or operations often formed to obtain financing before beginning business. Shell companies were primarily vehicles for lawfully hiding the identity of their beneficial ...

Strategy

DeepSeek states that it focuses on research and does not have immediate plans for commercialization. This posture also means it can skirt certain provisions of China's AI regulations aimed at consumer-facing technologies. DeepSeek's hiring approach emphasizes skills over lengthy work experience, resulting in many hires fresh out of university. The company likewise recruits individuals without computer science backgrounds to expand the range of expertise incorporated into the models, for instance in poetry or advanced mathematics. According to ''

The New York Times ''The New York Times'' (''NYT'') is an American daily newspaper based in New York City. ''The New York Times'' covers domestic, national, and international news, and publishes opinion pieces, investigative reports, and reviews. As one of ...

'', dozens of DeepSeek researchers have or have previously had affiliations with

People's Liberation Army The People's Liberation Army (PLA) is the military of the Chinese Communist Party (CCP) and the People's Republic of China (PRC). It consists of four Military branch, services—People's Liberation Army Ground Force, Ground Force, People's ...

laboratories and the

Seven Sons of National Defence The Seven Sons of National Defence () or colloquially G7 (Guofang 7) is a grouping of the public universities affiliated with the Ministry of Industry and Information Technology of China. They are widely believed to have close scientific research ...

Training framework

High-Flyer/DeepSeek operates at least two primary computing clusters: Fire-Flyer (萤火一号) and Fire-Flyer 2 (萤火二号). Fire-Flyer 2 consists of co-designed software and hardware architecture. On the hardware side, Nvidia GPUs use 200 Gbps interconnects. The cluster is divided into two "zones", and the platform supports cross-zone tasks. The network topology was two

fat tree The fat tree network is a universal network for provably efficient communication. It was invented by Charles E. Leiserson of the MIT in 1985. k-ary n-trees, the type of fat-trees commonly used in most high-performance networks, were initially ...

s, chosen for high bisection bandwidth. On the software side are: *3FS (Fire-Flyer File System): A distributed parallel file system, specifically designed for asynchronous random reads. It uses Direct I/O and RDMA Read. In contrast to standard Buffered I/O, Direct I/O does not cache data. Caching is useless in this case, since each data read is random and is not reused. *hfreduce: Library for asynchronous communication, originally designed to replace Nvidia Collective Communication Library (NCCL). It is mainly used for allreduce, especially of gradients during

backpropagation In machine learning, backpropagation is a gradient computation method commonly used for training a neural network to compute its parameter updates. It is an efficient application of the chain rule to neural networks. Backpropagation computes th ...

. It is asynchronously run on the CPU to avoid blocking

kernels Kernel may refer to: Computing * Kernel (operating system), the central component of most operating systems * Kernel (image processing), a matrix used for image convolution * Compute kernel, in GPGPU programming * Kernel method, in machine learnin ...

on the GPU. It uses two-tree broadcast like NCCL. *hfai.nn: Software library of commonly used operators for neural network training, similar to torch.nn in

PyTorch PyTorch is a machine learning library based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella. It is one of the mo ...

. *HaiScale Distributed Data Parallel (DDP): Parallel training library that implements various forms of parallelism such as

Data Parallelism Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures like ...

(DP), Pipeline Parallelism (PP), Tensor Parallelism (TP), Experts Parallelism (EP), Fully Sharded Data Parallel (FSDP) and Zero Redundancy Optimizer (ZeRO). It is similar to PyTorch DDP, which uses NCCL on the backend. *HAI Platform: Various applications such as task scheduling, fault handling, and disaster recovery. As of 2022, Fire-Flyer 2 had 5000

A100 GPUs in 625 nodes, each containing 8 GPUs. It later incorporated NVLinks and NCCL to train larger models that required model parallelism.

Development and release history

The first DeepSeek models were essentially the same as Llama, which were dense decoder-only

transformers ''Transformers'' is a media franchise produced by American toy company Hasbro and Japanese toy company Tomy, Takara Tomy. It primarily follows the heroic Autobots and the villainous Decepticons, two Extraterrestrials in fiction, alien robot fac ...

. Later models incorporated the multi-head latent attention (MLA), Mixture of Experts (MoE), and KV caching. A decoder-only transformer consists of multiple identical decoder layers. Each of these layers features two main components: an attention layer and a FeedForward network (FFN) layer. In the attention layer, the traditional multi-head attention mechanism has been enhanced with multi-head latent attention. This update introduces compressed latent vectors to boost performance and reduce memory usage during inference. Meanwhile, the FFN layer adopts a variant of the mixture of experts (MoE) approach, effectively doubling the number of experts compared to standard implementations. It distinguishes between two types of experts: shared experts, which are always active to encapsulate general knowledge, and routed experts, only a select few of which are activated to capture specialized information. Consider the current sequence of ''n'' tokens as input. To predict the next token based on the current input, the attention mechanism involves extensive calculations of matrices, including query (Q), key (K), and value (V) matrices. The dimensions of Q, K, and V are determined by the current number of tokens and the model's embedding size. Once the new token is generated, the autoregressive procedure appends it to the end of the input sequence, and the transformer layers repeat the matrix calculation for the next token. A mathematical analysis reveals that the new token introduces a new query, key, and value vector, appended to Q, K, and V, respectively. Appending these new vectors to the K and V matrices is sufficient for calculating the next token prediction. Consequently, storing the current K and V matrices in memory saves time by avoiding the recalculation of the attention matrix. This feature is known as KV (key–value) caching. This technique effectively reduces computational cost during inference.

Overview of models

DeepSeek's models are "open weight", which provides less freedom for modification than true

open source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...

software.

DeepSeek Coder

DeepSeek Coder is a series of eight models, four pretrained (Base) and four instruction-finetuned (Instruct). All have 16K context lengths. The model was made

source-available Source-available software is software released through a source code distribution model that includes arrangements where the source can be viewed, and in some cases modified, but without necessarily meeting the criteria to be called ''open-source ...

under the DeepSeek License, which includes "open and responsible downstream usage" restrictions. The

training Training is teaching, or developing in oneself or others, any skills and knowledge or fitness that relate to specific useful competencies. Training has specific goals of improving one's capability, capacity, productivity and performance. I ...

program was: # Pretraining: 1.8T tokens (87% source code, 10% code-related English (GitHub markdown and

Stack Exchange Stack Exchange is a network of question-and-answer (Q&A) websites on topics in diverse fields, each site covering a specific topic, where questions, answers, and users are subject to a reputation award process. The reputation system allows t ...

), and 3% code-unrelated Chinese). # Long-context pretraining: 200B tokens. This extends the context length from 4K to 16K. This produced the Base models. # Supervised finetuning (SFT): 2B tokens of instruction data. This produced the Instruct models. They were trained on clusters of A100 and H800 Nvidia GPUs, connected by

InfiniBand InfiniBand (IB) is a computer networking communications standard used in high-performance computing that features very high throughput and very low latency. It is used for data interconnect both among and within computers. InfiniBand is also used ...

, NVSwitch.

DeepSeek-LLM

The DeepSeek-LLM series was released in November 2023. It has 7B and 67B parameters in both Base and Chat forms. DeepSeek's accompanying paper claimed benchmark results higher than

Llama 2 Llama (Large Language Model Meta AI, formerly stylized as LLaMA) is a family of large language models (LLMs) released by Meta AI starting in February 2023. The latest version is Llama 4, released in April 2025. Llama models come in different s ...

and most open-source LLMs at the time. The model code is under the source-available DeepSeek License. The architecture was essentially the same as the

Llama The llama (; or ) (''Lama glama'') is a domesticated South American camelid, widely used as a List of meat animals, meat and pack animal by Inca empire, Andean cultures since the pre-Columbian era. Llamas are social animals and live with ...

series. They used the pre-norm decoder-only Transformer with RMSNorm as the normalization, SwiGLU in the feedforward layers, rotary positional embedding (RoPE), and grouped-query attention (GQA). Both had vocabulary size 102,400 ( byte-level BPE) and context length of 4096. They trained on 2 trillion tokens of English and Chinese text obtained by deduplicating the

Common Crawl Common Crawl is a nonprofit organization, nonprofit 501(c) organization#501.28c.29.283.29, 501(c)(3) organization that web crawler, crawls the web and freely provides its archives and datasets to the public. Common Crawl's Web archiving, web arch ...

. The Chat versions of the two Base models was released concurrently, obtained by training Base by supervised finetuning (SFT) followed by direct policy optimization (DPO).

MoE

DeepSeek-MoE models (Base and Chat), each have 16B parameters (2.7B activated per token, 4K context length). The training was essentially the same as DeepSeek-LLM 7B, and was trained on a part of its training dataset. They claimed performance comparable to a 16B MoE as a 7B non-MoE. It is a variant of the standard sparsely-gated MoE, with "shared experts" that are always queried, and "routed experts" that might not be. They found this to help with expert balancing. In standard MoE, some experts can become overused, while others are rarely used, wasting space. Attempting to balance expert usage causes experts to replicate the same capacity. They proposed the shared experts to learn core capacities that are often used, and let the routed experts learn peripheral capacities that are rarely used.

Math

DeepSeek-Math includes 3 models: Base, Instruct, and RL. Math was trained as follows: # Initialize with a previously pretrained DeepSeek-Coder Base v1.5 7B. # Further pretrain with 500B tokens (6% DeepSeekMath Corpus, 4% AlgebraicStack, 10% arXiv, 20% GitHub code, 10% Common Crawl). This produced Base. # Train an instruction-following model by SFT Base with 776K math problems and tool-use-integrated step-by-step solutions. This produced Instruct. #

Reinforcement learning Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...

(RL): The reward model was a process reward model (PRM) trained from Base according to the Math-Shepherd method. This reward model was then used to train Instruct using Group Relative Policy Optimization (GRPO) on a dataset of 144K math questions "related to GSM8K and MATH". The reward model was continuously updated during training to avoid reward hacking. This resulted in RL.

V2

In May 2024, DeepSeek released the DeepSeek-V2 series. The series includes 4 models, 2 base models (DeepSeek-V2, DeepSeek-V2 Lite) and 2 chatbots (Chat). The two larger models were trained as follows:. # Pretrain on a dataset of 8.1T tokens, using 12% more Chinese tokens than English ones. # Extend context length from 4K to 128K using YaRN.. This resulted in DeepSeek-V2. # SFT with 1.2M instances for helpfulness and 0.3M for safety. This resulted in Chat SFT, which was not released. # RL using GRPO in two stages. The first stage was trained to solve math and coding problems. This stage used 1 reward model, trained on compiler feedback (for coding) and ground-truth labels (for math). The second stage was trained to be helpful, safe, and follow rules. This stage used 3 reward models. The helpfulness and safety reward models were trained on human preference data. The rule-based reward model was manually programmed. All trained reward models were initialized from Chat (SFT). This resulted in the released version of Chat. They opted for 2-staged RL, because they found that RL on reasoning data had "unique characteristics" different from RL on general data. For example, RL on reasoning could improve over more training steps. The two V2-Lite models were smaller, and trained similarly. DeepSeek-V2 Lite-Chat underwent only SFT, not RL. They trained the Lite version to help "further research and development on MLA and DeepSeekMoE". Architecturally, the V2 models were significantly different from the DeepSeek LLM series. They changed the standard attention mechanism by a low-rank approximation called multi-head latent attention (MLA), and used the previously published

(MoE) variant. The ''

Financial Times The ''Financial Times'' (''FT'') is a British daily newspaper printed in broadsheet and also published digitally that focuses on business and economic Current affairs (news format), current affairs. Based in London, the paper is owned by a Jap ...

'' reported that it was cheaper than its peers with a price of 2 RMB for every million output tokens. The

University of Waterloo The University of Waterloo (UWaterloo, UW, or Waterloo) is a Public university, public research university located in Waterloo, Ontario, Canada. The main campus is on of land adjacent to uptown Waterloo and Waterloo Park. The university also op ...

Tiger Lab's leaderboard ranked DeepSeek-V2 seventh on its LLM ranking. The DeepSeek-Coder V2 series included V2-Base, V2-Lite-Base, V2-Instruct, and V20-Lite-Instruct.. Training: # Base models were initialized from corresponding intermediate checkpoints after pretraining on 4.2T tokens (not the version at the end of pretraining), then pretrained further for 6T tokens, then context-extended to 128K context length. # DeepSeek-Coder and DeepSeek-Math were used to generate 20K code-related and 30K math-related instruction data, then combined with an instruction dataset of 300M tokens. This was used for SFT. # RL with GRPO. The reward for math problems was computed by comparing with the ground-truth label. The reward for code problems was generated by a reward model trained to predict whether a program would pass the unit tests. DeepSeek-V2.5 was made by combining DeepSeek-V2-Chat and DeepSeek-Coder-V2-Instruct.

V3

DeepSeek-V3-Base and DeepSeek-V3 (a chat model) use essentially the same architecture as V2 with the addition of multi-token prediction, which (optionally) decodes extra tokens faster but less accurately. Training process: # Pretraining on 14.8T tokens of a multilingual corpus, mostly English and Chinese. It contained a higher ratio of math and programming than the pretraining dataset of V2. # Extend context length twice, from 4K to 32K and then to 128K, using YaRN. This produced DeepSeek-V3-Base. # SFT for 2 epochs on 1.5M samples of reasoning (math, programming, logic) and non-reasoning (creative writing, roleplay, simple question answering) data. Reasoning data was generated by "expert models". Non-reasoning data was generated by DeepSeek-V2.5 and checked by humans. #* The "expert models" were trained by starting with an unspecified base model, then SFT on both data, and synthetic data generated by an internal DeepSeek-R1-Lite model. The system prompt asked R1 to reflect and verify during thinking. Then the expert models were RL using an undisclosed reward function. #* Each expert model was trained to generate just synthetic reasoning data in one specific domain (math, programming, logic). #* Expert models were used instead of R1 itself, since the output from R1 itself suffered "overthinking, poor formatting, and excessive length". # Model-based reward models were made by starting with a SFT checkpoint of V3, then finetuning on human preference data containing both final reward and chain-of-thought leading to the final reward. The reward model produced reward signals for both questions with objective but free-form answers, and questions without objective answers (such as creative writing). # An SFT checkpoint of V3 was trained by GRPO using both reward models and rule-based reward. The rule-based reward was computed for math problems with a final answer (put in a box), and for programming problems by unit tests. This produced DeepSeek-V3. DeepSeek released its DeepSeek-V3-0324 model, which used the same architecture as V3, on 24 March 2025 under the MIT License. Mixed-precision_training_in_DeepSeek_V3

The DeepSeek team performed extensive low-level engineering to improve efficiency. They used mixed-precision arithmetic. Much of the forward pass was performed in 8-bit floating point numbers (5E2M: 5-bit exponent and 2-bit mantissa) rather than the standard

32-bit In computer architecture, 32-bit computing refers to computer systems with a processor, memory, and other major system components that operate on data in a maximum of 32- bit units. Compared to smaller bit widths, 32-bit computers can perform la ...

, requiring special GEMM routines to accumulate accurately. They used a custom 12-bit float (E5M6) only for the inputs to the linear layers after the attention modules. Optimizer states were in 16-bit (

BF16 The bfloat16 (brain floating point) floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a shortened (16-bi ...

). They minimized communication latency by extensively overlapping computation and communication, such as dedicating 20 streaming multiprocessors out of 132 per H800 for only inter-GPU communication. They lowered communication by rearranging (every 10 minutes) the exact machine each expert was on so as to avoid querying certain machines more often than others, adding auxiliary load-balancing losses to the training loss function, and other load-balancing techniques. After training, it was deployed on clusters of H800 GPUs. The 8 H800 GPUs within a cluster were connected by NVLink, and the clusters were connected by InfiniBand. The cost has been discussed and called misleading, because it covers only parts of the true cost. Benchmark tests show that V3 outperformed

3.1 and Qwen 2.5 while matching

GPT-4o GPT-4o ("o" for "omni") is a multilingual, multimodal generative pre-trained transformer developed by OpenAI and released in May 2024. It can process and generate text, images and audio. GPT-4o is free, but ChatGPT Plus subscribers have higher ...

and

Claude Claude may refer to: People and fictional characters * Claude (given name), a list of people and fictional characters * Claude (surname), a list of people * Claude Callegari (1962–2021), English Arsenal supporter * Claude Debussy (1862–1918), ...

3.5 Sonnet.

R1

In January 2025, DeepSeek released the DeepSeek-R1 model under the

. DeepSeek-R1-Lite-Preview was trained for logical inference, mathematical reasoning, and real-time problem-solving. DeepSeek claimed that it exceeded performance of

OpenAI o1 OpenAI o1 is a reflective generative pre-trained transformer (GPT). A preview of o1 was released by OpenAI on September 12, 2024. o1 spends time "thinking" before it answers, making it better at complex reasoning tasks, science and programming th ...

on benchmarks such as

American Invitational Mathematics Examination The American Invitational Mathematics Examination (AIME) is a selective 15-question, 3-hour test given since 1983 to those who rank in the top 5% on the AMC 12 high school mathematics examination (formerly known as the AHSME), and starting in 201 ...

(AIME) and MATH. However, ''

The Wall Street Journal ''The Wall Street Journal'' (''WSJ''), also referred to simply as the ''Journal,'' is an American newspaper based in New York City. The newspaper provides extensive coverage of news, especially business and finance. It operates on a subscriptio ...

'' reported that on 15 problems from the 2024 edition of AIME, the o1 model reached a solution faster. DeepSeek-R1 and DeepSeek-R1-Zero were initialized from DeepSeek-V3-Base and share its architecture. DeepSeek-R1-Distill models were instead initialized from other pretrained open-weight models, including

LLaMA The llama (; or ) (''Lama glama'') is a domesticated South American camelid, widely used as a List of meat animals, meat and pack animal by Inca empire, Andean cultures since the pre-Columbian era. Llamas are social animals and live with ...

and Qwen, then fine-tuned on

synthetic data Synthetic data are artificially generated rather than produced by real-world events. Typically created using algorithms, synthetic data can be deployed to validate mathematical models and to train machine learning models. Data generated by a comp ...

generated by R1. DeepSeek-R1-Zero was trained exclusively using GRPO RL without SFT. Unlike previous versions, it used no model-based reward. All reward functions were rule-based, "mainly" of two types (other types were not specified): accuracy rewards and format rewards. Accuracy reward was checking whether a boxed answer is correct (for math) or whether a code passes tests (for programming). Format reward was checking whether the model puts its thinking trace within a ... tag. R1-Zero has issues with readability and mixing languages. R1 was trained to address these issues and further improve reasoning: # SFT DeepSeek-V3-Base on "thousands" of "cold-start" data all with the standard format of , special_token, , special_token,

, designed to improve model output readability. # Apply the same GRPO RL process as R1-Zero, adding a "language consistency reward" to encourage it to respond monolingually. This produced an un released internal model. # Synthesize 600K reasoning data from the internal model, with rejection sampling (i.e. if the generated reasoning had a wrong final answer, then it is removed). Synthesize 200K non-reasoning data (writing, factual QA, self-cognition, translation) using DeepSeek-V3. # SFT DeepSeek-V3-Base on the 800K synthetic data for 2 epochs. # Apply the same GRPO RL process as R1-Zero with rule-based reward (for reasoning tasks), but also model-based reward (for non-reasoning tasks, helpfulness, and harmlessness). This produced DeepSeek-R1. Distilled models were trained by SFT on 800K data synthesized from DeepSeek-R1, in a similar way as step 3. They were not trained with RL. There were reports that R2, the intended successor to R1, was originally planned for release in early May 2025. However, on 28 May 2025, R1 was instead updated to version R1-0528.

Significance

DeepSeek's success against larger and more established rivals has been described as "upending AI". The DeepSeek-R1 model provides responses comparable to other contemporary large language models, such as

and o1. Its

cost is reported to be significantly lower than other LLMs. The company claims that it trained V3, a predecessor of R1, for US$6 million compared to $100 million for OpenAI's

in 2023, and approximately one tenth of the computing power used for

's comparable model, LLaMA 3.1. After the January 2025 release of the R1 model, which offered significantly lower costs than competing models, some investors anticipated a

price war A price war is a form of market competition in which companies within an industry engage in aggressive pricing activity "characterized by the repeated cutting of prices below those of competitors". This leads to a cycle, where each competitor att ...

in the American AI industry. It was dubbed the " Pinduoduo of AI", and other Chinese tech giants such as

ByteDance ByteDance Ltd. is a Chinese internet technology company headquartered in Haidian, Beijing, and incorporated in the Cayman Islands. Founded by Zhang Yiming, Liang Rubo, and a team of others in 2012, ByteDance developed the video-sharing ap ...

Tencent Tencent Holdings Ltd. ( zh, s=腾讯, p=Téngxùn) is a Chinese Multinational corporation, multinational technology Conglomerate (company), conglomerate and holding company headquartered in Shenzhen. It is one of the highest grossing multimed ...

Baidu Baidu, Inc. ( ; ) is a Chinese multinational technology company specializing in Internet services and artificial intelligence. It holds a dominant position in China's search engine market (via Baidu Search), and provides a wide variety of o ...

, and Alibaba cut the price of their AI models. Despite its low price, it was profitable compared to its money-losing rivals.

Notes

References

External links

*
DeepSeek
on

GitHub GitHub () is a Proprietary software, proprietary developer platform that allows developers to create, store, manage, and share their code. It uses Git to provide distributed version control and GitHub itself provides access control, bug trackin ...

DeepSeek
on

Hugging Face Hugging Face, Inc. is a French-American company based in List of tech companies in the New York metropolitan area, New York City that develops computation tools for building applications using machine learning. It is most notable for its Transf ...

Official API documentation

Anthology of DeepSeek papers

Research blog of High-Flyer
{{Authority control Chinese companies established in 2023 Artificial intelligence companies Artificial intelligence laboratories Companies based in Hangzhou Technology companies established in 2023 Chinese brands Open-source artificial intelligence 2023 in artificial intelligence