文献收藏与分享平台

惊鸿 (2025-02-15 00:02):

#paper DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model Pub Date : 2024-05-07 DOI : arxiv-2405.04434 我们提出了 DeepSeek-V2，一种强大的专家混合 (MoE) 语言模型，其特点是经济的训练和高效的推理。它总共包括236B个参数，其中每个令牌激活21B个参数，并支持128K令牌的上下文长度。 DeepSeek-V2采用多头潜在注意力（MLA）和DeepSeekMoE等创新架构。 MLA 通过将键值 (KV) 缓存显着压缩为潜在向量来保证高效推理，而 DeepSeekMoE 则可以通过稀疏计算以经济的成本训练强大的模型。与 DeepSeek 67B 相比，DeepSeek-V2 性能显着增强，同时节省了 42.5% 的训练成本，减少了 93.3% 的 KV 缓存，最大生成吞吐量提升至 5.76 倍。我们在由 8.1T 代币组成的高质量多源语料库上对 DeepSeek-V2 进行预训练，并进一步进行监督微调（SFT）和强化学习（RL）以充分释放其潜力。评估结果表明，即使只有21B个激活参数，DeepSeek-V2及其聊天版本仍然达到了开源模型中顶级的性能。模型检查点位于“https://github.com/deepseek-ai/DeepSeek-V2”。

arXiv, 2024-05-07T15:56:43Z. DOI: 10.48550/arXiv.2405.04434

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

翻译

DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, ... >>>

Abstract:

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language modelcharacterized by economical training and efficient inference. It comprises 236Btotal parameters, of which 21B are activated for each token, and supports acontext length of 128K tokens. DeepSeek-V2 adopts innovative architecturesincluding Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guaranteesefficient inference through significantly compressing the Key-Value (KV) cacheinto a latent vector, while DeepSeekMoE enables training strong models at aneconomical cost through sparse computation. Compared with DeepSeek 67B,DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximumgeneration throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-qualityand multi-source corpus consisting of 8.1T tokens, and further performSupervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlockits potential. Evaluation results show that, even with only 21B activatedparameters, DeepSeek-V2 and its chat versions still achieve top-tierperformance among open-source models.

翻译

Related Links:

https://doi.org/10.48550/arXiv.2405.04434