惊鸿 (2025-02-15 00:02):
#paper DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model Pub Date : 2024-05-07 DOI : arxiv-2405.04434 我们提出了 DeepSeek-V2,一种强大的专家混合 (MoE) 语言模型,其特点是经济的训练和高效的推理。它总共包括236B个参数,其中每个令牌激活21B个参数,并支持128K令牌的上下文长度。 DeepSeek-V2采用多头潜在注意力(MLA)和DeepSeekMoE等创新架构。 MLA 通过将键值 (KV) 缓存显着压缩为潜在向量来保证高效推理,而 DeepSeekMoE 则可以通过稀疏计算以经济的成本训练强大的模型。与 DeepSeek 67B 相比,DeepSeek-V2 性能显着增强,同时节省了 42.5% 的训练成本,减少了 93.3% 的 KV 缓存,最大生成吞吐量提升至 5.76 倍。我们在由 8.1T 代币组成的高质量多源语料库上对 DeepSeek-V2 进行预训练,并进一步进行监督微调(SFT)和强化学习(RL)以充分释放其潜力。评估结果表明,即使只有21B个激活参数,DeepSeek-V2及其聊天版本仍然达到了开源模型中顶级的性能。模型检查点位于“https://github.com/deepseek-ai/DeepSeek-V2”。
arXiv, 2024-05-07T15:56:43Z. DOI: 10.48550/arXiv.2405.04434
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
翻译
DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J. L. Cai, Jian Liang, Jianzhong Guo, Jiaqi Ni, Jiashi Li, Jin Chen, Jingyang Yuan, Junjie Qiu, Junxiao Song, Kai Dong, Kaige Gao, Kang Guan, Lean Wang, Lecong Zhang, Lei Xu, Leyi Xia, Liang Zhao, Liyue Zhang, Meng Li, Miaojun Wang, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingming Li, Ning Tian, Panpan Huang, Peiyi Wang, Peng Zhang, Qihao Zhu, Qinyu Chen, Qiushi Du, R. J. Chen, R. L. Jin, Ruiqi Ge, Ruizhe Pan, Runxin Xu, Ruyi Chen, S. S. Li, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaoqing Wu, Shengfeng Ye, Shirong Ma, Shiyu Wang, Shuang Zhou, Shuiping Yu, Shunfeng Zhou, Size Zheng, T. Wang, Tian Pei, Tian Yuan, Tianyu Sun, W. L. Xiao, Wangding Zeng, Wei An, Wen Liu, Wenfeng Liang, Wenjun Gao, Wentao Zhang, X. Q. Li, Xiangyue Jin, Xianzu Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaojin Shen, Xiaokang Chen, Xiaosha Chen, Xiaotao Nie, Xiaowen Sun, Xiaoxiang Wang, Xin Liu, Xin Xie, Xingkai Yu, Xinnan Song, Xinyi Zhou, Xinyu Yang, Xuan Lu, Xuecheng Su, Y. Wu, Y. K. Li, Y. X. Wei, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Li, Yaohui Wang, Yi Zheng, Yichao Zhang, Yiliang Xiong, Yilong Zhao, Ying He, Ying Tang, Yishi Piao, Yixin Dong, Yixuan Tan, Yiyuan Liu, Yongji Wang, Yongqiang Guo, Yuchen Zhu, Yuduan Wang, Yuheng Zou, Yukun Zha, Yunxian Ma, Yuting Yan, Yuxiang You, Yuxuan Liu, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhen Huang, Zhen Zhang, Zhenda Xie, Zhewen Hao, Zhihong Shao, Zhiniu Wen, Zhipeng Xu, Zhongyu Zhang, Zhuoshu Li, Zihan Wang, Zihui Gu, Zilin Li, Ziwei Xie <<<
Abstract:
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language modelcharacterized by economical training and efficient inference. It comprises 236Btotal parameters, of which 21B are activated for each token, and supports acontext length of 128K tokens. DeepSeek-V2 adopts innovative architecturesincluding Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guaranteesefficient inference through significantly compressing the Key-Value (KV) cacheinto a latent vector, while DeepSeekMoE enables training strong models at aneconomical cost through sparse computation. Compared with DeepSeek 67B,DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximumgeneration throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-qualityand multi-source corpus consisting of 8.1T tokens, and further performSupervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlockits potential. Evaluation results show that, even with only 21B activatedparameters, DeepSeek-V2 and its chat versions still achieve top-tierperformance among open-source models.
翻译
回到顶部