文献收藏与分享平台

前进 (2023-12-27 15:11):

#paper arXiv:2312.11514v1 ,2023, LLM in a flash: Efficient Large Language Model Inference with Limited Memory 大型语言模型（LLMs）在现代自然语言处理中具有重要作用，但其高昂的计算和内存需求对于内存有限的设备构成了挑战。为了高效运行超过可用DRAM容量的LLMs，该论文采用了存储模型参数在闪存上，并按需将其调入DRAM的方法。研究方法包括构建与闪存行为协调的推理模型，并在两个关键领域进行优化：减少闪存传输的数据量和以更大、更连续的块来读取数据。在这个框架下，引入了两种主要技术：“windowing”策略通过重复使用先前激活的神经元减少数据传输，“row-column bunding”则充分利用了闪存的顺序数据访问特性，增加了从闪存中读取的数据块的大小。这些方法使得可以在有限DRAM上运行比原先两倍大的模型，相较于朴素的加载方法，在CPU和GPU上推断速度分别提高了4-5倍和20-25倍。

arXiv, 2023. DOI: 10.48550/arXiv.2312.11514

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

翻译

Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar

Abstract:

Large language models (LLMs) are central to modern natural languageprocessing, delivering exceptional performance in various tasks. However, theirintensive computational and memory requirements present challenges, especiallyfor devices with limited DRAM capacity. This paper tackles the challenge ofefficiently running LLMs that exceed the available DRAM capacity by storing themodel parameters on flash memory but bringing them on demand to DRAM. Ourmethod involves constructing an inference cost model that harmonizes with theflash memory behavior, guiding us to optimize in two critical areas: reducingthe volume of data transferred from flash and reading data in larger, morecontiguous chunks. Within this flash memory-informed framework, we introducetwo principal techniques. First, "windowing'" strategically reduces datatransfer by reusing previously activated neurons, and second, "row-columnbundling", tailored to the sequential data access strengths of flash memory,increases the size of data chunks read from flash memory. These methodscollectively enable running models up to twice the size of the available DRAM,with a 4-5x and 20-25x increase in inference speed compared to naive loadingapproaches in CPU and GPU, respectively. Our integration of sparsity awareness,context-adaptive loading, and a hardware-oriented design paves the way foreffective inference of LLMs on devices with limited memory.

翻译

Related Links:

http://arxiv.org/abs/2312.11514v1