前进 (2023-12-27 15:11):
#paper arXiv:2312.11514v1 ,2023, LLM in a flash: Efficient Large Language Model Inference with Limited Memory 大型语言模型(LLMs)在现代自然语言处理中具有重要作用,但其高昂的计算和内存需求对于内存有限的设备构成了挑战。为了高效运行超过可用DRAM容量的LLMs,该论文采用了存储模型参数在闪存上,并按需将其调入DRAM的方法。研究方法包括构建与闪存行为协调的推理模型,并在两个关键领域进行优化:减少闪存传输的数据量和以更大、更连续的块来读取数据。在这个框架下,引入了两种主要技术:“windowing”策略通过重复使用先前激活的神经元减少数据传输,“row-column bunding”则充分利用了闪存的顺序数据访问特性,增加了从闪存中读取的数据块的大小。这些方法使得可以在有限DRAM上运行比原先两倍大的模型,相较于朴素的加载方法,在CPU和GPU上推断速度分别提高了4-5倍和20-25倍。
LLM in a flash: Efficient Large Language Model Inference with Limited Memory
Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar
Abstract:
Large language models (LLMs) are central to modern natural language<br>processing, delivering exceptional performance in various tasks. However, their<br>intensive computational and memory requirements present challenges, especially<br>for devices with limited DRAM capacity. This paper tackles the challenge of<br>efficiently running LLMs that exceed the available DRAM capacity by storing the<br>model parameters on flash memory but bringing them on demand to DRAM. Our<br>method involves constructing an inference cost model that harmonizes with the<br>flash memory behavior, guiding us to optimize in two critical areas: reducing<br>the volume of data transferred from flash and reading data in larger, more<br>contiguous chunks. Within this flash memory-informed framework, we introduce<br>two principal techniques. First, "windowing'" strategically reduces data<br>transfer by reusing previously activated neurons, and second, "row-column<br>bundling", tailored to the sequential data access strengths of flash memory,<br>increases the size of data chunks read from flash memory. These methods<br>collectively enable running models up to twice the size of the available DRAM,<br>with a 4-5x and 20-25x increase in inference speed compared to naive loading<br>approaches in CPU and GPU, respectively. Our integration of sparsity awareness,<br>context-adaptive loading, and a hardware-oriented design paves the way for<br>effective inference of LLMs on devices with limited memory.
回到顶部