响马读paper

一个要求成员每月至少读一篇文献并打卡的学术交流社群

2023, arXiv. DOI: 10.48550/arXiv.2312.11514 arXiv ID: 2312.11514
LLM in a flash: Efficient Large Language Model Inference with Limited Memory
Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar
Abstract:
Large language models (LLMs) are central to modern natural language
processing, delivering exceptional performance in various tasks. However, their
intensive computational and memory requirements present challenges, especially
for devices with limited DRAM capacity. This paper tackles the challenge of
efficiently running LLMs that exceed the available DRAM capacity by storing the
model parameters on flash memory but bringing them on demand to DRAM. Our
method involves constructing an inference cost model that harmonizes with the
flash memory behavior, guiding us to optimize in two critical areas: reducing
the volume of data transferred from flash and reading data in larger, more
contiguous chunks. Within this flash memory-informed framework, we introduce
two principal techniques. First, "windowing'" strategically reduces data
transfer by reusing previously activated neurons, and second, "row-column
bundling", tailored to the sequential data access strengths of flash memory,
increases the size of data chunks read from flash memory. These methods
collectively enable running models up to twice the size of the available DRAM,
with a 4-5x and 20-25x increase in inference speed compared to naive loading
approaches in CPU and GPU, respectively. Our integration of sparsity awareness,
context-adaptive loading, and a hardware-oriented design paves the way for
effective inference of LLMs on devices with limited memory.
2023-12-27 15:11:00
#paper arXiv:2312.11514v1 ,2023, LLM in a flash: Efficient Large Language Model Inference with Limited Memory 大型语言模型(LLMs)在现代自然语言处理中具有重要作用,但其高昂的计算和内存需求对于内存有限的设备构成了挑战。为了高效运行超过可用DRAM容量的LLMs,该论文采用了存储模型参数在闪存上,并按需将其调入DRAM的方法。研究方法包括构建与闪存行为协调的推理模型,并在两个关键领域进行优化:减少闪存传输的数据量和以更大、更连续的块来读取数据。在这个框架下,引入了两种主要技术:“windowing”策略通过重复使用先前激活的神经元减少数据传输,“row-column bunding”则充分利用了闪存的顺序数据访问特性,增加了从闪存中读取的数据块的大小。这些方法使得可以在有限DRAM上运行比原先两倍大的模型,相较于朴素的加载方法,在CPU和GPU上推断速度分别提高了4-5倍和20-25倍。
TOP