Vincent (2025-12-31 20:29):
#paper https://arxiv.org/abs/1706.03762 arxiv 2017. Attention Is All You Need. 这篇经典论文提出了Transformer,一种全新设计的序列转换模型,完全基于注意力机制而不再使用循环神经网络(RNN)或卷积神经网络(CNN),通过自注意力(Self-Attention)和多头注意力(Multi-Head Attention)有效建模序列中不同位置之间的依赖关系,使得训练可以大规模并行化而不受序列顺序计算的限制。Transformer 采用标准的编码器-解码器架构,其中编码器和解码器都由多个注意力层与前馈网络层堆叠构成,并通过位置编码注入序列中的位置信息,从而弥补没有序列结构时丢失的顺序信息。实验结果表明,该模型在 WMT 2014 英德翻译和英法翻译任务上分别显著优于传统的循环与卷积基线模型,同时训练速度更快,展现出强大的长距离依赖建模能力,并为后续大规模语言模型与多模态 Transformer 架构奠定了基础
arXiv, 2017-06-12T17:57:34Z. DOI: 10.48550/arXiv.1706.03762
Attention Is All You Need
翻译
Abstract:
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
翻译
回到顶部