Vincent (2025-01-31 14:05):
#paper https://doi.org/10.48550/arXiv.2111.06377 arxiv. 2021. Masked Autoencoders Are Scalable Vision Learners. Computer vision里很经典的一篇文章,提出了一种简单、快速、有效的模型 Masked autoencoder (MAE)。核心思路是随机遮盖图像区域,然后用模型去复原这些被遮盖的区域。MAE由不对称的编码器和解码器构成,编码器将图像的可见区域编码到隐空间,解码器使用隐空间的数据表征和遮盖符还原原始图片。值得注意的是即使遮盖区域达到75%,还原的图像和原始图像仍然很像,也说明图像里面的信息是十分稀疏的。另外由于编码区域只使用了原始图像的一部分,这使得MAE能大大加速训练的过程,同时得益于自监督学习和更好的表征能力,其在下游任务的预测效果也更好。值得注意的是,这种“预测掩盖区域”的技术在语言模型中早有应用,这篇文章只是将其用在了CV领域,展现了CV也可以用NLP的一些研究思路来推进。
arXiv, 2021-11-11T18:46:40Z. DOI: 10.48550/arXiv.2111.06377
Masked Autoencoders Are Scalable Vision Learners
翻译
Abstract:
This paper shows that masked autoencoders (MAE) are scalable self-supervisedlearners for computer vision. Our MAE approach is simple: we mask randompatches of the input image and reconstruct the missing pixels. It is based ontwo core designs. First, we develop an asymmetric encoder-decoder architecture,with an encoder that operates only on the visible subset of patches (withoutmask tokens), along with a lightweight decoder that reconstructs the originalimage from the latent representation and mask tokens. Second, we find thatmasking a high proportion of the input image, e.g., 75%, yields a nontrivialand meaningful self-supervisory task. Coupling these two designs enables us totrain large models efficiently and effectively: we accelerate training (by 3xor more) and improve accuracy. Our scalable approach allows for learninghigh-capacity models that generalize well: e.g., a vanilla ViT-Huge modelachieves the best accuracy (87.8%) among methods that use only ImageNet-1Kdata. Transfer performance in downstream tasks outperforms supervisedpre-training and shows promising scaling behavior.
翻译
回到顶部