文献收藏与分享平台

Vincent (2025-01-31 14:05):

#paper https://doi.org/10.48550/arXiv.2111.06377 arxiv. 2021. Masked Autoencoders Are Scalable Vision Learners. Computer vision里很经典的一篇文章，提出了一种简单、快速、有效的模型 Masked autoencoder (MAE)。核心思路是随机遮盖图像区域，然后用模型去复原这些被遮盖的区域。MAE由不对称的编码器和解码器构成，编码器将图像的可见区域编码到隐空间，解码器使用隐空间的数据表征和遮盖符还原原始图片。值得注意的是即使遮盖区域达到75%，还原的图像和原始图像仍然很像，也说明图像里面的信息是十分稀疏的。另外由于编码区域只使用了原始图像的一部分，这使得MAE能大大加速训练的过程，同时得益于自监督学习和更好的表征能力，其在下游任务的预测效果也更好。值得注意的是，这种“预测掩盖区域”的技术在语言模型中早有应用，这篇文章只是将其用在了CV领域，展现了CV也可以用NLP的一些研究思路来推进。

arXiv, 2021-11-11T18:46:40Z. DOI: 10.48550/arXiv.2111.06377

Masked Autoencoders Are Scalable Vision Learners

翻译

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick

Abstract:

This paper shows that masked autoencoders (MAE) are scalable self-supervisedlearners for computer vision. Our MAE approach is simple: we mask randompatches of the input image and reconstruct the missing pixels. It is based ontwo core designs. First, we develop an asymmetric encoder-decoder architecture,with an encoder that operates only on the visible subset of patches (withoutmask tokens), along with a lightweight decoder that reconstructs the originalimage from the latent representation and mask tokens. Second, we find thatmasking a high proportion of the input image, e.g., 75%, yields a nontrivialand meaningful self-supervisory task. Coupling these two designs enables us totrain large models efficiently and effectively: we accelerate training (by 3xor more) and improve accuracy. Our scalable approach allows for learninghigh-capacity models that generalize well: e.g., a vanilla ViT-Huge modelachieves the best accuracy (87.8%) among methods that use only ImageNet-1Kdata. Transfer performance in downstream tasks outperforms supervisedpre-training and shows promising scaling behavior.

翻译

Related Links:

https://doi.org/10.48550/arXiv.2111.06377