前进 (2024-12-31 20:09):
#paper DOI 10.48550/arXiv.2111.06377 He, K., Chen, X., Xie, S., Li, Y., Doll'ar, P., & Girshick, R. B. (2021). Masked Autoencoders Are Scalable Vision Learners. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 本文提出了一种创新的自监督学习框架器(MAE)。该方法的核心创新在于采用随机遮蔽策略,仅利用图像中未被遮蔽的25%像素来重建整个图像,从而迫使模型学习到更有效的视觉特征。此外,MAE采用非对称的编码器-解码器架构。使用一个编码器,仅处理未被遮蔽的图像部分,以及一个轻量级的解码器,它从编码器的输出和遮蔽部分的位置信息中重建原始图像。大幅降低了计算成本,提高了训练效率。实验结果表明,MAE在自监督预训练方面具有出色的泛化能力,可应用于多种下游任务,且具备良好的可扩展性。
arXiv, 2021-11-11T18:46:40Z. DOI: 10.48550/arXiv.2111.06377
Masked Autoencoders Are Scalable Vision Learners
翻译
Abstract:
This paper shows that masked autoencoders (MAE) are scalable self-supervisedlearners for computer vision. Our MAE approach is simple: we mask randompatches of the input image and reconstruct the missing pixels. It is based ontwo core designs. First, we develop an asymmetric encoder-decoder architecture,with an encoder that operates only on the visible subset of patches (withoutmask tokens), along with a lightweight decoder that reconstructs the originalimage from the latent representation and mask tokens. Second, we find thatmasking a high proportion of the input image, e.g., 75%, yields a nontrivialand meaningful self-supervisory task. Coupling these two designs enables us totrain large models efficiently and effectively: we accelerate training (by 3xor more) and improve accuracy. Our scalable approach allows for learninghigh-capacity models that generalize well: e.g., a vanilla ViT-Huge modelachieves the best accuracy (87.8%) among methods that use only ImageNet-1Kdata. Transfer performance in downstream tasks outperforms supervisedpre-training and shows promising scaling behavior.
翻译
回到顶部