林海onrush
(2026-03-31 20:08):
#paper, Wasserstein Distance Rivals Kullback-Leibler Divergence for Knowledge Distillation, DOI: 10.48550/arXiv.2412.08139. 论文提出用 Wasserstein Distance来替代知识蒸馏中长期主流的 KL Divergence(KL 散度).作者认为 KL 只擅长做“同类别对同类别”的概率对齐,难以显式利用类别之间的相似关系,而且在中间层特征蒸馏中对高维、稀疏、分布不重叠的数据也不够合适;因此他们分别设计了基于离散 WD 的WKD-L来做 logit 蒸馏、基于连续 WD 的WKD-F来做特征蒸馏,并在 ImageNet、CIFAR-100、Self-KD 和 MS-COCO 上都取得了优于多种 KL 系方法和强基线的方法效果,说明 WD 在知识蒸馏里不仅可用,而且在不少场景下甚至优于 KL 散度。
arXiv,
2024/12/11.
Wasserstein Distance Rivals Kullback-Leibler Divergence for Knowledge Distillation
翻译
Abstract:
Since pioneering work of Hinton et al., knowledge distillation based on Kullback-Leibler Divergence (KL-Div) has been predominant, and recently its variants have achieved compelling performance. However, KL-Div only compares probabilities of the corresponding category between the teacher and student while lacking a mechanism for cross-category comparison. Besides, KL-Div is problematic when applied to intermediate layers, as it cannot handle non-overlapping distributions and is unaware of geometry of the underlying manifold. To address these downsides, we propose a methodology of Wasserstein Distance (WD) based knowledge distillation. Specifically, we propose a logit distillation method called WKD-L based on discrete WD, which performs cross-category comparison of probabilities and thus can explicitly leverage rich interrelations among categories. Moreover, we introduce a feature distillation method called WKD-F, which uses a parametric method for modeling feature distributions and adopts continuous WD for transferring knowledge from intermediate layers. Comprehensive evaluations on image classification and object detection have shown (1) for logit distillation WKD-L outperforms very strong KL-Div variants; (2) for feature distillation WKD-F is superior to the KL-Div counterparts and state-of-the-art competitors. The source code is available at https://peihuali.org/WKD
翻译