文献收藏与分享平台

尹志 (2022-06-28 22:16):

#paper doi:10.1093/nar/gkac010 Nucleic Acids Research, Volume 50, Issue 8, 6 May 2022, AggMapNet: enhanced and explainable low-sample omics deep learning with feature-aggregated multi-channel networks 基于组学的生物医学数据的学习，通常依赖于高维特征及小样本，而这对于目前的深度学习主流方法而言则是一项挑战。本文首先提出了一种无监督的特征聚合技术AggMap，其作用是基于组学特征的内在固有关联，将组学特征聚合并映射为多通道的二维空间关联特征图（Fmaps）。AggMap在基准数据集上，相较于现有的算法，具有很强的特征重构能力；接着，文章利用AggMap的多通道Fmap作为输入，通过构建多通道深度学习模型AggMapNet，在18个小样本组学基准数据集上取得超过SOTA的性能。而且AggMapNet在噪声数据和疾病分类的问题上展现了良好的鲁棒性。另外，在可解释性方面，AggMapNet的的解释性模块Simply-explainer可以识别COVID19的检测和严重性预测的关键代谢分子和蛋白。总体上看，文章提出了一个组学小样本数据建模的pipeline：通过无监督算法AggMap的特征重构能力+基于监督信息的可解释的AggMapNet深度学习模型。几点启发：这个工作将小样本组学数据通过一个pipeline完成学习，我们可以将这个pipeline理解为特征重表示（AggMap）+DL网络（AggMapNet）。我们看到，这个过程不是端到端的，而是充分利用了对特征的重表示，挖掘新的特征空间的表征能力。有点返璞归真的意思，但又考虑到高维性质，不容易手工构造特征，因此在特征部分，用到了很多无监督聚类的方法，比如利用了基于pairwise关联距离的流形学习方法UMAP将组学数据点嵌入二维空间，同时，通过团聚层级聚类方法将组学数据点团聚为多特征簇。有趣的是，这几类方法是已有的通用的无监督算法。感觉基于流形的这类聚类算法，能很好的在保度规的情况下达到降维的效果，提取有效特征，为下游任务服务。对于小样本而言，这类方法的效果似乎是比较不错的。那么一个想法是，能不能利用生成的方式，合成数据，然后learning的方式去构建这个embedding表示，再去做下游任务？有点想试试看，不过考虑到在18个基准数据集上做pk，多少有点心累

IF:16.600Q1 Nucleic acids research, 2022-05-06. DOI: 10.1093/nar/gkac010 PMID: 35100418

AggMapNet: enhanced and explainable low-sample omics deep learning with feature-aggregated multi-channel networks

翻译

Wan Xiang Shen, Yu Liu, Yan Chen, Xian Zeng, Ying Tan, Yu Yang Jiang, Yu Zong Chen

Abstract:

Omics-based biomedical learning frequently relies on data of high-dimensions (up to thousands) and low-sample sizes (dozens to hundreds), which challenges efficient deep learning (DL) algorithms, particularly for low-sample omics investigations. Here, an unsupervised novel feature aggregation tool AggMap was developed to Aggregate and Map omics features into multi-channel 2D spatial-correlated image-like feature maps (Fmaps) based on their intrinsic correlations. AggMap exhibits strong feature reconstruction capabilities on a randomized benchmark dataset, outperforming existing methods. With AggMap multi-channel Fmaps as inputs, newly-developed multi-channel DL AggMapNet models outperformed the state-of-the-art machine learning models on 18 low-sample omics benchmark tasks. AggMapNet exhibited better robustness in learning noisy data and disease classification. The AggMapNet explainable module Simply-explainer identified key metabolites and proteins for COVID-19 detections and severity predictions. The unsupervised AggMap algorithm of good feature restructuring abilities combined with supervised explainable AggMapNet architecture establish a pipeline for enhanced learning and interpretability of low-sample omics data.

翻译

Related Links:

https://academic.oup.com/nar/article-pdf/50/8/e45/43547099/gkac010.pdf