尹志
(2022-06-28 22:16):
#paper doi:10.1093/nar/gkac010 Nucleic Acids Research, Volume 50, Issue 8, 6 May 2022, AggMapNet: enhanced and explainable low-sample omics deep learning with feature-aggregated multi-channel networks 基于组学的生物医学数据的学习,通常依赖于高维特征及小样本,而这对于目前的深度学习主流方法而言则是一项挑战。本文首先提出了一种无监督的特征聚合技术AggMap,其作用是基于组学特征的内在固有关联,将组学特征聚合并映射为多通道的二维空间关联特征图(Fmaps)。AggMap在基准数据集上,相较于现有的算法,具有很强的特征重构能力;接着,文章利用AggMap的多通道Fmap作为输入,通过构建多通道深度学习模型AggMapNet,在18个小样本组学基准数据集上取得超过SOTA的性能。而且AggMapNet在噪声数据和疾病分类的问题上展现了良好的鲁棒性。另外,在可解释性方面,AggMapNet的的解释性模块Simply-explainer可以识别COVID19的检测和严重性预测的关键代谢分子和蛋白。
总体上看,文章提出了一个组学小样本数据建模的pipeline:通过无监督算法AggMap的特征重构能力+基于监督信息的可解释的AggMapNet深度学习模型。
几点启发:这个工作将小样本组学数据通过一个pipeline完成学习,我们可以将这个pipeline理解为特征重表示(AggMap)+DL网络(AggMapNet)。我们看到,这个过程不是端到端的,而是充分利用了对特征的重表示,挖掘新的特征空间的表征能力。有点返璞归真的意思,但又考虑到高维性质,不容易手工构造特征,因此在特征部分,用到了很多无监督聚类的方法,比如利用了基于pairwise关联距离的流形学习方法UMAP将组学数据点嵌入二维空间,同时,通过团聚层级聚类方法将组学数据点团聚为多特征簇。有趣的是,这几类方法是已有的通用的无监督算法。感觉基于流形的这类聚类算法,能很好的在保度规的情况下达到降维的效果,提取有效特征,为下游任务服务。对于小样本而言,这类方法的效果似乎是比较不错的。那么一个想法是,能不能利用生成的方式,合成数据,然后learning的方式去构建这个embedding表示,再去做下游任务?有点想试试看,不过考虑到在18个基准数据集上做pk,多少有点心累
AggMapNet: enhanced and explainable low-sample omics deep learning with feature-aggregated multi-channel networks
翻译
Abstract:
Omics-based biomedical learning frequently relies on data of high-dimensions (up to thousands) and low-sample sizes (dozens to hundreds), which challenges efficient deep learning (DL) algorithms, particularly for low-sample omics investigations. Here, an unsupervised novel feature aggregation tool AggMap was developed to Aggregate and Map omics features into multi-channel 2D spatial-correlated image-like feature maps (Fmaps) based on their intrinsic correlations. AggMap exhibits strong feature reconstruction capabilities on a randomized benchmark dataset, outperforming existing methods. With AggMap multi-channel Fmaps as inputs, newly-developed multi-channel DL AggMapNet models outperformed the state-of-the-art machine learning models on 18 low-sample omics benchmark tasks. AggMapNet exhibited better robustness in learning noisy data and disease classification. The AggMapNet explainable module Simply-explainer identified key metabolites and proteins for COVID-19 detections and severity predictions. The unsupervised AggMap algorithm of good feature restructuring abilities combined with supervised explainable AggMapNet architecture establish a pipeline for enhanced learning and interpretability of low-sample omics data.
翻译