Vincent
(2023-06-30 15:00):
#paper https://www.nature.com/articles/s41467-018-04608-8, Nature communication 2018, Exploring patterns enriched in a dataset with contrastive principal component analysis
PCA(主成分分析)能够将高维数据映射到低维,是最常用的数据探索和可视化工具。然而PCA(以及其他降维方法例如t-sne, umap)每次只能分析一个数据集。当处理多个数据集,尤其是寻找某数据集特有的信号时,使用PCA就需要人工比较不同数据集的投影来试图寻找数据集间的相似和不同点。这篇文章提出了解决此类问题的一种简单有效的降维方法:对比PCA。该方法旨在寻找一个投影,使得目标数据集和背景数据集的差距尽可能大,从而富集目标数据集特有的信号。该方法原理与实现和PCA类似,后续实验验证了其能有效发现那些被PCA忽视的目标数据集特有的信号。除此之外,文章还详述了该方法的理论基础和几何表示,并指出其可以运用在很多PCA的使用场景中。
Exploring patterns enriched in a dataset with contrastive principal component analysis
翻译
Abstract:
Visualization and exploration of high-dimensional data is a ubiquitous challenge across disciplines. Widely used techniques such as principal component analysis (PCA) aim to identify dominant trends in one dataset. However, in many settings we have datasets collected under different conditions, e.g., a treatment and a control experiment, and we are interested in visualizing and exploring patterns that are specific to one dataset. This paper proposes a method, contrastive principal component analysis (cPCA), which identifies low-dimensional structures that are enriched in a dataset relative to comparison data. In a wide variety of experiments, we demonstrate that cPCA with a background dataset enables us to visualize dataset-specific patterns missed by PCA and other standard methods. We further provide a geometric interpretation of cPCA and strong mathematical guarantees. An implementation of cPCA is publicly available, and can be used for exploratory data analysis in many applications where PCA is currently used.
翻译