na na na
(2022-12-31 23:50):
#paper,Feature specific quantile normalization enables cross-platform classification of molecular subtypes using gene expression data(2018),DOI:10.1093/bioinformatics/bty026.
分享一篇算法工具类的文章,FSQN(feature specific quantile normalization);该方法主要是处理了 RNA-seq平台 转录组测序数据 和 芯片平台转录组测序数据的标准化问题。这个问题在做公共数据分析的时候尤其重要,通常的办法例如取log2,z-score以及用中位数做矫正等方法虽然可以在一定程度行把数据分布拉到一个区间上,但起分布依然是不一致的,导致在做机器学习建模的时候往往跨平台效果较差,该文章讨论了不同平台间批次产生的原因,并从应用角度入手,不仅比较了现有方法的劣势,也推出了FSQN的方法,该方法在测试数据集上,基于常见的分类器模型,实现了RNA-seq平台 98%的准确度和芯片平台97%准确度。还方法作者提供了R包:https://github.com/jenniferfranks/FSQN。我做过测试,通过PCA可以看到去批次效果较好,但未能实现文章中机器学习模型的高准确度,因此平台间数据的去批次方法和机器学习跨平台使用依然是一个可研究的方向,扩展思维的话,在RNA-seq和Nanostrign之间,RNA-seq和单细胞测序之间,芯片和Nanostrign之间都可以从数据矫正的角度出发去开发去批次的工具。
Feature specific quantile normalization enables cross-platform classification of molecular subtypes using gene expression data
翻译
Abstract:
Motivation: Molecular subtypes of cancers and autoimmune disease, defined by transcriptomic profiling, have provided insight into disease pathogenesis, molecular heterogeneity and therapeutic responses. However, technical biases inherent to different gene expression profiling platforms present a unique problem when analyzing data generated from different studies. Currently, there is a lack of effective methods designed to eliminate platform-based bias. We present a method to normalize and classify RNA-seq data using machine learning classifiers trained on DNA microarray data and molecular subtypes in two datasets: breast invasive carcinoma (BRCA) and colorectal cancer (CRC).Results: Multiple analyses show that feature specific quantile normalization (FSQN) successfully removes platform-based bias from RNA-seq data, regardless of feature scaling or machine learning algorithm. We achieve up to 98% accuracy for BRCA data and 97% accuracy for CRC data in assigning molecular subtypes to RNA-seq data normalized using FSQN and a support vector machine trained exclusively on DNA microarray data. We find that maximum accuracy was achieved when normalizing RNA-seq datasets that contain at least 25 samples. FSQN allows comparison of RNA-seq data to existing DNA microarray datasets. Using these techniques, we can successfully leverage information from existing gene expression data in new analyses despite different platforms used for gene expression profiling.Availability and implementation: FSQN has been submitted as an R package to CRAN. All code used for this study is available on Github (https://github.com/jenniferfranks/FSQN).Contact: michael.l.whitfield@dartmouth.edu.Supplementary information: Supplementary data are available at Bioinformatics online.
翻译