来自杂志 Genome biology 的文献。
当前共找到 17 篇文献分享。
1.
白鸟 (2024-07-31 22:52):
#paper, DOI: 10.1186/s13059-020-02116-x, Integrative analyses of single-cell transcriptome and regulome using MAESTRO.刘小乐实验室在2020年发表的一篇工具类文章。看这篇文章,主要是想看scATAC分析的新颖之处,和其他软件的异同之处。 1.开发的MAESTRO流程支持单细胞转录组+ATAC全分析,兼顾不同的单细胞平台,打通上下游分析; 2.染色质可及性:在基因水平对染色质可及性进行建模;强大的转录调节因子预测; 3.细胞类型自动注释,优化差异基因分析步骤,自动细胞类型注释和转录调节因子推断; 4.通过Snakemake流程执行,一些分析步骤很值得借鉴;scATAC代码部分还没看; 不足之处,是软件后期没有维护,文献引用率低。学习代码时,软件会调用不同的软件包,也一并需要了解。
IF:10.100Q1 Genome biology, 2020-08-07. DOI: 10.1186/s13059-020-02116-x PMID: 32767996 PMCID:PMC7412809
使用MAESTRO对单细胞转录组和调节组进行整合分析
Abstract:
We present Model-based AnalysEs of Transcriptome and RegulOme (MAESTRO), a comprehensive open-source computational workflow ( http://github.com/liulab-dfci/MAESTRO ) for the integrative analyses of single-cell RNA-seq (scRNA-seq) and ATAC-seq (scATAC-seq) data from … >>>
We present Model-based AnalysEs of Transcriptome and RegulOme (MAESTRO), a comprehensive open-source computational workflow ( http://github.com/liulab-dfci/MAESTRO ) for the integrative analyses of single-cell RNA-seq (scRNA-seq) and ATAC-seq (scATAC-seq) data from multiple platforms. MAESTRO provides functions for pre-processing, alignment, quality control, expression and chromatin accessibility quantification, clustering, differential analysis, and annotation. By modeling gene regulatory potential from chromatin accessibilities at the single-cell level, MAESTRO outperforms the existing methods for integrating the cell clusters between scRNA-seq and scATAC-seq. Furthermore, MAESTRO supports automatic cell-type annotation using predefined cell type marker genes and identifies driver regulators from differential scRNA-seq genes and scATAC-seq peaks. <<<
翻译
我们提出了基于模型的转录组和 RegulOme 分析 (MAESTRO),这是一种全面的开源计算工作流程 ( http://github.com/liulab-dfci/MAESTRO ),用于对来自多个平台的单细胞 RNA-seq (scRNA-seq) 和 ATAC-seq (scATAC-seq) 数据进行综合分析。MAESTRO 提供用于预处理、比对、质量控制、表达和染色质可及性定量、聚类、差异分析和注释的功能。通过在单细胞水平上对染色质可及性的基因调控潜力进行建模,MAESTRO优于现有的scRNA-seq和scATAC-seq之间整合细胞簇的方法。此外,MAESTRO还支持使用预定义的细胞类型标记基因进行自动细胞类型注释,并从差异scRNA-seq基因和scATAC-seq峰中识别驱动调节因子。
2.
徐炳祥 (2024-04-30 13:16):
#paper doi: 10.1186/s13059-020-02167-0 Genome Biology, 2020, Mustache: multi-scale detection of chromatin loops from Hi-C and Micro-C maps using scale-space representation。染色质环是染色质空间构象的重要组成部分,也是启动子-增强子相互作用的重要物理背景。基于Hi-C数据的染色质环检测是当前三维基因组学的重要命题。本文立足于计算机视觉中的尺度稳定斑点检测技术开发了一种高灵敏度,高稳定的基于染色质相互作用图谱的染色质环检测算法。该算法是局部最大值搜索这一思路的最新作品,能在保证染色质换检测准确度的前体下大幅度提高其灵敏度。其综合性能为此类算法中最优者。
IF:10.100Q1 Genome biology, 2020-09-30. DOI: 10.1186/s13059-020-02167-0 PMID: 32998764
Abstract:
We present MUSTACHE, a new method for multi-scale detection of chromatin loops from Hi-C and Micro-C contact maps. MUSTACHE employs scale-space theory, a technical advance in computer vision, to detect … >>>
We present MUSTACHE, a new method for multi-scale detection of chromatin loops from Hi-C and Micro-C contact maps. MUSTACHE employs scale-space theory, a technical advance in computer vision, to detect blob-shaped objects in contact maps. MUSTACHE is scalable to kilobase-resolution maps and reports loops that are highly consistent between replicates and between Hi-C and Micro-C datasets. Compared to other loop callers, such as HiCCUPS and SIP, MUSTACHE recovers a higher number of published ChIA-PET and HiChIP loops as well as loops linking promoters to regulatory elements. Overall, MUSTACHE enables an efficient and comprehensive analysis of chromatin loops. Available at: https://github.com/ay-lab/mustache . <<<
翻译
3.
徐炳祥 (2023-11-28 11:05):
#paper doi: 10.1186/s13059-023-03088-4 Genome Biology, 2023, CHESS 3: an improved, comprehensive catalog of human genes and transcripts based on large-scale expression data, phylogenetic analysis, and protein structure。本文介绍了一套针对最新人类基因组完整序列(T2T genome)的完整人类基因组编码序列注释。作者通过收集和分析来自54个组织位点的超过10000项RNA-seq数据组装了所有可能的转录本,在此基础上,通过综合利用基于序列特征和基于机器学习的编码能力预测模型,结合转录本表达的组织特异性,编码蛋白质空间构象的合理性(基于alphaFold2的预测)对其进行质控,最终获得了41,356个基因和158,377个转录本。本文的结果是基因组研究的重要基础资料,其研究方法对基于RNA测序的研究有一定参考价值。
IF:10.100Q1 Genome biology, 2023-10-30. DOI: 10.1186/s13059-023-03088-4 PMID: 37904256
Abstract:
CHESS 3 represents an improved human gene catalog based on nearly 10,000 RNA-seq experiments across 54 body sites. It significantly improves current genome annotation by integrating the latest reference data … >>>
CHESS 3 represents an improved human gene catalog based on nearly 10,000 RNA-seq experiments across 54 body sites. It significantly improves current genome annotation by integrating the latest reference data and algorithms, machine learning techniques for noise filtering, and new protein structure prediction methods. CHESS 3 contains 41,356 genes, including 19,839 protein-coding genes and 158,377 transcripts, with 14,863 protein-coding transcripts not in other catalogs. It includes all MANE transcripts and at least one transcript for most RefSeq and GENCODE genes. On the CHM13 human genome, the CHESS 3 catalog contains an additional 129 protein-coding genes. CHESS 3 is available at http://ccb.jhu.edu/chess . <<<
翻译
4.
徐炳祥 (2023-09-24 14:05):
#paper doi: 10.1186/s13059-023-03019-3 Genome Biology, 2023, The relationship between regulatory changes in cis and trans and the evolution of gene expression in humans and chimpanzees。对灵长类动物进行比较基因组学研究受限于伦理因素和材料获取的困难而一直进展缓慢。由iPSC细胞诱导获得的胚胎样团(EB)是进行此类研究的好材料。本文作者使用人类和大猩猩的iPSC细胞分别诱导获得了EB,并使用单细胞RNA-seq进行了基因表达的比较分析。结果显示,胚胎样团中已包含大量已知细胞类型。对不同类型的细胞进行人-猩猩的基因差异表达分析鉴定了一系列差异表达基因。与序列保守性和基因功能注释进行比对发现在多个细胞类型中存在差异的基因与更低的序列保守性和人与猩猩的差异有关,而在各细胞类型中表达保守的基因集中于基础生命过程。进一步,作者使用人-猩猩融合细胞构建了EB并借此将表达差异分解为顺式和反式两类,证明了反式差异与两个物种的基因表达调控网络差异有关。作为一项干湿结合的研究,本文仅有3位作者。新产出数据量也不大。是资源/规模有限的研究组通过发挥专长和优化的实验设计在避免堆砌数据的前提下解决重大生物问题的良好范例。
IF:10.100Q1 Genome biology, 2023-09-11. DOI: 10.1186/s13059-023-03019-3 PMID: 37697401
Abstract:
BACKGROUND: Comparative gene expression studies in apes are fundamentally limited by the challenges associated with sampling across different tissues. Here, we used single-cell RNA sequencing of embryoid bodies to collect … >>>
BACKGROUND: Comparative gene expression studies in apes are fundamentally limited by the challenges associated with sampling across different tissues. Here, we used single-cell RNA sequencing of embryoid bodies to collect transcriptomic data from over 70 cell types in three humans and three chimpanzees.RESULTS: We find hundreds of genes whose regulation is conserved across cell types, as well as genes whose regulation likely evolves under directional selection in one or a handful of cell types. Using embryoid bodies from a human-chimpanzee fused cell line, we also infer the proportion of inter-species regulatory differences due to changes in cis and trans elements between the species. Using the cis/trans inference and an analysis of transcription factor binding sites, we identify dozens of transcription factors whose inter-species differences in expression are affecting expression differences between humans and chimpanzees in hundreds of target genes.CONCLUSIONS: Here, we present the most comprehensive dataset of comparative gene expression from humans and chimpanzees to date, including a catalog of regulatory mechanisms associated with inter-species differences. <<<
翻译
5.
哪有情可长 (2023-06-30 16:14):
#paper Characterization of novel loci controlling seed oil content in Brassica napus by marker metabolite-based multi-omics analysis,Genome biology, 19 June 2023, doi.org/10.1186/s13059-023-02984-z 文章利用广泛靶向代谢物分析共检测到2173种代谢物,对代谢物和油菜含油量进行相关性分析,最终鉴定到131个跟含油量高度相关的代谢物,并将代谢物作为含油量的代谢标志物。对131个含有代谢物进行全基因组关联分析,鉴定到446个mQTL位点,结合群体转录组关联分析,共鉴定到与含油量标志物显著关联的7316个基因,后面作者们有找到催化黄酮生物合成第一步反应的一个基因,对该基因进行后续验证。现在作物上数据整合的落脚点还得是找基因进行验证
IF:10.100Q1 Genome biology, 2023-06-19. DOI: 10.1186/s13059-023-02984-z PMID: 37337206
Abstract:
BACKGROUND: Seed oil content is an important agronomic trait of Brassica napus (B. napus), and metabolites are considered as the bridge between genotype and phenotype for physical traits.RESULTS: Using a … >>>
BACKGROUND: Seed oil content is an important agronomic trait of Brassica napus (B. napus), and metabolites are considered as the bridge between genotype and phenotype for physical traits.RESULTS: Using a widely targeted metabolomics analysis in a natural population of 388 B. napus inbred lines, we quantify 2172 metabolites in mature seeds by liquid chromatography mass spectrometry, in which 131 marker metabolites are identified to be correlated with seed oil content. These metabolites are then selected for further metabolite genome-wide association study and metabolite transcriptome-wide association study. Combined with weighted correlation network analysis, we construct a triple relationship network, which includes 21,000 edges and 4384 nodes among metabolites, metabolite quantitative trait loci, genes, and co-expression modules. We validate the function of BnaA03.TT4, BnaC02.TT4, and BnaC05.UK, three candidate genes predicted by multi-omics analysis, which show significant impacts on seed oil content through regulating flavonoid metabolism in B. napus.CONCLUSIONS: This study demonstrates the advantage of utilizing marker metabolites integrated with multi-omics analysis to dissect the genetic basis of agronomic traits in crops. <<<
翻译
6.
徐炳祥 (2023-06-25 09:39):
#paper doi:10.1186/s13059-023-02970-5 Genome Biology, 2023, Genomic and epigenomic determinants of heat stress‑induced transcriptional memory in Arabidopsis。热刺激是植物细胞经常面临的环境压力,能引起细胞内大规模转录响应。热刺激诱导的转录记忆是其中重要的调控模式,然而其形成机制尚不清楚。本文结合前后两次热刺激后的HSFA2和HSFA3结合位点、H3K4me3信号分布、ATAC-seq标记的染色质开放性和基因表达谱数据,从表观遗传层面对该问题进行了探讨。结果显示具有热刺激诱导记忆行为的基因有特征性的热刺激因子结合模式、在常温下有低表达水平但有开放的启动子区域,刺激后富集H3K4me3信号等特征。本文为刺激反应的表观遗传研究提供了一个可供借鉴的范式。
IF:10.100Q1 Genome biology, 2023-05-30. DOI: 10.1186/s13059-023-02970-5 PMID: 37254211
Abstract:
BACKGROUND: Transcriptional regulation is a key aspect of environmental stress responses. Heat stress induces transcriptional memory, i.e., sustained induction or enhanced re-induction of transcription, that allows plants to respond more … >>>
BACKGROUND: Transcriptional regulation is a key aspect of environmental stress responses. Heat stress induces transcriptional memory, i.e., sustained induction or enhanced re-induction of transcription, that allows plants to respond more efficiently to a recurrent HS. In light of more frequent temperature extremes due to climate change, improving heat tolerance in crop plants is an important breeding goal. However, not all heat stress-inducible genes show transcriptional memory, and it is unclear what distinguishes memory from non-memory genes. To address this issue and understand the genome and epigenome architecture of transcriptional memory after heat stress, we identify the global target genes of two key memory heat shock transcription factors, HSFA2 and HSFA3, using time course ChIP-seq.RESULTS: HSFA2 and HSFA3 show near identical binding patterns. In vitro and in vivo binding strength is highly correlated, indicating the importance of DNA sequence elements. In particular, genes with transcriptional memory are strongly enriched for a tripartite heat shock element, and are hallmarked by several features: low expression levels in the absence of heat stress, accessible chromatin environment, and heat stress-induced enrichment of H3K4 trimethylation. These results are confirmed by an orthogonal transcriptomic data set using both de novo clustering and an established definition of memory genes.CONCLUSIONS: Our findings provide an integrated view of HSF-dependent transcriptional memory and shed light on its sequence and chromatin determinants, enabling the prediction and engineering of genes with transcriptional memory behavior. <<<
翻译
7.
徐炳祥 (2023-02-27 14:05):
#paper doi: 10.1186/gb-2012-13-10-r98, 2012, CHANCE: comprehensive software for quality control and validation of ChIP-seq data。ChIP-seq是目前解析特定蛋白质在基因组上结合位点的最流行高通量方法,也是表观遗传学中的常用技术。这篇旧文回顾了ChIP-seq中常见的实验误差,包括抗体的活性,免疫共沉淀反应的效率,PCR反应引起的偏倚,文库制备和测序过程引入的误差等。并针对每一项给出了可行的评估策略,其附带工具对ChIP-seq文库的质量检查和失败文库的归因等工作都是有益的。
IF:10.100Q1 Genome biology, 2012-Oct-15. DOI: 10.1186/gb-2012-13-10-r98 PMID: 23068444
Abstract:
ChIP-seq is a powerful method for obtaining genome-wide maps of protein-DNA interactions and epigenetic modifications. CHANCE (CHip-seq ANalytics and Confidence Estimation) is a standalone package for ChIP-seq quality control and … >>>
ChIP-seq is a powerful method for obtaining genome-wide maps of protein-DNA interactions and epigenetic modifications. CHANCE (CHip-seq ANalytics and Confidence Estimation) is a standalone package for ChIP-seq quality control and protocol optimization. Our user-friendly graphical software quickly estimates the strength and quality of immunoprecipitations, identifies biases, compares the user's data with ENCODE's large collection of published datasets, performs multi-sample normalization, checks against quantitative PCR-validated control regions, and produces informative graphical reports. CHANCE is available at https://github.com/songlab/chance. <<<
翻译
8.
Vincent (2023-01-31 14:45):
#paper doi:https://doi.org/10.1186/s13059-021-02388-x Gene set enrichment analysis for genome-wide DNA methylation data. Genome Biology 2021. 甲基化芯片相比WGBS而言所需要的费用更低,其被广泛用于DNA甲基化的测量。过去的研究主要着重于甲基化芯片的数据处理和甲基化差异分析上,对基因集富集分析的关注较少,这篇文章提出了一个基于甲基化差异分析结果的的基因集富集分析:GOmeth(适用于探针层面的差异分析数据)和GOregion(适用于区域层面的差异分析数据)。具体来说,CpG位点在基因组上的分布并不是均匀的,不同基因附近的CpG位点数量并不一样多,这导致依照甲基化差异分析选择相邻基因做富集分析时,CpG较多的基因更容易被选中,给富集分析带来偏差。同时同一个CpG位点可能位于好几个基因附近(大概占总数的8%),导致那些差异甲基化的基因并不是独立获得的,也会给基因集富集分析带来偏差。这篇文章的方案调整了富集分析中CpG位点的权重和统计分布,通过数据仿真和重复抽样的方法探究了上述两种偏差对基因集富集分析的影响,同时也验证了提出的方法能够很好的控制错误发现率(FDR),同时能给更加biological meaningful的通路分析结果
IF:10.100Q1 Genome biology, 2021-06-08. DOI: 10.1186/s13059-021-02388-x PMID: 34103055
Abstract:
DNA methylation is one of the most commonly studied epigenetic marks, due to its role in disease and development. Illumina methylation arrays have been extensively used to measure methylation across … >>>
DNA methylation is one of the most commonly studied epigenetic marks, due to its role in disease and development. Illumina methylation arrays have been extensively used to measure methylation across the human genome. Methylation array analysis has primarily focused on preprocessing, normalization, and identification of differentially methylated CpGs and regions. GOmeth and GOregion are new methods for performing unbiased gene set testing following differential methylation analysis. Benchmarking analyses demonstrate GOmeth outperforms other approaches, and GOregion is the first method for gene set testing of differentially methylated regions. Both methods are publicly available in the missMethyl Bioconductor R package. <<<
翻译
9.
Vincent (2022-12-31 17:51):
#paper DNA methylation aging clocks: challenges and recommendations, Genome Biology, 2019, https://doi.org/10.1186/s13059-019-1824-y 衰老通常伴随着疾病的发生,理解人类为何以及如何衰老是生物学中的重要课题。衰老伴随着分子层面的变化,过去十年内,不少研究发现可以使用基因组上的一部分CpG位点甲基化水平来准确预测年龄,这样的一组CpG位点又被称为 表观遗传时钟。事实上表观遗传时钟的预测误差与疾病发生率和死亡率也被发现有联系,从而广泛引起了研究者们的兴趣。这篇综述文章总结了表观遗传时钟领域的如下七大挑战,并分别介绍了研究现状,不确定性和未来研究方向的推荐:1. 拆分表观时钟的时序成分和生物成分;2. 组织特异或者疾病特异时钟的功能性研究;3.大规模时序种群研究的表观遗传学整合; 4. 衰老的全基因组分析以及其他表观遗传标记物的探索;5. 衰老与疾病的单细胞组学分析; 6. 稳健产生其他物种的衰老数据; 7. 将表观遗传学与遗传学的伦理和法律框架融合起来。个人感觉文章质量一般
IF:10.100Q1 Genome biology, 2019-11-25. DOI: 10.1186/s13059-019-1824-y PMID: 31767039
Abstract:
Epigenetic clocks comprise a set of CpG sites whose DNA methylation levels measure subject age. These clocks are acknowledged as a highly accurate molecular correlate of chronological age in humans … >>>
Epigenetic clocks comprise a set of CpG sites whose DNA methylation levels measure subject age. These clocks are acknowledged as a highly accurate molecular correlate of chronological age in humans and other vertebrates. Also, extensive research is aimed at their potential to quantify biological aging rates and test longevity or rejuvenating interventions. Here, we discuss key challenges to understand clock mechanisms and biomarker utility. This requires dissecting the drivers and regulators of age-related changes in single-cell, tissue- and disease-specific models, as well as exploring other epigenomic marks, longitudinal and diverse population studies, and non-human models. We also highlight important ethical issues in forensic age determination and predicting the trajectory of biological aging in an individual. <<<
翻译
10.
徐炳祥 (2022-12-31 14:45):
#paper doi: 10.1186/s13059-022-02835-3 Genome Biology, 2022, NetAct: a computational platform to construct core transcription factor regulatory networks using gene activity。如何构建基因表达调控网络始终是系统生物学面临的重要课题。当前的基因调控网络构造方法普遍基于基因表达数据,而转录因子的功能往往体现在表达之外;此外,当前常用的基因表达调控网络构建的数学/统计方法擅长关注相关性而非因果性;这些缺陷使得当前对基因调控网络的构造效果不佳。本文从文献整理的转录因子和靶基因数据库出发,借助基因表达数据和GSEA提出了一种新的评估某过程中TF活性的策略。在评估的基础上使用互信息完成了基因表达调控网络的构造。本文中结合数据库中转录因子——靶基因关系和基因表达数据进行的转录因子活性定量方法是值得借鉴的。
IF:10.100Q1 Genome biology, 2022-12-27. DOI: 10.1186/s13059-022-02835-3 PMID: 36575445
Abstract:
A major question in systems biology is how to identify the core gene regulatory circuit that governs the decision-making of a biological process. Here, we develop a computational platform, named … >>>
A major question in systems biology is how to identify the core gene regulatory circuit that governs the decision-making of a biological process. Here, we develop a computational platform, named NetAct, for constructing core transcription factor regulatory networks using both transcriptomics data and literature-based transcription factor-target databases. NetAct robustly infers regulators' activity using target expression, constructs networks based on transcriptional activity, and integrates mathematical modeling for validation. Our in silico benchmark test shows that NetAct outperforms existing algorithms in inferring transcriptional activity and gene networks. We illustrate the application of NetAct to model networks driving TGF-β-induced epithelial-mesenchymal transition and macrophage polarization. <<<
翻译
11.
小擎子 (2022-11-30 23:57):
# paper doi:10.1186/s13059-016-0997-x;Genome Biol.2016 Mash: fast genome and metagenome distance estimation using MinHash, Mash工具,用MinHash快速衡量基因组和宏基因组距离。Mash主要实现sketch和dist两个功能,sketch将序列或者序列合集转换为MinHash sketch,可以大幅缩小内存占用,dist计算Jaccard index可以在可控误差范围内近似ANI,且计算效率大大提供。重点是k-mer和s(sketch的size大小)的选择,会影响误差。Mash的特点是计算消耗主要是生成sketch上,sketch一旦生成,上万基因组的相似性比较和聚类几乎是瞬时完成的。
IF:10.100Q1 Genome biology, 2016-06-20. DOI: 10.1186/s13059-016-0997-x PMID: 27323842
Abstract:
Mash extends the MinHash dimensionality-reduction technique to include a pairwise mutation distance and P value significance test, enabling the efficient clustering and search of massive sequence collections. Mash reduces large … >>>
Mash extends the MinHash dimensionality-reduction technique to include a pairwise mutation distance and P value significance test, enabling the efficient clustering and search of massive sequence collections. Mash reduces large sequences and sequence sets to small, representative sketches, from which global mutation distances can be rapidly estimated. We demonstrate several use cases, including the clustering of all 54,118 NCBI RefSeq genomes in 33 CPU h; real-time database search using assembled or unassembled Illumina, Pacific Biosciences, and Oxford Nanopore data; and the scalable clustering of hundreds of metagenomic samples by composition. Mash is freely released under a BSD license ( https://github.com/marbl/mash ). <<<
翻译
12.
笑对人生 (2022-10-05 00:01):
#paper doi: 10.1186/s13059-016-0893-4. DeconstructSigs: delineating mutational processes in single tumors distinguishes DNA repair deficiencies and patterns of carcinoma evolution. Genome Biol. 2016 Feb 22;17:31. 突变信号(或突变特征)(mutational signature)首次提出来自Alexandrov LB, et al. Nature, 2013.的一项研究,当时利用非负矩阵分解(Non-negative matrix factorization,NMF)算法共发现21种mutational signature,每个signature包含96种不同三核苷酸突变(96 trinucleotide contexts)。最近来自science的研究报道了58种未被识别的mutational signature(Degasperi A, Science. 2022.)。与以往的研究相比,本研究开发的deconstructSigs包能够对单个肿瘤样本分析由环境暴露、DNA损伤修复异常和诱变等引起的突变信号。目前cosmic网站(https://cancer.sanger.ac.uk/signatures/)已经根据不同变异类型分成四大类signatures,分别是SBS Signature(Single base substitutions,95种亚signature)、DBS Signature(Doublet Base Substitution,11种亚signature)、ID Signatures(Small insertions and deletions,18种亚signature)和CN Signatures(Copy Number Variantions,24种亚signature)。deconstructSigs包的分析步骤包括(1)利用mut.to.sigs.input构建输入文件。(2)利用whichSignatures进行Signature 预测。这里提到的NMF是一种用于发现数据特征的算法,之前在图像识别领域很常用,较其他PCA或SVD等算法相比,保证了矩阵元素为非负(在大多数应用场景种负值元素大多数是无意义的)。NMF的基本思想是对于任意给定的一个非负矩阵V,其能够寻找到一个非负矩阵W和一个非负矩阵H,满足条件V=W*H,从而将一个非负的矩阵分解为左右两个非负矩阵的乘积。V分解为矩阵W和H的过程需要不断地迭代,直至矩阵W和H收敛才停止。V矩阵中每一列代表一个观测(observation),每一行代表一个特征(feature),比如RNAseq的样本(列)和基因(行)的表达矩阵;W矩阵称为基矩阵(行列式的值不等于0,就是基矩阵),H矩阵称为系数矩阵或权重矩阵。这时用系数矩阵H代替原始矩阵,就可以实现对原始矩阵进行降维,得到数据特征的降维矩阵,从而减少存储空间。
IF:10.100Q1 Genome biology, 2016-Feb-22. DOI: 10.1186/s13059-016-0893-4 PMID: 26899170
Abstract:
BACKGROUND: Analysis of somatic mutations provides insight into the mutational processes that have shaped the cancer genome, but such analysis currently requires large cohorts. We develop deconstructSigs, which allows the … >>>
BACKGROUND: Analysis of somatic mutations provides insight into the mutational processes that have shaped the cancer genome, but such analysis currently requires large cohorts. We develop deconstructSigs, which allows the identification of mutational signatures within a single tumor sample.RESULTS: Application of deconstructSigs identifies samples with DNA repair deficiencies and reveals distinct and dynamic mutational processes molding the cancer genome in esophageal adenocarcinoma compared to squamous cell carcinomas.CONCLUSIONS: deconstructSigs confers the ability to define mutational processes driven by environmental exposures, DNA repair abnormalities, and mutagenic processes in individual tumors with implications for precision cancer medicine. <<<
翻译
13.
徐炳祥 (2022-09-22 22:58):
#paper doi: 10.1186/s13059-022-02757-0 Genome Biology, 2022, Genetic regulation of RNA splicing in human pancreatic islets。在胰岛细胞中存在的非编码编译影响了细胞转录组,从而在I型和II型糖尿病发病过程中可能扮演重要角色。本文在由399名患者组成的队列中分析了一类特殊的常见基因组变异(sQTL,splicing QTL,那些能可变剪接事件的QTL)。sQTL 的靶基因不同于eQTL,暗示着两类QTL可能独立发挥作用。作者识别了一批新的与sQTL关联的I型和II型糖尿病风险基因。作者据此认为胰岛细胞中的可变剪接事件是重要的糖尿病风险因素。
IF:10.100Q1 Genome biology, 2022-09-15. DOI: 10.1186/s13059-022-02757-0 PMID: 36109769 PMCID:PMC9479353
Abstract:
BACKGROUND: Non-coding genetic variants that influence gene transcription in pancreatic islets play a major role in the susceptibility to type 2 diabetes (T2D), and likely also contribute to type 1 … >>>
BACKGROUND: Non-coding genetic variants that influence gene transcription in pancreatic islets play a major role in the susceptibility to type 2 diabetes (T2D), and likely also contribute to type 1 diabetes (T1D) risk. For many loci, however, the mechanisms through which non-coding variants influence diabetes susceptibility are unknown.RESULTS: We examine splicing QTLs (sQTLs) in pancreatic islets from 399 human donors and observe that common genetic variation has a widespread influence on the splicing of genes with established roles in islet biology and diabetes. In parallel, we profile expression QTLs (eQTLs) and use transcriptome-wide association as well as genetic co-localization studies to assign islet sQTLs or eQTLs to T2D and T1D susceptibility signals, many of which lack candidate effector genes. This analysis reveals biologically plausible mechanisms, including the association of T2D with an sQTL that creates a nonsense isoform in ERO1B, a regulator of ER-stress and proinsulin biosynthesis. The expanded list of T2D risk effector genes reveals overrepresented pathways, including regulators of G-protein-mediated cAMP production. The analysis of sQTLs also reveals candidate effector genes for T1D susceptibility such as DCLRE1B, a senescence regulator, and lncRNA MEG3.CONCLUSIONS: These data expose widespread effects of common genetic variants on RNA splicing in pancreatic islets. The results support a role for splicing variation in diabetes susceptibility, and offer a new set of genetic targets with potential therapeutic benefit. <<<
翻译
背景: 影响胰岛基因转录的非编码遗传变异在 2 型糖尿病 (T2D) 的易感性中起主要作用,也可能导致 1 型糖尿病 (T1D) 风险。然而,对于许多基因座,非编码变异影响糖尿病易感性的机制尚不清楚。 结果: 我们检查了 399 例人类供体胰岛中的剪接 QTL (sQTL),并观察到常见的遗传变异对在胰岛生物学和糖尿病中具有成熟作用的基因剪接具有广泛影响。同时,我们分析表达 QTL (eQTL) 并使用转录组范围的关联以及遗传共定位研究将胰岛 sQTL 或 eQTL 分配给 T2D 和 T1D 易感信号,其中许多信号缺乏候选效应基因。该分析揭示了生物学上合理的机制,包括 T2D 与 sQTL 的关联,该 sQTL 在 ERO1B 中产生无义亚型,ERO1B 是 ER 应激和胰岛素原生物合成的调节因子。扩展的 T2D 风险效应基因列表揭示了过度表达的通路,包括 G 蛋白介导的 cAMP 产生的调节因子。sQTL 的分析还揭示了 T1D 易感性的候选效应基因,例如 DCLRE1B、衰老调节因子和 lncRNA MEG3。 结论: 这些数据揭示了常见遗传变异对胰岛 RNA 剪接的广泛影响。结果支持剪接变异在糖尿病易感性中的作用,并提供了一组具有潜在治疗益处的新遗传靶点。
14.
颜林林 (2022-07-21 00:29):
#paper doi:10.1186/s13059-022-02726-7 Genome Biology, 2022, Integration of single-cell multi-omics data by regression analysis on unpaired observations. 受技术条件限制,绝大多数的单细胞多组学研究,其实都很难在同一细胞上同时检测多个不同组学。本文针对这个问题,基于“相似表达的靶基因的调控基因也相似”的直观认识和假设,采用回归分析方法,对scRNA-seq和ATAC-seq数据之间的关系进行关联和推断,使非配对的scRNA-seq和ATAC-seq实验(即并非同一细胞,而是在不同细胞上分别开展了这两项检测)中,可以通过其中一项数据(如ATAC-seq的染色质开放信息)去推断对应的被调控基因的表达。该方法在模拟数据和实测数据上进行评估,可以达到很高的准确度(与eQTL mapping进行对比,结果高度一致)。这为更好利用当前积累的大量非配对单细胞数据,提供了方法学上的支持。
IF:10.100Q1 Genome biology, 2022-07-19. DOI: 10.1186/s13059-022-02726-7 PMID: 35854350 PMCID:PMC9295346
通过对未配对观察值的回归分析整合单细胞多组学数据
Abstract:
Despite recent developments, it is hard to profile all multi-omics single-cell data modalities on the same cell. Thus, huge amounts of single-cell genomics data of unpaired observations on different cells … >>>
Despite recent developments, it is hard to profile all multi-omics single-cell data modalities on the same cell. Thus, huge amounts of single-cell genomics data of unpaired observations on different cells are generated. We propose a method named UnpairReg for the regression analysis on unpaired observations to integrate single-cell multi-omics data. On real and simulated data, UnpairReg provides an accurate estimation of cell gene expression where only chromatin accessibility data is available. The cis-regulatory network inferred from UnpairReg is highly consistent with eQTL mapping. UnpairReg improves cell type identification accuracy by joint analysis of single-cell gene expression and chromatin accessibility data. <<<
翻译
15.
颜林林 (2022-07-07 07:41):
#paper doi:10.1186/s13059-022-02699-7 Genome Biology, 2022, Storing and analyzing a genome on a blockchain. 好几年前,我就听很多人说起,想把区块链技术用于基因组相关的应用,然而,后来各种结局惨淡,似乎都没了下文。在币圈跌跌不休一片哀嚎的最近,竟然《Genome Biology》上会发表出这么一篇文章,也真是神奇和亮眼。这篇来自耶鲁的文章,其全文和源码都是开放访问的,值得对区块链技术感兴趣的朋友仔细一读。文章设想了一个由测序仪、所有者、临床医生和研究人员组成的网络,每个人都参与同步 VCFchain 或 SAMchain,以此来形成分布式的数据共享,且数据分析过程也穿插在链的延伸过程中。在区块链有限的额外字节存储中,保存巨大的基因组数据,也确实需要一些技巧(如数据拆分和查询时的重新组合)加以实现,这篇文章也确实因此做了一些工作。但整体上还是有一种“为了区块链而区块链”的感觉。权限的管理和不容篡改可能是其特点和优势,但并未在文章中充分呈现,这与此前分享过的提及区块链技术的另外两篇文章有所不同(那两篇文章的DOI分别是:10.1038/s41591-022-01768-5 和 10.1038/s41586-021-03583-3,分别发表在 Nature Medicine 和 Nature,它们更多是AI算法及数据分享价值),而本文的重点还是在于区块链相关的程序实现细节。有这篇做铺垫,说不定类似文章后续真能冲击NBT呢。
IF:10.100Q1 Genome biology, 2022-06-29. DOI: 10.1186/s13059-022-02699-7 PMID: 35765079
Abstract:
There are major efforts underway to make genome sequencing a routine part of clinical practice. A critical barrier to these is achieving practical solutions for data ownership and integrity. Blockchain … >>>
There are major efforts underway to make genome sequencing a routine part of clinical practice. A critical barrier to these is achieving practical solutions for data ownership and integrity. Blockchain provides solutions to these challenges in other realms, such as finance. However, its use in genomics is stymied due to the difficulty in storing large-scale data on-chain, slow transaction speeds, and limitations on querying. To overcome these roadblocks, we developed a private blockchain network to store genomic variants and reference-aligned reads on-chain. It uses nested database indexing with an accompanying tool suite to rapidly access and analyze the data. <<<
翻译
16.
Vincent (2022-03-31 11:11):
#paper doi: 10.1186/s13059-021-02443-7 Genome Biol 2021 Technology dictates algorithms: recent developments in read alignment. 序列比对是生物信息测序数据分析的基础步骤,这篇文章详细回顾了107种序列比对软件,并且通过实验评估了其中的11种软件的计算效率和速度。文章中提到序列比对算法和测序技术是共同进化的(co-evolution),一种新技术的诞生能带来了一系列工具的开发,而底层的核心算法往往没有很大的革命性的改变(只不过是tailored for the new technology)。文章调查发现基于哈希表index基因组的方法是最常见的,但是缺点是对存储空间的要求较大,基于suffix-tree的index方法往往计算速度也较快并且被越来越广泛的使用。另一方面,文章也发现,局部序列比对方法通常使用海明距离(hamming distance)和smith-waterman算法来寻找测序片段在基因组中的确切位置。此外文章还回顾了长序列读长对序列比对方法开发的影响等等。
IF:10.100Q1 Genome biology, 2021-08-26. DOI: 10.1186/s13059-021-02443-7 PMID: 34446078
Abstract:
Aligning sequencing reads onto a reference is an essential step of the majority of genomic analysis pipelines. Computational algorithms for read alignment have evolved in accordance with technological advances, leading … >>>
Aligning sequencing reads onto a reference is an essential step of the majority of genomic analysis pipelines. Computational algorithms for read alignment have evolved in accordance with technological advances, leading to today's diverse array of alignment methods. We provide a systematic survey of algorithmic foundations and methodologies across 107 alignment methods, for both short and long reads. We provide a rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read alignment. We discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology. <<<
翻译
17.
思考问题的熊 (2022-03-20 16:35):
#paper Li, Yumei, Xinzhou Ge, Fanglue Peng, Wei Li, and Jingyi Jessica Li. “Exaggerated False Positives by Popular Differential Expression Methods When Analyzing Human Population Samples.” Genome Biology 23, no. 1 (March 15, 2022): 79. https://doi.org/10.1186/s13059-022-02648-4. 前几天发表在 Genome Biology 的一篇论文,算是比较严谨地论证了在大样本量RNA-seq差异分析时,今后即便不考虑速度因素,也应该抛弃DEseq2和edgeR转而使用朴实无华的Wilcoxon秩和检验。 更具体的内容已经写成推送发出来了,感兴趣可以再看看。
IF:10.100Q1 Genome biology, 2022-03-15. DOI: 10.1186/s13059-022-02648-4 PMID: 35292087 PMCID:PMC8922736
在分析人类群体样本时,流行的差异表达方法夸大了假阳性
Abstract:
When identifying differentially expressed genes between two conditions using human population RNA-seq samples, we found a phenomenon by permutation analysis: two popular bioinformatics methods, DESeq2 and edgeR, have unexpectedly high … >>>
When identifying differentially expressed genes between two conditions using human population RNA-seq samples, we found a phenomenon by permutation analysis: two popular bioinformatics methods, DESeq2 and edgeR, have unexpectedly high false discovery rates. Expanding the analysis to limma-voom, NOISeq, dearseq, and Wilcoxon rank-sum test, we found that FDR control is often failed except for the Wilcoxon rank-sum test. Particularly, the actual FDRs of DESeq2 and edgeR sometimes exceed 20% when the target FDR is 5%. Based on these results, for population-level RNA-seq studies with large sample sizes, we recommend the Wilcoxon rank-sum test. <<<
翻译
当使用人类群体 RNA-seq 样本鉴定两种情况之间的差异表达基因时,我们通过排列分析发现了一个现象:两种流行的生物信息学方法 DESeq2 和 edgeR 具有出乎意料的高错误发现率。将分析扩展到 limma-voom、NOISeq、dearseq 和 Wilcoxon 秩和检验,我们发现除了 Wilcoxon 秩和检验外,FDR 控制经常失败。特别是,当目标 FDR 为 5% 时,DESeq2 和 edgeR 的实际 FDR 有时会超过 20%。基于这些结果,对于样本量较大的群体水平 RNA-seq 研究,我们建议使用 Wilcoxon 秩和检验。
回到顶部