来自用户 Vincent 的文献。
当前共找到 27 篇文献分享,本页显示第 21 - 27 篇。
21.
Vincent (2022-10-31 15:22):
#paper Obtaining genetics insights from deep learning via explainable artificial intelligence, Nature Reviews Genetics https://doi.org/10.1038/ s41576-022-00532-2 基于深度学习的人工智能模型在基因组功能预测中发挥重要作用,被认为是当下表现最好的模型(state of the art)。但是由于深度学习模型的复杂性, 它们往往被认为是黑箱模型,其预测效果/机制往往很难被解释,但是基因组的研究中很多时候作用机制(过程)比预测效果(结果)更有价值。这篇review paper总结了近年来新兴的可解释性机器学习(xAI)技术在基因组领域的研究进展,展望了该技术在揭示生物机理方面的潜能。这篇文章主要以regulatory genomics 作为例子, 总结归纳了4种解释机器学习模型的技术:基于模型的解释(检查隐含层的神经元活动,注意力机制),影响的数学传播(前向传播/后向传播), 特征相互作用的鉴别,和基于先验知识的透明模型,以及这几种技术在高通量测序技术中的潜在假设和相应的局限性。
Abstract:
Artificial intelligence (AI) models based on deep learning now represent the state of the art for making functional predictions in genomics research. However, the underlying basis on which predictive models … >>>
Artificial intelligence (AI) models based on deep learning now represent the state of the art for making functional predictions in genomics research. However, the underlying basis on which predictive models make such predictions is often unknown. For genomics researchers, this missing explanatory information would frequently be of greater value than the predictions themselves, as it can enable new insights into genetic processes. We review progress in the emerging area of explainable AI (xAI), a field with the potential to empower life science researchers to gain mechanistic insights into complex deep learning models. We discuss and categorize approaches for model interpretation, including an intuitive understanding of how each approach works and their underlying assumptions and limitations in the context of typical high-throughput biological datasets. <<<
翻译
22.
Vincent (2022-09-30 14:56):
#paper doi: https://doi.org/10.1038/s43586-021-00056-9 Genome-wide association studies. Nature Reviews Methods Primers. 2021. GWAS旨在寻找基因型和表型之间的关联。截止目前,总共有超过5700项,涵盖3300性状的GWAS研究。这篇review文章丛统计原理、实验设计、实际操作、结果解释,下游应用等方面很好地介绍了全基因组关联研究(GWAS)。在统计原理方面,文章介绍了假设检验常用的线性混合模型,假发现率的控制(FDR control)和下游fine mapping方法。实验设计方面,文章详细介绍了人群的选择(population-based, family-based 和 isolation populations),以及测序技术(microarray, WES, WGS)方面的优缺点。应用上,文章介绍了GWAS的两大重要应用:疾病风险预测(PRS score) 和 揭示生物性状的遗传基础。文章最后还提及了GWAS研究目前的局限和对未来发展的期待。总结起来是篇很不错的GWAS入门文章。
Abstract:
Genome-wide association studies (GWAS) test hundreds of thousands of genetic variants across many genomes to find those statistically associated with a specific trait or disease. This methodology has generated a … >>>
Genome-wide association studies (GWAS) test hundreds of thousands of genetic variants across many genomes to find those statistically associated with a specific trait or disease. This methodology has generated a myriad of robust associations for a range of traits and diseases, and the number of associated variants is expected to grow steadily as GWAS sample sizes increase. GWAS results have a range of applications, such as gaining insight into a phenotype’s underlying biology, estimating its heritability, calculating genetic correlations, making clinical risk predictions, informing drug development programmes and inferring potential causal relationships between risk factors and health outcomes. In this Primer, we provide the reader with an introduction to GWAS, explaining their statistical basis and how they are conducted, describe state-of-the art approaches and discuss limitations and challenges, concluding with an overview of the current and future applications for GWAS results. <<<
翻译
23.
Vincent (2022-08-31 13:52):
#paper  https://doi.org/10.1038/s41580-021-00407-0, Nat Rev Mol Cell Biol, 2021, A guide to machine learning for biologists. 这篇review paper深入浅出的介绍了各类机器学习算法和在生物领域的应用。文章一开始先梳理了很多ML的关键概念(例如机器学习算法的分类,overfitting/underfitting,bias-variance tradeoff)。随后分别介绍了传统机器学习算法(PCA, k-means, SVM, ridge regression, randomforest等),基于深度学习的算法(CNN, RNN, transformer, autoencoder等),描述了每种算法的优缺点和并且探讨了在生物学数据中使用机器学习算法的最佳实践。文章最后还介绍了机器学习算法在生物学领域的所面临的的挑战,例如数据可得性, 数据泄露, 模型可解释性,以及隐私保护方面的问题。感兴趣的可以看看,是一篇十分不错的参考文献。
Abstract:
The expanding scale and inherent complexity of biological data have encouraged a growing use of machine learning in biology to build informative and predictive models of the underlying biological processes. … >>>
The expanding scale and inherent complexity of biological data have encouraged a growing use of machine learning in biology to build informative and predictive models of the underlying biological processes. All machine learning techniques fit models to data; however, the specific methods are quite varied and can at first glance seem bewildering. In this Review, we aim to provide readers with a gentle introduction to a few key machine learning techniques, including the most recently developed and widely used techniques involving deep neural networks. We describe how different techniques may be suited to specific types of biological data, and also discuss some best practices and points to consider when one is embarking on experiments involving machine learning. Some emerging directions in machine learning methodology are also discussed. <<<
翻译
24.
Vincent (2022-07-31 17:30):
#paper doi: 10.1093/bioinformatics/btab083 DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. 由于序列多义性和遥远的语义联系,基因调控编码十分复杂。近年来有研究陆续发现DNA序列,尤其是非编码区序列,在字符表、语法、语义方面的特征都与自然语言相似,而基于transformer注意力机制的机器学习工具BERT在自然语言处理方面大放异彩。这篇文章运用类似的研究思路开发了DNABERT,一个基于上下文序列的、能表征DNA特征的预处理模型。为了展现这个模型的用处和效果,这篇文章尝试了几个经典的计算任务:启动子预测、剪切位点预测和转录因子结合位点的预测,文章先使用该模型去encode DNA 序列,然后再对具体的计算任务fine-tune,发现其在准确度上能够轻松超越其他算法。同时为了解决基于深度学习可解释性差的问题,该方法提供了可视化选项,能展现位点层面的重要性以及与其他位点的联系(attention机制)。同时该工作还发现用人类基因组预训练的模型,运用到其他生物也有很好的效果,进一步展现了这种encoding是可以迁移的(不是memorize,而是真正抓住了一些序列层面特征)
Abstract:
MOTIVATION: Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant … >>>
MOTIVATION: Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios.RESULTS: To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks.AVAILABILITY AND IMPLEMENTATION: The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT).SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. <<<
翻译
25.
Vincent (2022-04-30 21:26):
#paper https://doi.org/10.1038/s41467-020-17678-4 A deep learning model to predict RNA-Seq expression of tumours from whole slide images. Nature Comm (2020) 深度学习模型(CNN)在医学影像中有广泛的应用,最近也有研究指出可以通过病理图片来预测DNA突变和突变数,但是还没有研究关注过是否可以通过病理图片来预测基因表达,这篇文章填补了这部分空白。文章提出了一种基于多任务弱监督的深度学习模型 HE2RNA, 使用TCGA不同癌症类型数据(WSI + RNA-seq)进行训练,发现能准确预测基因的数量主要取决于训练数据集的大小,对这些被准确预测的基因进行富集分析,发现他们集中在免疫和T细胞调控,细胞周期,和癌症hallmark的通路上。最后文章还展现HE2RNA可以用于基因表达的空间可视化(预测基因在slide上表达)和提高MSI预测效果
IF:14.700Q1 Nature communications, 2020-08-03. DOI: 10.1038/s41467-020-17678-4 PMID: 32747659 PMCID:PMC7400514
Abstract:
Deep learning methods for digital pathology analysis are an effective way to address multiple clinical questions, from diagnosis to prediction of treatment outcomes. These methods have also been used to … >>>
Deep learning methods for digital pathology analysis are an effective way to address multiple clinical questions, from diagnosis to prediction of treatment outcomes. These methods have also been used to predict gene mutations from pathology images, but no comprehensive evaluation of their potential for extracting molecular features from histology slides has yet been performed. We show that HE2RNA, a model based on the integration of multiple data modes, can be trained to systematically predict RNA-Seq profiles from whole-slide images alone, without expert annotation. Through its interpretable design, HE2RNA provides virtual spatialization of gene expression, as validated by CD3- and CD20-staining on an independent dataset. The transcriptomic representation learned by HE2RNA can also be transferred on other datasets, even of small size, to increase prediction performance for specific molecular phenotypes. We illustrate the use of this approach in clinical diagnosis purposes such as the identification of tumors with microsatellite instability. <<<
翻译
26.
Vincent (2022-03-31 11:11):
#paper doi: 10.1186/s13059-021-02443-7 Genome Biol 2021 Technology dictates algorithms: recent developments in read alignment. 序列比对是生物信息测序数据分析的基础步骤,这篇文章详细回顾了107种序列比对软件,并且通过实验评估了其中的11种软件的计算效率和速度。文章中提到序列比对算法和测序技术是共同进化的(co-evolution),一种新技术的诞生能带来了一系列工具的开发,而底层的核心算法往往没有很大的革命性的改变(只不过是tailored for the new technology)。文章调查发现基于哈希表index基因组的方法是最常见的,但是缺点是对存储空间的要求较大,基于suffix-tree的index方法往往计算速度也较快并且被越来越广泛的使用。另一方面,文章也发现,局部序列比对方法通常使用海明距离(hamming distance)和smith-waterman算法来寻找测序片段在基因组中的确切位置。此外文章还回顾了长序列读长对序列比对方法开发的影响等等。
IF:10.100Q1 Genome biology, 2021-08-26. DOI: 10.1186/s13059-021-02443-7 PMID: 34446078
Abstract:
Aligning sequencing reads onto a reference is an essential step of the majority of genomic analysis pipelines. Computational algorithms for read alignment have evolved in accordance with technological advances, leading … >>>
Aligning sequencing reads onto a reference is an essential step of the majority of genomic analysis pipelines. Computational algorithms for read alignment have evolved in accordance with technological advances, leading to today's diverse array of alignment methods. We provide a systematic survey of algorithmic foundations and methodologies across 107 alignment methods, for both short and long reads. We provide a rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read alignment. We discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology. <<<
翻译
27.
Vincent (2022-02-28 15:50):
#paper What are the most important statistical ideas of the past 50 years? #Link: https://arxiv.org/abs/2012.00174 导读:作者Andrew Gelman是哥伦比亚大学统计系的教授,也是经济学人等杂志的资深统计顾问,2020年当选美国科学院院士。2021年他在arxiv上发布了这篇备受统计学家关注的文章。文中总结了过去50年来统计学领域最为重要的八大思想(he thinks) 1. 因果推断;2. bootstrap和基于模拟的推断;3.超参数模型和正则化;4.层次结构模型;5.通用计算算法;6.自适应判定分析;7.鲁棒性推断;8.探索性数据分析。个人认为第一点和第三点尤其得当。第三点基本可以囊括很多machine leanring的算法。而第一点直接影响着人们的决策和认知,多数时候我们总把相关关系误认为因果(在社会科学领域尤甚),大家如果有幸观察到网上的各类争论,不妨从这点来审视他们在论证中有没有犯这种常识性的错误。
回到顶部