Vincent (2022-07-31 17:30):
#paper doi: 10.1093/bioinformatics/btab083 DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. 由于序列多义性和遥远的语义联系,基因调控编码十分复杂。近年来有研究陆续发现DNA序列,尤其是非编码区序列,在字符表、语法、语义方面的特征都与自然语言相似,而基于transformer注意力机制的机器学习工具BERT在自然语言处理方面大放异彩。这篇文章运用类似的研究思路开发了DNABERT,一个基于上下文序列的、能表征DNA特征的预处理模型。为了展现这个模型的用处和效果,这篇文章尝试了几个经典的计算任务:启动子预测、剪切位点预测和转录因子结合位点的预测,文章先使用该模型去encode DNA 序列,然后再对具体的计算任务fine-tune,发现其在准确度上能够轻松超越其他算法。同时为了解决基于深度学习可解释性差的问题,该方法提供了可视化选项,能展现位点层面的重要性以及与其他位点的联系(attention机制)。同时该工作还发现用人类基因组预训练的模型,运用到其他生物也有很好的效果,进一步展现了这种encoding是可以迁移的(不是memorize,而是真正抓住了一些序列层面特征)
DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome
翻译
Abstract:
MOTIVATION: Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios.RESULTS: To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks.AVAILABILITY AND IMPLEMENTATION: The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT).SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
翻译
回到顶部