来自杂志 Bioinformatics (Oxford, England) 的文献。
当前共找到 7 篇文献分享。
1.
李翛然
(2023-10-31 13:21):
#paper doi:10.1093/bioinformatics/btad596 DeepCCI: a deep learning framework for identifying cell-cell interactions from single-cell RNA sequencing data
一个新的框架,在用scRNA的数据来解释细胞互作,不过我觉得最大的问题是,看了一下他的训练集和数据集,还是通过对于scRNA的初步处理数据,即做到uMAP的降维分类后就来训练,还是非常初级的想法,真正的细胞互作的机理在这个颗粒度下的解释会很糟糕。不过也算是一个跨领域的应用 值得鼓励
Abstract:
MOTIVATION: Cell-cell interactions (CCIs) play critical roles in many biological processes such as cellular differentiation, tissue homeostasis, and immune response. With the rapid development of high throughput single-cell RNA sequencing …
>>>
MOTIVATION: Cell-cell interactions (CCIs) play critical roles in many biological processes such as cellular differentiation, tissue homeostasis, and immune response. With the rapid development of high throughput single-cell RNA sequencing (scRNA-seq) technologies, it is of high importance to identify CCIs from the ever-increasing scRNA-seq data. However, limited by the algorithmic constraints, current computational methods based on statistical strategies ignore some key latent information contained in scRNA-seq data with high sparsity and heterogeneity.RESULTS: Here, we developed a deep learning framework named DeepCCI to identify meaningful CCIs from scRNA-seq data. Applications of DeepCCI to a wide range of publicly available datasets from diverse technologies and platforms demonstrate its ability to predict significant CCIs accurately and effectively. Powered by the flexible and easy-to-use software, DeepCCI can provide the one-stop solution to discover meaningful intercellular interactions and build CCI networks from scRNA-seq data.AVAILABILITY AND IMPLEMENTATION: The source code of DeepCCI is available online at https://github.com/JiangBioLab/DeepCCI.
<<<
翻译
2.
na na na
(2022-12-31 23:50):
#paper,Feature specific quantile normalization enables cross-platform classification of molecular subtypes using gene expression data(2018),DOI:10.1093/bioinformatics/bty026.
分享一篇算法工具类的文章,FSQN(feature specific quantile normalization);该方法主要是处理了 RNA-seq平台 转录组测序数据 和 芯片平台转录组测序数据的标准化问题。这个问题在做公共数据分析的时候尤其重要,通常的办法例如取log2,z-score以及用中位数做矫正等方法虽然可以在一定程度行把数据分布拉到一个区间上,但起分布依然是不一致的,导致在做机器学习建模的时候往往跨平台效果较差,该文章讨论了不同平台间批次产生的原因,并从应用角度入手,不仅比较了现有方法的劣势,也推出了FSQN的方法,该方法在测试数据集上,基于常见的分类器模型,实现了RNA-seq平台 98%的准确度和芯片平台97%准确度。还方法作者提供了R包:https://github.com/jenniferfranks/FSQN。我做过测试,通过PCA可以看到去批次效果较好,但未能实现文章中机器学习模型的高准确度,因此平台间数据的去批次方法和机器学习跨平台使用依然是一个可研究的方向,扩展思维的话,在RNA-seq和Nanostrign之间,RNA-seq和单细胞测序之间,芯片和Nanostrign之间都可以从数据矫正的角度出发去开发去批次的工具。
Abstract:
Motivation: Molecular subtypes of cancers and autoimmune disease, defined by transcriptomic profiling, have provided insight into disease pathogenesis, molecular heterogeneity and therapeutic responses. However, technical biases inherent to different gene …
>>>
Motivation: Molecular subtypes of cancers and autoimmune disease, defined by transcriptomic profiling, have provided insight into disease pathogenesis, molecular heterogeneity and therapeutic responses. However, technical biases inherent to different gene expression profiling platforms present a unique problem when analyzing data generated from different studies. Currently, there is a lack of effective methods designed to eliminate platform-based bias. We present a method to normalize and classify RNA-seq data using machine learning classifiers trained on DNA microarray data and molecular subtypes in two datasets: breast invasive carcinoma (BRCA) and colorectal cancer (CRC).Results: Multiple analyses show that feature specific quantile normalization (FSQN) successfully removes platform-based bias from RNA-seq data, regardless of feature scaling or machine learning algorithm. We achieve up to 98% accuracy for BRCA data and 97% accuracy for CRC data in assigning molecular subtypes to RNA-seq data normalized using FSQN and a support vector machine trained exclusively on DNA microarray data. We find that maximum accuracy was achieved when normalizing RNA-seq datasets that contain at least 25 samples. FSQN allows comparison of RNA-seq data to existing DNA microarray datasets. Using these techniques, we can successfully leverage information from existing gene expression data in new analyses despite different platforms used for gene expression profiling.Availability and implementation: FSQN has been submitted as an R package to CRAN. All code used for this study is available on Github (https://github.com/jenniferfranks/FSQN).Contact: michael.l.whitfield@dartmouth.edu.Supplementary information: Supplementary data are available at Bioinformatics online.
<<<
翻译
3.
颜林林
(2022-11-20 22:18):
#paper doi:10.1093/bioinformatics/btac018 Bioinformatics, 2022, StainedGlass: interactive visualization of massive tandem repeat structures with identity heatmaps. 这篇paper介绍了来自华盛顿大学Evan Eichler团队的一个小工具,它在基因组尺度上计算序列之间的一致性(或相似性),并以基因组浏览器上通常展示连锁不平衡(LD)的三角形方式,展示这些序列一致性关系。这几乎就只是一项日常分析工作中的普通任务,谈不上多大的创新性和重要意义。因此,作为一篇可以帮助其他人复用并快速实现类似功能的Application Note,作者将该功能封装成为snakemake模块,并且借用另一个发表于2018年的工具HiGlass,实现结果的交互式展示,允许快速进行不同分辨率的调节,倒是确实突出了实用性。
Bioinformatics (Oxford, England),
2022-03-28.
DOI: 10.1093/bioinformatics/btac018
PMID: 35020798
PMCID:PMC8963321
Abstract:
SUMMARY: The visualization and analysis of genomic repeats is typically accomplished using dot plots; however, the emergence of telomere-to-telomere assemblies with multi-megabase repeats requires new visualization strategies. Here, we introduce …
>>>
SUMMARY: The visualization and analysis of genomic repeats is typically accomplished using dot plots; however, the emergence of telomere-to-telomere assemblies with multi-megabase repeats requires new visualization strategies. Here, we introduce StainedGlass, which can generate publication-quality figures and interactive visualizations that depict the identity and orientation of multi-megabase tandem repeat structures at a genome-wide scale. The tool can rapidly reveal higher-order structures and improve the inference of evolutionary history for some of the most complex regions of genomes.AVAILABILITY AND IMPLEMENTATION: StainedGlass is implemented using Snakemake and available open source under the MIT license at https://mrvollger.github.io/StainedGlass/.SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
<<<
翻译
4.
林李泽强
(2022-08-31 22:57):
#paper doi:10.1093/bioinformatics/btz128,Bioinformatics,2019,Brain annotation toolbox: exploring thefunctional and genetic associations of neuroimaging results. 过去以来,大多数神经影像学研究的结果(比如激活的簇/区域或大脑区域之间的功能连接),往往无法方便和系统地解释,导致生物学意义不明确。在这项研究中,作者开发了一个大脑注释工具箱,它可以为神经成像结果自动生成功能和基因注释。该工具包是基于Neurosynth数据库中的体素级功能描述以及Allen人脑图谱中的基因表达谱,将它们用于生成区域级神经成像结果的功能/基因信息。这个工具包是基于MATLAB的免费的开源工具包,可以帮助为新发现的具有未知功能的区域提供功能/基因注释。
Abstract:
MOTIVATION: Advances in neuroimaging and sequencing techniques provide an unprecedented opportunity to map the function of brain regions and identify the roots of psychiatric diseases. However, the results from most …
>>>
MOTIVATION: Advances in neuroimaging and sequencing techniques provide an unprecedented opportunity to map the function of brain regions and identify the roots of psychiatric diseases. However, the results from most neuroimaging studies, i.e. activated clusters/regions or functional connectivities between brain regions, frequently cannot be conveniently and systematically interpreted, rendering the biological meaning unclear.RESULTS: We describe a brain annotation toolbox that generates functional and genetic annotations for neuroimaging results. The voxel-level functional description from the Neurosynth database and gene expression profile from the Allen Human Brain Atlas are used to generate functional/genetic information for region-level neuroimaging results. The validity of the approach is demonstrated by showing that the functional and genetic annotations for specific brain regions are consistent with each other; and further the region by region functional similarity network and genetic similarity network are highly correlated for major brain atlases. One application of brain annotation toolbox is to help provide functional/genetic annotations for newly discovered regions with unknown functions, e.g. the 97 new regions identified in the Human Connectome Project. Importantly, this toolbox can help understand differences between psychiatric patients and controls, and this is demonstrated using schizophrenia and autism data, for which the functional and genetic annotations for the neuroimaging changes in patients are consistent with each other and help interpret the results.AVAILABILITY AND IMPLEMENTATION: BAT is implemented as a free and open-source MATLAB toolbox and is publicly available at http://123.56.224.61:1313/post/bat.SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
<<<
翻译
5.
颜林林
(2022-08-01 01:02):
#paper doi:10.1093/bioinformatics/btac528 Bioinformatics, 2022, The K-mer File Format: a standardized and compact disk representation of sets of k-mers. 由k个字符连在一起的短串,称为k-mer,在生信的许多工具或分析过程中,如构建de Bruijn图(进行基因组组装)和创建序列索引(进行短序列比对),基本都会用到这个概念,并统计每种k-mer的出现频次,以及其他相关信息(如出现在基因组中的位置、与其他k-mer之间的关系)。随着k的增加,k-mer的种类呈几何数量增长,这给计算、存储都带来巨大开销。为此,本文开发了一种文件存储格式,用于存储k-mer信息,确保信息得以压缩存储的同时,还能保持高效的读写。说实话,这活不复杂,会点儿C++和Rust就能做,而且类似需求也不少。
Abstract:
SUMMARY: Bioinformatics applications increasingly rely on ad hoc disk storage of k-mer sets, e.g. for de Bruijn graphs or alignment indexes. Here, we introduce the K-mer File Format as a …
>>>
SUMMARY: Bioinformatics applications increasingly rely on ad hoc disk storage of k-mer sets, e.g. for de Bruijn graphs or alignment indexes. Here, we introduce the K-mer File Format as a general lossless framework for storing and manipulating k-mer sets, realizing space savings of 3-5× compared to other formats, and bringing interoperability across tools.AVAILABILITY AND IMPLEMENTATION: Format specification, C++/Rust API, tools: https://github.com/Kmer-File-Format/.SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
<<<
翻译
6.
Vincent
(2022-07-31 17:30):
#paper doi: 10.1093/bioinformatics/btab083
DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome.
由于序列多义性和遥远的语义联系,基因调控编码十分复杂。近年来有研究陆续发现DNA序列,尤其是非编码区序列,在字符表、语法、语义方面的特征都与自然语言相似,而基于transformer注意力机制的机器学习工具BERT在自然语言处理方面大放异彩。这篇文章运用类似的研究思路开发了DNABERT,一个基于上下文序列的、能表征DNA特征的预处理模型。为了展现这个模型的用处和效果,这篇文章尝试了几个经典的计算任务:启动子预测、剪切位点预测和转录因子结合位点的预测,文章先使用该模型去encode DNA 序列,然后再对具体的计算任务fine-tune,发现其在准确度上能够轻松超越其他算法。同时为了解决基于深度学习可解释性差的问题,该方法提供了可视化选项,能展现位点层面的重要性以及与其他位点的联系(attention机制)。同时该工作还发现用人类基因组预训练的模型,运用到其他生物也有很好的效果,进一步展现了这种encoding是可以迁移的(不是memorize,而是真正抓住了一些序列层面特征)
Abstract:
MOTIVATION: Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant …
>>>
MOTIVATION: Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios.RESULTS: To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks.AVAILABILITY AND IMPLEMENTATION: The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT).SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
<<<
翻译
7.
颜林林
(2022-07-28 08:50):
#paper doi:10.1093/bioinformatics/btac137 Bioinformatics, 2022, BWA-MEME: BWA-MEM emulated with a machine learning approach. 看到李恒在Twitter上转发这篇文章,本以为大神又升级了bwa mem2,之后发现原来是他人的作品,得到了李恒钦点而已。作为某个知名软件的后继者,必然是要在某个方面有较大改进的,这篇的改进主要在性能。用于高通量测序数据的短序列比对算法,通常都是先用精确匹配种子(这几乎都是查表法在常数时间内完成),然后进行延伸匹配。而种子序列的长度选择,是一项比较有技巧性的事,太短可能导致重复匹配(hit)过多,太长则可能大量单词无匹配(在基因组上无该序列)却占据字典,导致字典过大。为此,过去也有一些算法,会采用变长种子来解决该问题(我也设想过这个策略,但惭愧的是,最终未能付诸实践)。而变长种子的策略,存在内存块大小不定、访问频繁等问题,会导致性能瓶颈。在本文中,通过机器学习的方法,在建立种子索引的阶段进行预处理,使得索引能够根据基因组序列数据进行适应,使不同长度种子的内存访问次数固定,从而获得性能提升。在最终的评测中,bwa-meme 能保持与 bwa-mem2 的输出相同,运行速度则提升了 3.45 倍。这篇文章的算法,可以再仔细深入学习下。
Abstract:
MOTIVATION: The growing use of next-generation sequencing and enlarged sequencing throughput require efficient short-read alignment, where seeding is one of the major performance bottlenecks. The key challenge in the seeding …
>>>
MOTIVATION: The growing use of next-generation sequencing and enlarged sequencing throughput require efficient short-read alignment, where seeding is one of the major performance bottlenecks. The key challenge in the seeding phase is searching for exact matches of substrings of short reads in the reference DNA sequence. Existing algorithms, however, present limitations in performance due to their frequent memory accesses.RESULTS: This article presents BWA-MEME, the first full-fledged short read alignment software that leverages learned indices for solving the exact match search problem for efficient seeding. BWA-MEME is a practical and efficient seeding algorithm based on a suffix array search algorithm that solves the challenges in utilizing learned indices for SMEM search which is extensively used in the seeding phase. Our evaluation shows that BWA-MEME achieves up to 3.45× speedup in seeding throughput over BWA-MEM2 by reducing the number of instructions by 4.60×, memory accesses by 8.77× and LLC misses by 2.21×, while ensuring the identical SAM output to BWA-MEM2.AVAILABILITY AND IMPLEMENTATION: The source code and test scripts are available for academic use at https://github.com/kaist-ina/BWA-MEME/.SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
<<<
翻译