文献收藏与分享平台

1.

李翛然 (2023-10-31 13:21):

#paper doi:10.1093/bioinformatics/btad596 DeepCCI: a deep learning framework for identifying cell-cell interactions from single-cell RNA sequencing data 一个新的框架，在用scRNA的数据来解释细胞互作，不过我觉得最大的问题是，看了一下他的训练集和数据集，还是通过对于scRNA的初步处理数据，即做到uMAP的降维分类后就来训练，还是非常初级的想法，真正的细胞互作的机理在这个颗粒度下的解释会很糟糕。不过也算是一个跨领域的应用值得鼓励

Bioinformatics (Oxford, England), 2023-10-03. DOI: 10.1093/bioinformatics/btad596 PMID: 37740953

DeepCCI: a deep learning framework for identifying cell-cell interactions from single-cell RNA sequencing data

翻译

Abstract:

MOTIVATION: Cell-cell interactions (CCIs) play critical roles in many biological processes such as cellular differentiation, tissue homeostasis, and immune response. With the rapid development of high throughput single-cell RNA sequencing … >>>

翻译

2.

na na na (2022-12-31 23:50):

#paper，Feature specific quantile normalization enables cross-platform classification of molecular subtypes using gene expression data（2018），DOI:10.1093/bioinformatics/bty026. 分享一篇算法工具类的文章，FSQN（feature specific quantile normalization）；该方法主要是处理了 RNA-seq平台转录组测序数据和芯片平台转录组测序数据的标准化问题。这个问题在做公共数据分析的时候尤其重要，通常的办法例如取log2，z-score以及用中位数做矫正等方法虽然可以在一定程度行把数据分布拉到一个区间上，但起分布依然是不一致的，导致在做机器学习建模的时候往往跨平台效果较差，该文章讨论了不同平台间批次产生的原因，并从应用角度入手，不仅比较了现有方法的劣势，也推出了FSQN的方法，该方法在测试数据集上，基于常见的分类器模型，实现了RNA-seq平台 98%的准确度和芯片平台97%准确度。还方法作者提供了R包：https://github.com/jenniferfranks/FSQN。我做过测试，通过PCA可以看到去批次效果较好，但未能实现文章中机器学习模型的高准确度，因此平台间数据的去批次方法和机器学习跨平台使用依然是一个可研究的方向，扩展思维的话，在RNA-seq和Nanostrign之间，RNA-seq和单细胞测序之间，芯片和Nanostrign之间都可以从数据矫正的角度出发去开发去批次的工具。

Bioinformatics (Oxford, England), 2018-06-01. DOI: 10.1093/bioinformatics/bty026 PMID: 29360996

Feature specific quantile normalization enables cross-platform classification of molecular subtypes using gene expression data

翻译

Jennifer M Franks, Guoshuai Cai, Michael L Whitfield

Abstract:

Motivation: Molecular subtypes of cancers and autoimmune disease, defined by transcriptomic profiling, have provided insight into disease pathogenesis, molecular heterogeneity and therapeutic responses. However, technical biases inherent to different gene … >>>

翻译

3.

颜林林 (2022-11-20 22:18):

#paper doi:10.1093/bioinformatics/btac018 Bioinformatics, 2022, StainedGlass: interactive visualization of massive tandem repeat structures with identity heatmaps. 这篇paper介绍了来自华盛顿大学Evan Eichler团队的一个小工具，它在基因组尺度上计算序列之间的一致性（或相似性），并以基因组浏览器上通常展示连锁不平衡（LD）的三角形方式，展示这些序列一致性关系。这几乎就只是一项日常分析工作中的普通任务，谈不上多大的创新性和重要意义。因此，作为一篇可以帮助其他人复用并快速实现类似功能的Application Note，作者将该功能封装成为snakemake模块，并且借用另一个发表于2018年的工具HiGlass，实现结果的交互式展示，允许快速进行不同分辨率的调节，倒是确实突出了实用性。

Bioinformatics (Oxford, England), 2022-03-28. DOI: 10.1093/bioinformatics/btac018 PMID: 35020798 PMCID:PMC8963321

StainedGlass: interactive visualization of massive tandem repeat structures with identity heatmaps

翻译

Mitchell R Vollger, Peter Kerpedjiev, Adam M Phillippy, Evan E Eichler

Abstract:

SUMMARY: The visualization and analysis of genomic repeats is typically accomplished using dot plots; however, the emergence of telomere-to-telomere assemblies with multi-megabase repeats requires new visualization strategies. Here, we introduce … >>>

翻译

4.

林李泽强 (2022-08-31 22:57):

#paper doi：10.1093/bioinformatics/btz128，Bioinformatics，2019，Brain annotation toolbox: exploring thefunctional and genetic associations of neuroimaging results. 过去以来，大多数神经影像学研究的结果（比如激活的簇/区域或大脑区域之间的功能连接），往往无法方便和系统地解释，导致生物学意义不明确。在这项研究中，作者开发了一个大脑注释工具箱，它可以为神经成像结果自动生成功能和基因注释。该工具包是基于Neurosynth数据库中的体素级功能描述以及Allen人脑图谱中的基因表达谱，将它们用于生成区域级神经成像结果的功能/基因信息。这个工具包是基于MATLAB的免费的开源工具包，可以帮助为新发现的具有未知功能的区域提供功能/基因注释。

Bioinformatics (Oxford, England), 2019-10-01. DOI: 10.1093/bioinformatics/btz128 PMID: 30854545

Brain annotation toolbox: exploring the functional and genetic associations of neuroimaging results

翻译

Abstract:

MOTIVATION: Advances in neuroimaging and sequencing techniques provide an unprecedented opportunity to map the function of brain regions and identify the roots of psychiatric diseases. However, the results from most … >>>

翻译

5.

颜林林 (2022-08-01 01:02):

#paper doi:10.1093/bioinformatics/btac528 Bioinformatics, 2022, The K-mer File Format: a standardized and compact disk representation of sets of k-mers. 由k个字符连在一起的短串，称为k-mer，在生信的许多工具或分析过程中，如构建de Bruijn图（进行基因组组装）和创建序列索引（进行短序列比对），基本都会用到这个概念，并统计每种k-mer的出现频次，以及其他相关信息（如出现在基因组中的位置、与其他k-mer之间的关系）。随着k的增加，k-mer的种类呈几何数量增长，这给计算、存储都带来巨大开销。为此，本文开发了一种文件存储格式，用于存储k-mer信息，确保信息得以压缩存储的同时，还能保持高效的读写。说实话，这活不复杂，会点儿C++和Rust就能做，而且类似需求也不少。

Bioinformatics (Oxford, England), 2022-09-15. DOI: 10.1093/bioinformatics/btac528 PMID: 35904548

The K-mer File Format: a standardized and compact disk representation of sets of k-mers

翻译

Yoann Dufresne, Teo Lemane, Pierre Marijon, Pierre Peterlongo, Amatur Rahman, Marek Kokot, Paul Medvedev, Sebastian Deorowicz, Rayan Chikhi

Abstract:

SUMMARY: Bioinformatics applications increasingly rely on ad hoc disk storage of k-mer sets, e.g. for de Bruijn graphs or alignment indexes. Here, we introduce the K-mer File Format as a … >>>

翻译

6.

Vincent (2022-07-31 17:30):

#paper doi: 10.1093/bioinformatics/btab083 DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. 由于序列多义性和遥远的语义联系，基因调控编码十分复杂。近年来有研究陆续发现DNA序列，尤其是非编码区序列，在字符表、语法、语义方面的特征都与自然语言相似，而基于transformer注意力机制的机器学习工具BERT在自然语言处理方面大放异彩。这篇文章运用类似的研究思路开发了DNABERT，一个基于上下文序列的、能表征DNA特征的预处理模型。为了展现这个模型的用处和效果，这篇文章尝试了几个经典的计算任务：启动子预测、剪切位点预测和转录因子结合位点的预测，文章先使用该模型去encode DNA 序列，然后再对具体的计算任务fine-tune，发现其在准确度上能够轻松超越其他算法。同时为了解决基于深度学习可解释性差的问题，该方法提供了可视化选项，能展现位点层面的重要性以及与其他位点的联系（attention机制）。同时该工作还发现用人类基因组预训练的模型，运用到其他生物也有很好的效果，进一步展现了这种encoding是可以迁移的（不是memorize,而是真正抓住了一些序列层面特征）

Bioinformatics (Oxford, England), 2021-Aug-09. DOI: 10.1093/bioinformatics/btab083 PMID: 33538820

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

翻译

Yanrong Ji, Zhihan Zhou, Han Liu, Ramana V Davuluri

Abstract:

MOTIVATION: Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant … >>>

翻译

7.

颜林林 (2022-07-28 08:50):

#paper doi:10.1093/bioinformatics/btac137 Bioinformatics, 2022, BWA-MEME: BWA-MEM emulated with a machine learning approach. 看到李恒在Twitter上转发这篇文章，本以为大神又升级了bwa mem2，之后发现原来是他人的作品，得到了李恒钦点而已。作为某个知名软件的后继者，必然是要在某个方面有较大改进的，这篇的改进主要在性能。用于高通量测序数据的短序列比对算法，通常都是先用精确匹配种子（这几乎都是查表法在常数时间内完成），然后进行延伸匹配。而种子序列的长度选择，是一项比较有技巧性的事，太短可能导致重复匹配（hit）过多，太长则可能大量单词无匹配（在基因组上无该序列）却占据字典，导致字典过大。为此，过去也有一些算法，会采用变长种子来解决该问题（我也设想过这个策略，但惭愧的是，最终未能付诸实践）。而变长种子的策略，存在内存块大小不定、访问频繁等问题，会导致性能瓶颈。在本文中，通过机器学习的方法，在建立种子索引的阶段进行预处理，使得索引能够根据基因组序列数据进行适应，使不同长度种子的内存访问次数固定，从而获得性能提升。在最终的评测中，bwa-meme 能保持与 bwa-mem2 的输出相同，运行速度则提升了 3.45 倍。这篇文章的算法，可以再仔细深入学习下。

Bioinformatics (Oxford, England), 2022-04-28. DOI: 10.1093/bioinformatics/btac137 PMID: 35253835

BWA-MEME: BWA-MEM emulated with a machine learning approach

翻译

Youngmok Jung, Dongsu Han

Abstract:

MOTIVATION: The growing use of next-generation sequencing and enlarged sequencing throughput require efficient short-read alignment, where seeding is one of the major performance bottlenecks. The key challenge in the seeding … >>>

翻译