颜林林
(2022-08-01 01:02):
#paper doi:10.1093/bioinformatics/btac528 Bioinformatics, 2022, The K-mer File Format: a standardized and compact disk representation of sets of k-mers. 由k个字符连在一起的短串,称为k-mer,在生信的许多工具或分析过程中,如构建de Bruijn图(进行基因组组装)和创建序列索引(进行短序列比对),基本都会用到这个概念,并统计每种k-mer的出现频次,以及其他相关信息(如出现在基因组中的位置、与其他k-mer之间的关系)。随着k的增加,k-mer的种类呈几何数量增长,这给计算、存储都带来巨大开销。为此,本文开发了一种文件存储格式,用于存储k-mer信息,确保信息得以压缩存储的同时,还能保持高效的读写。说实话,这活不复杂,会点儿C++和Rust就能做,而且类似需求也不少。
The K-mer File Format: a standardized and compact disk representation of sets of k-mers
翻译
Abstract:
SUMMARY: Bioinformatics applications increasingly rely on ad hoc disk storage of k-mer sets, e.g. for de Bruijn graphs or alignment indexes. Here, we introduce the K-mer File Format as a general lossless framework for storing and manipulating k-mer sets, realizing space savings of 3-5× compared to other formats, and bringing interoperability across tools.AVAILABILITY AND IMPLEMENTATION: Format specification, C++/Rust API, tools: https://github.com/Kmer-File-Format/.SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
翻译