小擎子
(2022-11-30 23:57):
# paper doi:10.1186/s13059-016-0997-x;Genome Biol.2016 Mash: fast genome and metagenome distance estimation using MinHash, Mash工具,用MinHash快速衡量基因组和宏基因组距离。Mash主要实现sketch和dist两个功能,sketch将序列或者序列合集转换为MinHash sketch,可以大幅缩小内存占用,dist计算Jaccard index可以在可控误差范围内近似ANI,且计算效率大大提供。重点是k-mer和s(sketch的size大小)的选择,会影响误差。Mash的特点是计算消耗主要是生成sketch上,sketch一旦生成,上万基因组的相似性比较和聚类几乎是瞬时完成的。
Mash: fast genome and metagenome distance estimation using MinHash
翻译
Abstract:
Mash extends the MinHash dimensionality-reduction technique to include a pairwise mutation distance and P value significance test, enabling the efficient clustering and search of massive sequence collections. Mash reduces large sequences and sequence sets to small, representative sketches, from which global mutation distances can be rapidly estimated. We demonstrate several use cases, including the clustering of all 54,118 NCBI RefSeq genomes in 33 CPU h; real-time database search using assembled or unassembled Illumina, Pacific Biosciences, and Oxford Nanopore data; and the scalable clustering of hundreds of metagenomic samples by composition. Mash is freely released under a BSD license ( https://github.com/marbl/mash ).
翻译
Keywords: