白鸟 (2026-03-31 21:49):
#paper DOI:10.1101/2025.01.29.635579, A SNP Foundation Model: Application in Whole-Genome Haplotype Phasing and Genotype Imputation. SNPBag是一个基于Transformer的基础模型,专为全基因组规模的SNP分析而设计。包括基因型填补、单倍型分相、基因组嵌入、祖先推断和亲缘关系推断。它解决了传统工具的扩展性、效率和参考依赖问题,实现了10-100倍的加速。 SNPBag展示了基础模型在SNP分析中的潜力,提供统一、高效框架。优势包括无需参考面板,通过预训练直接建模全局遗传模式、分析加速和压缩存储。 局限:依赖模拟数据,可能未完全捕捉真实变异;非洲等高多样性人群性能较低;亲缘推断在远亲上召回有限。 未来可扩展到更多任务(如GWAS、PRS)、整合多模态数据,并使用更大真实数据集微调。
Towards a universal foundation model for biobank-scale human genome variation
翻译
Abstract:
Abstract Millions of human genomes have been genotyped by national biobanks worldwide. Training large language models (LLM) with this data may lead to a universal model of human genome with tremendous potential. Yet the quadrillions (10 15 ) of nucleotides— resulting from genome length multiplied by population size—pose formidable challenges for modeling. In this study, we propose a novel AI framework designed to scale with this data and support diverse analytical tasks. To demonstrate this scheme, we developed SNPBag—a foundation model focusing on single nucleotide polymorphism (SNP). With 0.8 billion parameters, it is trained on one million synthesized human genomes, corresponding to a total of 6 trillion SNP tokens. SNPBag showed superior performance in benchmarking of multiple tasks. In genotype imputation, it achieves state-of-the-art (SOTA) accuracy. In haplotype phasing, it rivals the best method with a 72-fold speedup. By encoding 6 million SNPs per genome into a 0.75 MB embedding, SNPBag enables efficient storage, transfer and downstream applications. In particular, the genome embeddings facilitate rapid ancestry inference across global populations and detection of genetic relationships up to 12th-degree relatives. Collectively, SNPBag introduces a new paradigm for scalable, unified and multitask analysis of the ever-growing human variation data.
翻译
回到顶部