颜林林
(2022-07-11 00:41):
#paper doi:10.1101/2022.07.09.499321 bioRxiv, 2022, A Draft Human Pangenome Reference. 这应该又是一篇重磅文章,在bioRxiv上提前预发表出来。三十多家顶级单位合作,作者名单即使在使用“Human Pangenome Reference Consortium”做了浓缩后依然很长,包含不少让人熟知的名字,他们在过去这些年里曾反复出现在基因组学的各重磅文章中,比如其中就包含李恒这位大神,他赫然是通讯作者之一。全文篇幅长达97页(不含另外39页的补充材料),也反映出这项工作的体量重大。众所周知,我们一直在使用的人类参考基因组,其实来自最早的七八个人,他们的基因组,对于全人类的基因库而言,是很难相信有足够代表性的。于是这些年来,随着大量基因组数据的积累,参考基因组一直在更新迭代,打了一个又一个补丁。这篇文章所提出的“泛基因组参考(pangenome reference)”可以被认为是又一个重大改进和新版本发布,甚至可能这是接近“一劳永逸”的关键改进。它整合了多达47个个体基因组,这些个体基因组完成了定相位(phased)和二倍体组装(diploid assemblies)。且通过先前诸如HapMap、千人基因组等人类群体基因组研究的积累,确定了这47个个体的基因组差异足够大,能够涵盖超过 99% 的预期序列,并且在结构和碱基对水平上的准确率超过 99%。超长的篇幅中,详细展示了这套新参考基因组的完整构建过程,甚至精确到详细的命令行及参数,是非常值得仔细学习的。
bioRxiv,
2022.
DOI: 10.1101/2022.07.09.499321
A Draft Human Pangenome Reference
翻译
Abstract:
The Human Pangenome Reference Consortium (HPRC) presents a first draft human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals. These assemblies cover more than 99% of the expected sequence and are more than 99% accurate at the structural and base-pair levels. Based on alignments of the assemblies, we generated a draft pangenome that captures known variants and haplotypes, reveals novel alleles at structurally complex loci, and adds 119 million base pairs of euchromatic polymorphic sequence and 1,529 gene duplications relative to the existing reference, GRCh38. Roughly 90 million of the additional base pairs derive from structural variation. Using our draft pangenome to analyze short-read data reduces errors when discovering small variants by 34% and boosts the detected structural variants per haplotype by 104% compared to GRCh38-based workflows, and by 34% compared to using previous diversity sets of genome assemblies.
翻译