来自用户 颜林林 的文献。
当前共找到 120 篇文献分享,本页显示第 1 - 20 篇。
1.
颜林林 (2025-01-18 12:14):
#paper doi:10.1016/j.ajhg.2024.12.013, The American Journal of Human Genetics, 2025, HiFi long-read genomes for difficult-to-detect, clinically relevant variants. 这篇文献来自PacBio,选取了100例病例,这些病例都携带有已知致病性胚系突变,且这些突变都难以在短读长测序中被检测,但已经经由其他诸如PCR、MLPA等各类方法确认过。使用长读长测序(LRS,PacBio HiFi)检测,其中93%的突变能够被测到(83%的突变能够在分析流程中被报出,而10%通过人工复核数据能够看到相应信号)。据此,说明LRS作为一种有前景的单一技术,可以用于罕见疾病的诊断。
2.
颜林林 (2024-12-22 16:33):
#paper doi:10.1038/s41593-024-01812-2, Nature Neuroscience, The cell-type underpinnings of the human functional cortical connectome. 这篇文章于上月发表在《Nature Neuroscience》,结合 fMRI 和单细胞转录组测序数据,研究大脑皮层功能梯度与细胞类型分布的关系。fMRI 通常用于从宏观上研究大脑功能,而转录组等基于高通量测序的技术则太过微观,难以直接对应到宏观功能上。我最近因参与的课题,需要将基因组学与脑影像学结合起来,因此对这个方向做了一些文献调研,但发现绝大多数文章在将fMRI和组学数据进行关联时都浅尝辄止,所幸找到这篇文章在这方面的分析相对深入,因此给了我一定的启发。他们的 fMRI 数据来自 Human Connectome Project(HCP),而单核 RNA 测序(snRNA-seq)数据来自 Allen 人脑图谱(AHBA)。通过将 fMRI 数据得到的功能网络,映射成为功能梯度(functional gradients),功能梯度反映了从单模态(如视觉)到跨模态(如默认网络)的连续变化;再利用反卷积方法分析转录组数据,推断出每个皮层分区的细胞组成;之后通过空间坐标将两者对齐并联系起来,从而可以对不同功能梯度的细胞组成分布变化关系进行详细讨论。这篇文章的数据和方法都是公开的,值得仔细研究下其方法细节。
3.
颜林林 (2024-11-15 23:02):
#paper doi:10.1101/2024.01.18.24301478, medRxiv, Connecting genomic results for psychiatric disorders to human brain cell types and regions reveals convergence with functional connectivity. 这是今年1月在medRxiv发出的文章(至今尚未正式发表),来自瑞典卡罗林斯卡学院。关于精神疾病的研究,核磁主要探讨表型或功能相关脑区的定位,基因组学通过大规模人群基因检测和GWAS定位基因和突变,但两者之间尚未能直接联系。现有的人脑单细胞测序数据主要来自健康受试者,虽然能够将脑区与分子机制联系起来,但缺乏疾病相关数据。本研究通过一系列方法(如TDEP和S-LDSC),将GWAS数据与人类单核转录组数据关联,将精神疾病等表型与特定脑区及特定细胞类型联系起来。此外,研究使用fMRI数据对关键脑区(如海马体、杏仁核和前额皮层)的功能连接特性进行了验证,结果支持遗传学和转录组学发现的脑区定位。这一整合方法为理解精神疾病的多基因机制及其大脑定位提供了新的可能性。
Abstract:
AbstractUnderstanding the temporal and spatial brain locations etiological for psychiatric disorders is essential for targeted neurobiological research. Integration of genomic insights from genome-wide association studies with single-cell transcriptomics is a … >>>
AbstractUnderstanding the temporal and spatial brain locations etiological for psychiatric disorders is essential for targeted neurobiological research. Integration of genomic insights from genome-wide association studies with single-cell transcriptomics is a powerful approach although past efforts have necessarily relied on mouse atlases. Leveraging a comprehensive atlas of the adult human brain, we prioritized cell types via the enrichment of SNP-heritabilities for brain diseases, disorders, and traits, progressing from individual cell types to brain regions. Our findings highlight specific neuronal clusters significantly enriched for the SNP-heritabilities for schizophrenia, bipolar disorder, and major depressive disorder along with intelligence, education, and neuroticism. Extrapolation of cell-type results to brain regions reveals important patterns for schizophrenia with distinct subregions in the hippocampus and amygdala exhibiting the highest significance. Cerebral cortical regions display similar enrichments despite the known prefrontal dysfunction in those with schizophrenia highlighting the importance of subcortical connectivity. Using functional MRI connectivity from cases with schizophrenia and neurotypical controls, we identified brain networks that distinguished cases from controls that also confirmed involvement of the central and lateral amygdala, hippocampal body, and prefrontal cortex. Our findings underscore the value of single-cell transcriptomics in decoding the polygenicity of psychiatric disorders and offer a promising convergence of genomic, transcriptomic, and brain imaging modalities toward common biological targets. <<<
翻译
<jats:title>摘要</jats:title><jats:p>了解精神疾病病因的大脑时间和空间位置对于有针对性的神经生物学研究至关重要。将来自全基因组关联研究的基因组见解与单细胞转录组学相结合是一种强大的方法,尽管过去的工作必然依赖于小鼠图谱。利用成人大脑的综合图谱,我们通过富集脑部疾病、障碍和性状的 SNP 遗传性来确定细胞类型的优先级,从单个细胞类型发展到大脑区域。我们的研究结果强调了特定神经元簇显着富集精神分裂症、双相情感障碍和重度抑郁症的 SNP 遗传性以及智力、教育和神经质。将细胞类型结果外推到大脑区域揭示了精神分裂症的重要模式,其中海马体和杏仁核中不同的亚区域表现出最高意义。尽管精神分裂症患者存在已知的前额叶功能障碍,但大脑皮层区域表现出类似的丰富,这突出了皮层下连接的重要性。使用来自精神分裂症病例和神经典型对照的功能性 MRI 连接,我们确定了将病例与对照组区分开来的大脑网络,这些病例也证实了中央和外侧杏仁核、海马体和前额叶皮层的参与。我们的研究结果强调了单细胞转录组学在解码精神疾病多基因性方面的价值,并为基因组、转录组和脑成像模式向共同的生物学靶标提供了有希望的融合。</jats:p>
4.
颜林林 (2024-10-27 08:10):
#paper doi:10.1126/science.add7046, Science, 2023, Transcriptomic diversity of cell types across the adult human brain. 这篇文章来自瑞典卡罗琳斯卡学院(全球最顶尖的医学研究机构之一,负责颁发诺贝尔生理学或医学奖)和艾伦脑科学研究所(全球神经科学研究的重要中心之一,旨在通过大规模的脑科学研究和开放数据共享,推动对大脑结构和功能的理解)。该研究基于三位成人捐赠者的脑组织样本,采集了数百个不同脑区的样本,覆盖了包括端脑、中脑及脑干等部位,进行单核测序,得到超过三百万个细胞的表达数据,采用层次聚类和图分析技术对这些数据进行分析,将细胞划分为31个超簇及3000多个亚簇,并揭示了各簇在不同区域中的分布特点与功能关系,比较特别的是其中一个被命名为“溅射(Splatter)神经元”的超簇,因其复杂的分子特性和广泛分布,反映了神经元及其功能网络的高度异质性,成为本研究的重点之一。论文还阐明了皮层内兴奋性与抑制性神经元的层次性结构、不同脑区神经元的异质性、非神经元的星形胶质细胞的多样性和分布异质性等,从单细胞层面拓展了我们对基因表达与脑功能网络关系的理解。
Science, 2023-10-13. DOI: 10.1126/science.add7046
Abstract:
The human brain directs complex behaviors, ranging from fine motor skills to abstract intelligence, but the diversity of cell types that support these skills has not been fully described. In … >>>
The human brain directs complex behaviors, ranging from fine motor skills to abstract intelligence, but the diversity of cell types that support these skills has not been fully described. In this work, we used single-nucleus RNA sequencing to systematically survey cells across the entire adult human brain. We sampled more than three million nuclei from approximately 100 dissections across the forebrain, midbrain, and hindbrain in three postmortem donors. Our analysis identified 461 clusters and 3313 subclusters organized largely according to developmental origins and revealing high diversity in midbrain and hindbrain neurons. Astrocytes and oligodendrocyte-lineage cells also exhibited regional diversity at multiple scales. The transcriptomic census of the entire human brain presented in this work provides a resource for understanding the molecular diversity of the human brain in health and disease. <<<
翻译
5.
颜林林 (2024-09-28 00:05):
#paper doi:10.1126/science.1255905, Science, 2015, Correlated gene expression supports synchronous activity in brain networks. 最近我参与到一项脑科学相关的课题中,于是,开始深入挖挖一些历史上的重要相关文献。这篇应该是第一次将fMRI(功能核磁共振)数据与转录组数据联系起来的文章。作者们先使用静息态 fMRI 数据,使用独立成分分析(Independent Component Analysis, ICA)方法,构建出14个功能网络,通过其 MNI 坐标,映射到来自 Allen Institute 提供的上千个转录组样本数据,确定每个样本所属功能网络。之后,通过置换检验验证了功能网络内基因表达的显著性。采用稳定性选择方法筛选出 136 个关键基因,这些基因与离子通道和突触功能相关,并在阿尔茨海默病和精神分裂症中具有显著作用。之后在青少年 IMAGEN 数据集、小鼠数据集、其他啮齿类研究以及疾病关联分析上验证了基因表达与功能网络的关系,并在不同物种和细胞类型中证实了这些基因的稳定作用。该研究首次在分子层面揭示了基因表达与大脑功能网络的关系。
Science, 2015-6-12. DOI: 10.1126/science.1255905
Jonas Richiardi, Andre Altmann, Anna-Clare Milazzo, Catie Chang, M. Mallar Chakravarty, Tobias Banaschewski, Gareth J. Barker, Arun L.W. Bokde, Uli Bromberg, Christian Büchel, Patricia Conrod, Mira Fauth-Bühler, Herta Flor, Vincent Frouin, Jürgen Gallinat, Hugh Garavan, Penny Gowland, Andreas Heinz, Hervé Lemaître, Karl F. Mann, Jean-Luc Martinot, Frauke Nees, Tomáš Paus, Zdenka Pausova, Marcella Rietschel, Trevor W. Robbins, Michael N. Smolka, Rainer Spanagel, Andreas Ströhle, Gunter Schumann, Mike Hawrylycz, Jean-Baptiste Poline, Michael D. Greicius, , Lisa Albrecht, Chris Andrew, Mercedes Arroyo, Eric Artiges, Semiha Aydin, Christine Bach, Tobias Banaschewski, Alexis Barbot, Gareth Barker, Nathalie Boddaert, Arun Bokde, Zuleima Bricaud, Uli Bromberg, Ruediger Bruehl, Christian Büchel, Arnaud Cachia, Anna Cattrell, Patricia Conrod, Patrick Constant, Jeffrey Dalley, Benjamin Decideur, Sylvane Desrivieres, Tahmine Fadai, Herta Flor, Vincent Frouin, Jürgen Gallinat, Hugh Garavan, Fanny Gollier Briand, Penny Gowland, Bert Heinrichs, Andreas Heinz, Nadja Heym, Thomas Hübner, James Ireland, Bernd Ittermann, Tianye Jia, Mark Lathrop, Dirk Lanzerath, Claire Lawrence, Hervé Lemaitre, Katharina Lüdemann, Christine Macare, Catherine Mallik, Jean-François Mangin, Karl Mann, Jean- Luc Martinot, Eva Mennigen, Fabiana Mesquita de Carvahlo, Xavier Mignon, Ruben Miranda, Kathrin Müller, Frauke Nees, Charlotte Nymberg, Marie-Laure Paillere, Tomas Paus, Zdenka Pausova, Jean-Baptiste Poline, Luise Poustka, Michael Rapp, Gabriel Robert, Jan Reuter, Marcella Rietschel, Stephan Ripke, Trevor Robbins, Sarah Rodehacke, John Rogers, Alexander Romanowski, Barbara Ruggeri, Christine Schmäl, Dirk Schmidt, Sophia Schneider, MarkGunter Schumann, Florian Schubert, Yannick Schwartz, Michael Smolka, Wolfgang Sommer, Rainer Spanagel, Claudia Speiser, Tade Spranger, Alicia Stedman, Sabina Steiner, Dai Stephens, Nicole Strache, Andreas Ströhle, Maren Struve, Naresh Subramaniam, Lauren Topper, Robert Whelan, Steve Williams, Juliana Yacubian, Monica Zilbovicius, C Peng Wong, Steven Lubbe, Lourdes Martinez-Medina, Alinda Fernandes, Amir Tahmasebi <<<
Abstract:
Cooperating brain regions express similar genesWhen the brain is at rest, a number of distinct areas are functionally connected. They tend to be organized in networks. Richiardiet al.compared brain imaging … >>>
Cooperating brain regions express similar genesWhen the brain is at rest, a number of distinct areas are functionally connected. They tend to be organized in networks. Richiardiet al.compared brain imaging and gene expression data to build computational models of these networks. These functional networks are underpinned by the correlated expression of a core set of 161 genes. In this set, genes coding for ion channels and other synaptic functions such as neurotransmitter release dominate.Science, this issue p.1241 <<<
翻译
6.
颜林林 (2024-08-18 05:49):
#paper doi:10.1038/s41597-024-03701-6, Scientific data, 2024, ChineseMPD: A Semantic Segmentation Dataset of Chinese Martial Arts Classic Movie Props. 只做数据清洗和整理,提供公开的数据集,也是可以发表文章的,Scientific Data杂志上就大量收录此类文章。这篇文章分享的数据很有意思,是来自大批量的中国武侠电影,通过语义分割算法,从中识别出枪、剑、棍、刀、钩、箭等武侠道具,动用了包括11名本科生在内的21人,历时半年,进行人工标注和审核,填补了现有语义分割数据集在动作电影道具方面的研究空白。数据集以CC BY 4.0许可发布,可供非商业用途的重新分发、修改、调整和构建作品,下载地址:https://www.scidb.cn/en/anonymous/SlpaelFy
IF:5.800Q1 Scientific data, 2024-Aug-14. DOI: 10.1038/s41597-024-03701-6 PMID: 39143093 PMCID:PMC11325024
Abstract:
Recent advances in computer vision and deep learning techniques have facilitated significant progress in video scene understanding, thus helping film and television practitioners achieve accurate video editing. However, so far, … >>>
Recent advances in computer vision and deep learning techniques have facilitated significant progress in video scene understanding, thus helping film and television practitioners achieve accurate video editing. However, so far, publicly available semantic segmentation datasets are mostly limited to indoor scenes, city streets, and natural images, often ignoring example objects in action movies, which is a research gap that needs to be urgently filled. In this paper, we introduce a large-scale, high-precision semantic segmentation dataset of props in Chinese martial arts movie clips, named ChineseMPD. Specifically, this dataset first establishes segmentation rules and general review criteria for audiovisual data, and then provides semantic segmentation annotations for six weapon props (Gun, Sword, Stick, Knife, Hook, and Arrow) with a summary of 32,992 objects.To the best of our knowledge, this dataset is the largest semantic segmentation dataset for movie props to date. ChineseMPD dataset not only significantly expands the application of traditional tasks of computer vision such as object detection and scene understanding, but also opens up new avenues for interdisciplinary research. <<<
翻译
7.
颜林林 (2024-07-20 14:59):
#paper doi:10.1371/journal.pcbi.1012232, PLOS Computational Biology, Ten simple rules for building and maintaining a responsible data science workflow. “十条简单规则(Ten simple rules)” 是 PLOS Comp. Bio. 杂志上非常受欢迎的系列评论文章,每次篇幅不长,谈一个主题,提供十条“规则”并逐一展开解释,这些建议通常来自该领域有经验者,因此建议本身往往都非常简明且中肯,很值得阅读。这一篇说的是如何构建并维持“负责任”的数据科学流程,简单说其实就是“如何不作恶”,提到的建议包括提前考虑研究可能导致的恶性结果、注意数据源的偏差、经常性复盘和审视、迭代更新评估方法及标准、保持透明度等。像我们这种天天跟数据打交道的人,用这篇作为一张日常检查清单,也是不错的选择。
8.
颜林林 (2024-06-19 06:01):
#paper doi:10.1038/s41559-024-02420-w, Nature Ecology & Evolution, 2024, African elephants address one another with individually specific name-like calls. 这篇研究很有意思,作者通过对非洲象发出的声音及其行为进行分析,首次确认了非洲象能够使用个体特定的叫声来识别和称呼其他象。之所以开展这项研究,是因为作者作为生态学家,观察到非洲象拥有广泛的声音交流和丰富的社会关系,因此他推测大象很可能会给彼此起名字。于是他们录制了大象发出的声音,并使用随机森林等机器学习方法,将声音片段与该声音的接收对象建立联系,预测准确度达到27.5%,显著超过作为对照的随机声音的效果。更有趣的是,他们将声音回放给大象,并观察其反应,确认了当大象听到“自己的名字”时,它们会发出更大的叫声,并更快地向扬声器移动。作者认为,这项研究是一个“非常有希望的开端”,将“引出一系列可以研究的其他问题”,比如大象是否也会说出地点的名字,甚至会用第三人称谈论彼此,而这种称呼个体同类的社会需求,很可能会是语言起源的前身。
IF:13.900Q1 Nature ecology & evolution, 2024-Jun-10. DOI: 10.1038/s41559-024-02420-w PMID: 38858512
Abstract:
Personal names are a universal feature of human language, yet few analogues exist in other species. While dolphins and parrots address conspecifics by imitating the calls of the addressee, human … >>>
Personal names are a universal feature of human language, yet few analogues exist in other species. While dolphins and parrots address conspecifics by imitating the calls of the addressee, human names are not imitations of the sounds typically made by the named individual. Labelling objects or individuals without relying on imitation of the sounds made by the referent radically expands the expressive power of language. Thus, if non-imitative name analogues were found in other species, this could have important implications for our understanding of language evolution. Here we present evidence that wild African elephants address one another with individually specific calls, probably without relying on imitation of the receiver. We used machine learning to demonstrate that the receiver of a call could be predicted from the call's acoustic structure, regardless of how similar the call was to the receiver's vocalizations. Moreover, elephants differentially responded to playbacks of calls originally addressed to them relative to calls addressed to a different individual. Our findings offer evidence for individual addressing of conspecifics in elephants. They further suggest that, unlike other non-human animals, elephants probably do not rely on imitation of the receiver's calls to address one another. <<<
翻译
9.
颜林林 (2024-05-25 18:44):
#paper doi:10.1126/science.adh2602, Science, 2024, Molecular cascades and cell type-specific signatures in ASD revealed by single-cell genomics. 这周的Science集中发表了一大波脑科学的单细胞或空间转录组文章,本文是其中一篇。本文是联合使用了单核RNA-seq、单核ATAC-seq和空间转录组,研究遗传变异如何在健康和疾病状态下影响人脑,特别是在孤独症谱系障碍(ASD)中的作用。这是对ASD队列进行了迄今为止规模最大的单细胞分析,包括33例ASD和31例对照(共计64例),取样了这些受试者死后的大脑皮层组织,进行的实验。分析结果揭示了在ASD中从稳态到反应性的细胞状态转变过程,识别了涉及数千个基因的表达差异和转录因子网络,为理解ASD在人脑中的分子变化提供了有力的因果锚点和分子表型。
Abstract:
Genomic profiling in postmortem brain from autistic individuals has consistently revealed convergent molecular changes. What drives these changes and how they relate to genetic susceptibility in this complex condition are … >>>
Genomic profiling in postmortem brain from autistic individuals has consistently revealed convergent molecular changes. What drives these changes and how they relate to genetic susceptibility in this complex condition are not well understood. We performed deep single-nucleus RNA sequencing (snRNA-seq) to examine cell composition and transcriptomics, identifying dysregulation of cell type-specific gene regulatory networks (GRNs) in autism spectrum disorder (ASD), which we corroborated using single-nucleus assay for transposase-accessible chromatin with sequencing (snATAC-seq) and spatial transcriptomics. Transcriptomic changes were primarily cell type specific, involving multiple cell types, most prominently interhemispheric and callosal-projecting neurons, interneurons within superficial laminae, and distinct glial reactive states involving oligodendrocytes, microglia, and astrocytes. Autism-associated GRN drivers and their targets were enriched in rare and common genetic risk variants, connecting autism genetic susceptibility and cellular and circuit alterations in the human brain. <<<
翻译
10.
颜林林 (2024-04-28 14:25):
#paper doi:10.1101/2022.11.29.518309, bioRxiv, 2024, NanoTrans: an integrated computational framework for comprehensive transcriptome analysis with Nanopore direct-RNA sequencing. 这篇预发表文章,开发了一套分析流程NanoTrans,用于Nanopore直接RNA测序(DRS)数据,进行全面的转录组分析,包括各基因及其转录本的聚类、定量、poly-A尾巴长度profiling、RNA修饰分析、融合基因检测等。文章本身在技术上并没有特别的创新,但将各方面的分析步骤,比较全面地整合到一起,提供一站式的功能封装,并以单HTML形式输出结果报告,这对于使用者还是很友好且很有用的。同时,文章在多种真实数据集(包括酵母、拟南芥、人胚胎肾和癌细胞系)上进行了测试,以证明其适用于不同的生物学应用场景。我个人觉得,这种流程开发的工作,其实很难发表得比较好(当经常地,我们又不得不花大量时间来做),想要进一步提升价值,需要更深入地在某些特定场景下进行改进和优化,而不是一味求全,但相应地,针对特定场景的数据所做的优化,会进一步限制流程软件的适用范围,这种时候如果结果不出彩(比如没有一些新奇发现),最终价值也同样会非常受限。
Abstract:
Nanopore direct RNA sequencing (DRS) provides the direct access to native RNA strands with full-length information, shedding light on rich qualitative and quantitative properties of gene expression profiles. Here with … >>>
Nanopore direct RNA sequencing (DRS) provides the direct access to native RNA strands with full-length information, shedding light on rich qualitative and quantitative properties of gene expression profiles. Here with NanoTrans, we present an integrated computational framework that comprehensively covers all major DRS-based application scopes, including isoform clustering and quantification, poly(A) tail length estimation, RNA modification profiling, and fusion gene detection. In addition to its merit in providing such a streamlined one-stop solution, NanoTrans also shines in its workflow-orientated modular design, batch processing capability, all-in-one tabular and graphic report output, as well as automatic installation and configuration supports. Finally, by applying NanoTrans to real DRS datasets of yeast, Arabidopsis, as well as human embryonic kidney and cancer cell lines, we further demonstrated its utility, effectiveness, and efficacy across a wide range of DRS-based application settings. <<<
翻译
11.
颜林林 (2024-03-13 05:35):
#paper doi:10.1101/2024.02.18.580107, 2024, FECDO-Flexible and Efficient Coding for DNA Odyssey. 这篇文献提出了一种新的DNA数据存储编码方法,FECDO(缩写自 Flexible and Efficient Coding for DNA Odyssey),旨在通过高效的数据压缩和灵活的编码策略来减少DNA合成成本,从而促进DNA数据存储技术的实用化。该方法首先使用深度学习方法(分别尝试了无任何先验知识的独立神经网络,以及预训练的语言模型)来提取数据特征,从而把要存储的数据,从独热编码张量(one-hot encoded tensor)转换成为边际概率序列,实现了压缩的过程;该概率序列被映射成为4字母(A、C、G、T)的碱基序列,进而再使用一个层次有限状态机(hierarchical finite state machine)排除掉不适合DNA存储的特殊编码(如连续相同碱基、有特殊二级结构等)。通过上述过程,本文方法通过实测文本和图像数据,对比bzip2方法,提高了12%-26%的压缩效率,这种压缩效率将反映到DNA合成成本的显著降低上,是DNA存储技术的关键问题。同时,本文还尝试将其中一组文字所编码的结果,实际合成为DNA(进行保存),之后使用PCR将目标片段扩增出来,使用NanoPore测序,再解码还原得到原始数据,从整个流程上对方法进行了验证。由于目前文章尚处于bioRxiv preprint(文章提交版本v2),只提供了正文全文和正文图表,并未提供补充材料、方法描述和程序源码,尚有许多实现和结果的细节未公布,我个人比较怀疑该方法的信息容错能力和实测效果,正文中图表上展现的非英语文本和图像的压缩效果看起来也不是很理想,这些都有待文章正式发表后看到相应解答。
Abstract:
DNA has been pursued as a compelling medium for digital data storage during the past decade. While large-scale data storage and random access have been achieved in artificial DNA, the … >>>
DNA has been pursued as a compelling medium for digital data storage during the past decade. While large-scale data storage and random access have been achieved in artificial DNA, the synthesis cost keeps hindering DNA data storage from popularizing into daily life. In this study, we proposed a more efficient paradigm for digital data compressing to DNA, while excluding arbitrary sequence constraints. Both standalone neural networks and pre-trained language models were used to extract the intrinsic patterns of data, and generated probabilistic portrayal, which was then transformed into constraint-free nucleotide sequences with a hierarchical finite state machine. Utilizing these methods, a 12%-26% improvement of compression ratio was realized for various data, which directly translated to up to 26% reduction in DNA synthesis cost. Combined with the progress in DNA synthesis, our methods are expected to facilitate the realization of practical DNA data storage. <<<
翻译
12.
颜林林 (2024-02-29 09:02):
#paper doi:10.1038/s41592-024-02201-0. Nature Methods, 2024, scGPT: toward building a foundation model for single-cell multi-omics using generative AI. 这篇文章使用了生成式AI大模型,来进行单细胞测序数据分析。文章并未自己收集样本和测序,而仅仅依靠已发表的公开数据或来自公共数据库的数据,进行模型训练、工具开发和性能验证,属于典型的纯生信文章,借着生成式AI概念的火热,加上结果性能表现良好,这篇文章发表到了Nature Methods杂志,很值得生信专业者学习和模仿。文章在九个多月前,就已预发表在bioRxiv上,当时整合使用了1000万个细胞的数据,在这次的正式发表版本中,整合的细胞数量增加到了3300万,模型性能也得到了进一步的改进。文章开发的模型名为scGPT,它基于生成式预训练变换器(Transformer)架构的单细胞基础模型,旨在处理和解析大规模的单细胞数据。scGPT模型展示了在多种下游任务中,如细胞类型注释、遗传扰动反应预测、多批次整合以及多组学数据整合等方面的卓越性能。研究的创新点在于首次将基础模型概念应用于单细胞生物学领域,通过自监督预训练和任务特定的微调,有效捕获和理解细胞和基因之间复杂的生物学关系。scGPT利用其强大的学习能力揭示了特定条件下的基因-基因互作,展现了转移学习中的扩展性和上下文效应。相比传统的机器学习模型,大模型能够捕捉到更为细致和全面的生物学特征,尤其是一些长距离依赖和复杂的数据关系,比如隐藏在数据背后的未知细胞类型或细胞相互作用,这大概也是这篇文章将其用于单细胞数据分析的重要出发点。
IF:36.100Q1 Nature methods, 2024-Aug. DOI: 10.1038/s41592-024-02201-0 PMID: 38409223
scGPT:利用生成式 AI 构建单细胞多组学基础模型
Abstract:
Generative pretrained models have achieved remarkable success in various domains such as language and computer vision. Specifically, the combination of large-scale diverse datasets and pretrained transformers has emerged as a … >>>
Generative pretrained models have achieved remarkable success in various domains such as language and computer vision. Specifically, the combination of large-scale diverse datasets and pretrained transformers has emerged as a promising approach for developing foundation models. Drawing parallels between language and cellular biology (in which texts comprise words; similarly, cells are defined by genes), our study probes the applicability of foundation models to advance cellular biology and genetic research. Using burgeoning single-cell sequencing data, we have constructed a foundation model for single-cell biology, scGPT, based on a generative pretrained transformer across a repository of over 33 million cells. Our findings illustrate that scGPT effectively distills critical biological insights concerning genes and cells. Through further adaptation of transfer learning, scGPT can be optimized to achieve superior performance across diverse downstream applications. This includes tasks such as cell type annotation, multi-batch integration, multi-omic integration, perturbation response prediction and gene network inference. <<<
翻译
13.
颜林林 (2024-01-30 09:58):
#paper doi:10.1101/2023.12.20.570816. bioRxiv, 2024, Neoantigen Cancer Vaccines and Different Immune Checkpoint Therapies Each Utilize Both Converging and Distinct Mechanisms that in Combination Enable Synergistic Therapeutic Efficacy. 本文使用甲基胆蒽烷(Methylcholanthrene,简称MCA,一种化学致癌物)诱导基因工程小鼠,构造了肉瘤动物模型,并以此作为研究体系,比较了新抗原疫苗和抗CTLA-4/抗PD-1的疗效。两种疗法都能促进肿瘤内特定CD8 T细胞的扩张,而新抗原疫苗的效应更为显著。文章通过单细胞转录组测序和单细胞免疫组库测序,分析了不同疗法导致了免疫微环境变化,揭示了这些细胞克隆型扩张与特定免疫治疗相关的表型和功能状态。新抗原疫苗与ICT联合使用显示出比单独使用任一治疗更高的疗效,为联合使用肿瘤免疫和免疫检查点治疗方法提供了证据支持。
Abstract:
The goal of therapeutic cancer vaccines and immune checkpoint therapy (ICT) is to eliminate cancer by expanding and/or sustaining T cells with anti-tumor capabilities. However, whether cancer vaccines and ICT … >>>
The goal of therapeutic cancer vaccines and immune checkpoint therapy (ICT) is to eliminate cancer by expanding and/or sustaining T cells with anti-tumor capabilities. However, whether cancer vaccines and ICT enhance anti-tumor immunity by distinct or overlapping mechanisms remains unclear. Here, we compared effective therapeutic tumor-specific mutant neoantigen (NeoAg) cancer vaccines with anti-CTLA-4 and/or anti-PD-1 ICT in preclinical models. Both NeoAg vaccines and ICT induce expansion of intratumoral NeoAg-specific CD8 T cells, though the degree of expansion and acquisition of effector activity was much more substantial following NeoAg vaccination. Further, we found that NeoAg vaccines are particularly adept at inducing proliferating and stem-like NeoAg-specific CD8 T cells. Single cell T cell receptor (TCR) sequencing revealed that TCR clonotype expansion and diversity of NeoAg-specific CD8 T cells relates to their phenotype and functional state associated with specific immunotherapies employed. Effective NeoAg vaccines and ICT required both CD8 and CD4 T cells. While NeoAg vaccines and anti-PD-1 affected the CD4 T cell compartment, it was to less of an extent than observed with anti-CTLA-4, which notably induced ICOSBhlhe40 Th1-like CD4 T cells and, when combined with anti-PD-1, a small subset of Th2-like CD4 T cells. Although effective NeoAg vaccines or ICT expanded intratumoral M1-like iNOS macrophages, NeoAg vaccines expanded rather than suppressed (as observed with ICT) M2-like CX3CR1CD206 macrophages, associated with the vaccine adjuvant. Further, combining NeoAg vaccination with ICT induced superior efficacy compared to either therapy in isolation, highlighting the utility of combining these modalities to eliminate cancer. <<<
翻译
14.
颜林林 (2023-12-27 12:42):
#paper doi:10.1101/2023.10.04.560604. bioRxiv, 2023, Federated Learning for multi-omics: a performance evaluation in Parkinson's disease. 这篇文章基于两个帕金森病研究的数据集(PPMI和PDBP),这两个数据集都入组了数百例患者和对照健康人,分别都进行了WGS和RNA-seq,获得了多组学检测的分析特征结果。通过将PPMI拆分为K折,留出一折后所剩余K-1折用于模型训练,再将模型放到PPMI预先留出的一折数据和PBMP上进行测试和性能评估。建模分别使用了集中化的机器学习方法,以及将数据拆分到多个节点(site)以采取联邦学习法,并使用了不同的联邦学习策略。结果显示,虽然样本在不同site的分散程度、联邦学习的策略等都会对最终性能有所影响,但联邦学习的最优结果,能与集中化训练的性能相当。此外,本文对联邦学习的训练时间进行评估,比集中化的方法至少高出一个数量级。虽然如此,由于联邦学习可以避免大规模数据在不同sites之间分享和传输,对于整合更广泛的数据,提升模型性能,还是有优势的。提供了对联邦学习在多组学和特别是在帕金森病预测中的应用的深入分析,展示了其作为一种协作工具在处理大规模异构数据时的潜力和挑战。
Abstract:
While machine learning (ML) research has recently grown more in popularity, its application in the omics domain is constrained by access to sufficiently large, high-quality datasets needed to train ML … >>>
While machine learning (ML) research has recently grown more in popularity, its application in the omics domain is constrained by access to sufficiently large, high-quality datasets needed to train ML models. Federated Learning (FL) represents an opportunity to enable collaborative curation of such datasets among participating institutions. We compare the simulated performance of several models trained using FL against classically trained ML models on the task of multi-omics Parkinson's Disease prediction. We find that FL model performance tracks centrally trained ML models, where the most performant FL model achieves an AUC-PR of 0.876 ± 0.009, 0.014 ± 0.003 less than its centrally trained variation. We also determine that the dispersion of samples within a federation plays a meaningful role in model performance. Our study implements several open source FL frameworks and aims to highlight some of the challenges and opportunities when applying these collaborative methods in multi-omics studies. <<<
翻译
15.
颜林林 (2023-11-25 09:20):
#paper doi:10.1016/S2589-7500(23)00219-4. The Lancet Digital Health. 2023, Operational greenhouse-gas emissions of deep learning in digital pathology: a modelling study. 这篇文章其实是针对数字病理学中大量应用深度学习技术的现状,对不同模型和分析策略的使用,进行能耗分析。而在能耗评估时,文章通过IP地址关联到相应电厂,结合该地区能源供应中可持续能源的占比等信息,换算成二氧化碳排放量,作为对环境可持续的影响评估。其研究结果显示,全球范围内深度学习在病理学中的广泛应用,可能导致大量的温室气体排放,最高可达16兆吨二氧化碳当量,需要大约86,590平方公里(0.22%)的森林来吸收这些排放,并由此呼吁大家关注相关问题。对于模型之间的比较,结果自然是使用升级后的模型或更简单的模型,可以降低能耗,从而有助于应对环境可持续问题。我个人对这种将能耗外推至碳排放的做法难以苟同,毕竟这中间的影响因素太多,将其过度简化的过程中,很容易人为操纵和引导结论,不过,这篇论文在能耗对比评估的部分还是有可取之处的。
Abstract:
BACKGROUND: Deep learning is a promising way to improve health care. Image-processing medical disciplines, such as pathology, are expected to be transformed by deep learning. The first clinically applicable deep-learning … >>>
BACKGROUND: Deep learning is a promising way to improve health care. Image-processing medical disciplines, such as pathology, are expected to be transformed by deep learning. The first clinically applicable deep-learning diagnostic support tools are already available in cancer pathology, and their number is increasing. However, data on the environmental sustainability of these tools are scarce. We aimed to conduct an environmental-sustainability analysis of a theoretical implementation of deep learning in patient-care pathology.METHODS: For this modelling study, we first assembled and calculated relevant data and parameters of a digital-pathology workflow. Data were breast and prostate specimens from the university clinic at the Institute of Pathology of the Rheinisch-Westfälische Technische Hochschule Aachen (Aachen, Germany), for which commercially available deep learning was already available. Only specimens collected between Jan 1 and Dec 31, 2019 were used, to omit potential biases due to the COVID-19 pandemic. Our final selection was based on 2 representative weeks outside holidays, covering different types of specimens. To calculate carbon dioxide (CO2) or CO2 equivalent (CO2 eq) emissions of deep learning in pathology, we gathered relevant data for exact numbers and sizes of whole-slide images (WSIs), which were generated by scanning histopathology samples of prostate and breast specimens. We also evaluated different data input scenarios (including all slide tiles, only tiles containing tissue, or only tiles containing regions of interest). To convert estimated energy consumption from kWh to CO2 eq, we used the internet protocol address of the computational server and the Electricity Maps database to obtain information on the sources of the local electricity grid (ie, renewable vs non-renewable), and estimated the number of trees and proportion of the local and world's forests needed to sequester the CO2 eq emissions. We calculated the computational requirements and CO2 eq emissions of 30 deep-learning models that varied in task and size. The first scenario represented the use of one commercially available deep-learning model for one task in one case (1-task), the second scenario considered two deep-learning models for two tasks per case (2-task), the third scenario represented a future, potentially automated workflow that could handle 7 tasks per case (7-task), and the fourth scenario represented the use of a single potential, large, computer-vision model that could conduct multiple tasks (multitask). We also compared the performance (ie, accuracy) and CO2 eq emissions of different deep-learning models for the classification of renal cell carcinoma on WSIs, also from Rheinisch-Westfälische Technische Hochschule Aachen. We also tested other approaches to reducing CO2 eq emissions, including model pruning and an alternative method for histopathology analysis (pathomics).FINDINGS: The pathology database contained 35 552 specimens (237 179 slides), 6420 of which were prostate specimens (10 115 slides) and 11 801 of which were breast specimens (19 763 slides). We selected and subsequently digitised 140 slides from eight breast-cancer cases and 223 slides from five prostate-cancer cases. Applying large deep-learning models on all WSI tiles of prostate and breast pathology cases would result in yearly CO2 eq emissions of 7·65 metric tons (t; 95% CI 7·62-7·68) with the use of a single deep-learning model per case; yearly CO2 eq emissions were up to 100·56 t (100·21-100·99) with the use of seven deep-learning models per case. CO2 eq emissions for different deep-learning model scenarios, data inputs, and deep-learning model sizes for all slides varied from 3·61 t (3·59-3·63) to 2795·30 t (1177·51-6482·13. For the estimated number of overall pathology cases worldwide, the yearly CO2 eq emissions varied, reaching up to 16 megatons (Mt) of CO2 eq, requiring up to 86 590 km2 (0·22%) of world forest to sequester the CO2 eq emissions. Use of the 7-task scenario and small deep-learning models on slides containing tissue only could substantially reduce CO2 eq emissions worldwide by up to 141 times (0·1 Mt, 95% CI 0·1-0·1). Considering the local environment in Aachen, Germany, the maximum CO2 eq emission from the use of deep learning in digital pathology only would require 32·8% (95% CI 13·8-76·6) of the local forest to sequester the CO2 eq emissions. A single pathomics run on a tissue could provide information that was comparable to or even better than the output of multitask deep-learning models, but with 147 times reduced CO2 eq emissions.INTERPRETATION: Our findings suggest that widespread use of deep learning in pathology might have considerable global-warming potential. The medical community, policy decision makers, and the public should be aware of this potential and encourage the use of CO2 eq emissions reduction strategies where possible.FUNDING: German Research Foundation, European Research Council, German Federal Ministry of Education and Research, Health, Economic Affairs and Climate Action, and the Innovation Fund of the Federal Joint Committee. <<<
翻译
16.
颜林林 (2023-10-27 12:22):
#paper doi:10.1038/s41592-023-02043-2. Nature Methods, 2023, Comprehensive benchmarking and guidelines of mosaic variant calling strategies. 本文是一篇方法学评估对比的文章,对11个嵌合体突变鉴定工具(这其中也包括我读博期间参与的MosaicHunter)进行了系统评估。嵌合体突变是精卵结合形成合子后,在生物个体发育早期发生的一类体细胞突变,这类突变会随着发育和器官形成,被携带并分布到生物个体的不同部位。本文使用预先确定了胚系突变信息的细胞系,分步骤进行混合,以模拟生物个体早期不同阶段发生的嵌合体突变,由此得到一组拥有不同频率嵌合体突变结果(ground truth)的参考样品,用来测试和评估各鉴定工具(这个参考品制备方法,在过去几年里,也被我们用于癌症基因检测产品研发,对体细胞突变鉴定进行技术验证)。本文的评估结果显示,嵌合体突变鉴定,很大程度上取决于研究目的(及由此考虑的假设条件),根据不同目的所选择的工具及参数,可能对结果产生较大影响,本文根据评估结果对不同工具的特点进行了描述,为后续其他关于嵌合体突变的研究,以及分析工具开发,提供了参考指导和建议。
IF:36.100Q1 Nature methods, 2023-Dec. DOI: 10.1038/s41592-023-02043-2 PMID: 37828153
Abstract:
Rapid advances in sequencing and analysis technologies have enabled the accurate detection of diverse forms of genomic variants represented as heterozygous, homozygous and mosaic mutations. However, the best practices for … >>>
Rapid advances in sequencing and analysis technologies have enabled the accurate detection of diverse forms of genomic variants represented as heterozygous, homozygous and mosaic mutations. However, the best practices for mosaic variant calling remain disorganized owing to the technical and conceptual difficulties faced in evaluation. Here we present our benchmark of 11 feasible mosaic variant detection approaches based on a systematically designed whole-exome-level reference standard that mimics mosaic samples, supported by 354,258 control positive mosaic single-nucleotide variants and insertion-deletion mutations and 33,111,725 control negatives. We identified not only the best practice for mosaic variant detection but also the condition-dependent strengths and weaknesses of the current methods. Furthermore, feature-level evaluation and their combinatorial usage across multiple algorithms direct the way for immediate to prolonged improvements in mosaic variant detection. Our results will guide researchers in selecting suitable calling algorithms and suggest future strategies for developers. <<<
翻译
17.
颜林林 (2023-09-27 09:43):
#paper doi:10.1038/s41467-023-41690-z. 2023, Nature Communications, Genome-wide enhancer-gene regulatory maps link causal variants to target genes underlying human cancer risk. 这篇文章使用了一种名为 Activity-by-Contact (ABC) 的计算方法,在20个癌种的已发表的多组学测序数据中进行分析,识别出54万多个“增强子-基因调控(Enhancer-gene regulation)”关系对,为解释这其中的非编码区突变功能提供了基础。此后又入组10例结直肠癌(CRC)临床样本,也进行多组学检测和上述调控关系对的鉴别,并将发现结果放到数万例的大规模人群中进行验证。此外,还进一步对其中发现的与CRC风险相关的调控区突变位点rs4810856,使用细胞系、小鼠模型等,在基因表达、蛋白表达等层面,分别进行了功能上的验证。整篇文章从逻辑上看并不特别连贯,但工作量比较大,更像是一开始入组了10例癌症患者的临床样本,做了多组学测序,之后在分析数据结果基础上不断扩展完善,最后拼凑出来的故事。
IF:14.700Q1 Nature communications, 2023-09-25. DOI: 10.1038/s41467-023-41690-z PMID: 37749132
Abstract:
Genome-wide association studies have identified numerous variants associated with human complex traits, most of which reside in the non-coding regions, but biological mechanisms remain unclear. However, assigning function to the … >>>
Genome-wide association studies have identified numerous variants associated with human complex traits, most of which reside in the non-coding regions, but biological mechanisms remain unclear. However, assigning function to the non-coding elements is still challenging. Here we apply Activity-by-Contact (ABC) model to evaluate enhancer-gene regulation effect by integrating multi-omics data and identified 544,849 connections across 20 cancer types. ABC model outperforms previous approaches in linking regulatory variants to target genes. Furthermore, we identify over 30,000 enhancer-gene connections in colorectal cancer (CRC) tissues. By integrating large-scale population cohorts (23,813 cases and 29,973 controls) and multipronged functional assays, we demonstrate an ABC regulatory variant rs4810856 associated with CRC risk (Odds Ratio = 1.11, 95%CI = 1.05-1.16, P = 4.02 × 10) by acting as an allele-specific enhancer to distally facilitate PREX1, CSE1L and STAU1 expression, which synergistically activate p-AKT signaling. Our study provides comprehensive regulation maps and illuminates a single variant regulating multiple genes, providing insights into cancer etiology. <<<
翻译
18.
颜林林 (2023-08-30 08:09):
#paper doi:10.1016/j.crmeth.2023.100547. Cell Reports Methods, 2023, An introduction to representation learning for single-cell data analysis. 机器学习方法的效果常依赖于数据质量,也与所选择的特征(即数据的表示方法)有关,而表示学习(representation learning)能够通过模型自身去学习数据的表示,这在有足够数据的情况下是非常适合的。单细胞测序数据分析正好是这样一个场景。本文综述了单细胞测序数据分析各个环节(包括数据预处理、超参数优化、下游分析、生物学验证等)中,表示学习方法的应用及应注意的关键点。
IF:4.300Q2 Cell reports methods, 2023-08-28. DOI: 10.1016/j.crmeth.2023.100547 PMID: 37671013 PMCID:PMC10475795
Abstract:
Single-cell-resolved systems biology methods, including omics- and imaging-based measurement modalities, generate a wealth of high-dimensional data characterizing the heterogeneity of cell populations. Representation learning methods are routinely used to analyze … >>>
Single-cell-resolved systems biology methods, including omics- and imaging-based measurement modalities, generate a wealth of high-dimensional data characterizing the heterogeneity of cell populations. Representation learning methods are routinely used to analyze these complex, high-dimensional data by projecting them into lower-dimensional embeddings. This facilitates the interpretation and interrogation of the structures, dynamics, and regulation of cell heterogeneity. Reflecting their central role in analyzing diverse single-cell data types, a myriad of representation learning methods exist, with new approaches continually emerging. Here, we contrast general features of representation learning methods spanning statistical, manifold learning, and neural network approaches. We consider key steps involved in representation learning with single-cell data, including data pre-processing, hyperparameter optimization, downstream analysis, and biological validation. Interdependencies and contingencies linking these steps are also highlighted. This overview is intended to guide researchers in the selection, application, and optimization of representation learning strategies for current and future single-cell research applications. <<<
翻译
19.
颜林林 (2023-07-25 00:17):
#paper doi:10.1038/s41588-023-01452-5. Nature Genetics, 2023, Crosstalk between RNA m6A and DNA methylation regulates transposable element chromatin activation and cell fate in human pluripotent stem cells. 这篇文章开发了一种名为CARGO-BioID的方法,基于CRISPR技术 和 蛋白质邻位连接技术(Proximity Ligation Assay,PLA),能够抓取与基因组上特定转座元件(Transposable elements,TEs)区域序列相结合的蛋白,并通过质谱和ChIP-seq等实验,对这些蛋白进行鉴定和定量检测。文章以LTR7/HRV-H为目标,这是个灵长目特有的TE序列,通过上述技术方法,识别出与之结合的蛋白,其中包括 YTHDC2 和 TET1 这两个蛋白,前者是RNA m6A甲基化的读取器(reader),后者则是DNA 5mC甲基化的去甲基酶。随后,文章又利用一系列细胞实验,证实了这两个蛋白在该基因组区域上的生物学作用,包括相应的RNA甲基化与DNA甲基化之间的相互作用(crosstalk)、它们对TE活性的调控、以及对hPSC(人多能干细胞)分化命运的影响等。
IF:31.700Q1 Nature genetics, 2023-08. DOI: 10.1038/s41588-023-01452-5 PMID: 37474847
Abstract:
Transposable elements (TEs) are parasitic DNA sequences accounting for over half of the human genome. Tight control of the repression and activation states of TEs is critical for genome integrity, … >>>
Transposable elements (TEs) are parasitic DNA sequences accounting for over half of the human genome. Tight control of the repression and activation states of TEs is critical for genome integrity, development, immunity and diseases, including cancer. However, precisely how this regulation is achieved remains unclear. Here we develop a targeted proteomic proximity labeling approach to capture TE-associated proteins in human embryonic stem cells (hESCs). We find that the RNA N-methyladenosine (mA) reader, YTHDC2, occupies genomic loci of the primate-specific TE, LTR7/HERV-H, specifically through its interaction with mA-modified HERV-H RNAs. Unexpectedly, YTHDC2 recruits the DNA 5-methylcytosine (5mC)-demethylase, TET1, to remove 5mC from LTR7/HERV-H and prevent epigenetic silencing. Functionally, the YTHDC2/LTR7 axis inhibits neural differentiation of hESCs. Our results reveal both an underappreciated crosstalk between RNA mA and DNA 5mC, the most abundant regulatory modifications of RNA and DNA in eukaryotes, and the fact that in hESCs this interplay controls TE activity and cell fate. <<<
翻译
20.
颜林林 (2023-06-24 21:59):
#paper doi:10.1093/nar/gkad526 Nucleic Acids Research, 2023, Precise characterization of somatic complex structural variations from tumor/control paired long-read sequencing data with nanomonsv. 这是一篇生信文章,作者开发了一个工具nanomonsv,基于配对的肿瘤和对照样本的三代测序数据,鉴定构变异(SV)。该程序包括两个模块:Canonical SV module 和 Single breakend SV module,前者采取寻找跨越断点的多条支持reads的策略,后者则先对断点单侧的序列进行合并,再通过soft clip部分去寻找(可能在基因组上缺失或难以判定)的另一侧序列。通过对这两种策略的实现、优化和整合,提高了对SV的鉴定性能。本文在三个肿瘤细胞系样本(及其对应对照样本)的三代数据上,对所开发的工具进行了实测和评估,并使用PCR方法对部分结果进行了验证。此外,本文还对甲基化、重复序列、移动元件、病毒序列整合等序列特性进行了分析,以进一步充实文章的内容。
IF:16.600Q1 Nucleic acids research, 2023-08-11. DOI: 10.1093/nar/gkad526 PMID: 37336583
Abstract:
We present our novel software, nanomonsv, for detecting somatic structural variations (SVs) using tumor and matched control long-read sequencing data with a single-base resolution. The current version of nanomonsv includes … >>>
We present our novel software, nanomonsv, for detecting somatic structural variations (SVs) using tumor and matched control long-read sequencing data with a single-base resolution. The current version of nanomonsv includes two detection modules, Canonical SV module, and Single breakend SV module. Using tumor/control paired long-read sequencing data from three cancer and their matched lymphoblastoid lines, we demonstrate that Canonical SV module can identify somatic SVs that can be captured by short-read technologies with higher precision and recall than existing methods. In addition, we have developed a workflow to classify mobile element insertions while elucidating their in-depth properties, such as 5' truncations, internal inversions, as well as source sites for 3' transductions. Furthermore, Single breakend SV module enables the detection of complex SVs that can only be identified by long-reads, such as SVs involving highly-repetitive centromeric sequences, and LINE1- and virus-mediated rearrangements. In summary, our approaches applied to cancer long-read sequencing data can reveal various features of somatic SVs and will lead to a better understanding of mutational processes and functional consequences of somatic SVs. <<<
翻译
回到顶部