来自用户 颜林林 的文献。
当前共找到 116 篇文献分享,本页显示第 61 - 80 篇。
61.
颜林林 (2022-07-21 00:29):
#paper doi:10.1186/s13059-022-02726-7 Genome Biology, 2022, Integration of single-cell multi-omics data by regression analysis on unpaired observations. 受技术条件限制,绝大多数的单细胞多组学研究,其实都很难在同一细胞上同时检测多个不同组学。本文针对这个问题,基于“相似表达的靶基因的调控基因也相似”的直观认识和假设,采用回归分析方法,对scRNA-seq和ATAC-seq数据之间的关系进行关联和推断,使非配对的scRNA-seq和ATAC-seq实验(即并非同一细胞,而是在不同细胞上分别开展了这两项检测)中,可以通过其中一项数据(如ATAC-seq的染色质开放信息)去推断对应的被调控基因的表达。该方法在模拟数据和实测数据上进行评估,可以达到很高的准确度(与eQTL mapping进行对比,结果高度一致)。这为更好利用当前积累的大量非配对单细胞数据,提供了方法学上的支持。
IF:10.100Q1 Genome biology, 2022-07-19. DOI: 10.1186/s13059-022-02726-7 PMID: 35854350 PMCID:PMC9295346
通过对未配对观察值的回归分析整合单细胞多组学数据
Abstract:
Despite recent developments, it is hard to profile all multi-omics single-cell data modalities on the same cell. Thus, huge amounts of single-cell genomics data of unpaired observations on different cells … >>>
Despite recent developments, it is hard to profile all multi-omics single-cell data modalities on the same cell. Thus, huge amounts of single-cell genomics data of unpaired observations on different cells are generated. We propose a method named UnpairReg for the regression analysis on unpaired observations to integrate single-cell multi-omics data. On real and simulated data, UnpairReg provides an accurate estimation of cell gene expression where only chromatin accessibility data is available. The cis-regulatory network inferred from UnpairReg is highly consistent with eQTL mapping. UnpairReg improves cell type identification accuracy by joint analysis of single-cell gene expression and chromatin accessibility data. <<<
翻译
62.
颜林林 (2022-07-20 07:49):
#paper doi:10.1101/2022.07.17.500374 bioRxiv, 2022, Genozip Dual-Coordinate VCF format enables efficient genomic analyses and alleviates liftover limitations. 这是一个“认真地做一件小事”的例子。在做基因组分析时,我们经常遭遇“究竟该用hg19还是hg38”的纠结,有时候不得不并行地分别使用两个参考基因组来进行两次差不多的分析,以避免由于使用liftOver之类的基因组坐标转换工具带来的信息丢失。这篇文章针对这个小小的(甚至不那么常见的)痛点,在兼容现有VCF格式的情况下,使其在同一个结果文件中带上两套基因组坐标,不仅不影响现有工具的使用,而且可以随时从中进行所需基因组坐标的提取。想法很简单,实现也不难,但却的确是有效解决了某些实际操作的问题。
Abstract:
We introduce Dual Coordinate VCF (DVCF), a file format that records genomic variants against two different reference genomes simultaneously and is fully compliant with the current VCF specification. As implemented … >>>
We introduce Dual Coordinate VCF (DVCF), a file format that records genomic variants against two different reference genomes simultaneously and is fully compliant with the current VCF specification. As implemented in the Genozip platform, DVCF enables bioinformatics pipelines to seamlessly operate across two coordinate systems by leveraging the system most advantageous to each pipeline step, simplifying bioinformatics workflows and reducing file generation and associated data storage burden. Moreover, our benchmarking of Genozip DVCF shows that it produces more complete, less erroneous, and less biased translations across coordinate systems than two widely used alternative tools (i.e., LiftoverVcf and CrossMap). <<<
翻译
63.
颜林林 (2022-07-19 00:21):
#paper doi:10.1002/humu.24440 Human Mutation, 2022, Multi-omics analysis reveals multiple mechanisms causing Prader-Willi like syndrome in a family with a X;15 translocation. 这篇文章报道了一个患有PWS(Prader-Willi syndrome)遗传病的家庭,以及对其致病基因进行发现和确认的过程。PWS是一种神经发育疾病,且属于教科书级别的遗传病,因为它由一个遗传印记基因区域的变异所导致。所谓遗传印记,即该等位基因会记住其来源是父方或母方,并只在其中一方来源的染色体上的该基因才会表达。PWS就是与15q11.2区域相关,通常是该区域基因的父源拷贝缺失导致疾病。这篇文章报道的家庭,两位女儿都表现出该疾病相关症状(肥胖、智力障碍等),其母亲是携带者(存在一个15号染色体与X染色体的易位突变,translocation)。在本文中,分别使用了核型分析(karyotype)、FISH(染色体原位荧光杂交)、甲基化敏感的MLPA、短序列WGS、10x linked read WGS、转录组测序、ddPCR等方法,各方法都对应解决了在该遗传调查过程中要解决的某个环节的问题,最终确认了该致病基因,以及解释和推论出两个女儿患者的不同发病机制:一个是在15号染色体该区域表现为单亲二体(Uniparental disomy,UPD),另一个则是在印记基因上丧失了印记特性,即两条染色体上都能同时表达该SNRPN基因。对于遗传病研究人员或者从事遗传咨询工作的人员,这篇文章的整个研究过程,涉及的技术众多,逻辑条理清晰,非常具有学习价值。
IF:3.300Q2 Human mutation, 2022-11. DOI: 10.1002/humu.24440 PMID: 35842787
Abstract:
Prader-Willi syndrome (PWS; MIM# 176270) is a neurodevelopmental disorder caused by the loss of expression of paternally imprinted genes within the PWS region located on 15q11.2. It is usually caused … >>>
Prader-Willi syndrome (PWS; MIM# 176270) is a neurodevelopmental disorder caused by the loss of expression of paternally imprinted genes within the PWS region located on 15q11.2. It is usually caused by either maternal uniparental disomy of chromosome 15 (UPD15) or 15q11.2 recurrent deletion(s). Here, we report a healthy carrier of a balanced X;15 translocation and her two daughters, both with the karyotype 45,X,der(X)t(X;15)(p22;q11.2),-15. Both daughters display symptoms consistent with haploinsufficiency of the SHOX gene and PWS. We explored the architecture of the derivative chromosomes and investigated effects on gene expression in patient-derived neural cells. First, a multiplex ligation-dependent probe amplification methylation assay was used to determine the methylation status of the PWS-region revealing maternal UPD15 in daughter 2, explaining her clinical symptoms. Next, short read whole genome sequencing and 10X genomics linked read sequencing was used to pinpoint the exact breakpoints of the translocation. Finally, we performed transcriptome sequencing on neuroepithelial stem cells from the mother and from daughter 1 and observed biallelic expression of genes in the PWS region (including SNRPN) in daughter 1. In summary, our multi-omics analysis highlights two different PWS mechanisms in one family and provide an example of how structural variation can affect imprinting through long-range interactions. <<<
翻译
64.
颜林林 (2022-07-18 06:00):
#paper doi:10.1101/2022.07.14.500036 bioRxiv, 2022, Trade-off between conservation of biological variation and batch effect removal in deep generative modeling for single-cell transcriptomics. 单细胞转录组测序数据分析中,需要对批次效应影响进行去除。这通常是对原本高维的数据进行降维,使其在更容易反映出数据结构特征的低维空间上,根据批次信息对数据进行矫正。这个过程很容易导致具有生物学意义的数据特征被误伤,而这样的生物学差异正是我们进行单细胞测序所要研究的对象。针对如何去除批次效应影响,以及如何保留生物学相关数据差异,这两个原本互相矛盾的目标,通常被单细胞测序分析工具根据其各自策略原则的不同,会被选取其中之一作为优先目标进行优化。在本文中,作者通过引入一种名为帕累托多任务学习(Pareto MTL)的多目标优化技术,使综合评估并权衡与两者有关的多种不同指标,以获得整体更优的目的。在这个过程中,还基于神经网络方法,提出一种名为交互信息神经估计(Mutual Information Neural Estimation,MINE)的指标,来帮助该平衡点的选取。文章使用了TM-MARROW和MACAQUE-RETINA等公共数据集,对方法进行了评估,并展示了MINE的效果,确实优于常用的MMD方法。
Abstract:
Single-cell RNA sequencing (scRNA-seq) technology has contributed significantly to diverse research areas in biology, from cancer to development. Since scRNA-seq data is high-dimensional, a common strategy is to learn low … >>>
Single-cell RNA sequencing (scRNA-seq) technology has contributed significantly to diverse research areas in biology, from cancer to development. Since scRNA-seq data is high-dimensional, a common strategy is to learn low dimensional latent representations better to understand overall structure in the data. In this work, we build upon scVI, a powerful deep generative model which can learn biologically meaningful latent representations, but which has limited explicit control of batch effects. Rather than prioritizing batch effect removal over conservation of biological variation, or vice versa, our goal is to provide a bird eye view of the trade-offs between these two conflicting objectives. Specifically, using the well established concept of Pareto front from economics and engineering, we seek to learn the entire trade-off curve between conservation of biological variation and removal of batch effects. A multi-objective optimisation technique known as Pareto multi-task learning (Pareto MTL) is used to obtain the Pareto front between conservation of biological variation and batch effect removal. Our results indicate Pareto MTL can obtain a better Pareto front than the naive scalarization approach typically encountered in the literature. In addition, we propose to measure batch effect by applying a neural-network based estimator called Mutual Information Neural Estimation (MINE) and show benefits over the more standard Maximum Mean Discrepancy (MMD) measure. The Pareto front between conservation of biological variation and batch effect removal is a valuable tool for researchers in computational biology. Our results demonstrate the efficacy of applying Pareto MTL to estimate the Pareto front in conjunction with applying MINE to measure the batch effect. <<<
翻译
65.
颜林林 (2022-07-15 00:05):
#paper doi:10.3390/ijms23137446 International Journal of Molecular Sciences, 2022, Identification of Spliceogenic Variants beyond Canonical GT-AG Splice Sites in Hereditary Cancer Genes. 位于外显子边界附近的点突变,可能会影响基因表达的剪接形式,这在遗传病诊断和咨询过程中,是重要的信息。然而,大多数情况下,这类突变只能通过既往报道和计算工具预测来进行判定,而在美国医学遗传学和基因组学学会和分子病理学协会(ACMG/AMP)变异分类指南中,计算方法得到的结果,通常只能作为意义不确定的突变(VUS)。本文研究纳入了732例携带此类潜在可能影响RNA剪接的VUS突变的患者,涉及APC、ATM、FH、LZTR1、MSH6、PALB2、RAD51C和TP53基因,采用多重PCR方法,在RNA水平上进行了检测,以验证这些VUS所造成的影响。对于检测结果,本文逐一进行了生物学功能的分析与解读,以确定相应突变是否致病。最终对50%的VUS突变重新进行了分类,25%降级成为可能良性,25%升级成为可能致病。
Abstract:
Pathogenic/likely pathogenic variants in susceptibility genes that interrupt RNA splicing are a well-documented mechanism of hereditary cancer syndromes development. However, if RNA studies are not performed, most of the variants … >>>
Pathogenic/likely pathogenic variants in susceptibility genes that interrupt RNA splicing are a well-documented mechanism of hereditary cancer syndromes development. However, if RNA studies are not performed, most of the variants beyond the canonical GT-AG splice site are characterized as variants of uncertain significance (VUS). To decrease the VUS burden, we have bioinformatically evaluated all novel VUS detected in 732 consecutive patients tested in the routine genetic counseling process. Twelve VUS that were predicted to cause splicing defects were selected for mRNA analysis. Here, we report a functional characterization of 12 variants located beyond the first two intronic nucleotides using RNAseq in , , , , , , , and genes. Based on the analysis of mRNA, we have successfully reclassified 50% of investigated variants. 25% of variants were downgraded to likely benign, whereas 25% were upgraded to likely pathogenic leading to improved clinical management of the patient and the family members. <<<
翻译
66.
颜林林 (2022-07-14 21:57):
#paper doi:10.1126/science.abl9283 Science, 2022, Substitution mutational signatures in whole-genome–sequenced cancers in the UK population. 这篇今年四月发表在《Science》上的文章,被最新一期《Cancer Cell》所推荐(doi:10.1016/j.ccell.2022.05.011)。这些年做大规模人群做全基因组测序(WGS)的文章并不少见,时至今日仍能发表于顶刊,其创新点及意义,大概还是值得关注和了解下的。本文的入组病例样本来自Genomics England (GEL) 100,000 Genomes Project (100kGP),共计12,222个肿瘤样本(来自11,585位个体)的WGS,在分析得到与肿瘤发生的突变特征后,又在另外两个大型独立队列(来自国际癌症基因组联盟 (ICGC) 的 3001 例原发性癌症和来自 Hartwig 医学基金会的 3417 例转移性癌症)中进行了验证。本文重点关注由WGS分析得到的单碱基替换 (SBS) 和双碱基替换 (DBS) 特征,并建立了一个名为 Signature Fit Multi-Step (FitMS) 的计算框架。该方法用来区分哪些特征是各不同癌种中常见的,而哪些是罕见的、仅出现在特定癌种或器官。而通过对组织特异性特征进行聚类分析,并将其组合起来形成一组参考特征,帮助进行机制和病因的解释。从所解决的问题及方法看,似乎并无特别重大的创新,因此初步推断,之所以能跻身顶刊,与其超大人群及数据量,以及相应的工作量(参见长达94页的补充材料),还是密不可分的。
Abstract:
Whole-genome sequencing (WGS) permits comprehensive cancer genome analyses, revealing mutational signatures, imprints of DNA damage and repair processes that have arisen in each patient's cancer. We performed mutational signature analyses … >>>
Whole-genome sequencing (WGS) permits comprehensive cancer genome analyses, revealing mutational signatures, imprints of DNA damage and repair processes that have arisen in each patient's cancer. We performed mutational signature analyses on 12,222 WGS tumor-normal matched pairs, from patients recruited via the UK National Health Service. We contrasted our results to two independent cancer WGS datasets, the International Cancer Genome Consortium (ICGC) and Hartwig Foundation, involving 18,640 WGS cancers in total. Our analyses add 40 single and 18 double substitution signatures to the current mutational signature tally. Critically, we show for each organ, that cancers have a limited number of 'common' signatures and a long tail of 'rare' signatures. We provide a practical solution for utilizing this concept of common versus rare signatures in future analyses. <<<
翻译
67.
颜林林 (2022-07-13 00:46):
#paper doi:10.1093/bib/bbac221 Briefings in Bioinformatics, 2022, A comprehensive benchmarking of WGS-based deletion structural variant callers. 这是一篇工具比较的方法学文章,针对基于全基因组测序数据鉴定结构变异(SV,structural variant)的工具,而且仅限定缺失(deletion)类型的SV。文章使用了瓶中基因组(genome-in-a-bottle)的结构变异集合,以及经PCR实验进行过验证的小鼠模型的结构变异集合,作为金标准,以便准确计算出每个工具的灵敏度、特异度等性能指标。评价结果反映了过去类似工作的表现:不同工具的表现之间的确差异很大,也确有一些工具在平衡灵敏度和特异度时表现不错。最终文章给出了相应的建议,即针对不同长度的缺失类型结构变异,相应推荐使用的工具。本文中规中矩,做得也算细致。比较有意思的是,在SV工具选择时的吐槽:排除需要配对样本的工具、排除只能检测很小片段变异的工具、排除仅支持长读长测序数据的工具,最终筛选出61个合适的工具,然而测试只使用了15或14个(分别针对小鼠和人的数据),只因为:其他工具都装不上!我个人也深有同感,姑且不说那些不舍得开放源码提供他人使用者,即使开源的,很多工具也并不容易被正常使用起来,需要阅读其源码并手工debug才能用起来的工具,并不罕见。
IF:6.800Q1 Briefings in bioinformatics, 2022-07-18. DOI: 10.1093/bib/bbac221 PMID: 35753701 PMCID:PMC9294411
Abstract:
Advances in whole-genome sequencing (WGS) promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from WGS data presents a substantial number of challenges and a plethora … >>>
Advances in whole-genome sequencing (WGS) promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from WGS data presents a substantial number of challenges and a plethora of SV detection methods have been developed. Currently, evidence that investigators can use to select appropriate SV detection tools is lacking. In this article, we have evaluated the performance of SV detection tools on mouse and human WGS data using a comprehensive polymerase chain reaction-confirmed gold standard set of SVs and the genome-in-a-bottle variant set, respectively. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of the SV detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance as the SV detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low- and ultralow-pass sequencing data as well as for different deletion length categories. <<<
翻译
68.
颜林林 (2022-07-12 00:03):
#paper doi:10.1016/j.gpb.2022.04.009 Genomics, Proteomics & Bioinformatics, 2022, N6-methyladenosine and Its Implications in Viruses. 这是一篇关于m6A的综述。m6A是哺乳动物的mRNA上最常见的碱基修饰,而本文侧重于与病毒相关的m6A修饰的研究。这篇综述先概述了m6A的基本知识,包括m6A修饰碱基的占比及分布、进行m6A修饰或去修饰的调控蛋白,以及m6A在生物体中发挥的功能(如影响mRNA剪接、出核、翻译、降解等)。然后,又从技术角度,介绍检测该m6A修饰的不同实验方法。之后,进入正题,叙述这些年在各类病毒上开展的m6A相关研究,涉及SV40、乙肝、疱疹、HIV、丙肝、寨卡、登革热和新冠等病毒。从这些综述结果,可以看到m6A参与了各种各样的生物学活动。而在不同病毒中,m6A有时甚至行使着完全相反的功能。可见m6A更像是涉及底层机制过程的存在,而由它在基因调控网络中所处的时空位置不同,展示出不同的功能,而且,似乎万事都与之相关。m6A是近几年的研究热点,各类与之相关的数据挖掘层出不穷,大概也与这种“底层”且“普遍”的特性相关。对m6A的深入研究,有助于了解它对病毒复制等生命周期过程的影响,并为开发治疗病毒性疾病的药物提供基础研究支持,这很符合当前疫情时代之所需。
N6-甲基腺苷及其在病毒中的意义
Abstract:
N-methyladenine (mA) is the most abundant RNA modification in mammalian messenger RNAs (mRNAs), which participates in and regulates many important biological activities, such as tissue development and stem cell differentiation. … >>>
N-methyladenine (mA) is the most abundant RNA modification in mammalian messenger RNAs (mRNAs), which participates in and regulates many important biological activities, such as tissue development and stem cell differentiation. Due to an improved understanding of mA, researchers have discovered that the biological function of mA can be linked to many stages of mRNA metabolism and that mA can regulate a variety of complex biological processes. In addition to its location on mammalian mRNAs, mA has been identified on viral transcripts. mA also plays important roles in the life cycle of many viruses and in viral replication in host cells. In this review, we briefly introduce the detection methods of mA, the mA-related proteins, and the functions of mA. We also summarize the effects of mA-related proteins on viral replication and infection. We hope that this review provides researchers with some insights for elucidating the complex mechanisms of the epitranscriptome related to viruses, and provides information for further study of the mechanisms of other modified nucleobases acting on processes such as viral replication. We also anticipate that this review can stimulate collaborative research from different fields, such as chemistry, biology, and medicine, and promote the development of antiviral drugs and vaccines. <<<
翻译
N-甲基腺嘌呤 (mA) 是哺乳动物信使 RNA (mRNA) 中最丰富的 RNA 修饰,参与并调节许多重要的生物活动,如组织发育和干细胞分化。由于对 mA 的理解有所提高,研究人员发现 mA 的生物学功能可以与 mRNA 代谢的许多阶段相关联,并且 mA 可以调节各种复杂的生物过程。除了在哺乳动物 mRNA 上的位置外,mA 还在病毒转录本上被发现。mA在许多病毒的生命周期和病毒在宿主细胞中的复制中也起着重要作用。本文简要介绍了mA的检测方法、mA相关蛋白以及mA的功能。我们还总结了mA相关蛋白对病毒复制和感染的影响。我们希望这篇综述能为研究人员提供一些见解,以阐明与病毒相关的表观转录组的复杂机制,并为进一步研究其他修饰的核碱基作用于病毒复制等过程的机制提供信息。我们还预计,这篇综述可以促进化学、生物学和医学等不同领域的合作研究,并促进抗病毒药物和疫苗的开发。
69.
颜林林 (2022-07-11 00:41):
#paper doi:10.1101/2022.07.09.499321 bioRxiv, 2022, A Draft Human Pangenome Reference. 这应该又是一篇重磅文章,在bioRxiv上提前预发表出来。三十多家顶级单位合作,作者名单即使在使用“Human Pangenome Reference Consortium”做了浓缩后依然很长,包含不少让人熟知的名字,他们在过去这些年里曾反复出现在基因组学的各重磅文章中,比如其中就包含李恒这位大神,他赫然是通讯作者之一。全文篇幅长达97页(不含另外39页的补充材料),也反映出这项工作的体量重大。众所周知,我们一直在使用的人类参考基因组,其实来自最早的七八个人,他们的基因组,对于全人类的基因库而言,是很难相信有足够代表性的。于是这些年来,随着大量基因组数据的积累,参考基因组一直在更新迭代,打了一个又一个补丁。这篇文章所提出的“泛基因组参考(pangenome reference)”可以被认为是又一个重大改进和新版本发布,甚至可能这是接近“一劳永逸”的关键改进。它整合了多达47个个体基因组,这些个体基因组完成了定相位(phased)和二倍体组装(diploid assemblies)。且通过先前诸如HapMap、千人基因组等人类群体基因组研究的积累,确定了这47个个体的基因组差异足够大,能够涵盖超过 99% 的预期序列,并且在结构和碱基对水平上的准确率超过 99%。超长的篇幅中,详细展示了这套新参考基因组的完整构建过程,甚至精确到详细的命令行及参数,是非常值得仔细学习的。
Abstract:
The Human Pangenome Reference Consortium (HPRC) presents a first draft human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals. These assemblies cover … >>>
The Human Pangenome Reference Consortium (HPRC) presents a first draft human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals. These assemblies cover more than 99% of the expected sequence and are more than 99% accurate at the structural and base-pair levels. Based on alignments of the assemblies, we generated a draft pangenome that captures known variants and haplotypes, reveals novel alleles at structurally complex loci, and adds 119 million base pairs of euchromatic polymorphic sequence and 1,529 gene duplications relative to the existing reference, GRCh38. Roughly 90 million of the additional base pairs derive from structural variation. Using our draft pangenome to analyze short-read data reduces errors when discovering small variants by 34% and boosts the detected structural variants per haplotype by 104% compared to GRCh38-based workflows, and by 34% compared to using previous diversity sets of genome assemblies. <<<
翻译
70.
颜林林 (2022-07-10 09:00):
#paper doi:10.1109/TR.2022.3171220 IEEE Transactions on Reliability, 2022, Detecting C++ Compiler Front-End Bugs via Grammar Mutation and Differential Testing. 这篇来自大连理工大学的文章,设计了一套名为CCoft的软件框架,用以自动识别C++编译器前端部分的bug。编译器的内部结构,通常按流程分为两部分,前端和后端,前端是从C++源代码识别语义、并将其转化为中间语言的阶段,后端则是根据中间语言生成机器代码的步骤。本文仅针对前端部分。本文的框架,首先将C++语法转换为一种结构化格式,然后使用“突变”的方式,来生成大批量的各种C++代码,其中包括符合语法的,也包括不符合语法的,目的是覆盖尽可能多的代码场景,用以挑战C++编译器,看编译器是否能够符合预期地进行处理。之后,将代码丢给编译器,根据编译器的输出信息,评判是否得到了正确处理,从而识别出一系列软件bug,包括:错误拒绝了合法代码、错误接受了不合法代码、代码语义处理错误、代码编译执行崩溃、代码编译时间过长而超时等。通过使用主流编译器GCC和Clang进行测试,在三个月内找到了136个编译器bug,对比市面上主流的工具,有大幅提升。
Abstract:
C++ is a widely used programming language and the C++ front-end is a critical part of a C++ compiler. Although many techniques have been proposed to test compilers, few studies … >>>
C++ is a widely used programming language and the C++ front-end is a critical part of a C++ compiler. Although many techniques have been proposed to test compilers, few studies are devoted to detecting bugs in C++ compiler. In this study, we take the first step to detect bugs in C++ compiler front-ends. To do so, two main challenges need to be addressed, namely, the acquisition of test programs that are more likely to trigger bugs in compiler front-ends and the bug identification from complicated compiler outputs. In this article, we propose a novel framework named Ccoft to detect bugs in C++ compiler front-ends. To address the first challenge, Ccoft implements a practical program generator. The generator first transforms C++ grammars into a flexible structured format and then utilizes an equal-chance selection (ECS) strategy to conduct structure-aware grammar mutation to generate diverse C++ programs. Next, Ccoft employs a set of differential testing strategies to identify various kinds of bugs in C++ compiler front-ends by comparing complex outputs emitted by C++ compilers, thus tackling the second challenge. Empirical evaluation results over two mainstream compilers (i.e., GCC and Clang) show that Ccoft greatly improves two state-of-the-art approaches (i.e., Dharma and Grammarinator) by 135% and 111% in terms of the numbers of detected bugs, respectively. By running Ccoft for three months, we have successfully reported 136 bugs for two C++ compilers, of which 78 (57 confirmed, assigned, or fixed) for GCC and 58 (10 confirmed or fixed) for Clang. <<<
翻译
71.
颜林林 (2022-07-09 07:36):
#paper doi:10.1186/s13073-022-01079-x Genome Medicine, 2022, Identification of a cytokine-dominated immunosuppressive class in squamous cell lung carcinoma with implications for immunotherapy resistance. 这是一篇纯数据挖掘的文章,试图回答肺鳞癌中免疫检查点抑制剂耐药的机制问题。文章通过收集了来自TCGA和GEO的624例肺鳞癌转录组数据,使用无监督聚类,从中识别出与 T 细胞衰竭特征、免疫抑制细胞、临床特征和免疫治疗反应相关的表达模式,并定义了一组衰竭免疫等级 (EIC) 的免疫抑制患者。这些患者占到28%至36%,尽管他们表现出高密度的肿瘤浸润淋巴细胞,却因显著富集、高比例的免疫抑制细胞、多个免疫检查点基因同时上调等特性,表现出对ICB的耐药性。相应的表达特征,在具有 ICB 治疗抗性的黑色素瘤患者中也得到印证。文章还检查了基因组和表观组的数据,发现这些患者呈现出较低的染色体突变负担和独特的甲基化模式。由此,作者还建立了一个在线网站,整合了用到的数据及分析方法,供研究人员使用多组学数据分析来研究 ICB 耐药性的潜在关联。从分析方法看,这篇文章的套路应该是比较常见的,算不上有什么创新性,不过在单病种上整合数据,并以在线网站的形式来使分析过程能够泛化并提供他人使用,也算是一类可行的生信“原创”工作吧。
IF:10.400Q1 Genome medicine, 2022-07-08. DOI: 10.1186/s13073-022-01079-x PMID: 35799269
Abstract:
BACKGROUND: Immune checkpoint blockade (ICB) therapy has revolutionized the treatment of lung squamous cell carcinoma (LUSC). However, a significant proportion of patients with high tumour PD-L1 expression remain resistant to … >>>
BACKGROUND: Immune checkpoint blockade (ICB) therapy has revolutionized the treatment of lung squamous cell carcinoma (LUSC). However, a significant proportion of patients with high tumour PD-L1 expression remain resistant to immune checkpoint inhibitors. To understand the underlying resistance mechanisms, characterization of the immunosuppressive tumour microenvironment and identification of biomarkers to predict resistance in patients are urgently needed.METHODS: Our study retrospectively analysed RNA sequencing data of 624 LUSC samples. We analysed gene expression patterns from tumour microenvironment by unsupervised clustering. We correlated the expression patterns with a set of T cell exhaustion signatures, immunosuppressive cells, clinical characteristics, and immunotherapeutic responses. Internal and external testing datasets were used to validate the presence of exhausted immune status.RESULTS: Approximately 28 to 36% of LUSC patients were found to exhibit significant enrichments of T cell exhaustion signatures, high fraction of immunosuppressive cells (M2 macrophage and CD4 Treg), co-upregulation of 9 inhibitory checkpoints (CTLA4, PDCD1, LAG3, BTLA, TIGIT, HAVCR2, IDO1, SIGLEC7, and VISTA), and enhanced expression of anti-inflammatory cytokines (e.g. TGFβ and CCL18). We defined this immunosuppressive group of patients as exhausted immune class (EIC). Although EIC showed a high density of tumour-infiltrating lymphocytes, these were associated with poor prognosis. EIC had relatively elevated PD-L1 expression, but showed potential resistance to ICB therapy. The signature of 167 genes for EIC prediction was significantly enriched in melanoma patients with ICB therapy resistance. EIC was characterized by a lower chromosomal alteration burden and a unique methylation pattern. We developed a web application ( http://lilab2.sysu.edu.cn/tex & http://liwzlab.cn/tex ) for researchers to further investigate potential association of ICB resistance based on our multi-omics analysis data.CONCLUSIONS: We introduced a novel LUSC immunosuppressive class which expressed high PD-L1 but showed potential resistance to ICB therapy. This comprehensive characterization of immunosuppressive tumour microenvironment in LUSC provided new insights for further exploration of resistance mechanisms and optimization of immunotherapy strategies. <<<
翻译
72.
颜林林 (2022-07-08 07:19):
#paper doi:10.1038/s41540-022-00233-w npj Systems Biology and Applications, Adaptive coding for DNA storage with high storage density and low coverage. 基于生物大分子(如DNA)实现大规模数据存储功能,是我个人比较感兴趣的方向之一。这几年在这个领域突然涌现了许多优秀文章,这可能与高通量测序技术发展,以及相关的合成生物学的进步有关。这篇来自大连理工的文章,也正是这样一个案例。本文提出了一种自适应编码DNA存储系统,针对不同的编码区域位置采用不同的编码方案,将 698 KB 大小的图像、视频和 PDF 文件存储在 DNA 中,之后又将其无损地解码还原为原始数据。相比过去同类工作,本文在编码数据过程中,更细致地设计了各种DNA分子特性及约束,使在保持碱基平衡和避免非特异性杂交的同时,能在尽量低测序深度下,对测序错误的噪声进行容错。将原始内容打散并接上索引片段,从而使所存储的内容可以通过特异性扩增并测序的方式进行随机读取。比较可惜的是,本文只做了理论上的模拟和探讨,尚未开展实际的DNA合成和测序,这大大削弱了文章的说服力。
Abstract:
The rapid development of information technology has generated substantial data, which urgently requires new storage media and storage methods. DNA, as a storage medium with high density, high durability, and … >>>
The rapid development of information technology has generated substantial data, which urgently requires new storage media and storage methods. DNA, as a storage medium with high density, high durability, and ultra-long storage time characteristics, is promising as a potential solution. However, DNA storage is still in its infancy and suffers from low space utilization of DNA strands, high read coverage, and poor coding coupling. Therefore, in this work, an adaptive coding DNA storage system is proposed to use different coding schemes for different coding region locations, and the method of adaptively generating coding constraint thresholds is used to optimize at the system level to ensure the efficient operation of each link. Images, videos, and PDF files of size 698 KB were stored in DNA using adaptive coding algorithms. The data were sequenced and losslessly decoded into raw data. Compared with previous work, the DNA storage system implemented by adaptive coding proposed in this paper has high storage density and low read coverage, which promotes the development of carbon-based storage systems. <<<
翻译
73.
颜林林 (2022-07-07 07:41):
#paper doi:10.1186/s13059-022-02699-7 Genome Biology, 2022, Storing and analyzing a genome on a blockchain. 好几年前,我就听很多人说起,想把区块链技术用于基因组相关的应用,然而,后来各种结局惨淡,似乎都没了下文。在币圈跌跌不休一片哀嚎的最近,竟然《Genome Biology》上会发表出这么一篇文章,也真是神奇和亮眼。这篇来自耶鲁的文章,其全文和源码都是开放访问的,值得对区块链技术感兴趣的朋友仔细一读。文章设想了一个由测序仪、所有者、临床医生和研究人员组成的网络,每个人都参与同步 VCFchain 或 SAMchain,以此来形成分布式的数据共享,且数据分析过程也穿插在链的延伸过程中。在区块链有限的额外字节存储中,保存巨大的基因组数据,也确实需要一些技巧(如数据拆分和查询时的重新组合)加以实现,这篇文章也确实因此做了一些工作。但整体上还是有一种“为了区块链而区块链”的感觉。权限的管理和不容篡改可能是其特点和优势,但并未在文章中充分呈现,这与此前分享过的提及区块链技术的另外两篇文章有所不同(那两篇文章的DOI分别是:10.1038/s41591-022-01768-5 和 10.1038/s41586-021-03583-3,分别发表在 Nature Medicine 和 Nature,它们更多是AI算法及数据分享价值),而本文的重点还是在于区块链相关的程序实现细节。有这篇做铺垫,说不定类似文章后续真能冲击NBT呢。
IF:10.100Q1 Genome biology, 2022-06-29. DOI: 10.1186/s13059-022-02699-7 PMID: 35765079 PMCID:PMC9241283
Abstract:
There are major efforts underway to make genome sequencing a routine part of clinical practice. A critical barrier to these is achieving practical solutions for data ownership and integrity. Blockchain … >>>
There are major efforts underway to make genome sequencing a routine part of clinical practice. A critical barrier to these is achieving practical solutions for data ownership and integrity. Blockchain provides solutions to these challenges in other realms, such as finance. However, its use in genomics is stymied due to the difficulty in storing large-scale data on-chain, slow transaction speeds, and limitations on querying. To overcome these roadblocks, we developed a private blockchain network to store genomic variants and reference-aligned reads on-chain. It uses nested database indexing with an accompanying tool suite to rapidly access and analyze the data. <<<
翻译
74.
颜林林 (2022-07-06 00:02):
#paper doi:10.1186/s12864-022-08717-z BMC Genomics, 2022, The effects of sequencing depth on the assembly of coding and noncoding transcripts in the human genome. 众所周知,测序深度会影响其数据的分析结果。然而,到底影响多大,怎么影响的,往往视研究目的和研究对象而定,得具体分析,也值得研究。这篇文章,就是在系统研究测序深度对转录组数据的转录本组装的影响。文章纳入了来自150个人类干细胞样本的不同细胞组织的RNA-seq数据,除了短读长平台外,还包括四个PacBio平台的长读长数据。其中有两个样本还测了高达200M reads的NGS数据量,于是可以用它们来抽取不同比例数据,以模拟不同的测序数据量。分析结果表明,编码转录本与非编码转录本之间存在差异,前者随着测序深度增加而迅速进入饱和,后者在所分析的数据中则几乎始终未达到饱和。这可能与两者的组装难度有关。此外,长读长信息有助于含有转座元件的转录本组装。比较有意思的是单细胞RNA-seq(scRNA-seq),其非编码转录本的表达水平低,是由于表达细胞较少,而在表达的细胞中,非编码转录本的表达水平其实与编码转录本相似,这个现象的发现得益于长读长测序平台,因此文章得出结论是长读长测序更适合scRNA-seq。但我个人多少还是怀疑这些结论很可能与分析评估方法有关,也许值得重复下这篇文章的分析过程。
IF:3.500Q2 BMC genomics, 2022-Jul-04. DOI: 10.1186/s12864-022-08717-z PMID: 35787153
Abstract:
Investigating the functions and activities of genes requires proper annotation of the transcribed units. However, transcript assembly efforts have produced a surprisingly large variation in the number of transcripts, and … >>>
Investigating the functions and activities of genes requires proper annotation of the transcribed units. However, transcript assembly efforts have produced a surprisingly large variation in the number of transcripts, and especially so for noncoding transcripts. This heterogeneity in assembled transcript sets might be partially explained by sequencing depth. Here, we used real and simulated short-read sequencing data as well as long-read data to systematically investigate the impact of sequencing depths on the accuracy of assembled transcripts. We assembled and analyzed transcripts from 671 human short-read data sets and four long-read data sets. At the first level, there is a positive correlation between the number of reads and the number of recovered transcripts. However, the effect of the sequencing depth varied based on cell or tissue type, the type of read and the nature and expression levels of the transcripts. The detection of coding transcripts saturated rapidly with both short and long-reads, however, there was no sign of early saturation for noncoding transcripts at any sequencing depth. Increasing long-read sequencing depth specifically benefited transcripts containing transposable elements. Finally, we show how single-cell RNA-seq can be guided by transcripts assembled from bulk long-read samples, and demonstrate that noncoding transcripts are expressed at similar levels to coding transcripts but are expressed in fewer cells. This study highlights the impact of sequencing depth on transcript assembly. <<<
翻译
75.
颜林林 (2022-07-05 00:03):
#paper doi:10.1093/database/baac049 Database, 2022, dbBIP: a comprehensive bipolar disorder database for genetic research. 这篇文章,正如其期刊名,是一个数据库。它的研究主题和对象是bipolar disorder(BIP,双相情感障碍,又称躁狂抑郁症)。通过整合既往关于该疾病的大规模组学数据,包括两个基于芯片的GWAS队列(PGC2和PGC3,分别贡献了20352例BIP病例和31358名对照、41917例BIP和371549对照),也包括后续多项研究的WGS/WES测序数据,还包括大规模脑组织的转录组测序数据(表达谱数据),并通过各类组学分析方法,提供了对这些数据的功能注释、连锁关联、蛋白质相互作用、时空表达模式等信息。所有这些信息都以网站形式提供查询和在线分析功能。这是典型的生物信息学类型研究工作,也是深入开启某一研究方向的有效开局方式。
Abstract:
Bipolar disorder (BIP) is one of the most common hereditary psychiatric disorders worldwide. Elucidating the genetic basis of BIP will play a pivotal role in mechanistic delineation. Genome-wide association studies … >>>
Bipolar disorder (BIP) is one of the most common hereditary psychiatric disorders worldwide. Elucidating the genetic basis of BIP will play a pivotal role in mechanistic delineation. Genome-wide association studies (GWAS) have successfully reported multiple susceptibility loci conferring BIP risk, thus providing insight into the effects of its underlying pathobiology. However, difficulties remain in the extrication of important and biologically relevant data from genetic discoveries related to psychiatric disorders such as BIP. There is an urgent need for an integrated and comprehensive online database with unified access to genetic and multi-omics data for in-depth data mining. Here, we developed the dbBIP, a database for BIP genetic research based on published data. The dbBIP consists of several modules, i.e.: (i) single nucleotide polymorphism (SNP) module, containing large-scale GWAS genetic summary statistics and functional annotation information relevant to risk variants; (ii) gene module, containing BIP-related candidate risk genes from various sources and (iii) analysis module, providing a simple and user-friendly interface to analyze one's own data. We also conducted extensive analyses, including functional SNP annotation, integration (including summary-data-based Mendelian randomization and transcriptome-wide association studies), co-expression, gene expression, tissue expression, protein-protein interaction and brain expression quantitative trait loci analyses, thus shedding light on the genetic causes of BIP. Finally, we developed a graphical browser with powerful search tools to facilitate data navigation and access. The dbBIP provides a comprehensive resource for BIP genetic research as well as an integrated analysis platform for researchers and can be accessed online at http://dbbip.xialab.info. Database URL: http://dbbip.xialab.info. <<<
翻译
76.
颜林林 (2022-07-04 20:59):
#paper doi:10.1038/s41467-022-31236-0, Nature Communications, 2022, A convolutional neural network highlights mutations relevant to antimicrobial resistance in Mycobacterium tuberculosis. 本文建立了一套CNN(卷积神经网络)模型,从2万多个结核分枝杆菌的测序数据中,使用18个根据先验知识挑选的与其耐药性相关的基因座,将基因座的整个序列作为输入,以此来预测耐药性。结果显示,该CNN模型性能超过了目前其他基于传统机器学习方法和非卷积的常规神经网络方法。而且,由于深度学习方法提取了序列中的隐含特征信息,可以有效帮助预测未知突变对耐药性的影响。
IF:14.700Q1 Nature communications, 2022-07-02. DOI: 10.1038/s41467-022-31236-0 PMID: 35780211 PMCID:PMC9250494
Abstract:
Long diagnostic wait times hinder international efforts to address antibiotic resistance in M. tuberculosis. Pathogen whole genome sequencing, coupled with statistical and machine learning models, offers a promising solution. However, … >>>
Long diagnostic wait times hinder international efforts to address antibiotic resistance in M. tuberculosis. Pathogen whole genome sequencing, coupled with statistical and machine learning models, offers a promising solution. However, generalizability and clinical adoption have been limited by a lack of interpretability, especially in deep learning methods. Here, we present two deep convolutional neural networks that predict antibiotic resistance phenotypes of M. tuberculosis isolates: a multi-drug CNN (MD-CNN), that predicts resistance to 13 antibiotics based on 18 genomic loci, with AUCs 82.6-99.5% and higher sensitivity than state-of-the-art methods; and a set of 13 single-drug CNNs (SD-CNN) with AUCs 80.1-97.1% and higher specificity than the previous state-of-the-art. Using saliency methods to evaluate the contribution of input sequence features to the SD-CNN predictions, we identify 18 sites in the genome not previously associated with resistance. The CNN models permit functional variant discovery, biologically meaningful interpretation, and clinical applicability. <<<
翻译
77.
颜林林 (2022-07-03 00:04):
#paper doi:10.1002/ajmg.c.31987 American Journal of Medical Genetics, 2022, Genetic testing and glomerular hematuria - A nephrologist's perspective. 这篇综述介绍了Alport综合征(一种遗传性肾炎)的诊断和早期治疗方法的演变。该疾病表现为血尿,但并非急性外伤引起,而是与慢性炎症相关,且具有遗传性。该疾病发现于1920年,但直至2003年才被报道有药物可以进行治疗(之前只能选择透析和肾移植)。长期的临床病例积累和观察研究,确定了该疾病的遗传性,以及定位出COL4A3、COL4A4和COL4A5这三个基因与该疾病相关。由于血尿的原因很多,Alport综合征也存在各种不同程度症状的谱系分布,因此其诊断也需要开展对上述三个基因的突变检测。基因检测方法早期使用Sanger(一代测序),后来改为使用NGS(新一代测序,或者称为二代测序),无论哪种方法,都存在费用高昂等问题。在临床肾病专家的角度,会通过显微镜观察尿液中血细胞的形态等特征,帮助确定血尿的来源是否为肾小球,并综合考虑患者个体因素,确定是采取基因检测方法,或是肾活检方法。各种检测方法都并不完美,需要通过彼此互补来帮助进行疾病确诊。诸如对三个基因的检测,在NGS时代可以开展全外显子测序,不仅可能发现这三个基因上从未被报道过的难以判断致病性的突变,也可能发现与此疾病相关的其他基因突变,这些突变的解读,则需要依赖于遗传咨询师的辅助配合。这篇综述中展示的临床诊治路径(及其演化),反映了对这些信息的综合利用,以及从使患者受益的角度,该以何种顺序来组合不同的检测方法。
Abstract:
Alport syndrome is an inherited disorder of the kidneys that results from variants in three collagen IV genes-COL4A3, COL4A4, and COL4A5. Early diagnosis and pharmacologic intervention can delay the progression … >>>
Alport syndrome is an inherited disorder of the kidneys that results from variants in three collagen IV genes-COL4A3, COL4A4, and COL4A5. Early diagnosis and pharmacologic intervention can delay the progression of chronic kidney disease and the onset of kidney failure in patients with Alport syndrome. This article describes the evolution of approaches to the diagnosis and early treatment of Alport syndrome. <<<
翻译
78.
颜林林 (2022-07-02 00:24):
#paper doi:10.1186/s12859-022-04798-5 BMC Bioinformatics, 2022, DeepPN: a deep parallel neural network based on convolutional neural network and graph convolutional network for predicting RNA-protein binding sites. 识别RNA与蛋白的结合位点(RBP),是研究基因调控的重要内容。传统采用免疫沉淀等方法进行高通量的筛选和测定,但实验方法存在诸多局限,故人们尝试开发了许多计算工具来预测RBP,这其中大多为根据序列和结构信息进行数学计算的方法。深度学习技术,由于能够自动根据数据学习到重要且复杂的隐藏特征,因此也逐步被应用到这个问题上来。本文的研究,在考虑深度学习技术时,采用了图卷积网络(GCN)中的ChebNet。该方法过去多被用于光谱数据,且近年的研究在fMRI、图像语义分割等领域也都取得不错效果。于是本文基于CNN和ChebNet搭建了一个名为DeepPN的并行深度神经网络,并在24个真实数据集上进行测试,效果优于其他同类方法。推测可能是由于本文方法利用了统计频率来补充特征,因此取得了更好的性能。
IF:2.900Q1 BMC bioinformatics, 2022-Jun-29. DOI: 10.1186/s12859-022-04798-5 PMID: 35768792
Abstract:
BACKGROUND: Addressing the laborious nature of traditional biological experiments by using an efficient computational approach to analyze RNA-binding proteins (RBPs) binding sites has always been a challenging task. RBPs play … >>>
BACKGROUND: Addressing the laborious nature of traditional biological experiments by using an efficient computational approach to analyze RNA-binding proteins (RBPs) binding sites has always been a challenging task. RBPs play a vital role in post-transcriptional control. Identification of RBPs binding sites is a key step for the anatomy of the essential mechanism of gene regulation by controlling splicing, stability, localization and translation. Traditional methods for detecting RBPs binding sites are time-consuming and computationally-intensive. Recently, the computational method has been incorporated in researches of RBPs. Nevertheless, lots of them not only rely on the sequence data of RNA but also need additional data, for example the secondary structural data of RNA, to improve the performance of prediction, which needs the pre-work to prepare the learnable representation of structural data.RESULTS: To reduce the dependency of those pre-work, in this paper, we introduce DeepPN, a deep parallel neural network that is constructed with a convolutional neural network (CNN) and graph convolutional network (GCN) for detecting RBPs binding sites. It includes a two-layer CNN and GCN in parallel to extract the hidden features, followed by a fully connected layer to make the prediction. DeepPN discriminates the RBP binding sites on learnable representation of RNA sequences, which only uses the sequence data without using other data, for example the secondary or tertiary structure data of RNA. DeepPN is evaluated on 24 datasets of RBPs binding sites with other state-of-the-art methods. The results show that the performance of DeepPN is comparable to the published methods.CONCLUSION: The experimental results show that DeepPN can effectively capture potential hidden features in RBPs and use these features for effective prediction of binding sites. <<<
翻译
79.
颜林林 (2022-07-01 07:57):
#paper doi:10.1101/2022.06.27.497710 bioRxiv, 2022, PaliDIS: A tool for fast discovery of novel insertion sequences. 这是一篇有关的生信工具的文章,通讯作者来自Wellcome Sanger Institute。该工具从宏基因组数据中,寻找彼此之间含有相同重复片段的序列,将其比对到各组装好的微生物基因组上,将连锁位于同一组装序列且彼此反向互补的重复片段筛选出来,并经过一系列质控过滤,从而鉴别出在微生物基因组上发生的倒位形式的移动元件,以此帮助对耐药基因及其在不同菌种之间传播进行研究。类似流程在人类基因组分析中并不少见,且基本都是根据基因组事件及其序列特征直接进行实现,方法本身算不上有什么特别的创新之处。只不过应用于特定场景的特定数据集(在这篇文章里,数据是来自HMP,Human Microbiome Project,人类微生物计划),对分析结果进行(关于该移动元件的)统计描述和分析,倒是可行且常见的研究套路。
Abstract:
The diversity of microbial insertion sequences, crucial mobile genetic elements in generating diversity in microbial genomes, needs to be better represented in current microbial databases. Identification of these sequences in … >>>
The diversity of microbial insertion sequences, crucial mobile genetic elements in generating diversity in microbial genomes, needs to be better represented in current microbial databases. Identification of these sequences in microbiome communities presents some significant problems that have led to their underrepresentation. Here, we present a software tool called PaliDIS that recognises insertion sequences in metagenomic sequence data rapidly by identifying inverted terminal repeat regions from mixed microbial community genomes. Applying this software to 266 human metagenomes identifies 11,681 unique insertion sequences. Querying this catalogue against a large database of isolate genomes reveals evidence of horizontal gene transfer events of clinically relevant antimicrobial resistance genes between classes of bacteria. We will continue to apply this tool more widely, building the Insertion Sequence Catalogue, a valuable resource for researchers wishing to query their microbial genomes for insertion sequences. <<<
翻译
80.
颜林林 (2022-06-30 00:17):
#paper doi:10.1038/s41597-022-01450-y Scientific Data, 2022, HunCRC: annotated pathological slides to enhance deep learning applications in colorectal cancer screening. 《Nature》子刊《Scientific Data》确实是宝藏。这篇来自匈牙利的论文,就分享了一组很有用的数据。取材了200张H&E染色的结直肠癌的肿瘤组织切片,使用40倍高分辨率扫描全片,然后由病理医生进行标注,从中切分出多个不同类别的图像块,可用于后续结直肠癌的各类病理图像分析研究。值得夸赞的是,从样本采集到数据处理,整个过程有详细描述,数据处理代码、带标注的原始图像、处理后的带分类信息的图像块,全部都开放供直接下载使用。 代码地址: https://github.com/qbeer/qupath-binarymask-extension https://github.com/patbaa/crc_data_paper 原始图像数据: https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=91357370 处理后数据: https://figshare.com/articles/dataset/patches_and_local_annotations_slide_200_zoom_124x124_um2/19500266
IF:5.800Q1 Scientific data, 2022-06-28. DOI: 10.1038/s41597-022-01450-y PMID: 35764660
Abstract:
Histopathology is the gold standard method for staging and grading human tumors and provides critical information for the oncoteam's decision making. Highly-trained pathologists are needed for careful microscopic analysis of … >>>
Histopathology is the gold standard method for staging and grading human tumors and provides critical information for the oncoteam's decision making. Highly-trained pathologists are needed for careful microscopic analysis of the slides produced from tissue taken from biopsy. This is a time-consuming process. A reliable decision support system would assist healthcare systems that often suffer from a shortage of pathologists. Recent advances in digital pathology allow for high-resolution digitalization of pathological slides. Digital slide scanners combined with modern computer vision models, such as convolutional neural networks, can help pathologists in their everyday work, resulting in shortened diagnosis times. In this study, 200 digital whole-slide images are published which were collected via hematoxylin-eosin stained colorectal biopsy. Alongside the whole-slide images, detailed region level annotations are also provided for ten relevant pathological classes. The 200 digital slides, after pre-processing, resulted in 101,389 patches. A single patch is a 512 × 512 pixel image, covering 248 × 248 μm tissue area. Versions at higher resolution are available as well. Hopefully, HunCRC, this widely accessible dataset will aid future colorectal cancer computer-aided diagnosis and research. <<<
翻译
回到顶部