来自用户 颜林林 的文献。
当前共找到 116 篇文献分享,本页显示第 41 - 60 篇。
41.
颜林林 (2022-08-26 23:18):
#paper doi:10.1101/2022.08.24.505159 bioRxiv, 2022, A genome-wide atlas of recurrent repeat expansions in human cancer. 这篇来自斯坦福大学的Michael Snyder团队。通过重分析来自ICGC和TCGA的2622个癌症全基因组测序数据,涉及29个癌种,从中鉴定出160个重复序列扩张(recurrent repeat expansions, rRE)事件,且这些事件绝大多数都与特定癌症亚型相关。这些重复序列所处基因组区域,也富集在某些基因的调控元件附近,提示了它们在基因调控方面可能发挥作用。其中一个GAAA重复发生在UGT2B7基因的内含子中,在34%的肾细胞癌样本中都能观察到,于是通过斯坦福癌症中心入组了12例肾癌病例,对其样本开展了二代测序(Illumina NovaSeq)和三代测序(PacBio),验证了该rRE事件的发生。
Abstract:
Expansion of a single repetitive DNA sequence, termed a tandem repeat (TR), is known to cause more than 50 diseases. However, repeat expansions are often not explored beyond neurological and … >>>
Expansion of a single repetitive DNA sequence, termed a tandem repeat (TR), is known to cause more than 50 diseases. However, repeat expansions are often not explored beyond neurological and neurodegenerative disorders. In some cancers, mutations accumulate in short tracts of TRs (STRs), a phenomenon termed microsatellite instability (MSI); however larger repeat expansions have not been systematically analyzed in cancer. Here, we identified TR expansions in 2,622 cancer genomes, spanning 29 cancer types. In 7 cancer types, we found 160 recurrent repeat expansions (rREs); most of these (155/160) were subtype specific. We found that rREs were non-uniformly distributed in the genome with an enrichment near candidate cis-regulatory elements, suggesting a role in gene regulation. One rRE located near a regulatory element in the first intron of UGT2B7 was detected in 34% of renal cell carcinoma samples and was validated by long-read DNA sequencing. Moreover, targeting cells harboring this rRE with a rationally designed, sequence-specific DNA binder led to a dose-dependent decrease in cell proliferation. Overall, our results demonstrate that rREs are an important but unexplored source of genetic variation in human cancers, and we provide a comprehensive catalog for further study. <<<
翻译
42.
颜林林 (2022-08-18 00:34):
#paper doi:10.1186/s12859-022-04876-8 BMC Bioinformatics, 2022, IMSE: interaction information attention and molecular structure based drug drug interaction extraction. 让机器自动读取大量论文,并从中提炼有用信息,是很多人的梦想,BERT等模型让这件事逐步成为现实。本文便是基于PubMed摘要和PMC全文,进行BioBERT预训练,并由此改进DDIExtraction 2013的任务执行性能,该任务旨在从生物医学领域的自由文本中提取药物间相互作用(drug-drug interaction, DDI)。关于这项任务已有不少研究工作,本文引入了交互注意力向量(interaction attention vector),以及加入药物分子结构(以利用其特征空间信息)等,来改善模型性能及可解释性,取得不错的效果。
IF:2.900Q1 BMC bioinformatics, 2022-Aug-14. DOI: 10.1186/s12859-022-04876-8 PMID: 35965308
Abstract:
BACKGROUND: Extraction of drug drug interactions from biomedical literature and other textual data is an important component to monitor drug-safety and this has attracted attention of many researchers in healthcare. … >>>
BACKGROUND: Extraction of drug drug interactions from biomedical literature and other textual data is an important component to monitor drug-safety and this has attracted attention of many researchers in healthcare. Existing works are more pivoted around relation extraction using bidirectional long short-term memory networks (BiLSTM) and BERT model which does not attain the best feature representations.RESULTS: Our proposed DDI (drug drug interaction) prediction model provides multiple advantages: (1) The newly proposed attention vector is added to better deal with the problem of overlapping relations, (2) The molecular structure information of drugs is integrated into the model to better express the functional group structure of drugs, (3) We also added text features that combined the T-distribution and chi-square distribution to make the model more focused on drug entities and (4) it achieves similar or better prediction performance (F-scores up to 85.16%) compared to state-of-the-art DDI models when tested on benchmark datasets.CONCLUSIONS: Our model that leverages state of the art transformer architecture in conjunction with multiple features can bolster the performances of drug drug interation tasks in the biomedical domain. In particular, we believe our research would be helpful in identification of potential adverse drug reactions. <<<
翻译
43.
颜林林 (2022-08-17 23:55):
#paper doi:10.1016/j.xgen.2022.100168 Cell Genomics, 2022, Systematic single-variant and gene-based association testing of thousands of phenotypes in 394,841 UK Biobank exomes. 高通量测序技术的发展、降价和普及,拉动了一大批人类群体基因组学的研究。本文又是这样一篇大规模人群的全外显子组数据及其分析结果的发布,该人群来自UK biobank,入组人数超过39万。文章开发并使用了一个混合模型分析框架SAIGE-GENE,会同时考虑点突变的水平、基因水平的突变负荷、以及两者的组合,由此分析与4529种疾病或表型(包括II型糖尿病、心脏代谢等)存在关联关系的各类罕见突变。在此基础上,本文还提供了一个在线浏览器Genebass,以展示这些表型相关的罕见突变。作为一个实例,文章在结果部分还特意强调了所发现的一个基因SCRIB,以及它与MRI脑成像特征之间的关系。类似的大规模人群基因组分析文章层出不穷,分析方法各有侧重或不同,若有可能,倒是值得研究下它们之间的方法差异,是否可能对所报道的结果产生影响。
IF:11.100Q1 Cell genomics, 2022-Sep-14. DOI: 10.1016/j.xgen.2022.100168 PMID: 36778668
Abstract:
Genome-wide association studies have successfully discovered thousands of common variants associated with human diseases and traits, but the landscape of rare variations in human disease has not been explored at … >>>
Genome-wide association studies have successfully discovered thousands of common variants associated with human diseases and traits, but the landscape of rare variations in human disease has not been explored at scale. Exome-sequencing studies of population biobanks provide an opportunity to systematically evaluate the impact of rare coding variations across a wide range of phenotypes to discover genes and allelic series relevant to human health and disease. Here, we present results from systematic association analyses of 4,529 phenotypes using single-variant and gene tests of 394,841 individuals in the UK Biobank with exome-sequence data. We find that the discovery of genetic associations is tightly linked to frequency and is correlated with metrics of deleteriousness and natural selection. We highlight biological findings elucidated by these data and release the dataset as a public resource alongside the Genebass browser for rapidly exploring rare-variant association results. <<<
翻译
44.
颜林林 (2022-08-13 23:36):
#paper doi:10.1038/s41586-022-04774-2 Nature, 2022, Stromal changes in the aged lung induce an emergence from melanoma dormancy. 众所周知,年龄是肿瘤发病的最重要因素。这篇文章将培养的黑色素瘤细胞(其中部分细胞系使用质粒体系过表达WNT通路相关基因),注入年轻与年老小鼠,观察其成瘤过程及表型变化,其中还穿插腹腔注射等干预实验,之后取样后对肺组织进行免疫组化、蛋白组(质谱)等检测,用以揭示衰老与肿瘤发生之间的关系。该研究发现,在老化的肺微环境中,黑色素瘤并未快速生长,反而是受到了抑制,处于一种休眠状态,但同时该微环境又会促进其转移扩散,使黑色素瘤细胞能够在转移性生态位中有效传播和播种。本文同时还详细研究了WNT通路在此过程中的作用,以及酪氨酸激酶受体 AXL 和 MER 对肿瘤休眠的促进再激活。这些结果为后续研究肿瘤休眠及肺组织微环境之间的关系提供了重要信息,同时也提示在肿瘤治疗过程中有必要关注年龄因素的影响。
IF:50.500Q1 Nature, 2022-06. DOI: 10.1038/s41586-022-04774-2 PMID: 35650435
Abstract:
Disseminated cancer cells from primary tumours can seed in distal tissues, but may take several years to form overt metastases, a phenomenon that is termed tumour dormancy. Despite its importance … >>>
Disseminated cancer cells from primary tumours can seed in distal tissues, but may take several years to form overt metastases, a phenomenon that is termed tumour dormancy. Despite its importance in metastasis and residual disease, few studies have been able to successfully characterize dormancy within melanoma. Here we show that the aged lung microenvironment facilitates a permissive niche for efficient outgrowth of dormant disseminated cancer cells-in contrast to the aged skin, in which age-related changes suppress melanoma growth but drive dissemination. These microenvironmental complexities can be explained by the phenotype switching model, which argues that melanoma cells switch between a proliferative cell state and a slower-cycling, invasive state. It was previously shown that dermal fibroblasts promote phenotype switching in melanoma during ageing. We now identify WNT5A as an activator of dormancy in melanoma disseminated cancer cells within the lung, which initially enables the efficient dissemination and seeding of melanoma cells in metastatic niches. Age-induced reprogramming of lung fibroblasts increases their secretion of the soluble WNT antagonist sFRP1, which inhibits WNT5A in melanoma cells and thereby enables efficient metastatic outgrowth. We also identify the tyrosine kinase receptors AXL and MER as promoting a dormancy-to-reactivation axis within melanoma cells. Overall, we find that age-induced changes in distal metastatic microenvironments promote the efficient reactivation of dormant melanoma cells in the lung. <<<
翻译
45.
颜林林 (2022-08-12 07:42):
#paper doi:10.1016/j.ccell.2022.07.002 Cancer Cell, 2022, Integrative analysis of drug response and clinical outcome in acute myeloid leukemia. 这是一项关于AML(急性骨髓性白血病)的长达10年的真实世界临床研究,收集了来自多个中心的 805 名患者(942 个样本),对样本进行基因组和转录组的测序,同时使用离体细胞培养进行药物反应实验,此外还利用NLP技术整理和分析患者的病历数据。在数据分析方面,使用反卷积方法,通过转录组数据推断出样本的细胞类群组成,并结合临床信息和组学数据分析结果,识别出影响药物响应情况的因素(如年龄、基因表达、细胞分化状态等)。所建立的模型,揭示了单个基因 PEAR1 是患者生存的最强预测因子之一。所形成的数据集,也提供了一个在线交互式网站进行分析展示。分析方面基本都是很多生信数据挖掘类文章的常见套路,并没有特别新颖之处,但得益于长时间积累的队列及其完整的临床信息,作为一个重要的数据集资源,以及单病种的真实世界研究实例,也还是很有价值的。此外,关于药物响应的细胞实验部分相对独立,与患者预后进行关联解释并不容易,大概也是为了提升文章份量而加入的。
IF:48.800Q1 Cancer cell, 2022-08-08. DOI: 10.1016/j.ccell.2022.07.002 PMID: 35868306
Abstract:
Acute myeloid leukemia (AML) is a cancer of myeloid-lineage cells with limited therapeutic options. We previously combined ex vivo drug sensitivity with genomic, transcriptomic, and clinical annotations for a large … >>>
Acute myeloid leukemia (AML) is a cancer of myeloid-lineage cells with limited therapeutic options. We previously combined ex vivo drug sensitivity with genomic, transcriptomic, and clinical annotations for a large cohort of AML patients, which facilitated discovery of functional genomic correlates. Here, we present a dataset that has been harmonized with our initial report to yield a cumulative cohort of 805 patients (942 specimens). We show strong cross-cohort concordance and identify features of drug response. Further, deconvoluting transcriptomic data shows that drug sensitivity is governed broadly by AML cell differentiation state, sometimes conditionally affecting other correlates of response. Finally, modeling of clinical outcome reveals a single gene, PEAR1, to be among the strongest predictors of patient survival, especially for young patients. Collectively, this report expands a large functional genomic resource, offers avenues for mechanistic exploration and drug development, and reveals tools for predicting outcome in AML. <<<
翻译
46.
颜林林 (2022-08-08 07:54):
#paper doi:10.1038/s41596-022-00728-0 Nature Protocols, 2022, I-TASSER-MTD: a deep-learning-based platform for multi-domain protein structure and function prediction. 目前,关于蛋白质结构预测的工具,大多都只能处理单结构域蛋白。然而,自然界中广泛存在的蛋白质,更多是具有多个结构域的,各结构域之间会协同发挥功能,因此亟需开发对这类蛋白质进行结构及功能预测的算法工具。本文提供了一个流程,名为I-TASSER-MTD,用于多结构域蛋白质的结构与功能预测。通过整合如下步骤:基于序列分析结构域(sequence-based domain parsing)、单结构域结构折叠(single-domain structure folding)、结构域之间的结构组装(inter-domain structure assembly)、基于结构的功能注释(structure-based function annotation),并且在各个步骤中都引入了深度学习,以及整合其他诸如蛋白质交联、冷冻电镜等实验数据,来提升相应的准确度,从而提高整体的蛋白质结构功能预测效果,并最终封装成为一套全自动的分析流程。
IF:13.100Q1 Nature protocols, 2022-10. DOI: 10.1038/s41596-022-00728-0 PMID: 35931779
Abstract:
Most proteins in cells are composed of multiple folding units (or domains) to perform complex functions in a cooperative manner. Relative to the rapid progress in single-domain structure prediction, there … >>>
Most proteins in cells are composed of multiple folding units (or domains) to perform complex functions in a cooperative manner. Relative to the rapid progress in single-domain structure prediction, there are few effective tools available for multi-domain protein structure assembly, mainly due to the complexity of modeling multi-domain proteins, which involves higher degrees of freedom in domain-orientation space and various levels of continuous and discontinuous domain assembly and linker refinement. To meet the challenge and the high demand of the community, we developed I-TASSER-MTD to model the structures and functions of multi-domain proteins through a progressive protocol that combines sequence-based domain parsing, single-domain structure folding, inter-domain structure assembly and structure-based function annotation in a fully automated pipeline. Advanced deep-learning models have been incorporated into each of the steps to enhance both the domain modeling and inter-domain assembly accuracy. The protocol allows for the incorporation of experimental cross-linking data and cryo-electron microscopy density maps to guide the multi-domain structure assembly simulations. I-TASSER-MTD is built on I-TASSER but substantially extends its ability and accuracy in modeling large multi-domain protein structures and provides meaningful functional insights for the targets at both the domain- and full-chain levels from the amino acid sequence alone. <<<
翻译
47.
颜林林 (2022-08-05 21:59):
#paper doi:10.1038/s41586-022-05028-x Nature, 2022, A physical wiring diagram for the human immune system. 本文开发了一种名为SAVEXIS(scalable arrayed multi-valent extracellular interaction screen)的方法,高通量地筛选存在相互作用关系的细胞表面蛋白对,并用多种实验方法、文献支持、单细胞数据等来对所发现的结果进行验证,得到一套高质量的免疫细胞相互作用的连接关系图谱。
IF:50.500Q1 Nature, 2022-08. DOI: 10.1038/s41586-022-05028-x PMID: 35922511
Abstract:
The human immune system is composed of a distributed network of cells circulating throughout the body, which must dynamically form physical associations and communicate using interactions between their cell-surface proteomes. … >>>
The human immune system is composed of a distributed network of cells circulating throughout the body, which must dynamically form physical associations and communicate using interactions between their cell-surface proteomes. Despite their therapeutic potential, our map of these surface interactions remains incomplete. Here, using a high-throughput surface receptor screening method, we systematically mapped the direct protein interactions across a recombinant library that encompasses most of the surface proteins that are detectable on human leukocytes. We independently validated and determined the biophysical parameters of each novel interaction, resulting in a high-confidence and quantitative view of the receptor wiring that connects human immune cells. By integrating our interactome with expression data, we identified trends in the dynamics of immune interactions and constructed a reductionist mathematical model that predicts cellular connectivity from basic principles. We also developed an interactive multi-tissue single-cell atlas that infers immune interactions throughout the body, revealing potential functional contexts for new interactions and hubs in multicellular networks. Finally, we combined targeted protein stimulation of human leukocytes with multiplex high-content microscopy to link our receptor interactions to functional roles, in terms of both modulating immune responses and maintaining normal patterns of intercellular associations. Together, our work provides a systematic perspective on the intercellular wiring of the human immune system that extends from systems-level principles of immune cell connectivity down to mechanistic characterization of individual receptors, which could offer opportunities for therapeutic intervention. <<<
翻译
48.
颜林林 (2022-08-04 23:48):
#paper doi:10.1016/j.cell.2022.06.036 Cell, 2022, A cross-disorder dosage sensitivity map of the human genome. 作为专业背景是生信的我,经常会思考,纯计算的文章究竟能发到多好的杂志上,是否也有机会能刷刷顶刊主刊。或者换个说法,从码农转职而来的、没啥经费支持的研究人员,只凭借一台电脑(及其背后的互联网),是否也可以做出“顶级”生物学研究?之所以有此不自信,主要还是太多来自传统研究学者及其遵循的研究范式所提出的质疑,大家普遍认为“纯计算”本身不可信,总需要有“自己产出的生物数据”才算是可信和有意义的。然而,这篇登上《Cell》杂志的文章,却真是这样一个“纯计算”的案例。固然它是有Harvard和Broad institute的招牌加持,然而,其整合的来自17个数据源的基因组数据,都来自既往其他研究,涉及54种疾病,近百万例入组受试,重新分析并人工核对了罕见CNV突变,以及这些CNV在相应疾病背景下,对它们经由剂量效应而造成的表型影响,进行了评估。文章整合得到的数据,以及相应的分析方法及产出结果,其质量都并不逊色于大多数“直接产出生物数据”的工作。此外,文章的图表(包括补充材料的图表,比如Fig.S3)也都挺赏心悦目的。
IF:45.500Q1 Cell, 2022-08-04. DOI: 10.1016/j.cell.2022.06.036 PMID: 35917817
Abstract:
Rare copy-number variants (rCNVs) include deletions and duplications that occur infrequently in the global human population and can confer substantial risk for disease. In this study, we aimed to quantify … >>>
Rare copy-number variants (rCNVs) include deletions and duplications that occur infrequently in the global human population and can confer substantial risk for disease. In this study, we aimed to quantify the properties of haploinsufficiency (i.e., deletion intolerance) and triplosensitivity (i.e., duplication intolerance) throughout the human genome. We harmonized and meta-analyzed rCNVs from nearly one million individuals to construct a genome-wide catalog of dosage sensitivity across 54 disorders, which defined 163 dosage sensitive segments associated with at least one disorder. These segments were typically gene dense and often harbored dominant dosage sensitive driver genes, which we were able to prioritize using statistical fine-mapping. Finally, we designed an ensemble machine-learning model to predict probabilities of dosage sensitivity (pHaplo & pTriplo) for all autosomal genes, which identified 2,987 haploinsufficient and 1,559 triplosensitive genes, including 648 that were uniquely triplosensitive. This dosage sensitivity resource will provide broad utility for human disease research and clinical genetics. <<<
翻译
49.
颜林林 (2022-08-03 00:15):
#paper doi:10.1016/j.molmet.2022.101556 Molecular Metabolism, Tryptophan Metabolism is a Physiological Integrator Regulating Circadian Rhythms. 这是个关于昼夜节律生物钟的研究,比较经典的基于动物模型的生理学研究实验。通过对小鼠进行特定光照条件(12小时昼夜平分,或24小时全黑暗环境)饲养,使其适应该节律。之后通过控制饮食,减少或去除必需氨基酸的摄入,再改变特定光照条件,研究其对节律恢复的影响。根据小鼠的活动时间记录,研究其所表现出的节律,即生物钟调节结果。通过采集小鼠血液、肝脏组织等样本,进行质谱、液相色谱、转录组测序等检测。最终证明色氨酸代谢是关键的昼夜节律调节剂,其代谢受到光调控,并影响小鼠的昼夜节律调节。
Abstract:
OBJECTIVE: The circadian clock aligns physiology with the 24-hour rotation of Earth. Light and food are the main environmental cues (zeitgebers) regulating circadian rhythms in mammals. Yet, little is known … >>>
OBJECTIVE: The circadian clock aligns physiology with the 24-hour rotation of Earth. Light and food are the main environmental cues (zeitgebers) regulating circadian rhythms in mammals. Yet, little is known about the interaction between specific dietary components and light in coordinating circadian homeostasis. Herein, we focused on the role of essential amino acids.METHODS: Mice were fed diets depleted of specific essential amino acids and their behavioral rhythms were monitored and tryptophan was selected for downstream analyses. The role of tryptophan metabolism in modulating circadian homeostasis was studied using isotope tracing as well as transcriptomic- and metabolomic- analyses.RESULTS: Dietary tryptophan depletion alters behavioral rhythms in mice. Furthermore, tryptophan metabolism was shown to be regulated in a time- and light- dependent manner. A multi-omics approach and combinatory diet/light interventions demonstrated that tryptophan metabolism modulates temporal regulation of metabolism and transcription programs by buffering photic cues. Specifically, tryptophan metabolites regulate central circadian functions of the suprachiasmatic nucleus and the core clock machinery in the liver.CONCLUSIONS: Tryptophan metabolism is a modulator of circadian homeostasis by integrating environmental cues. Our findings propose tryptophan metabolism as a potential point for pharmacologic intervention to modulate phenotypes associated with disrupted circadian rhythms. <<<
翻译
50.
颜林林 (2022-08-02 23:38):
#paper doi:10.1101/2020.02.16.951657 bioRxiv, 2022, APA-Scan: Detection and Visualization of 3'-UTR Alternative Polyadenylation with RNA-seq and 3'-end-seq Data. 在真核生物中存在一种名为APA(可变的多聚腺苷酸)的机制,通过形成不同的可变剪接,使表达的基因的3'-UTR区域携带不同长度的poly-A(多聚腺苷酸)序列,从而实现精细调控基因表达(包括降解等)。本文开发了一个计算工具APA-Scan,能够基于RNA-seq数据,分析并充分考虑其相关区域的测序深度信息,鉴定APA事件,给出相应注释,并提供图形化展示,弥补了过去其他工具方法在这方面的缺失和不足。本文还通过对模拟数据和两个实际公共数据集(DaPars和APAtrap)进行分析评测,并使用qPCR实验进行了验证。
Abstract:
BackgroundThe eukaryotic genome is capable of producing multiple isoforms from a gene by alternative polyadenylation (APA) during pre-mRNA processing. APA in the 3’-untranslated region (3’-UTR) of mRNA produces transcripts with … >>>
BackgroundThe eukaryotic genome is capable of producing multiple isoforms from a gene by alternative polyadenylation (APA) during pre-mRNA processing. APA in the 3’-untranslated region (3’-UTR) of mRNA produces transcripts with shorter or longer 3’-UTR. Often, 3’-UTR serves as a binding platform for microRNAs and RNA-binding proteins, which affect the fate of the mRNA transcript. Thus, 3’-UTR APA is known to modulate translation and provides a mean to regulate gene expression at the post-transcriptional level. Current bioinformatics pipelines have limited capability in profiling 3’-UTR APA events due to incomplete annotations and a low-resolution analyzing power: widely available bioinformatics pipelines do not reference actionable polyadenylation (cleavage) sites but simulate 3’-UTR APA only using RNA-seq read coverage, causing false positive identifications. To overcome these limitations, we developed APA-Scan, a robust program that identifies 3’-UTR APA events and visualizes the RNA-seq short-read coverage with gene annotations.MethodsAPA-Scan utilizes either predicted or experimentally validated actionable polyadenylation signals as a reference for polyadenylation sites and calculates the quantity of long and short 3’-UTR transcripts in the RNA-seq data. APA-Scan works in three major steps: (i) calculate the read coverage of the 3’-UTR regions of genes; (ii) identify the potential APA sites and evaluate the significance of the events among two biological conditions; (iii) graphical representation of user specific event with 3’-UTR annotation and read coverage on the 3’-UTR regions. APA-Scan is implemented in Python3. Source code and a comprehensive user’s manual are freely available at https://github.com/compbiolabucf/APA-Scan.ResultAPA-Scan was applied to both simulated and real RNA-seq datasets and compared with two widely used baselines DaPars and APAtrap. In simulation APA-Scan significantly improved the accuracy of 3’-UTR APA identification compared to the other baselines. The performance of APA-Scan was also validated by 3’-end-seq data and qPCR on mouse embryonic fibroblast cells. The experiments confirm that APA-Scan can detect unannotated 3’ -UTR APA events and improve genome annotation.ConclusionAPA-Scan is a comprehensive computational pipeline to detect transcriptome-wide 3’-UTR APA events. The pipeline integrates both RNA-seq and 3’-end-seq data information and can efficiently identify the significant events with a high-resolution short reads coverage plots. <<<
翻译
51.
颜林林 (2022-08-01 01:02):
#paper doi:10.1093/bioinformatics/btac528 Bioinformatics, 2022, The K-mer File Format: a standardized and compact disk representation of sets of k-mers. 由k个字符连在一起的短串,称为k-mer,在生信的许多工具或分析过程中,如构建de Bruijn图(进行基因组组装)和创建序列索引(进行短序列比对),基本都会用到这个概念,并统计每种k-mer的出现频次,以及其他相关信息(如出现在基因组中的位置、与其他k-mer之间的关系)。随着k的增加,k-mer的种类呈几何数量增长,这给计算、存储都带来巨大开销。为此,本文开发了一种文件存储格式,用于存储k-mer信息,确保信息得以压缩存储的同时,还能保持高效的读写。说实话,这活不复杂,会点儿C++和Rust就能做,而且类似需求也不少。
Abstract:
SUMMARY: Bioinformatics applications increasingly rely on ad hoc disk storage of k-mer sets, e.g. for de Bruijn graphs or alignment indexes. Here, we introduce the K-mer File Format as a … >>>
SUMMARY: Bioinformatics applications increasingly rely on ad hoc disk storage of k-mer sets, e.g. for de Bruijn graphs or alignment indexes. Here, we introduce the K-mer File Format as a general lossless framework for storing and manipulating k-mer sets, realizing space savings of 3-5× compared to other formats, and bringing interoperability across tools.AVAILABILITY AND IMPLEMENTATION: Format specification, C++/Rust API, tools: https://github.com/Kmer-File-Format/.SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. <<<
翻译
52.
颜林林 (2022-07-31 07:26):
#paper doi:10.1016/j.ccell.2022.07.003 Cancer Cell, 2022, Dark genome, bright ideas: Recent approaches to harness transposable elements in immunotherapies. 占比达到近一半人类基因组的转座元件(transposable element,TE)是个需要继续深入研究的存在。这篇评论文章,快速综述了有关TE与免疫之间的关系,如TE具备的免疫原性,它能激活 DNA 或 RNA 的传感器,也能引发免疫系统反应,从而可能形成新的免疫治疗方法。本文相继描述了 TE 表达对抗肿瘤免疫的影响,以及如何通过介导 TE 表达、介导 TE 免疫原性、辅助 CAR-T 细胞等方式,来实现对肿瘤开展免疫治疗。补充点个人想法:在 DNA 水平上研究各类重复片段,一直是相当困难的,这也是这些序列区间通常被称为“dark genome”(暗黑基因组)的原因;这种困难类似于想要通过地面的投影去反推空中漂浮的大量物件,许多物件的投影彼此重叠而无法区分;而所幸新技术让我们能从长读长、多组学等角度,开始一层层剥开迷雾。
IF:48.800Q1 Cancer cell, 2022-08-08. DOI: 10.1016/j.ccell.2022.07.003 PMID: 35907399
Abstract:
Transposable elements (TEs), which make up almost half of the human genome, often display altered expression in cancers. Here, we review recent progress in elucidating the role of TEs as … >>>
Transposable elements (TEs), which make up almost half of the human genome, often display altered expression in cancers. Here, we review recent progress in elucidating the role of TEs as mediators of immune responses in cancer and discuss how novel therapeutic strategies can harness TE immunogenicity for cancer immunotherapy. <<<
翻译
53.
颜林林 (2022-07-30 01:17):
#paper doi:10.15252/msb.202211017 Molecular Systems Biology, 2022, Computational estimation of quality and clinical relevance of cancer cell lines. 这是一篇关于肿瘤细胞系的综述,主要考察公开并被广泛使用的各肿瘤细胞系的质量。文章首先概述了当前不同癌种的细胞系公共资源,包括相应的多组学数据。接着,介绍可能对细胞系质量产生影响的因素,如交叉污染、传代过程中的突变积累、缺少微环境因素、分子和细胞状态等层面的异质性等。然后,针对这些问题,可以如何进行评估,综述了相应的不同计算方法(含工具)。最后,在讨论部分,展望未来的改进方向,诸如多组学整合、迁移学习的引入、单细胞数据的使用、可解释性的提高等。细胞系是肿瘤研究的重要体系,本文对其相应的资源选择和分析评估方法,都系统性地提供了汇总信息。
Abstract:
Immortal cancer cell lines (CCLs) are the most widely used system for investigating cancer biology and for the preclinical development of oncology therapies. Pharmacogenomic and genome-wide editing screenings have facilitated … >>>
Immortal cancer cell lines (CCLs) are the most widely used system for investigating cancer biology and for the preclinical development of oncology therapies. Pharmacogenomic and genome-wide editing screenings have facilitated the discovery of clinically relevant gene-drug interactions and novel therapeutic targets via large panels of extensively characterised CCLs. However, tailoring pharmacological strategies in a precision medicine context requires bridging the existing gaps between tumours and in vitro models. Indeed, intrinsic limitations of CCLs such as misidentification, the absence of tumour microenvironment and genetic drift have highlighted the need to identify the most faithful CCLs for each primary tumour while addressing their heterogeneity, with the development of new models where necessary. Here, we discuss the most significant limitations of CCLs in representing patient features, and we review computational methods aiming at systematically evaluating the suitability of CCLs as tumour proxies and identifying the best patient representative in vitro models. Additionally, we provide an overview of the applications of these methods to more complex models and discuss future machine-learning-based directions that could resolve some of the arising discrepancies. <<<
翻译
54.
颜林林 (2022-07-29 08:21):
#paper doi:10.1093/nar/gkac586 Nucleic Acid Research, 2022, De novo assembly of human genome at single-cell levels. 作者之前开发的一项名为 SMOOTH-seq 的技术,大致原理是:用 Tn5 转座子插入基因组DNA,使其随机片段化,然后用带有 barcode 的引物对片段进行链置换和扩增,再将双链末端分别连入一段序列以成环,进行滚环扩增,得到可供长读长测序的长片段,该长片段上带有多份原始序列片段,因而可以准确校正序列碱基。本文在此基础上进行了改进,使用 PacBio HiFi 和 Oxford Nanopore Technologies(ONT)两种测序平台,对 K562 和 HG002 两个细胞系进行单细胞测序。首次在单细胞水平上完成了具有高连续性的人类基因组组装。其结果包括:95 个 K562 细胞,总测序深度约37x(如果没理解错,应该每个细胞的测序深度为 37/95 = 0.4 x),NG50 约 2 Mb;30 个 HG002 细胞,每个细胞的测序深度约为 1G(相当于是 0.33x),NG50 约 1.3 Mb。按文章摘要的说法“开启了单细胞基因组从头组装实践的新篇章”。这个主题看似创新度很高,仔细推敲却不禁有些疑问:单细胞基因组测序很难区分不同类群细胞,因而应该只能在单细胞水平上分别进行组装,否则大量不同类群细胞混合起来组装,则又失去了原本的立意。但是,单个细胞的基因组覆盖度是不可能很全面的(文章提到平均覆盖率约是 41.7%,我猜提升测序数据量也未必对此会有大幅改善),这又很大程度上会限制组装本身,因而最终只能关注其中的结构变异鉴定结果。此外,单细胞基因组结果其实很难验证,很难用其他细胞的结果来评判当前被测细胞的结果是否准确,这应该也是一个逻辑上的硬伤。所以,最终这篇文章的贡献,除了两个细胞系的单细胞基因组测序数据本身外,大概主要还是在于实验方法摸索优化和技术方法建立吧,当然其数据分析方法过程也是值得参考的。
IF:16.600Q1 Nucleic acids research, 2022-07-22. DOI: 10.1093/nar/gkac586 PMID: 35819189 PMCID:PMC9303314
人类基因组在单细胞水平上的从头组装
Abstract:
Genome assembly has been benefited from long-read sequencing technologies with higher accuracy and higher continuity. However, most human genome assembly require large amount of DNAs from homogeneous cell lines without … >>>
Genome assembly has been benefited from long-read sequencing technologies with higher accuracy and higher continuity. However, most human genome assembly require large amount of DNAs from homogeneous cell lines without keeping cell heterogeneities, since cell heterogeneity could profoundly affect haplotype assembly results. Herein, using single-cell genome long-read sequencing technology (SMOOTH-seq), we have sequenced K562 and HG002 cells on PacBio HiFi and Oxford Nanopore Technologies (ONT) platforms and conducted de novo genome assembly. For the first time, we have completed the human genome assembly with high continuity (with NG50 of ∼2 Mb using 95 individual K562 cells) at single-cell levels, and explored the impact of different assemblers and sequencing strategies on genome assembly. With sequencing data from 30 diploid individual HG002 cells of relatively high genome coverage (average coverage ∼41.7%) on ONT platform, the NG50 can reach over 1.3 Mb. Furthermore, with the assembled genome from K562 single-cell dataset, more complete and accurate set of insertion events and complex structural variations could be identified. This study opened a new chapter on the practice of single-cell genome de novo assembly. <<<
翻译
55.
颜林林 (2022-07-28 08:50):
#paper doi:10.1093/bioinformatics/btac137 Bioinformatics, 2022, BWA-MEME: BWA-MEM emulated with a machine learning approach. 看到李恒在Twitter上转发这篇文章,本以为大神又升级了bwa mem2,之后发现原来是他人的作品,得到了李恒钦点而已。作为某个知名软件的后继者,必然是要在某个方面有较大改进的,这篇的改进主要在性能。用于高通量测序数据的短序列比对算法,通常都是先用精确匹配种子(这几乎都是查表法在常数时间内完成),然后进行延伸匹配。而种子序列的长度选择,是一项比较有技巧性的事,太短可能导致重复匹配(hit)过多,太长则可能大量单词无匹配(在基因组上无该序列)却占据字典,导致字典过大。为此,过去也有一些算法,会采用变长种子来解决该问题(我也设想过这个策略,但惭愧的是,最终未能付诸实践)。而变长种子的策略,存在内存块大小不定、访问频繁等问题,会导致性能瓶颈。在本文中,通过机器学习的方法,在建立种子索引的阶段进行预处理,使得索引能够根据基因组序列数据进行适应,使不同长度种子的内存访问次数固定,从而获得性能提升。在最终的评测中,bwa-meme 能保持与 bwa-mem2 的输出相同,运行速度则提升了 3.45 倍。这篇文章的算法,可以再仔细深入学习下。
Abstract:
MOTIVATION: The growing use of next-generation sequencing and enlarged sequencing throughput require efficient short-read alignment, where seeding is one of the major performance bottlenecks. The key challenge in the seeding … >>>
MOTIVATION: The growing use of next-generation sequencing and enlarged sequencing throughput require efficient short-read alignment, where seeding is one of the major performance bottlenecks. The key challenge in the seeding phase is searching for exact matches of substrings of short reads in the reference DNA sequence. Existing algorithms, however, present limitations in performance due to their frequent memory accesses.RESULTS: This article presents BWA-MEME, the first full-fledged short read alignment software that leverages learned indices for solving the exact match search problem for efficient seeding. BWA-MEME is a practical and efficient seeding algorithm based on a suffix array search algorithm that solves the challenges in utilizing learned indices for SMEM search which is extensively used in the seeding phase. Our evaluation shows that BWA-MEME achieves up to 3.45× speedup in seeding throughput over BWA-MEM2 by reducing the number of instructions by 4.60×, memory accesses by 8.77× and LLC misses by 2.21×, while ensuring the identical SAM output to BWA-MEM2.AVAILABILITY AND IMPLEMENTATION: The source code and test scripts are available for academic use at https://github.com/kaist-ina/BWA-MEME/.SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. <<<
翻译
56.
颜林林 (2022-07-26 23:37):
#paper doi:10.1002/jbio.202100389 Journal of Biophotonics, 2022, Skin's green autofluorescence at dorsal centremetacarpus may become a novel biomarker for diagnosis of lung cancer. 肿瘤早筛是当下最热门的研发方向之一,过热到都似乎开始裁员的地步,因为大家都在同质化地走类似的路线(如甲基化测序)。而这篇来自上海交大的文章,另辟蹊径地采取对皮肤的自发荧光进行检测的方法,尝试将其用于肺癌早期筛查和诊断。这是一种真正无创的新型检测方法,其原理在于皮肤表皮的棘层中,存在一种角蛋白分子,在蓝光照射下会发出荧光。而这种荧光的强度,又与疾病状态相关。本文研究中纳入了临床实际病例和异体移植的小鼠肿瘤模型,从肺部感染或健康对照中分别区分肺癌,AUC分别可达到 0.871 和 0.813,证明了这是一种潜在的生物标志物,可用于肺癌早期筛查和诊断。
IF:2.000Q3 Journal of biophotonics, 2022-05. DOI: 10.1002/jbio.202100389 PMID: 35075788
Abstract:
It is critical to discover novel biomarkers of lung cancer for establishing economical technology for diagnosis of lung cancer. Our study has suggested that the autofluorescence (AF) of the skin … >>>
It is critical to discover novel biomarkers of lung cancer for establishing economical technology for diagnosis of lung cancer. Our study has suggested that the autofluorescence (AF) of the skin may become a novel biomarker of this type: First, development of lung cancer led to a significant increase in the skin's green AF in a mouse model of lung cancer; second, lung cancer patients had significantly higher skin's green AF at certain positions compared with healthy volunteers and pulmonary infection patients; and third, using the skin's green AF intensity at dorsal centremetacarpus as the variable, the areas under curve (AUC) for differentiating lung cancer patients and pulmonary infection patients and for differentiating lung cancer patients and healthy volunteers was 0.871 and 0.813, respectively. Collectively, our study has indicated that the skin's green AF at dorsal centremetacarpus may become a novel biomarker for establishing a ground-breaking diagnostic strategy for lung cancer. <<<
翻译
57.
颜林林 (2022-07-25 07:28):
#paper doi:10.1038/s41380-022-01661-0 Molecular Psychiatry, 2022, The serotonin theory of depression: a systematic umbrella review of the evidence. 这是一篇meta分析,而且还是一篇阴性结果的报道,按照很多“业内人”的观点,这样的“水文”是不屑一顾或羞于启齿的。本文研究血清素(serotonin,即5-羟色胺)是否与抑郁症病因有关。这是一个流行于大多数公众和专业研究人员的观点,人们普遍认为血清素降低与抑郁症有关。本文采取了“伞式”审查(umbrella review)方法,纳入多个不同领域对血清素系统进行的大量研究,以便为结论提供可及的最高证据等级支持。涵盖的六个领域分别是:(1) 血清素及其代谢物5-HIAA(5-羟吲哚乙酸)是否在抑郁症患者体液中含量更低;(2) 抑郁症患者的血清素受体是否表达水平更低;(3) 血清素转运蛋白(SERT)是否抑郁症患者中表达更高;(4) 色氨酸(5-羟色胺的前体)耗竭是否会导致抑郁症;(5) 抑郁症患者的 SERT 基因是否表达更高;(6) 抑郁症患者的SERT基因与压力之间是否存在相互作用。本文研究在 PROSPERO 注册(CRD42020207203),共纳入 17 项研究:12 项系统评价和meta分析(systematic reviews and meta-analyses),1 项协作meta分析(collaborative meta-analysis),1 项大型队列研究的meta分析(meta-analysis of large cohort studies),1 项系统评价和综述(systematic review and narrative synthesis),1 项遗传关联研究(genetic association study)和 1 项伞式审查(umbrella review)。最终在六个领域问题上,分别以各自可及的最大样本量(从数百到数万),否定了血清素活性标志物与抑郁症之间的关联,并建议“it is time to acknowledge that the serotonin theory of depression is not empirically substantiated(是时候承认抑郁症的血清素理论并没有经验实证)”。可见,能够明确下一个阴性结论(否定结论),也是相当不容易的。
IF:9.600Q1 Molecular psychiatry, 2023-Aug. DOI: 10.1038/s41380-022-01661-0 PMID: 35854107
Abstract:
The serotonin hypothesis of depression is still influential. We aimed to synthesise and evaluate evidence on whether depression is associated with lowered serotonin concentration or activity in a systematic umbrella … >>>
The serotonin hypothesis of depression is still influential. We aimed to synthesise and evaluate evidence on whether depression is associated with lowered serotonin concentration or activity in a systematic umbrella review of the principal relevant areas of research. PubMed, EMBASE and PsycINFO were searched using terms appropriate to each area of research, from their inception until December 2020. Systematic reviews, meta-analyses and large data-set analyses in the following areas were identified: serotonin and serotonin metabolite, 5-HIAA, concentrations in body fluids; serotonin 5-HT receptor binding; serotonin transporter (SERT) levels measured by imaging or at post-mortem; tryptophan depletion studies; SERT gene associations and SERT gene-environment interactions. Studies of depression associated with physical conditions and specific subtypes of depression (e.g. bipolar depression) were excluded. Two independent reviewers extracted the data and assessed the quality of included studies using the AMSTAR-2, an adapted AMSTAR-2, or the STREGA for a large genetic study. The certainty of study results was assessed using a modified version of the GRADE. We did not synthesise results of individual meta-analyses because they included overlapping studies. The review was registered with PROSPERO (CRD42020207203). 17 studies were included: 12 systematic reviews and meta-analyses, 1 collaborative meta-analysis, 1 meta-analysis of large cohort studies, 1 systematic review and narrative synthesis, 1 genetic association study and 1 umbrella review. Quality of reviews was variable with some genetic studies of high quality. Two meta-analyses of overlapping studies examining the serotonin metabolite, 5-HIAA, showed no association with depression (largest n = 1002). One meta-analysis of cohort studies of plasma serotonin showed no relationship with depression, and evidence that lowered serotonin concentration was associated with antidepressant use (n = 1869). Two meta-analyses of overlapping studies examining the 5-HT receptor (largest n = 561), and three meta-analyses of overlapping studies examining SERT binding (largest n = 1845) showed weak and inconsistent evidence of reduced binding in some areas, which would be consistent with increased synaptic availability of serotonin in people with depression, if this was the original, causal abnormaly. However, effects of prior antidepressant use were not reliably excluded. One meta-analysis of tryptophan depletion studies found no effect in most healthy volunteers (n = 566), but weak evidence of an effect in those with a family history of depression (n = 75). Another systematic review (n = 342) and a sample of ten subsequent studies (n = 407) found no effect in volunteers. No systematic review of tryptophan depletion studies has been performed since 2007. The two largest and highest quality studies of the SERT gene, one genetic association study (n = 115,257) and one collaborative meta-analysis (n = 43,165), revealed no evidence of an association with depression, or of an interaction between genotype, stress and depression. The main areas of serotonin research provide no consistent evidence of there being an association between serotonin and depression, and no support for the hypothesis that depression is caused by lowered serotonin activity or concentrations. Some evidence was consistent with the possibility that long-term antidepressant use reduces serotonin concentration. <<<
翻译
58.
颜林林 (2022-07-24 05:55):
#paper doi:10.1186/s12864-022-08762-8 BMC Genomics, 2022, Poly(a) selection introduces bias and undue noise in direct RNA-sequencing. 全转录组测序实验中,在初始的RNA提取环节后,经常会使用poly-A筛选方法,来富集mRNA。本文使用ONT平台,开展直接RNA测序(direct RNA-sequencing),并对同一样本,平行地采取使用和不适用poly-A筛选的方法。最终结果说明,省略该环节是合适的,虽然这么做可能轻微降低文库复杂度,但它能更有效避免该筛选环节带来的其他弊端,如需要更多RNA起始量、容易倾向地筛选出具有更长poly-A尾巴的mRNA、会导致差异表达基因也受到影响而更不稳定等。
IF:3.500Q2 BMC genomics, 2022-Jul-22. DOI: 10.1186/s12864-022-08762-8 PMID: 35869428
Abstract:
BACKGROUND: Genome-wide RNA-sequencing technologies are increasingly critical to a wide variety of diagnostic and research applications. RNA-seq users often first enrich for mRNA, with the most popular enrichment method being … >>>
BACKGROUND: Genome-wide RNA-sequencing technologies are increasingly critical to a wide variety of diagnostic and research applications. RNA-seq users often first enrich for mRNA, with the most popular enrichment method being poly(A) selection. In many applications it is well-known that poly(A) selection biases the view of the transcriptome by selecting for longer tailed mRNA species.RESULTS: Here, we show that poly(A) selection biases Oxford Nanopore direct RNA sequencing. As expected, poly(A) selection skews sequenced mRNAs toward longer poly(A) tail lengths. Interestingly, we identify a population of mRNAs (> 10% of genes' mRNAs) that are inconsistently captured by poly(A) selection due to highly variable poly(A) tails, and demonstrate this phenomenon in our hands and in published data. Importantly, we show poly(A) selection is dispensable for Oxford Nanopore's direct RNA-seq technique, and demonstrate successful library construction without poly(A) selection, with decreased input, and without loss of quality.CONCLUSIONS: Our work expands the utility of direct RNA-seq by validating the use of total RNA as input, and demonstrates important technical artifacts from poly(A) selection that inconsistently skew mRNA expression and poly(A) tail length measurements. <<<
翻译
59.
颜林林 (2022-07-23 22:05):
#paper doi:10.1101/2022.07.21.500999 bioRxiv, 2022, High-resolution de novo structure prediction from primary sequence. 这篇预发表的文章,开发了一个工具,OmegaFold,可以基于单个蛋白的一级序列信息,预测三级结构。现在主流的方法,都需要依赖演化信息,即通过多序列比对作为辅助,进行蛋白质折叠结构的预测。而本文认为,蛋白从被翻译合成出来后,就会经历从一级序列自动折叠成为三级结构,因而这些演化信息对于结构预测而言并非必要。本文采取的深度模型,会依赖于一组预训练模型,帮助识别出一级序列中哪些氨基酸更为重要(即赋予不同的注意力),并采取基于BERT的语言模型技术,帮助进行蛋白质折叠的模型训练。最终实现的方法,可以有效解决孤儿蛋白(即当前结构数据库中缺乏其他可供参考的相近蛋白)的结构预测问题,且与AlphaFold等工具相比,在准确度上又有显著提升。
Abstract:
Recent breakthroughs have used deep learning to exploit evolutionary information in multiple sequence alignments (MSAs) to accurately predict protein structures. However, MSAs of homologous proteins are not always available, such … >>>
Recent breakthroughs have used deep learning to exploit evolutionary information in multiple sequence alignments (MSAs) to accurately predict protein structures. However, MSAs of homologous proteins are not always available, such as with orphan proteins and fast-evolving proteins like antibodies, and a protein typically folds in a natural setting from its primary amino acid sequence into its three-dimensional structure, suggesting that evolutionary information and MSAs should not be necessary to predict a protein's folded form. Here, we introduce OmegaFold, the first computational method to successfully predict high-resolution protein structure from a single primary sequence alone. Using a new combination of a protein language model that allows us to make predictions from single sequences and a geometry-inspired transformer model trained on protein structures, OmegaFold outperforms RoseTTAFold and achieves similar prediction accuracy to AlphaFold2 on recently released structures. OmegaFold enables accurate predictions on orphan proteins that do not belong to any functionally characterized protein family and antibodies that tend to have noisy MSAs due to fast evolution. Our study fills a much-needed structure prediction gap and brings us a step closer to understanding protein folding in nature. <<<
翻译
60.
颜林林 (2022-07-22 00:00):
#paper doi:10.1056/NEJMe2207902 The New England Journal of Medicine, 2022, Setting the Benchmark for KRAS(G12C)-Mutated NSCLC. 这是一篇社论(Editorial),介绍了该期杂志上关于KRYSTAL-1二期临床试验的结果报道(doi:10.1056/NEJMoa2204619)。该临床试验的主角,是一种KRAS G12C抑制剂,阿达格拉西布(Adagrasib),其在此次临床试验中表现不错,对经过化疗与免疫治疗的携带KRAS G12C突变的患者,生存评估的指标(ORR、PFS和OS等),与此前另一个获批药物,索托拉西布(sotorasib)非常接近。这篇社论由此推测,这两个药物在机制上可能存在很大的重叠。此外,两个药物在代谢和动力学方面的差异(如穿越血脑屏障、在体内的半衰期等),则又为两个药物未来在选用时可采取的差异化,提供了方向提示。
回到顶部