来自用户 Vincent 的文献。
当前共找到 27 篇文献分享,本页显示第 1 - 20 篇。
1.
Vincent
(2024-06-30 16:11):
#paper doi:https://doi.org/10.1038/s41556-020-00620-7, Nat Cell Biol,2021, CRISPR technologies for precise epigenome editing.
这篇论文综述了CRISPR/Cas系统在表观基因组编辑中的应用和进展。文章介绍了CRISPR/Cas9系统的基本原理,以及如何利用Cas9(dCas9)招募表观遗传修饰酶,实现特定基因的转录激活或抑制。文章回顾了一些细胞和动物模型的应用实例,展示了CRISPR技术在研究基因功能和治疗疾病中的潜力,并指出了现有的技术挑战与优化策略,包括编辑效率、脱靶效应和表观遗传修饰的动态性。最后展望了CRISPR技术在表观基因组编辑领域的未来发展方向,强调需要进一步研究以提高技术的特异性和稳定性。
CRISPR技术用于精确的表观基因组编辑
Abstract:
The epigenome involves a complex set of cellular processes governing genomic activity. Dissecting this complexity necessitates the development of tools capable of specifically manipulating these processes. The repurposing of prokaryotic …
>>>
The epigenome involves a complex set of cellular processes governing genomic activity. Dissecting this complexity necessitates the development of tools capable of specifically manipulating these processes. The repurposing of prokaryotic CRISPR systems has allowed for the development of diverse technologies for epigenome engineering. Here, we review the state of currently achievable epigenetic manipulations along with corresponding applications. With future optimization, CRISPR-based epigenomic editing stands as a set of powerful tools for understanding and controlling biological function.
<<<
翻译
表观基因组涉及一组控制基因组活动的复杂细胞过程。剖析这种复杂性需要开发能够专门操作这些过程的工具。原核CRISPR系统的重新利用使得表观基因组工程的多样化技术得以开发。在这里,我们回顾了目前可实现的表观遗传操作的状态以及相应的应用。随着未来的优化,基于CRISPR的表观基因组编辑将成为一套用于理解和控制生物功能的强大工具。
2.
Vincent
(2024-05-31 15:19):
#paper https://doi.org/10.1016/j.cell.2022.12.027 Cell. 2023 Loss of epigenetic information as a cause of mammalian aging 衰老过程伴随着信息丢失。遗传信息(DNA层面)和表观遗传信息(DNA组蛋白修饰等)可以类比为生物体的硬件和软件,这两个层面的信息丢失都有可能导致衰老。这篇文章主要是通过不引起突变的DNA双链断裂,利用细胞修复DNA的过程破坏其表观遗传信息全景。验证了在只有表观遗传信息丢失的情况下,哺乳动物细胞展现出了衰老特征,例如细胞特征的丢失,细胞休眠等。后续实验通过表观遗传重编程又将细胞拉回到年轻的状态,验证了表观遗传信息的丢失能够导致衰老,且该变化在一定程度上是可逆的。
Abstract:
No abstract available.
3.
Vincent
(2024-04-30 23:17):
#paper https://doi.org/10.1214/23-AOAS1780 Ann. Appl. Stat. 2024 Bayesian multiple instance classification based on hierarchical probit regression
多示例学习(multiple instance learning)在药效预测,病理图像检测等领域有着广泛的应用,与常见的监督学习中每个实例有一个label不同,多示例学习中,不同的实例组成一个bag,每个bag有一个label,哪些实例是主要实例(primary instance)以及主要实例如何决定label是未知的。过去的多示例学习研究集中在计算机科学领域,着重预测。而统计推理以及模型可解释性的工作较少。这篇文章试图填补这部分空白。文章中提出了一种贝叶斯层次概率比回归模型(nested probit model),内层回归模型学习实例特征与主要实例之间的关系,外层回归模型学习主要实例与label之间的关系。相较其他模型,该参数模型在模拟数据和真实数据上的表现较具竞争力,同时能够提供更好的模型解释和更直观的统计推理。
The Annals of Applied Statistics,
2024.
DOI: 10.1214/23-AOAS1780
Abstract:
No abstract available.
4.
Vincent
(2024-03-31 16:59):
#paper Clarifying the biological and statistical assumptions of cross-sectional biological age predictors: an elaborate illustration using synthetic and real data. BMC Medical Research Methodology. 2024. https://doi.org/10.1186/s12874-024-02181-x. 生物年龄代表了个体真实的生理状态,其与实际年龄可能会有差异(个体可能比实际年龄更年轻/更老)。生物年龄与实际年龄之间的偏离(aging divergence)激发了广泛的研究兴趣,通常认为当生物年龄大于实际年龄时,个体会有更低的预期寿命以及更高的死亡或者疾病风险。常见的生物年龄通常由生化或者分子特征预测得知,而实际应用中这类数据往往属于横截面数据(cross-section data, 指在某一个时间点收集的数据,与时序数据相区别)。 这篇文章指出,当使用的是横截面数据时候,研究 aging divergence是否与某一些性状相关往往有一个隐含假设(identical-association-assumption),即与年龄最有关的形状也必然与aging divergence最有联系。该假设是否成立直接影响分析结果是否有生物学意义。可惜的是从横截面数据中我们无法测试这种假设是否成立或者不成立(untestable)。这篇文章的主要贡献是通过模拟和真实数据显示地揭示了这个经常被忽视的隐含假设,对衰老的研究和衰老机理的解释有一些警醒作用。
IF:3.900Q1
BMC medical research methodology,
2024-Mar-08.
DOI: 10.1186/s12874-024-02181-x
PMID: 38459475
Abstract:
BACKGROUND: There is divergence in the rate at which people age. The concept of biological age is postulated to capture this variability, and hence to better represent an individual's true …
>>>
BACKGROUND: There is divergence in the rate at which people age. The concept of biological age is postulated to capture this variability, and hence to better represent an individual's true global physiological state than chronological age. Biological age predictors are often generated based on cross-sectional data, using biochemical or molecular markers as predictor variables. It is assumed that the difference between chronological and predicted biological age is informative of one's chronological age-independent aging divergence ∆.METHODS: We investigated the statistical assumptions underlying the most popular cross-sectional biological age predictors, based on multiple linear regression, the Klemera-Doubal method or principal component analysis. We used synthetic and real data to illustrate the consequences if this assumption does not hold.RESULTS: The most popular cross-sectional biological age predictors all use the same strong underlying assumption, namely that a candidate marker of aging's association with chronological age is directly informative of its association with the aging rate ∆. We called this the identical-association assumption and proved that it is untestable in a cross-sectional setting. If this assumption does not hold, weights assigned to candidate markers of aging are uninformative, and no more signal may be captured than if markers would have been assigned weights at random.CONCLUSIONS: Cross-sectional methods for predicting biological age commonly use the untestable identical-association assumption, which previous literature in the field had never explicitly acknowledged. These methods have inherent limitations and may provide uninformative results, highlighting the importance of researchers exercising caution in the development and interpretation of cross-sectional biological age predictors.
<<<
翻译
5.
Vincent
(2024-02-29 17:06):
#paper Transfer learning enables predictions in network biology. Nature. 2023. doi: https://doi.org/10.1038/s41586-023-06139-9. 学习基因互作网络通常需要大量数据,对于数据较少的生物研究来说,利用迁移学习和预训练模型能够有效降低对数据量的需求。这篇文章提出了一种基于transformer的深度学习模型geneformer,其使用了大量的单细胞数据集进行预训练(自监督学习)。在模型训练中,geneformer 并未使用gene的原始表达值,而是使用了gene expression rank(相当于数据降噪)来学习基因网络。对于下游任务,利用少量数据对模型微调就能够很好的增强预测准确率。文章列举了geneformer在基因剂量, 染色质,基因网络方面的例子,预测准确性相较传统的机器学习模型均有明显提升。
Abstract:
Mapping gene networks requires large amounts of transcriptomic data to learn the connections between genes, which impedes discoveries in settings with limited data, including rare diseases and diseases affecting clinically …
>>>
Mapping gene networks requires large amounts of transcriptomic data to learn the connections between genes, which impedes discoveries in settings with limited data, including rare diseases and diseases affecting clinically inaccessible tissues. Recently, transfer learning has revolutionized fields such as natural language understanding and computer vision by leveraging deep learning models pretrained on large-scale general datasets that can then be fine-tuned towards a vast array of downstream tasks with limited task-specific data. Here, we developed a context-aware, attention-based deep learning model, Geneformer, pretrained on a large-scale corpus of about 30 million single-cell transcriptomes to enable context-specific predictions in settings with limited data in network biology. During pretraining, Geneformer gained a fundamental understanding of network dynamics, encoding network hierarchy in the attention weights of the model in a completely self-supervised manner. Fine-tuning towards a diverse panel of downstream tasks relevant to chromatin and network dynamics using limited task-specific data demonstrated that Geneformer consistently boosted predictive accuracy. Applied to disease modelling with limited patient data, Geneformer identified candidate therapeutic targets for cardiomyopathy. Overall, Geneformer represents a pretrained deep learning model from which fine-tuning towards a broad range of downstream applications can be pursued to accelerate discovery of key network regulators and candidate therapeutic targets.
<<<
翻译
6.
Vincent
(2024-01-31 15:43):
#paper doi:https://www.jstor.org/stable/30047444 Journal of the American Statistical Association, 2006, Prediction by Supervised Principal Components. 当特征维度较高时,回归分析结果往往不是很理想,这一方面是因为数据噪声较大,另一方面是特征之间的相关性较高所导致的。这篇文章提出了一个简单有效的监督学习降维的框架,即根据特征与因变量之间的回归系数进行阈值筛选,再对筛出的少量特征降维,利用得到的主成分进行回归或者广义回归分析。这篇文章的主要理论贡献是在回归和生存分析的背景下论证了该方法的渐进一致性,比较了该方法其他方法(例如岭回归,lasso回归,偏最小二乘)的异同。文章最后还提到了该方法的局限性,例如无法处理单个特征与因变量边缘独立,但是几个特征联合起来与因变量不独立的情况等。
Abstract:
No abstract available.
7.
Vincent
(2023-12-31 21:15):
#paper doi: 10.1126/science.adi6000 Prediction-powered inference, science 2023 目前很多领域里已标注的数据(金标准)较稀缺而未标注的数据较丰富,如何使用这些数据得到严谨的统计结论还面临着颇多挑战。传统方法的思路是只使用这些少数的金标准的数据进行统计推断,这种方案得到的统计结果有效,但样本量少会导致可能的发现较少。另一种思路是使用预测模型对未标注的数据进行标注,用补全标签后的数据和金标准数据进行统计推断,这种方案样本量大,但其假设了预测模型是完美的, 很多时候这种假设并不成立,预测误差与偏差累计可能会导致无效的统计结论。这篇文章提出了一个通用的框架,在使用预测模型的同时也保证了统计结论的有效性。该框架分为三步,1.选择需要估计的参数,2.从未标注数据估计拟合度,从标注数据估计矫正量,3.结合拟合度与校正量获取参数的置信区间。文章在数学上证明了对于任意的预测算法与数据分布,这种基于预测的统计推断能够确保置信区间涵盖真实值的概率达到给定的置信度。由于该方法能够使用的样本量更大,后续数据分析也验证了其较传统方法得到的置信区间更窄,p-value更有效。
Abstract:
Prediction-powered inference is a framework for performing valid statistical inference when an experimental dataset is supplemented with predictions from a machine-learning system. The framework yields simple algorithms for computing provably …
>>>
Prediction-powered inference is a framework for performing valid statistical inference when an experimental dataset is supplemented with predictions from a machine-learning system. The framework yields simple algorithms for computing provably valid confidence intervals for quantities such as means, quantiles, and linear and logistic regression coefficients without making any assumptions about the machine-learning algorithm that supplies the predictions. Furthermore, more accurate predictions translate to smaller confidence intervals. Prediction-powered inference could enable researchers to draw valid and more data-efficient conclusions using machine learning. The benefits of prediction-powered inference were demonstrated with datasets from proteomics, astronomy, genomics, remote sensing, census analysis, and ecology.
<<<
翻译
8.
Vincent
(2023-11-30 16:34):
#paper Contrastive Variational Autoencoder Enhances Salient Features, arxiv, 2019 https://arxiv.org/abs/1902.04601 最近的对比PCA采用了对比学习的思路,能够捕捉目标数据集与背景之间的差异,从而实现保留对比信号的无监督降维。然而对比PCA跟PCA类似,只能对变量做线性组合进行降维,无法捕捉变量间的非线性关系。这篇文章对对比PCA做了拓展,使用变分自编码模型(VAE)来实现对非线性关系的捕捉,该方法称为对比VAE。对比VAE通过对数据集间的共享特征以及富集在目标数据中的特征进行显式建模,从而分离和增强目标数据中的突出潜在特征。该方法的运算时间与VAE类似,并且对噪音和数据纯度有较高的鲁棒性。文章在多个数据集上(例如手写数字MNIST)验证了该方法在捕捉突出潜在特征方面的有效性,比起传统的VAE也有持续提高。同时其作为一种生成式学习工具,训练好以后也能够用这些显著潜在特征来生成新的数据。
arXiv,
2019.
DOI: 10.48550/arXiv.1902.04601
Abstract:
Variational autoencoders are powerful algorithms for identifying dominantlatent structure in a single dataset. In many applications, however, we areinterested in modeling latent structure and variation that are enriched in atarget …
>>>
Variational autoencoders are powerful algorithms for identifying dominantlatent structure in a single dataset. In many applications, however, we areinterested in modeling latent structure and variation that are enriched in atarget dataset compared to some background---e.g. enriched in patients comparedto the general population. Contrastive learning is a principled framework tocapture such enriched variation between the target and background, butstate-of-the-art contrastive methods are limited to linear models. In thispaper, we introduce the contrastive variational autoencoder (cVAE), whichcombines the benefits of contrastive learning with the power of deep generativemodels. The cVAE is designed to identify and enhance salient latent features.The cVAE is trained on two related but unpaired datasets, one of which hasminimal contribution from the salient latent features. The cVAE explicitlymodels latent features that are shared between the datasets, as well as thosethat are enriched in one dataset relative to the other, which allows thealgorithm to isolate and enhance the salient latent features. The algorithm isstraightforward to implement, has a similar run-time to the standard VAE, andis robust to noise and dataset purity. We conduct experiments across diversetypes of data, including gene expression and facial images, showing that thecVAE effectively uncovers latent structure that is salient in a particularanalysis.
<<<
翻译
9.
Vincent
(2023-10-31 14:27):
#paper https://doi.org/10.1038/s41576-022-00477-6 Nat Rev Genet 2022 Making sense of the ageing methylome
衰老近些年引起了比较大的研究兴趣。这篇综述文章总结了近些年关于衰老的甲基化组学研究。文章介绍了寻找衰老关联位点的几种统计方法和对应的工具,例如最常见的使用线性模型寻找差异化位点,使用假设检验寻找变异位点,以及通过使用熵值和相关性网络等统计工具寻找更复杂的变化模式。此外文章还介绍了一些有趣的与衰老相关的甲基化证据,探讨了通过干预甲基化模式与机制来达到延长寿命的策略。最后文章还讨论了甲基化年龄机理的相关理论。
Abstract:
Over time, the human DNA methylation landscape accrues substantial damage, which has been associated with a broad range of age-related diseases, including cardiovascular disease and cancer. Various age-related DNA methylation …
>>>
Over time, the human DNA methylation landscape accrues substantial damage, which has been associated with a broad range of age-related diseases, including cardiovascular disease and cancer. Various age-related DNA methylation changes have been described, including at the level of individual CpGs, such as differential and variable methylation, and at the level of the whole methylome, including entropy and correlation networks. Here, we review these changes in the ageing methylome as well as the statistical tools that can be used to quantify them. We detail the evidence linking DNA methylation to ageing phenotypes and the longevity strategies aimed at altering both DNA methylation patterns and machinery to extend healthspan and lifespan. Lastly, we discuss theories on the mechanistic causes of epigenetic ageing.
<<<
翻译
10.
Vincent
(2023-09-30 23:59):
#paper https://doi.org/10.1038/s41592-018-0213-x Identification of differentially methylated cell types in epigenome-wide association studies. Nature Methods, 2018。表观基因组关联研究经常使用细胞类型的比例作为协变量,使用线性模型挖掘出与研究性状相关的差异甲基化位点,然而此类方法很难确定具体是什么细胞类型导致了该差异甲基化位点。这篇论文介绍了简单而有效的新的甲基化差异检测方法,通过引入性状与细胞类型的interaction term,在原有的统计框架下,该方法能够发现引起甲基化位点变化的具体的细胞类型。在模拟研究中,该方法表现优异,能够达到超过90%的灵敏度和特异性。
Abstract:
An outstanding challenge of epigenome-wide association studies (EWASs) performed in complex tissues is the identification of the specific cell type(s) responsible for the observed differential DNA methylation. Here we present …
>>>
An outstanding challenge of epigenome-wide association studies (EWASs) performed in complex tissues is the identification of the specific cell type(s) responsible for the observed differential DNA methylation. Here we present a statistical algorithm called CellDMC ( https://github.com/sjczheng/EpiDISH ), which can identify differentially methylated positions and the specific cell type(s) driving the differential methylation. We validated CellDMC on in silico mixtures of DNA methylation data generated with different technologies, as well as on real mixtures from epigenome-wide association and cancer epigenome studies. CellDMC achieved over 90% sensitivity and specificity in scenarios where current state-of-the-art methods did not identify differential methylation. By applying CellDMC to an EWAS performed in buccal swabs, we identified smoking-associated differentially methylated positions occurring in the epithelial compartment, which we validated in smoking-related lung cancer. CellDMC may be useful in the identification of causal DNA-methylation alterations in disease.
<<<
翻译
11.
Vincent
(2023-08-31 23:50):
#paper https://doi.org/10.48550/arXiv.2306.03301. arxiv 2023, Estimating Conditional Mutual Information for Dynamic Feature Selection. 动态特征选择涉及到学习特征选择策略,以及使用任意特征对目标值进行预测。其中学习选择策略往往十分具有挑战性。这篇文章介绍了一种基于特征与预测目标的条件互信息(conditional mutual information)对特征进行优先级排序,该方法通过训练一个神经网络估算在给定特征集情况下,其他特征的预测能力(条件互信息),每一步选择最具信息的特征加入到已有特征集中。依次迭代下去直到满足停止条件(例如达到给定特征数量,不确定度,代价等)。此外,该框架同样能够利用先验信息。文章验证了该方法在表格与图像数据集测试中均有不错效果。
arXiv,
2023.
DOI: 10.48550/arXiv.2306.03301
Abstract:
Dynamic feature selection, where we sequentially query features to make accurate predictions with a minimal budget, is a promising paradigm to reduce feature acquisition costs and provide transparency into the …
>>>
Dynamic feature selection, where we sequentially query features to make accurate predictions with a minimal budget, is a promising paradigm to reduce feature acquisition costs and provide transparency into the prediction process. The problem is challenging, however, as it requires both making predictions with arbitrary feature sets and learning a policy to identify the most valuable selections. Here, we take an information-theoretic perspective and prioritize features based on their mutual information with the response variable. The main challenge is learning this selection policy, and we design a straightforward new modeling approach that estimates the mutual information in a discriminative rather than generative fashion. Building on our learning approach, we introduce several further improvements: allowing variable feature budgets across samples, enabling non-uniform costs between features, incorporating prior information, and exploring modern architectures to handle partial input information. We find that our method provides consistent gains over recent state-of-the-art methods across a variety of datasets.
<<<
翻译
12.
Vincent
(2023-07-31 14:42):
#paper Deep learning-based prediction of the T cell receptor–antigen binding specificity https://doi.org/10.1038/s42256-021-00383-2 2021 nature machine intelligence. 肿瘤新抗原在T细胞识别肿瘤细胞的过程中发挥着重要的作用,肿瘤新抗原与T细胞受体的结合与相互作用预测一直备受关注,然而相关的实验与计算方法一直有诸多不足,可验证性也很差。这篇文章开发了一套基于迁移学习的机器学习方法pMTnet,来预测抗原MHC结合物与T细胞受体的结合能力。通过将pMTnet运用到人的肿瘤基因组数据上,发现肿瘤新抗原比自身抗原的免疫原性更高,拥有对肿瘤新抗原结合能力强的T细胞克隆的病人在免疫治疗中有更好的预后和治疗效果。
IF:18.800Q1
Nature machine intelligence,
2021-Oct.
DOI: 10.1038/s42256-021-00383-2
PMID: 36003885
PMCID:PMC9396750
Abstract:
Neoantigens play a key role in the recognition of tumor cells by T cells. However, only a small proportion of neoantigens truly elicit T cell responses, and fewer clues exist …
>>>
Neoantigens play a key role in the recognition of tumor cells by T cells. However, only a small proportion of neoantigens truly elicit T cell responses, and fewer clues exist as to which neoantigens are recognized by which T cell receptors (TCRs). We built a transfer learning-based model, named pMHC-TCR binding prediction network (pMTnet), to predict TCR-binding specificities of neoantigens, and T cell antigens in general, presented by class I major histocompatibility complexes (pMHCs). pMTnet was comprehensively validated by a series of analyses, and showed advance over previous work by a large margin. By applying pMTnet in human tumor genomics data, we discovered that neoantigens were generally more immunogenic than self-antigens, but HERV-E, a special type of self-antigen that is re-activated in kidney cancer, is more immunogenic than neoantigens. We further discovered that patients with more clonally expanded T cells exhibiting better affinity against truncal, rather than subclonal, neoantigens, had more favorable prognosis and treatment response to immunotherapy, in melanoma and lung cancer but not in kidney cancer. Predicting TCR-neoantigen/antigen pairs is one of the most daunting challenges in modern immunology. However, we achieved an accurate prediction of the pairing only using the TCR sequence (CDR3β), antigen sequence, and class I MHC allele, and our work revealed unique insights into the interactions of TCRs and pMHCs in human tumors using pMTnet as a discovery tool.
<<<
翻译
13.
Vincent
(2023-06-30 15:00):
#paper https://www.nature.com/articles/s41467-018-04608-8, Nature communication 2018, Exploring patterns enriched in a dataset with contrastive principal component analysis
PCA(主成分分析)能够将高维数据映射到低维,是最常用的数据探索和可视化工具。然而PCA(以及其他降维方法例如t-sne, umap)每次只能分析一个数据集。当处理多个数据集,尤其是寻找某数据集特有的信号时,使用PCA就需要人工比较不同数据集的投影来试图寻找数据集间的相似和不同点。这篇文章提出了解决此类问题的一种简单有效的降维方法:对比PCA。该方法旨在寻找一个投影,使得目标数据集和背景数据集的差距尽可能大,从而富集目标数据集特有的信号。该方法原理与实现和PCA类似,后续实验验证了其能有效发现那些被PCA忽视的目标数据集特有的信号。除此之外,文章还详述了该方法的理论基础和几何表示,并指出其可以运用在很多PCA的使用场景中。
Abstract:
Visualization and exploration of high-dimensional data is a ubiquitous challenge across disciplines. Widely used techniques such as principal component analysis (PCA) aim to identify dominant trends in one dataset. However, …
>>>
Visualization and exploration of high-dimensional data is a ubiquitous challenge across disciplines. Widely used techniques such as principal component analysis (PCA) aim to identify dominant trends in one dataset. However, in many settings we have datasets collected under different conditions, e.g., a treatment and a control experiment, and we are interested in visualizing and exploring patterns that are specific to one dataset. This paper proposes a method, contrastive principal component analysis (cPCA), which identifies low-dimensional structures that are enriched in a dataset relative to comparison data. In a wide variety of experiments, we demonstrate that cPCA with a background dataset enables us to visualize dataset-specific patterns missed by PCA and other standard methods. We further provide a geometric interpretation of cPCA and strong mathematical guarantees. An implementation of cPCA is publicly available, and can be used for exploratory data analysis in many applications where PCA is currently used.
<<<
翻译
14.
Vincent
(2023-05-31 13:56):
#paper doi: https://doi.org/10.1111/j.1467-9868.2008.00674.x Journal of he Royal Statistical Society, 2008, Sure independence screening for ultrahighdimensional feature space. 高维数据往往面临着两大难题,参数估计的准确性和计算负担。先前的方法(Dantzig selector)在处理极高维数据(log p > n)时还是不够有效,这篇文章提出了一种基于相关性学习的特征筛选方法,能够将数据从极高维降到的合适的维度(小于n)。文章展示了在十分普遍的渐进框架下,相关性学习有可靠的筛选性能。同时作为该方法的扩展,文章还提出了一种迭代式的特征筛选,能够在有限数据量的情况下,提高筛选的准确性。此外当使用该方法把高维数据降低到低维之后,其他变量选择的方法例如lasso等也可以被运用进来,从而实现更准确和更快速的变量选择。
Abstract:
SummaryVariable selection plays an important role in high dimensional statistical modelling which nowadays appears in many areas and is key to various scientific discoveries. For problems of large scale or …
>>>
SummaryVariable selection plays an important role in high dimensional statistical modelling which nowadays appears in many areas and is key to various scientific discoveries. For problems of large scale or dimensionality p, accuracy of estimation and computational cost are two top concerns. Recently, Candes and Tao have proposed the Dantzig selector using L1-regularization and showed that it achieves the ideal risk up to a logarithmic factor log(p). Their innovative procedure and remarkable result are challenged when the dimensionality is ultrahigh as the factor log(p) can be large and their uniform uncertainty principle can fail. Motivated by these concerns, we introduce the concept of sure screening and propose a sure screening method that is based on correlation learning, called sure independence screening, to reduce dimensionality from high to a moderate scale that is below the sample size. In a fairly general asymptotic framework, correlation learning is shown to have the sure screening property for even exponentially growing dimensionality. As a methodological extension, iterative sure independence screening is also proposed to enhance its finite sample performance. With dimension reduced accurately from high to below sample size, variable selection can be improved on both speed and accuracy, and can then be accomplished by a well-developed method such as smoothly clipped absolute deviation, the Dantzig selector, lasso or adaptive lasso. The connections between these penalized least squares methods are also elucidated.
<<<
翻译
15.
Vincent
(2023-04-30 15:13):
#paper doi: https://www.nature.com/articles/s41576-023-00586-w Best practices for single-cell analysis across modalities. Nature review genetics,2023. 这篇综述文章来自Fabian Theis组, 是一篇极好的单细胞分析指导文章。文章涵盖了几种不同的技术(scRNA-seq, scATAC-seq, scTCR/BCR, spatial transcriptomics), 对于每一种技术路线,介绍了完整的分析流程和目前最好的处理方法,例如scRNA, 介绍了原始数据处理、数据过滤和去杂,标准化和批次效应去除,降维聚类分型,拟时序分析和RNA速率分析,差异基因分析,细胞组成分析和细胞通讯分析等等。对于每一个步骤,文章会总结当前的最佳实践(如果有其他文章做过基准测试)或者给出分析建议(如果目前还没有基准测试的工作)。鉴于当前单细胞分析领域各种方法层出不穷,这篇文章提供了一个很好的指导总结,非常推荐做单细胞分析的朋友阅读。
Abstract:
Recent advances in single-cell technologies have enabled high-throughput molecular profiling of cells across modalities and locations. Single-cell transcriptomics data can now be complemented by chromatin accessibility, surface protein expression, adaptive …
>>>
Recent advances in single-cell technologies have enabled high-throughput molecular profiling of cells across modalities and locations. Single-cell transcriptomics data can now be complemented by chromatin accessibility, surface protein expression, adaptive immune receptor repertoire profiling and spatial information. The increasing availability of single-cell data across modalities has motivated the development of novel computational methods to help analysts derive biological insights. As the field grows, it becomes increasingly difficult to navigate the vast landscape of tools and analysis steps. Here, we summarize independent benchmarking studies of unimodal and multimodal single-cell analysis across modalities to suggest comprehensive best-practice workflows for the most common analysis steps. Where independent benchmarks are not available, we review and contrast popular methods. Our article serves as an entry point for novices in the field of single-cell (multi-)omic analysis and guides advanced users to the most recent best practices.
<<<
翻译
16.
Vincent
(2023-03-31 15:34):
#paper https://doi.org/10.48550/arXiv.1904.10098 ICML 2019 DAG-GNN: DAG Structure Learning with Graph Neural Networks. 有向无环图(DAG)的结构学习是一项十分具有挑战性的工作,其搜索空间随着节点数的增多而呈现指数式的增长。常用的研究手段是将结构学习转化为一种score的优化问题。为了让问题可解,传统的方法通常考虑线性结构方程模型(Linear SEM),这篇文章基于线性SEM的框架,发展了一套基于变分自编码器VAE和图神经网络GNN的DAG学习方法,得益于神经网络的非线性拟合,这套方法在保证至少比线性SEM好的情况下还能解决一些非线性的问题。通过数据仿真和真实数据的学习,文章验证了该方法的准确度比线性SEM好,假发现率比线性SEM低。
arXiv,
2019.
DOI: 10.48550/arXiv.1904.10098
Abstract:
Learning a faithful directed acyclic graph (DAG) from samples of a joint distribution is a challenging combinatorial problem, owing to the intractable search space superexponential in the number of graph …
>>>
Learning a faithful directed acyclic graph (DAG) from samples of a joint distribution is a challenging combinatorial problem, owing to the intractable search space superexponential in the number of graph nodes. A recent breakthrough formulates the problem as a continuous optimization with a structural constraint that ensures acyclicity (Zheng et al., 2018). The authors apply the approach to the linear structural equation model (SEM) and the least-squares loss function that are statistically well justified but nevertheless limited. Motivated by the widespread success of deep learning that is capable of capturing complex nonlinear mappings, in this work we propose a deep generative model and apply a variant of the structural constraint to learn the DAG. At the heart of the generative model is a variational autoencoder parameterized by a novel graph neural network architecture, which we coin DAG-GNN. In addition to the richer capacity, an advantage of the proposed model is that it naturally handles discrete variables as well as vector-valued ones. We demonstrate that on synthetic data sets, the proposed method learns more accurate graphs for nonlinearly generated samples; and on benchmark data sets with discrete variables, the learned graphs are reasonably close to the global optima. The code is available at \url{this https URL}.
<<<
翻译
17.
Vincent
(2023-02-28 19:08):
#paper DOI: https://doi.org/10.1038/s41592-021-01205-4 DOME: recommendations for supervised machine learning validation in biology. Nat Methods 2021. 机器学习方法在生物学领域变得越发重要,理想情况下机器学习预测结果最好能够被生物实验所验证,但是目前绝大多数的文章并没有配套的实验验证步骤,而只是通过计算指标来反映模型的性能,但这类计算指标往往受很多步骤的影响(例如数据集选择,训练集测试集的拆分,正负样本平衡性等等),导致最后的结论不一定稳定可靠。这篇评论文章旨在号召相关领域应该建立一套机器学习研究的写作和汇报标准,从而提高该领域内机器学习应用的交流效率。这篇文章从数据,算法,模型,评价四个方面列举了诸多影响模型性能的因素,并建议研究者在发表机器学习的文章时应该参照这四个方面的问题,详细阐述方法的细节,以此推动文章评审的效率,提高研究的透明度和可重复性
Abstract:
No abstract available.
18.
Vincent
(2023-01-31 14:45):
#paper doi:https://doi.org/10.1186/s13059-021-02388-x Gene set enrichment analysis for genome-wide DNA methylation data. Genome Biology 2021. 甲基化芯片相比WGBS而言所需要的费用更低,其被广泛用于DNA甲基化的测量。过去的研究主要着重于甲基化芯片的数据处理和甲基化差异分析上,对基因集富集分析的关注较少,这篇文章提出了一个基于甲基化差异分析结果的的基因集富集分析:GOmeth(适用于探针层面的差异分析数据)和GOregion(适用于区域层面的差异分析数据)。具体来说,CpG位点在基因组上的分布并不是均匀的,不同基因附近的CpG位点数量并不一样多,这导致依照甲基化差异分析选择相邻基因做富集分析时,CpG较多的基因更容易被选中,给富集分析带来偏差。同时同一个CpG位点可能位于好几个基因附近(大概占总数的8%),导致那些差异甲基化的基因并不是独立获得的,也会给基因集富集分析带来偏差。这篇文章的方案调整了富集分析中CpG位点的权重和统计分布,通过数据仿真和重复抽样的方法探究了上述两种偏差对基因集富集分析的影响,同时也验证了提出的方法能够很好的控制错误发现率(FDR),同时能给更加biological meaningful的通路分析结果
Abstract:
DNA methylation is one of the most commonly studied epigenetic marks, due to its role in disease and development. Illumina methylation arrays have been extensively used to measure methylation across …
>>>
DNA methylation is one of the most commonly studied epigenetic marks, due to its role in disease and development. Illumina methylation arrays have been extensively used to measure methylation across the human genome. Methylation array analysis has primarily focused on preprocessing, normalization, and identification of differentially methylated CpGs and regions. GOmeth and GOregion are new methods for performing unbiased gene set testing following differential methylation analysis. Benchmarking analyses demonstrate GOmeth outperforms other approaches, and GOregion is the first method for gene set testing of differentially methylated regions. Both methods are publicly available in the missMethyl Bioconductor R package.
<<<
翻译
19.
Vincent
(2022-12-31 17:51):
#paper DNA methylation aging clocks: challenges and recommendations, Genome Biology, 2019, https://doi.org/10.1186/s13059-019-1824-y 衰老通常伴随着疾病的发生,理解人类为何以及如何衰老是生物学中的重要课题。衰老伴随着分子层面的变化,过去十年内,不少研究发现可以使用基因组上的一部分CpG位点甲基化水平来准确预测年龄,这样的一组CpG位点又被称为 表观遗传时钟。事实上表观遗传时钟的预测误差与疾病发生率和死亡率也被发现有联系,从而广泛引起了研究者们的兴趣。这篇综述文章总结了表观遗传时钟领域的如下七大挑战,并分别介绍了研究现状,不确定性和未来研究方向的推荐:1. 拆分表观时钟的时序成分和生物成分;2. 组织特异或者疾病特异时钟的功能性研究;3.大规模时序种群研究的表观遗传学整合; 4. 衰老的全基因组分析以及其他表观遗传标记物的探索;5. 衰老与疾病的单细胞组学分析; 6. 稳健产生其他物种的衰老数据; 7. 将表观遗传学与遗传学的伦理和法律框架融合起来。个人感觉文章质量一般
Abstract:
Epigenetic clocks comprise a set of CpG sites whose DNA methylation levels measure subject age. These clocks are acknowledged as a highly accurate molecular correlate of chronological age in humans …
>>>
Epigenetic clocks comprise a set of CpG sites whose DNA methylation levels measure subject age. These clocks are acknowledged as a highly accurate molecular correlate of chronological age in humans and other vertebrates. Also, extensive research is aimed at their potential to quantify biological aging rates and test longevity or rejuvenating interventions. Here, we discuss key challenges to understand clock mechanisms and biomarker utility. This requires dissecting the drivers and regulators of age-related changes in single-cell, tissue- and disease-specific models, as well as exploring other epigenomic marks, longitudinal and diverse population studies, and non-human models. We also highlight important ethical issues in forensic age determination and predicting the trajectory of biological aging in an individual.
<<<
翻译
20.
Vincent
(2022-11-30 19:09):
#paper https://doi.org/10.1038/s41467-020-15298-6 nature communication, 2020, Integrative differential expression and gene set enrichment analysis using summary statistics for scRNA-seq studies. 基因表达差异分析和基因集富集分析是单细胞领域两个最常用的分析方式,但是两种分析往往是独立进行的,由于单细胞数据噪声较大,这样单独分析会造成统计效力的降低以及不同的数据集(或者使用不同方法分析同一套数据)得到的分析结果不一致。另一方面差异分析和富集分析其实在内部是紧密相连的,差异分析的结果是富集分析的基础,同时基因集富集分析反过来也可以反哺差异分析(基因之间并非独立,如果某基因差异表达了,与之相关的基因也可能差异表达),这意味着将两者结合起来同时分析能够提高统计效力并且使得分析结果更加稳健和可重复。这篇文章提出了一种新方法iDEA,该方法使用了层次贝叶斯模型,将差异分析和富集分析整合起来综合分析,通过仿真实验和真实数据分析,文章发现该方法较现有的差异或者富集方法有更高的统计效力,更一致的差异分析结果和更准确的富集分析结论
Abstract:
Differential expression (DE) analysis and gene set enrichment (GSE) analysis are commonly applied in single cell RNA sequencing (scRNA-seq) studies. Here, we develop an integrative and scalable computational method, iDEA, …
>>>
Differential expression (DE) analysis and gene set enrichment (GSE) analysis are commonly applied in single cell RNA sequencing (scRNA-seq) studies. Here, we develop an integrative and scalable computational method, iDEA, to perform joint DE and GSE analysis through a hierarchical Bayesian framework. By integrating DE and GSE analyses, iDEA can improve the power and consistency of DE analysis and the accuracy of GSE analysis. Importantly, iDEA uses only DE summary statistics as input, enabling effective data modeling through complementing and pairing with various existing DE methods. We illustrate the benefits of iDEA with extensive simulations. We also apply iDEA to analyze three scRNA-seq data sets, where iDEA achieves up to five-fold power gain over existing GSE methods and up to 64% power gain over existing DE methods. The power gain brought by iDEA allows us to identify many pathways that would not be identified by existing approaches in these data.
<<<
翻译