白鸟
(2025-11-21 11:38):
#paper DOI: 10.1126/science.ade2574, Evolutionary-scale prediction of atomic-level protein structure with a language model, 2023.
文章提出了ESM-2和ESM-fold模型。特点是单序列,快速预测,用高达150亿参数的Transformer蛋白质语言模型,创建了序列到结构预测器ESMFold,速度提升一个数量级。用此预测算法构建ESM宏基因组图谱数据库。
底层思路:蛋白质在漫长进化中积累了大量突变,序列中包含“可容忍的突变模式”,可从此种进化模式推测它的三维构型和功能。
工作原理:用Transformer去学习蛋白质序列内部的规则,无监督学习,训练目标是掩码预测(MLM),通过预测被遮住的氨基酸间接学习到进化规律。语言模型参数扩展至150亿时,数据规模扩大,原子级结构信息逐渐涌现。
模型的思考:
1.校准的重要性:无法人工检查所有预测,模型的置信度校准能力变得关键
2.算法的限制:只能从序列中“间接”学习结构与功能是无监督学习的天花板;模型没有显式的三维信息,从序列统计规律“猜”结构,只能输出 embedding;
3. 数据偏倚:ESM-2的大部分数据来自 MGnify:微生物序列占绝大多数,海洋菌群序列远多于哺乳动物;
作者愿景:
1.未来理解基因测序实验中发现的所有蛋白质的结构;
2.模型在参数、序列数据和计算能力未达最优,未来可能涌现更高维的结构;
Science,
2023-3-17.
DOI: 10.1126/science.ade2574
Evolutionary-scale prediction of atomic-level protein structure with a language model
翻译
Abstract:
Recent advances in machine learning have leveraged evolutionary information in multiple sequence alignments to predict protein structure. We demonstrate direct inference of full atomic-level protein structure from primary sequence using a large language model. As language models of protein sequences are scaled up to 15 billion parameters, an atomic-resolution picture of protein structure emerges in the learned representations. This results in an order-of-magnitude acceleration of high-resolution structure prediction, which enables large-scale structural characterization of metagenomic proteins. We apply this capability to construct the ESM Metagenomic Atlas by predicting structures for >617 million metagenomic protein sequences, including >225 million that are predicted with high confidence, which gives a view into the vast breadth and diversity of natural proteins.
翻译
Related Links: