James
(2023-04-21 10:41):
#paper Ali Madani, Ben Krause, Eric R Greene, Subu Subramanian, Benjamin P Mohr, James M Holton, Jose Luis Olmos Jr, Caiming Xiong, Zachary Z Sun, Richard Socher, James S Fraser, Nikhil Naik Large language models generate functional protein sequences across diverse families PMID: 36702895 DOI: 10.1038/s41587-022-01618-2。 文章通过对超过1万9千个家族的2.8亿条蛋白序列进行训练从而构建 和LLM类似的深度学习模型 ProGen。其可以进一步微调到精选的序列和标签,以提高来自具有足够同源样本的家族的蛋白质的可控生成性能。针对五个不同的溶菌酶家族进行微调的人工蛋白质显示出与天然溶菌酶相似的催化效率,且与天然蛋白质的序列同一性只有 31.4%。就在论文登上Nature Biotechnology的同一天,由论文第一作者Ali Madani创办的公司Profluent Bio宣布获得由Insight Partners领投的900万美元种子轮融资。该笔融资的将用于在加利福尼亚州伯克利建立一个湿实验室,使Profluent能够在通过实验方法产生的数据与其AI系统之间创建一个紧密的反馈循环,为设计任何蛋白质提供强大的验证,并不断改进他们的AI。
Large language models generate functional protein sequences across diverse families
翻译
Abstract:
Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase.
翻译