小W (2023-08-31 22:27):
#paper doi:https://doi.org/10.1038/s41586-023-06291-2 Large language models encode clinical knowledge 本文是谷歌一篇介绍医学LLM(大型语言模型)的文章。作者进行了以下工作,1. 提出了包含医学考试、研究和医患问答数据 的医学问答基准测试数据集MultiMedQA 2. 从科学基础、理解推理能力、答案准确和完整、误诊伤害等方面提出了人类对医学LMM的评估框架 3.基于Flan-PaLM模型,使用 instruction prompt tuning 迁移到新知识,生成 Med-PaLM 模型 4. 对 PaLM ,Flan-PaLM 和 Med-PaLM 模型进行评估,Med-PaLM 在其中几个指标上大大缩小了与临床医生的差距,还没找到试用。
IF:50.500Q1 Nature, 2023-Aug. DOI: 10.1038/s41586-023-06291-2 PMID: 37438534
Large language models encode clinical knowledge
翻译
Abstract:
Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA, MedMCQA, PubMedQA and Measuring Massive Multitask Language Understanding (MMLU) clinical topics), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.
翻译
回到顶部