Vincent (2025-03-31 16:09):
#paper doi: https://doi.org/10.48550/arXiv.2503.00096 BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology 大语言模型在加速科学发现方面展现出了重要潜力。目前大语言模型智能体在生物信息领域的应用缺乏系统评估,这篇文章整理了近50个真实场景,约300个开放性问题来衡量基于大语言模型的智能体在解决复杂生信问题的能力,作者测试了两个前沿大语言模型(gpt 4o和claude 3.5 sonnet),发现这些模型在回答开放性问题的准确率都较低,回答多选问题的能力也并不比随机选择策略好。这篇文章的贡献在于提供了测试用例与评估框架,为更搭建性能更好的智能体打下了基础
arXiv, 2025-02-28T18:47:57Z. DOI: 10.48550/arXiv.2503.00096
BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology
翻译
Abstract:
Large Language Models (LLMs) and LLM-based agents show great promise inaccelerating scientific research. Existing benchmarks for measuring thispotential and guiding future development continue to evolve from pure recalland rote knowledge tasks, towards more practical work such as literature reviewand experimental planning. Bioinformatics is a domain where fully autonomousAI-driven discovery may be near, but no extensive benchmarks for measuringprogress have been introduced to date. We therefore present the BioinformaticsBenchmark (BixBench), a dataset comprising over 50 real-world scenarios ofpractical biological data analysis with nearly 300 associated open-answerquestions designed to measure the ability of LLM-based agents to explorebiological datasets, perform long, multi-step analytical trajectories, andinterpret the nuanced results of those analyses. We evaluate the performance oftwo frontier LLMs (GPT-4o and Claude 3.5 Sonnet) using a custom agent frameworkwe open source. We find that even the latest frontier models only achieve 17%accuracy in the open-answer regime, and no better than random in amultiple-choice setting. By exposing the current limitations of frontiermodels, we hope BixBench can spur the development of agents capable ofconducting rigorous bioinformatic analysis and accelerate scientific discovery.
翻译
回到顶部