Vincent (2025-03-31 16:09):
#paper doi: https://doi.org/10.48550/arXiv.2503.00096 BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology 大语言模型在加速科学发现方面展现出了重要潜力。目前大语言模型智能体在生物信息领域的应用缺乏系统评估,这篇文章整理了近50个真实场景,约300个开放性问题来衡量基于大语言模型的智能体在解决复杂生信问题的能力,作者测试了两个前沿大语言模型(gpt 4o和claude 3.5 sonnet),发现这些模型在回答开放性问题的准确率都较低,回答多选问题的能力也并不比随机选择策略好。这篇文章的贡献在于提供了测试用例与评估框架,为更搭建性能更好的智能体打下了基础
arXiv, 2025-02-28T18:47:57Z. DOI: 10.48550/arXiv.2503.00096
BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology
Ludovico Mitchener, Jon M Laurent, Benjamin Tenmann, Siddharth Narayanan, Geemi P Wellawatte, Andrew White, Lorenzo Sani, Samuel G Rodriques
Abstract:
Large Language Models (LLMs) and LLM-based agents show great promise in<br>accelerating scientific research. Existing benchmarks for measuring this<br>potential and guiding future development continue to evolve from pure recall<br>and rote knowledge tasks, towards more practical work such as literature review<br>and experimental planning. Bioinformatics is a domain where fully autonomous<br>AI-driven discovery may be near, but no extensive benchmarks for measuring<br>progress have been introduced to date. We therefore present the Bioinformatics<br>Benchmark (BixBench), a dataset comprising over 50 real-world scenarios of<br>practical biological data analysis with nearly 300 associated open-answer<br>questions designed to measure the ability of LLM-based agents to explore<br>biological datasets, perform long, multi-step analytical trajectories, and<br>interpret the nuanced results of those analyses. We evaluate the performance of<br>two frontier LLMs (GPT-4o and Claude 3.5 Sonnet) using a custom agent framework<br>we open source. We find that even the latest frontier models only achieve 17%<br>accuracy in the open-answer regime, and no better than random in a<br>multiple-choice setting. By exposing the current limitations of frontier<br>models, we hope BixBench can spur the development of agents capable of<br>conducting rigorous bioinformatic analysis and accelerate scientific discovery.
回到顶部