Vincent (2025-02-28 18:53):
#paper https://doi.org/10.1038/s41586-024-08328-6 nature. 2025. Accurate predictions on small data with a tabular foundation model. 过去二十年表格型数据预测一直是梯度提升决策树(gradient boosting decision tree)的天下,这篇文章开发了一种基于生成型transformer的表格基础模型。模型采用统一的嵌入方式来表示数值型和类别型特征,通过自注意力机制捕捉不同特征之间的复杂交互关系,并在数百万个合成数据上进行了大规模预训练,从而显著提升了对新任务的适应能力。实验结果显示,在多个真实小规模数据集上,该模型在预测准确度和训练效率方面都优于传统梯度提升决策树以及其他常见深度学习基线。研究还通过定量、定性和可解释性分析验证了模型在模型微调、数据生成、密度估计及表示学习等方面的多任务能力。尽管该模型在小数据场景中展现出显著优势,但真实数据分布的多样性、扩展到更高维度数据,理解模型的理论基础等问题仍有待进一步研究。
IF:50.500Q1 Nature, 2025-1-9. DOI: 10.1038/s41586-024-08328-6 PMID: 39780007 PMCID:PMC11711098
Accurate predictions on small data with a tabular foundation model
翻译
Abstract:
AbstractTabular data, spreadsheets organized in rows and columns, are ubiquitous across scientific fields, from biomedicine to particle physics to economics and climate science1,2. The fundamental prediction task of filling in missing values of a label column based on the rest of the columns is essential for various applications as diverse as biomedical risk models, drug discovery and materials science. Although deep learning has revolutionized learning from raw data and led to numerous high-profile success stories3–5, gradient-boosted decision trees6–9 have dominated tabular data for the past 20 years. Here we present the Tabular Prior-data Fitted Network (TabPFN), a tabular foundation model that outperforms all previous methods on datasets with up to 10,000 samples by a wide margin, using substantially less training time. In 2.8 s, TabPFN outperforms an ensemble of the strongest baselines tuned for 4 h in a classification setting. As a generative transformer-based foundation model, this model also allows fine-tuning, data generation, density estimation and learning reusable embeddings. TabPFN is a learning algorithm that is itself learned across millions of synthetic datasets, demonstrating the power of this approach for algorithm development. By improving modelling abilities across diverse fields, TabPFN has the potential to accelerate scientific discovery and enhance important decision-making in various domains.
翻译
回到顶部