Vincent
(2023-12-31 21:15):
#paper doi: 10.1126/science.adi6000 Prediction-powered inference, science 2023 目前很多领域里已标注的数据(金标准)较稀缺而未标注的数据较丰富,如何使用这些数据得到严谨的统计结论还面临着颇多挑战。传统方法的思路是只使用这些少数的金标准的数据进行统计推断,这种方案得到的统计结果有效,但样本量少会导致可能的发现较少。另一种思路是使用预测模型对未标注的数据进行标注,用补全标签后的数据和金标准数据进行统计推断,这种方案样本量大,但其假设了预测模型是完美的, 很多时候这种假设并不成立,预测误差与偏差累计可能会导致无效的统计结论。这篇文章提出了一个通用的框架,在使用预测模型的同时也保证了统计结论的有效性。该框架分为三步,1.选择需要估计的参数,2.从未标注数据估计拟合度,从标注数据估计矫正量,3.结合拟合度与校正量获取参数的置信区间。文章在数学上证明了对于任意的预测算法与数据分布,这种基于预测的统计推断能够确保置信区间涵盖真实值的概率达到给定的置信度。由于该方法能够使用的样本量更大,后续数据分析也验证了其较传统方法得到的置信区间更窄,p-value更有效。
Prediction-powered inference
翻译
Abstract:
Prediction-powered inference is a framework for performing valid statistical inference when an experimental dataset is supplemented with predictions from a machine-learning system. The framework yields simple algorithms for computing provably valid confidence intervals for quantities such as means, quantiles, and linear and logistic regression coefficients without making any assumptions about the machine-learning algorithm that supplies the predictions. Furthermore, more accurate predictions translate to smaller confidence intervals. Prediction-powered inference could enable researchers to draw valid and more data-efficient conclusions using machine learning. The benefits of prediction-powered inference were demonstrated with datasets from proteomics, astronomy, genomics, remote sensing, census analysis, and ecology.
翻译