文献收藏与分享平台

龙海晨 (2026-03-08 23:17):

#paper Scancar B, Byrne JA, Causeur D, Barnett AG. Machine learning based screening of potential paper mill publications in cancer research: methodological and cross sectional study. BMJ. 2026 Jan 29;392:e087581. doi: 10.1136/bmj-2025-087581. PMID: 41611528; PMCID: PMC12853418.这是一篇发表在BMJ上的文章，是大刊，可是我觉的设计方法上有明显错误，不光我这么觉得，我查了一下，还有很多国家的学者这么觉得。先分享再说我认为的错误，这篇设计了一个机器学习，让他学习论文工厂的癌症研究的文章，和正真的癌症研究文章，结果发现中国有17万篇癌症研究的文章来自论文工厂这个比例是中国癌症研究的36%以上是文章的结果，以下是我的反驳，他的文章结果中，越是英语母语国家，论文工厂比例越低，越是非母语国家论文工厂比例越高。论文工厂哪来那么多的产量。实际上就是，我们非英语母语国家写论文的时候，都是找几篇大牛的，模仿人家的语气结构去写。论文工厂的AI也是模仿。这个因素不排除搞笑呢。

BMJ, 2026-1-29. DOI: 10.1136/bmj-2025-087581

Machine learning based screening of potential paper mill publications in cancer research: methodological and cross sectional study

Baptiste Scancar, Jennifer A Byrne, David Causeur, Adrian G Barnett

Abstract:

Abstract Objectives To train and validate a machine learning model to distinguish paper mill publications from genuine cancer research articles, and to screen the cancer research literature to assess the prevalence of papers that have textual similarities to paper mill papers. Design Methodological and cross sectional study applying a BERT (bidirectional encoder representations from transformers) based, text classification model to article titles and abstracts. Setting Retracted paper mill publications listed in the Retraction Watch database were used for model training. The cancer research corpus was screened by the model using the PubMed database restricted to original cancer research articles published between 1999 and 2024. Population The model was trained on 2202 retracted paper mill papers and validated on independent data collected by image integrity experts. 2.6 million cancer research papers were screened. Main outcome measures Classification performance of the model. Prevalence of papers flagged as similar to retracted paper mill publications with 95% confidence intervals and their distribution over time, by country, publisher, cancer type, research area, and within high impact journals (top 10%). Results The model achieved an accuracy of 0.91. When applied to the cancer research literature, it flagged 261 245 of 2 647 471 papers (9.87%, 95% confidence interval 9.83 to 9.90) and revealed a large increase in flagged papers from 1999 to 2024, both across the entire corpus and in the top 10% of journals by impact factor. More than 170 000 papers affiliated with Chinese institutions were flagged, accounting for 36% of Chinese cancer research articles. Most publishers had published substantial numbers of flagged papers. Flagged papers were overrepresented in fundamental research and in gastric, bone, and liver cancer. Conclusions Paper mills are a large and growing problem in the cancer literature and are not restricted to low impact journals. Collective awareness and action will be crucial to address the problem of paper mill publications.

Related Links:

https://doi.org/10.1136/bmj-2025-087581