Baptiste Scancar1, Jennifer A. Byrne23, David Causeur1 and Adrian G. Barnett4*
1IRMAR UMR 6625 CNRS, L’Institut Agro, Rennes, France
2NSW Health Statewide Biobank, NSW Health Pathology, Camperdown, Australia
3School of Medical Sciences, Faculty of Medicine and Health, The University of Sydney, Camperdown, Australia
4School of Public Health & Social Work, Queensland University of Technology, Kelvin Grove, Australia
*Corresponding author; email: a.barnett{at}qut.edu.au
bioRxiv preprint DOI: https://doi.org/10.1101/2025.08.29.673016
Posted: September 03, 2025, Version 1
Copyright: This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at http://creativecommons.org/licenses/by/4.0/
Abstract
Paper mills are fraudulent organisations that publish fake manuscripts for profit. Paper mill-supported papers have been identified in the cancer research literature, where some papers show features of template use, with seemingly pre-formulated sentences. We trained a machine learning model to distinguish between paper mill papers and genuine cancer research papers published between 1999 and 2024. Using titles and abstracts, the model achieved a prediction accuracy of 0.91. When applied to the cancer research literature, it flagged 9.87% (95% CI 9.83 to 9.90) of papers and revealed a large increase in flagged papers from 1999 to 2024, both across the entire corpus and in the top 10% of journals by impact factor. Most publishers had substantial numbers of flagged papers. Over 170,000 papers by authors from Chinese institutions were flagged. These results indicate that paper mills are a large and growing problem in the cancer literature and are not restricted to low-impact journals.
Introduction
Research paper mills are ‘contract-cheating organisations which provide undeclared services to support research manuscripts and publications’1. Research paper mills fabricate and submit manuscripts for their customers. The first research paper mill activity was reported in the 2010s2, and has since increased in volume and sophistication. According to Nature, more than 400,000 papers suspected to have originated from paper mills have been published in the last 20 years3, with the paper mills earning tens of millions of dollars annually4. This issue gained visibility when Wiley – after acquiring Hindawi – retracted nearly 11,000 suspected paper mill papers and shut down 19 journals over two years5.
Research paper mills maximise their earnings by quickly producing industrial quantities of research papers6,7. To produce manuscripts at scale, fabrication has likely relied on templates with pre-made sentences where domain-specific terms vary8. Suspect papers can include incorrect reagents9, fabricated data and experiments2,10, and photoshopped or re-used figures11. Paper mill papers are often generic, poorly written, lack coherence between sections8,12, and may offer superficial research justifications7. Paper mills sell manuscripts to researchers eager to increase their number of publications10, creating author groups who have never worked together or made any intellectual input13. Paper mills may even bribe editors and manipulate peer-review to facilitate publication, as shown by online discussions between researchers and likely paper mill contacts4.
Paper mill manuscripts can be simultaneously submitted to multiple journals until acceptance, wasting the time of editors and reviewers2,11,14. The reported percentage of paper mill submissions to journals ranges from 2 to 46%2. Paper mills likely target journals where fabricated manuscripts have already been accepted2, increasing their chances of success. They may also focus on high-impact journals, as the prices they charge – according to A. Abalkina’s investigations of a paper mill14 – can be directly linked to the journal’s impact factor15.
The overall prevalence of paper mill papers in biology and medicine is estimated as 3%3 but cancer research and particularly molecular oncology could be more affected1. This can be explained by high publication pressure7, a specialised field with simple-to-fake data and techniques1, and limited peer-review capacity6 – making fake papers easier to produce and harder to detect.
The rise of AI will likely exacerbate the paper mill problem, via automated image and text generation, making detection more challenging16. Some publishers are using screening tools to detect manuscripts from paper mills17. Integrity sleuths have developed independent detection methods, such as identifying awkward rewording of scientific terms, known as ‘tortured phrases’18, or nucleotide sequence reagent verification9. Paper mill manuscripts may have missing or unusual acknowledgements, funding or ethics statements11.
To our knowledge, and despite evidence of template use8,9, detection methods based on text-structure understanding have not been publicly developed. The use of templates by paper mills creates the potential to systematically screen papers and to detect paper mill use in cancer research. Previous research has focused on predictive performance than on developing scalable screening approaches, using data from Retraction Watch – a non-profit group that records paper retractions19 – to detect retractions20,21 and paper mill products22.
We hypothesise that paper mills use manuscript templates with recurring features that frequently can be automatically detected. We believe the templates are being used to create whole papers, including the title and abstract. Thus, we aim to use cancer research titles and abstracts from retracted paper mill papers as input to a BERT-based (Bidirectional Encoder Representations from Transformers)23 machine learning pipeline for a text classification task. BERT learns from examples to recognise patterns in text, enabling it to identify when new papers share similarities with retracted paper mill papers. Similar techniques are commonly used, for example, to distinguish genuine emails from spam24.
The first objective was to train and evaluate our model’s ability to reliably classify retracted papers attributed to suspected paper mill activity and genuine cancer research papers. The second objective was to use our model to screen millions of cancer research papers to assess the prevalence of flagged papers (those classified by the model as similar to retracted paper mill papers) over time, across countries, publishers, and cancer research subdomains, and to examine how suspected paper mill papers have evolved in high–impact factor journals (defined as Decile 1, or D1 – top 10% journals in this study). This research aims not only to assess the potential of machine learning for detecting paper mill manuscripts, but also to raise awareness of potentially fabricated papers among stakeholders in cancer research.
Material and methods
Cancer research corpus
PubMed baseline pre-processing
To create a comprehensive cancer research dataset, the entire biomedical research corpus from the 2025 PubMed (https://pubmed.ncbi.nlm.nih.gov/) baseline was downloaded in March 2025. The following data were extracted from each of the over 38 million papers: PubMed Identifier (PMID), title, abstract, original language, journal name, journal ISSN (International Standard Serial Number), publication date, first author’s affiliation, publication type and MeSH terms (Medical Subject Headings). The data were pre-processed following the method used by González-Márquez et al.25 as follows: we excluded abstracts that were non-English, empty, truncated and unpunctuated to avoid these features influencing the language model. The text was transformed into standardised tokens (units of text such as words or punctuation marks) for analysis. Abstracts of less than 250 tokens and more than 4,000 tokens were removed as these were generally non-standard abstracts and were also rare (< 1%). After this initial filtering, the research dataset contained 24.8 million papers published between 1975 and 2025.
Next, the dataset was filtered to retain only papers published within the target time frame (1999 to 2024): papers prior to 1999 or after 2024 were excluded, and duplicates were removed, reducing the dataset to 20.2 million papers. Only original research papers were retained: specifically, papers where ‘Journal Article’ was listed and other publication types were excluded, such as literature reviews and clinical trials, which are also targeted by paper mills but would need a separate model as we expect the paper mills to use manuscript type-specific templates. All retraction notices, corrections and expressions of concern were also removed. After this second filtering step, 17.4 million papers remained.
Cancer research filtering
The cancer research corpus was derived from the remaining papers using a two-level keyword filtering strategy. Box 1 shows the keywords searched for in titles and abstracts of the 17.4 million papers. These keywords were adapted from MeSH terms and National Cancer Institute26 terminology. The keyword matching was designed to be specific to cancer whilst also retaining the broadest coverage of cancer research papers. We acknowledge that some non-cancer-related papers may be included and that not all cancer-specific terms have been used.
Box 1:
Cancer-related keywords
astrocytoma, carcinoembryonic antigen, carcinoid, carcinogen, carcinogenesis, carcinoma, cancer, checkpoint inhibitor, chemotherapy, chordoma, ependymoma, glioblastoma, glioma, leukaemia, leukemia, lymphoma, macroglobulinaemia, macroglobulinemia, medulloblastoma, melanoma, mesothelioma, metastasis, metastatic, myelodysplastic syndrome, myeloma, myeloproliferative neoplasm, neuroblastoma, nsclc, oncogene, oncogenesis, oncology, pheochromocytoma, radiation therapy, radiotherapy, retinoblastoma, sarcoma, seminoma, tumor and tumour.
Substring matching was used, meaning that terms such as osteosarcoma were captured under broader categories like sarcoma. Papers matching multiple keywords were included only once to avoid duplicates. Both UK and US spellings of each cancer-related term were used. All levels of the search were combined to produce a final cancer research dataset of 2,647,471 papers, published across 11,632 journals. A flowchart detailing the filtering strategy from the initial PubMed baseline to the final cancer research corpus is provided in Figure 1.

Figure 1:
Flow diagram showing all preprocessing steps and exclusions applied to derive the final cancer research corpus. Cancer keywords are provided in Box 1.
Data extracted for visualisation purposes are the first author’s country of affiliation, the publisher, the type of cancer investigated, the main cancer research areas, and the SCImago Journal Impact Factor27. The type of cancer investigated and the main cancer research areas were inferred from the content of the abstract using AI labelling. All extraction methods are detailed in Supplementary File 1.
Paper mill datasets
We developed our model using two sources of paper mill papers: 1) Papers tagged as originating from paper mills in the Retraction Watch Database19; 2) Online lists compiled by research integrity experts – also called integrity sleuths – where evidence of image manipulation was found. A compilation of paper mill papers is available online in the ‘Spreadsheet of spreadsheet’ thanks to anonymous PubPeer contributors28. PubPeer29 is a web site (https://pubpeer.com/) that allows users to leave post-publication comments concerning potential research integrity issues or other concerns. It has been used in research on integrity30 and has played an important role in high-profile retractions31.
We used the Retraction Watch dataset during the model training stage while the experts’ dataset was kept for further validation of the model’s performance. From the 64,457 total retractions recorded in the Retraction Watch Database as of June 2025, we removed papers without a PMID, and those that were not retractions (e.g., Expressions of Concern, Corrections, or Reinstatements). The ‘Paper Mill’ tag in the retraction reason field was used to identify 5,657 retracted publications. Only those whose PMIDs matched entries in our cancer research corpus were retained, reducing the number to 2,270 retracted papers. We excluded papers for which the original text had been replaced by the retraction notice, reducing the number of retracted paper mill papers to 2,202.
To test the model’s ability in new data, we used 3,094 papers from the integrity experts’ dataset for later model testing, excluding those that overlapped with the Retraction Watch set.
Visualisations of data used at the training stage from the Retraction Watch and the integrity experts’ sets are presented in Supplementary file 2. These show the distribution of publishers, countries of the first authors’ institutions, cancer types, and research areas among paper mill papers, as well as title unigrams, bigrams, and trigrams, which are the single words, two-word and three-word combinations most prevalent in the titles of known or suspected paper mill papers.
Model selection and training
Controls selection
Our training dataset was chosen to be balanced, with 50% paper mill papers and 50% presumed genuine papers (controls). Controls were selected from the cancer research corpus with the aim of including as few paper mill papers as possible to minimise bias and enhance training performance. Given the difficulty of assessing the genuine status of cancer research papers in large samples, control papers were selected from high impact factor journals (Decile 1) and countries underrepresented in the Retraction Watch database (using the country of the first author’s institution).
To avoid a potential bias in the English diction used by Chinese authors, we included 101 papers (5%) in the control set authored by researchers from Chinese institutions that were published in four high-impact journals: Cell, Cancer Cell, Molecular Cell, and The EMBO journal. We also included 600 of the most cited Taiwanese cancer research papers (28%) listed in OpenAlex32, as Taiwan is a predominantly Mandarin-speaking country, but is less represented in the Retraction Watch database (with only one recorded retraction due to paper mill involvement in the cancer research corpus). Another 33% of the control papers were randomly selected from Swedish, Finnish and Norwegian institutions in cancer research, as these countries have no recorded instances of paper mill retractions in the Retraction Watch database. The remaining 33% consisted of a random selection of papers published in Cell, Cancer Cell, Molecular Cell, and The EMBO Journal, from countries other than China, Taiwan, Norway, Sweden and Finland.
For the testing dataset, papers from the integrity experts’ dataset were combined with an equal number of control papers, randomly sampled from authors from Swedish, Finnish and Norwegian institutions, and other papers published in the four high-impact factor journals mentioned above.
All papers in the control sets were verified as free of research integrity concerns on PubPeer. We assumed that all papers in the paper mill datasets are indeed paper mill products, and that the controls are legitimate scientific papers. However, we acknowledge that the ground truth is unknown and that, despite our efforts, some papers may have been mislabelled.
Model selection and training
We chose to use only titles and abstracts to train the model, as these data were always available (full texts are frequently behind paywalls). Each paper’s title and abstract were combined. We framed the detection of paper mill papers as a binary text classification problem – either Authentic or Fraudulent – to provide supervision signals to the model. The data were split into 75% training, 20% validation, and 5% test sets.
We selected BERT for its strong performance and relatively low computational cost, due to its moderate number of parameters23. To validate this choice, we conducted preliminary experiments with several other state-of-the-art models, including RoBERTa, BioBERT, PubMedBERT, Longformer, and Clinical Longformer33–37. These alternatives were selected to assess whether biomedical-specific pretraining or extended input capacity (up to 4,096 tokens, compared to BERT’s 512-token input limit) could enhance classification performance. Although some of these models offered theoretical advantages – such as domain-specific training or the ability to process longer text sequences – they did not outperform BERT in our empirical experiments (Supplementary file 3). We performed a full training of the BERT model from scratch, rather than only training the classifier head.
All journal-specific formatting was removed from the abstracts. BERT’s input requirements made it necessary to split each title and abstract into individual sentences. Each sentence was labelled and fed into BERT during the training phase. During inference, predictions were made at the sentence level and final classification probabilities were obtained by averaging the positive class probabilities across the title and abstract. After optimisation, the Receiver Operating Characteristic (ROC) curve was used to find a suitable threshold for large-scale inference. The optimisation method is described in Supplementary file 1. A logistic regression model was also tested on BERT’s output probabilities to incorporate sentence ordering; however, it did not outperform the simple averaging approach and so was not used.
The accuracy, sensitivity and specificity of the final model were calculated on the Retraction Watch testing set and on the integrity experts’ dataset.
Results inference and visualisation
Each of the 2.6 million papers of the cancer research corpus published between 1999 and 2024 were screened with our fine-tuned version of BERT (version of the model trained on a specific downstream task – in this case, classifying retracted paper mill papers). We will refer to papers that were classified as resembling retracted paper mill papers as ‘flagged papers’. The 95% confidence interval for the proportion of flagged papers was estimated via bootstrapping (1,000 resamples with replacement). Data were visualised using the ggplot2 R library38. A diagram of the study design is available in Figure 2.

Figure 2:
Study design and workflow for large-scale detection of problematic papers in the cancer research corpus. Filtering method (*) is shown in Figure 1.
To further evaluate the model, we checked whether 873 problematic cancer research papers reported in three prior studies where paper mill involvement was suspected 39–41, were flagged by the model. None of these papers were included in the training set. No content re-evaluation was performed, and integrity-related information was not disclosed to the model. These papers, involving misidentified and/or non-verifiable nucleotide sequences (primers and other oligonucleotides used for amplification, detection, or gene knockdown) or cell lines, were retrieved from three sources: 193 cancer-related papers identified by Oste et al.39, 113 cancer-related papers in high impact factor journals – Molecular Cancer and Oncogene – reported by Pathmendra et al.40, and 567 cancer-related papers, as listed by Park et al.41.
Results
Model performance and errors
Training data
Retraction Watch data cover papers from 2007 to 2024, with a peak in 2022, while the integrity experts’ data includes papers from 2010 to 2024, with a peak between 2019 and 2020 (Figure S2.1). In the Retraction Watch dataset, the most frequent publisher is John Wiley & Sons (via Hindawi), followed by Spandidos Publications and Informa (Table S2.2). In the experts’ dataset, Springer Nature has the most paper mill papers, followed by John Wiley & Sons and Spandidos Publications (Table S2.4). Papers from Chinese institutions are highly represented in both datasets (Tables S2.2 and S2.4). Common topics in paper titles for both datasets based on n-gram analysis, include microRNAs (‘mir’ or ‘microrna’ unigrams), long non-coding RNAs, and lung cancer-related terms (Tables S2.1 and S2.3). Many papers cover topics related to gene and preclinical research. The most frequent cancer types are lung, liver, colorectal, gastric, and brain cancer (Tables S2.2 and S2.4). The most prevalent research fields are cancer biology and cancer therapy, followed by cancer diagnosis (Tables S2.2 and S2.4).
Performance
The classification model achieved an accuracy of 0.91 in detecting paper mill papers in the Retraction Watch dataset (502 out of 551 papers were correctly classified as either fraudulent or genuine), with a sensitivity of 0.87 (239 out of 276 paper mill papers correctly identified) and a specificity of 0.96 (263 out of 275 genuine papers correctly identified). The titles and abstracts of three retracted paper mill papers, all correctly predicted with high probability, are presented as examples in Supplementary File 4.
In testing on unseen data for the integrity experts’ dataset, the model had a classification accuracy of 0.93 (5,771 out of 6,194) with a sensitivity of 0.87 (2,698 out of 3,094) and a specificity of 0.99 (3,073 out of 3,100).
In further test sets, the model flagged 67% (130 out of 193) of problematic papers found by Oste et al.39, flagged 75% (425 out of 567) of problematic papers found by Park et al.41, and 66% (75 out of 113) of problematic papers found by Pathmendra et al.40.
Misclassifications
False positives – control papers incorrectly predicted to resemble paper mill papers – were rare, with only 39 out of 3,375 across both the Retraction Watch and integrity experts’ datasets. This small number of false positives did not allow for the identification of generalisable patterns. In contrast, 433 out of 3,370 paper mill papers were false negatives – papers mill papers predicted as not resembling paper mill papers.
Characteristics of false negative predictions by the model are summarised in Table 1. False negatives are disproportionately associated with first authors affiliated in China. First authors affiliated with Chinese institutions only represent 48% of papers in the combined datasets but account for 94% of false negatives. Gastric cancer (+4 percentage points), liver (+3), colorectal (+4), and lung cancer (+4), as well as publishers such as Rapamycin Press LLC (+4), PLoS (+2), John Wiley & Sons (+2), and Spandidos Publications (+2), are slightly overrepresented among the false negatives. Title unigrams and bigrams do not show overrepresented topics other than general cancer words in the false negatives.

Table 1.
Characteristics of false negative papers (paper mill papers missed by the model, n = 433) by publication year, publisher, first author’s country, cancer type, research area, and title n-grams (unigrams and bigrams). Only categories with a statistically significant Pearson’s chi-squared test of independence are included (p < 0.05). The proportion is shown both within the false negatives and in the overall dataset. The difference represents false negatives minus overall. For each category, the 10 most frequent values were selected for testing. Research area codes: THER – treatment development or evaluation; EPID – epidemiology and population studies.
Flagged papers in the cancer literature
After applying our model to the cancer research corpus from 1999 to 2024, there were 261,245 papers flagged as including characteristics of retracted paper mill papers – representing 9.87% (95% CI 9.83 to 9.90) of all original cancer research papers.
Trends in flagged papers
The number of flagged papers had a clear and rapid increase between 1999 and 2022 (Figure 3). The number of flagged papers peaked in 2022 before showing a slight decline in 2023 and 2024. The number of flagged papers per year followed an exponential trend from 1999 to 2022 (R2 for exponential fit = 0.92). While the percentages of flagged papers remained around 1% in the early 2000s, they progressively rose to exceed 15% of the annual cancer research output by the early 2020s.

Figure 3:
Number of papers per year in the cancer research corpus, flagged because their titles and abstracts were similar to those of retracted paper mill papers. The percentage of flagged papers among all cancer research papers published each year is shown above the bars. Error bars are 95% confidence intervals estimated using bootstrap resampling.
Countries of flagged papers
The percentages of flagged papers per country show that papers from China were most frequently flagged, representing 35% of cancer papers with 177,907 flagged papers (Figure 4), followed by Iran with 20% of flagged papers, accounting for 6,801 papers. Papers from four other countries were also frequently flagged: Saudi Arabia (16%), Egypt (15%), Pakistan (14%) and Malaysia (13%). The United States was the second country in terms of numbers of flagged papers, with 10,511 flagged papers, representing 2% of US cancer research papers.

Figure 4:
Percentage of papers in the cancer research corpus, flagged by our model because their titles and abstracts were similar to those of retracted paper mill papers, across the 25 countries most frequently flagged based on the first author’s affiliation country. The numbers of flagged cancer research papers per country are given next to each bar.
Publishers and journals of flagged papers
The publisher Verduci Editore had the highest percentage of flagged papers with approximately 65% in its cancer research journal – The European Review for Medical and Pharmacological Sciences (Figure 5). The second publisher in terms of percentage was International Scientific Literature, with approximately 45% of papers flagged in one journal, Medical Science Monitor (MSM). The next five publishers were E-Century Publishing Corporation (42%), Spandidos Publications (37%), Ivyspring International Publisher (31%), and IOS Press (30%).

Figure 5:
Percentage of papers in the cancer research corpus, flagged because their titles and abstracts were similar to those of retracted paper mill papers, across the 25 publishers with the highest numbers of flagged papers. The number of flagged papers per publisher is given next to each bar and the corresponding number of journals is shown in parentheses.
The largest publishers – such as Elsevier, Springer Nature, and John Wiley & Sons – have a relatively low percentage of flagged papers (around 8%) but account for the highest absolute numbers of flagged papers across more than 500 journals, with flagged paper numbers of 39,738 (Elsevier), 39,626 (Springer Nature), and 28,200 (John Wiley & Sons).
Cancer types of flagged papers
Among all cancer types, gastric cancer papers show the highest percentage of flagged papers, with 22% of these papers flagged (Figure 6). Bone cancers, such as osteosarcoma, follow with 21% of papers flagged, followed by liver cancer at 19%. Most cancer types fall within a range of 10 to 15% flagged papers. Breast, skin, prostate, and blood cancers show the lowest percentages of flagged papers. In terms of absolute numbers, lung (28,435) and liver (26,730) cancer account for the highest numbers of flagged papers.

Figure 6:
Percentage of cancer research papers according to cancer type, flagged because their titles and abstracts were similar to those of retracted paper mill papers. The number of flagged papers per cancer type is shown next to the bars.
Cancer research area of flagged papers
Flagged papers are largely concentrated in Cancer Biology and Fundamental Research, as well as in Treatment Development or Evaluation and Diagnosis and Prognosis, where percentages exceed 10% (Figure 7). In contrast, areas such as Survivorship, Supportive Care and end-of-life, Epidemiology and population studies, and Health Systems, Policy and Implementation had lower percentages of flagged papers (under 2%).

Figure 7:
Percentage of cancer research papers within each research area, flagged because their titles and abstracts were similar to those of retracted paper mill papers. The number of flagged papers per area is shown next to the bars. As categories were assigned using a multi-labelling tool, each paper may appear in multiple categories, and the sum of all categories does not equal the total number of flagged papers. Percentages represent the proportion of papers flagged as similar to retracted paper mill papers within each research area, calculated as (flagged papers in area / total papers in area) × 100.
Flagged cancer research papers in high impact factor journals
The percentages of flagged papers in the top 10% of journals from the cancer research corpus by impact factor (Decile 1 or D1) shows a clear increase over time (Figure 8). While the percentage remained low in the early 2000s, a sustained increase occurred in the following years, reaching around 10% in 2022. The minimum impact factor required to be among the top 10% of journals for each year (cut-off impact factor) also increased from 3 in 1999 to 7 in 2021.

Figure 8:
Percentages of cancer research papers flagged because their titles and abstracts were similar to those of retracted paper mill papers, in the top 10% of journals by impact factor, according to publication year. The minimum impact factor required to be among the top 10% of journals (cut-off impact factor) in each year is shown by a red dashed line.
Discussion
We have fine-tuned a BERT machine learning model that achieves good accuracy in flagging papers resembling retracted paper mill papers, using combined titles and abstracts from both Retraction Watch and integrity experts’ datasets. The model also flagged 72% of the papers in which incorrect nucleotide sequences or cell lines had been identified in earlier research39–41, without having access to publication text with cell line and nucleotide sequence descriptions, demonstrating the model’s ability to flag suspect papers. Nucleotide sequence information is rarely available in titles or abstracts, ruling out circular validation bias.
Applying our model to 2.6 million papers in the cancer literature shows that both the number and proportion of flagged papers have greatly increased in cancer research, notably also in high impact factor journals. Overall, 9.87% of original cancer research papers share title and abstract features with retracted paper mill papers. This proportion is higher than the previous 3% estimate of paper mill paper prevalence in biomedical research3, although those estimates were derived from different datasets, time periods, and detection criteria. Flagged papers originate from a wide range of countries – with China strongly over-represented – and have appeared in many journals and publishers. They are particularly prevalent in fundamental research and in studies on gastric, bone, and liver cancer.
Applying our model to the cancer research database provided strong indirect validation of the model’s performance. The exponential trend over time in flagged papers coincides with the known development of paper mill papers2. China as the leading country in terms of flagged papers is consistent with findings of paper mill papers origins9,42,43. Publishers with a high percentage of flagged papers in this study have been found to publish problematic papers43,44. Finally, suspected paper mill papers have already been identified in high-impact factor biomedical journals40.
The exponential rise of flagged papers in Figure 3 flattens after 2022. Three potential hypotheses can explain this phenomenon: the result of the publishers and research community fighting back against paper mills; a shift to new templates by paper mills, following the rise of AI; or the known delays in PubMed listing all papers for the most recent years. The relatively low number of flagged papers prior to 2010 may reflect the distribution of the training data, which primarily includes retracted paper mill papers published between 2013 and 2023 (Supplementary file 2), rather than indicating a near complete absence of such features during these years.
Flagged papers were prevalent in papers investigating gastric, bone and liver cancers (Figure 6), and especially for fundamental research (Figure 7). While the high prevalence of gastric and liver cancers in China45 may partly explain the focus on these cancer types, their marked overrepresentation among misidentified cell lines – 25% and 15% of all such lines, respectively46 – is striking. Given that some misidentified cell lines, such as BGC-823 (gastric cancer) and BEL-7402 (liver cancer), appear almost exclusively in publications from Chinese institutions46, this pattern may also reflect vulnerabilities exploited by paper mills, where popular research topics are targeted regardless of data reliability. It may also result from inertia, as early templates were reused and adapted repeatedly in these domains. Flagged papers were more common in fundamental research which coincides with evidence from the literature43. However, this could be linked to the nature of the training set which included mainly fundamental research papers (Supplementary file 2).
Figure 5 indicates that although some publishers have a relatively high percentage of flagged papers, all publishers are affected by this issue. Furthermore, the rise in the percentage of flagged papers in Decile 1 journals suggests that paper mill papers are not just a low-impact journal problem (Figure 8). The concurrent increase in impact factors and the spread of flagged papers suggests that both phenomena may stem from the pressures of the publish-or-perish culture47. The increase of paper mills in high impact factor journals highlights an important limitation of using impact factors as proxies for research quality40. This trend should also prompt high-impact journals to invest in implementing models such as this one presented here or human checks of submissions, thereby strengthening their capacity to detect paper mill papers.
Our training set of paper mill papers has limitations. The tag ‘paper mill’ in the Retraction Watch Database only reflects the retraction notice provided by the publisher. There is no uniformity in the way publishers investigate fraudulent papers; thus, the ‘paper mill’ qualification likely reflects varied levels of evidence. The papers listed online by research integrity experts include evidence of image manipulation, which can occur within settings beyond paper mills. Additionally, the experts may vary in their methods and transparency. Research on paper mills remains limited1, which could mean that the currently identified paper mill papers represent only a fraction of their actual prevalence in the scientific literature.
The overrepresentation of authors from Chinese institutions among retracted papers suspected of originating from paper mills introduces a potential bias. Despite efforts to balance by language in the control set, a residual risk remains that the model may learn to associate linguistic patterns of Chinese scientific writing with paper mill content, rather than identifying features specific to fraudulent manuscripts. However, analysis of the model’s misclassifications (Table 1) shows few false positives, and an overrepresentation of Chinese papers among false negatives – which does not indicate systematic over-flagging. This pattern may instead reflect blind spots in the training data, where certain textual features present in paper mill papers are underrepresented or absent.
Additional sources of bias may stem from the composition of the control set. Controls were not randomly sampled from the broader cancer research literature to avoid including undetected paper mill papers. The assumption – supported by retraction data – that articles published in selected high impact journals or authored by Taiwanese, Swedish, Norwegian and Finnish research teams can serve as proxies for high-quality controls is open to criticism. While this strategy may enhance contrast between genuine and fabricated texts, it may also limit the model’s ability to detect more nuanced cases of fabrication.
The non-explainability of deep learning models prevents us from directly identifying the features captured by BERT. Flagged papers can include actual paper mill features; other features of misconduct; original work copied by paper mills; original work drawing inspiration from paper mill papers; and mistakes by the model. This research does not aim to directly identify paper mill papers or to accuse anyone of fraud, but rather to identify potentially problematic papers. The classifier is a probabilistic model, not a definitive arbiter of misconduct. As such, all flagged papers represent statistical predictions based on textual features and should be interpreted as signals requiring human judgment and further verification, not as confirmed cases of fraud.
Our model could be continuously improved by updating the training set with the latest confirmed paper mill papers. Since only titles and abstracts were used to train the model, incorporating full-text data or selected sections of the full-text has the potential to further enhance its performance. Future work could explore alternative training strategies, such as tuning only the classification head or selectively freezing model layers. Additionally, experimenting with other aggregation strategies and post hoc calibration methods may help improve the robustness and interpretability of model predictions.
We expect the paper mills to react and innovate, as detection methods like ours threaten their income. The release of OpenAI’s ChatGPT-3.5 in 2022 and the rise of generative AI might further blur the boundaries between genuine and fabricated texts, rendering future automated detection of fraudulent features even more challenging. Our model is currently integrated into the online submission systems of three journals from a major publisher and is being used to screen cancer-related manuscripts. Authors are not informed if their paper is flagged, in order to prevent paper mills from adapting their templates. This approach may serve as an example for other publishers and journals, fostering collective efforts to combat the proliferation of fake papers produced by paper mills.
In conclusion, this study demonstrates that using machine learning to identify papers resembling retracted paper mill papers is both feasible and effective. Our findings reveal concerning trends in cancer research publishing. The rising percentage of flagged papers in high-impact factor journals indicates that paper mills have grown in ambition, and that all journals, reviewers, and researchers need to be alert to their presence. While our model has clear limitations, it provides useful insights and highlights the need for collective awareness to curb the spread of paper mill publications.
Supporting information
Supplementary file 1[supplements/673016_file02.docx]
Supplementary file 2[supplements/673016_file03.docx]
Supplementary file 3[supplements/673016_file04.docx]
Supplementary file 4[supplements/673016_file05.docx]
Author contributions
Conceptualisation: AGB and JAB. Methodology: BS, AGB, DC and JAB. Code development and execution: BS. Analysis: BS, AGB, DC and JAB. Writing: BS and AGB. Reviewing and editing: BS, AGB, DC and JAB. Funding acquisition: JAB and AGB. Supervision: AGB, JAB and DC. The work reported in the paper has been performed by the authors only.
Funding
This study was funded by the National Health and Medical Research Council (NHMRC), Ideas Grant no. 2029249: ‘Problematic Articles and Literature Reviews in Molecular Cancer Research’
Data availability
All data used in this study are publicly available: PubMed annual XML dumps (https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/), Retraction Watch database19, and the integrity experts’ dataset28.
Acknowledgement
The authors thank Associate Professor Nathalie Bock for her valuable review of this manuscript. We acknowledge the members of the high-performance computing environment teams of Aqua (Queensland University of Technology, QUT), Bunya (Queensland Cyber Infrastructure Foundation, QCIF, on behalf of the University of Queensland, UQ) and Tesla (Institut de recherche mathématique de Rennes, IRMAR, France) for granting access to their resources and for their technical support. We also acknowledge the teams at PubMed, Retraction Watch, SCImago, and the contributors from PubPeer, especially anonymous user Hoya Camphorifolia, for sharing their data and insights.
Funder Information Declared
National Health and Medical Research Council, https://ror.org/011kf5r70, Ideas Grant no. 2029249:
References
47.Vasconez-Gonzalez, J., Izquierdo-Condoy, J. S., Naranjo-Lara, P., Garcia-Bereguiain, M. Á. & Ortiz-Prado, E. Integrity at stake: confronting “publish or perish” in the developing world and emerging economies. Front. Med. 11, 1405424 (2024).
1.Byrne, J. A. et al. Protection of the human gene research literature from contract cheating organizations known as research paper mills. Nucleic Acids Res. 50, 12058–12070 (2022).
2.COPE & STM. Paper mills research report and recommendations. Technical report. vol. 6 (2022).
3.Van Noorden, R. How big is science’s fake-paper problem? Nature 623, 466–467 (2023).
4.Joelving, F. Paper trail. Science 383, 252–255 (2024).
5.Van Noorden, R. More than 10,000 research papers were retracted in 2023 – a new record. Nature 624, 479–481 (2023).
6.Byrne, J. A., Grima, N., Capes-Davis, A. & Labbé, C. The Possibility of Systematic Research Fraud Targeting Under-Studied Human Genes: Causes, Consequences, and Potential Solutions. Biomark. Insights 14, 1–12 (2019).
7.Byrne, J. A. & Christopher, J. Digital magic, or the dark arts of the 21st century-how can journals and peer reviewers detect manuscripts and publications from paper mills? FEBS Lett. 594, 583–589 (2020).
8.Christopher, J. The raw truth about paper mills. FEBS Lett. 595, 1751–1757 (2021).
9.Byrne, J. A. & Labbe, C. Striking similarities between publications from China describing single gene knockdown experiments in human cancer cell lines. Scientometrics 110, 1471–1493 (2017).
10.Hvistendahl, M. China’s publication bazaar. Science 342, 1035–1039 (2013).
11.Parker, L., Boughton, S., Lawrence, R. & Bero, L. Experts identified warning signs of fraudulent research: a qualitative study to inform a screening tool. J. Clin. Epidemiol. 151, 1–17 (2022).
12.Else, H. & Van Noorden, R. The fight against fake-paper factories that churn out sham science. Nature 591, 516–519 (2021).
13.Porter, S. J. & McIntosh, L. D. Identifying Fabricated Networks within Authorship-for-Sale Enterprises. Sci. Rep. 14, 1–21 (2024).
14.Abalkina, A. Publication and collaboration anomalies in academic papers originating from a paper mill: Evidence from a Russia-based paper mill. Learn. Publ. 36, 689–702 (2023).
15.A Sting Inside a Papermill. For Better Science https://forbetterscience.com/2025/05/19/a-sting-inside-a-papermill/ (2025).
16.Gu, J. et al. AI-enabled image fraud in scientific publications. Patterns 3, 100511 (2022).
17.STM Integrity Hub – STM Association. https://stm-assoc.org/what-we-do/strategic-areas/research-integrity/integrity-hub/.
18.Cabanac, G., Labbé, C. & Magazinov, A. Tortured phrases: A dubious writing style emerging in science. Evidence of critical issues affecting established journals. arXiv 2107.06751 (2021).
19.The Retraction Watch Database. http://retractiondatabase.org/.
20.Fletcher, A. H. A. & Stevenson, M. Predicting retracted research: a dataset and machine learning approaches. Res. Integr. Peer Rev. 10, 1–10 (2025).
21.Chen, L. et al. Pub-Guard-LLM: Detecting Fraudulent Biomedical Articles with Reliable Explanations. arXiv 2502.15429 (2025).
22.Bless, C., Waldis, A., Parfenova, A., Andueza Rodriguez, M. & Marfurt, A. Analyzing the Evolution of Scientific Misconduct Based on the Language Of Retracted Papers. Proc. Fifth Work. Sch. Doc. Process. 57–71 (2025).
23.Devlin, J., Chang, M. W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. Proc. Conf. North Am. Chapter Assoc. Comput. Linguist. 1, 4171–4186 (2019).
24.Sahmoud, T. & Mikki, M. Spam Detection Using BERT. arXiv 2206.02443 (2022).
25.González-Márquez, R., Schmidt, L., Schmidt, B. M., Berens, P. & Kobak, D. The landscape of biomedical research. Patterns 5, 100968 (2024).
26.National Cancer Institute (NCI). https://www.cancer.gov/.
27.SCImago. SJR — SCImago Journal & Country Rank. http://www.scimagojr.com.
28.PubPeer user – Hoya Camphorifolia. Spreadsheet of spreadsheets. https://docs.google.com/spreadsheets/d/1zKxfaqug4ZhwHyGzslF38pFyC8xtU8lzmmOFMGYITDI/edit?gid=1473413779#gid=1473413779.
29.PubPeer – The online Journal club. https://pubpeer.com/.
30.Zhu, H., Jia, Y. & Leung, S. W. Citations of microRNA Biomarker Articles That Were Retracted: A Systematic Review. JAMA Netw. Open 7, e243173 (2024).
31.Schrag, M., Patrick, K. & Bik, E. Academic Research Integrity Investigations Must be Independent, Fair, and Timely. J. Law, Med. Ethics 53, 55–58 (2025).
32.Priem, J., Piwowar, H. & Orr, R. OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. arXiv 2205.01833 (2022).
33.Liu, Y. et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 1907.11692 (2019).
34.Lee, J. et al. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2020).
35.Gu, Y. et al. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Trans. Comput. Healthc. 3, 1–23 (2022).
36.Beltagy, I., Peters, M. E. & Cohan, A. Longformer: The Long-Document Transformer. arXiv 2004.05150 (2020).
37.Li, Y., Wehbe, R. M., Ahmad, F. S., Wang, H. & Luo, Y. Clinical-Longformer and Clinical-BigBird: Transformers for long clinical sequences. arXiv 2201.11838 (2022).
38. Hadley Wickham. ggplot2: Elegant Graphics for Data Analysis. J. R. Stat. Soc. Ser. A Stat. Soc. 174, 245–246 (2016).
39.Oste, D. J. et al. Misspellings or “miscellings”—NonlJverifiable and unknown cell lines in cancer research publications. Int. J. Cancer 155, 1278–1289 (2024).
40.Pathmendra, P., Park, Y., Enguita, F. J. & Byrne, J. A. Verification of nucleotide sequence reagent identities in original publications in high impact factor cancer research journals. Naunyn. Schmiedebergs. Arch. Pharmacol. 397, 5049–5066 (2024).
41.Park, Y. et al. Identification of human gene research articles with wrongly identified nucleotide sequences. Life Sci. Alliance 5, e202101203 (2022).
42.Chambers, H. Unmasking the fraud: How paper mills are undermining scientific publishing. Dev. Med. Child Neurol. 66, 1262–1263 (2024).
43.Candal-Pedreira, C. et al. Retracted papers originating from paper mills: Cross sectional study. BMJ 379, e071517 (2022).
44.Bik, E. M., Casadevall, A. & Fang, F. C. The prevalence of inappropriate image duplication in biomedical research publications. MBio 7, e00809–16 (2016).
45.Xie, W. et al. Chinese and Global Burdens of Gastrointestinal Cancers From 1990 to 2019. Front. Public Heal. 10, 941284 (2022).
46.Souren, N. Y. et al. Cell line authentication: a necessity for reproducible biomedical research. EMBO J. 41, e111307 (2022).
