3 minute read

Assessing the risk of bias within primary AI literature in medicine

In medical literature, systematic reviews are considered as the “highest quality of evidence” in the evidence-based pyramid (EBM) pyramid 1. A systematic review synthesises primary literature using a pre-defined eligibility criteria, conducting database searches, and using tools for critical appraisal to address a specific research question. Once the relevant studies are identified, they are evaluated for reporting quality and assessed for risk of bias. This rigorous process helps contextualise the results of the studies, ensuring that the conclusions are informed by a thorough and critical analysis of the methodologies used in the primary studies.

EBM Figure

Research on AI technology in medicine has gained momentum over the last two decades, with indexed articles on PubMed increasing from around 1,000 articles in 2000 to over 50,000 in 2024. In medical imaging, AI models have been reported to be superior or comparable to clinicians in making certain diagnoses however many studies comparing AI versus clinicians are not prospective, not randomised, are at high risk of bias, and deviate from reporting standards 2. For appraisal of primary AI literature in medicine, the following tools have been commonly used:

Assessing reporting quality within studies:

CONSORT-AI

Provides a set of recommendations for clinical trials evaluating interventions with an AI component Link

CLAIM

Provides a checklist for reporting of medical image studies which uses AILink

DECIDE-AI

Reporting guideline for clinical evaluation of decision-support systems based on AI Link

STARD-AI (in development)

An AI-centric guideline based on the original STARD 2015 guideline which was developed to improve the reporting quality of studies investigating diagnostic test accuracy Related Publication STARD 2015

SPIRIT-AI

Reporting guideline for clinical trial protocols evaluating interventions with an AI component Link

TRIPOD-AI

Reporting guidance for clinical prediction model that uses regression or machine learning methods Link Original TRIPOD 2015

PRISMA-AI (in development)

A steering committee has been established to update PRISMA guidelines for applicability in AI literature Related Publication

Assessing for risk of bias within studies:

QUADAS-AI (in development)

Tool in development to assess risk of bias in primary diagnostic accuracy studies that have an AI component Related Publication

QUADAS-2

Used to assess risk of bias in primary diagnostic accuracy studies Link

CHARMS

Checklist to appraise all types of primary prediction modelling studies including regressions, neural networks, genetic programming and vector machine learning models Link

PROBAST-AI (in development)

A tool based on the original PROBAST tool to assess both quality of reporting and risk of bias on studies developing and evaluating multivariable diagnostic and prognostic prediction models using any AI/ML technique Related Publication Original PROBAST Tool

Successful Application of Tools

Though these tools are becoming more commonly used,, the high level of heterogeneity in primary AI literature and poor reporting practices can make it challenging to use. Examples of successful application of these tools can be found below:

Systematic review of ML-based diagnostic imaging models using CLAIM & QUADAS-2 Lans A, Pierik RJB, Bales JR, Fourman MS, Shin D, Kanbier LN, Rifkin J, DiGiovanni WH, Chopra RR, Moeinzad R, Verlaan JJ, Schwab JH. Quality assessment of machine learning models for diagnostic imaging in orthopaedics: A systematic review. Artif Intell Med. 2022 Oct;132:102396. doi: 10.1016/j.artmed.2022.102396. Epub 2022 Sep 6. PMID: 36207080.

Systematic review and meta-analysis of a comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging using CHARMS Liu X, Faes L, Kale AU, Wagner SK, Fu DJ, Bruynseels A, Mahendiran T, Moraes G, Shamdas M, Kern C, Ledsam JR. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. The lancet digital health. 2019 Oct 1;1(6):e271-97.

Systematic review of claims of deep learning studies in medical imaging using TRIPOD, Cochrane risk of bias tool for RCTs, and PROBAST Nagendran M, Chen Y, Lovejoy CA, et al. Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies. BMJ. 2020;368:m689.

References

  1. Figure 1 taken from here 

  2. Nagendran M, Chen Y, Lovejoy CA, et al. Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies. BMJ. 2020;368:m689.