Survey study shows the importance of balancing the delivery of timely systematic reviews while maintaining quality in clinical trials.
Systematic reviews are a key element of clinical research. While these reviews must be completed in a timely manner, there is also a crucial demand for quality. Balancing the two can be a challenge, especially when considering risk of bias (ROB) and the process of objectively assessing methodological flaws.1
A survey study recently published on JAMA Network Open sought to discover whether there was a way to reliably and efficiently assess ROB in randomized clinical trials (RCTs). In particular, the study evaluated large language models (LLMs) and their accuracy in assessing ROB.
“LLMs may facilitate the labor-intensive process of systematic reviews. However, the exact methods and reliability remain uncertain,” the study authors wrote.
The study used two LLMs—ChatGPT and Claude—to assess ROB in 30 RCTs selected from systematic reviews. Three experts also assessed the RCTs to compare with the validity of the LLMs. Each RCT was assessed twice by both LLMs and the results were then compared with an assessment by the three experts. The survey study was conducted between August 10, 2023, and October 30, 2023.
A modified version of the Cochrane ROB tool developed by the CLARITY group at McMaster University was used to create the prompt used by the LLMs. The tool has 10 domains—random sequence generation; allocation concealment; blinding to patients, health care clinicians, data collectors, outcome assessors, and data analysts; and missing outcome data, selective outcome reporting, and other concerns.
Following the study, results showed that both models demonstrated high correct assessment rates. According to the authors, ChatGPT reached a mean correct assessment rate of 84.5% (95% CI, 81.5%-87.3%), and Claude reached a rate of 89.5% (95% CI, 87.0%-91.8%). The consistent rates between the two assessments were 84.0% for ChatGPT and 87.3% for Claude.
“In this survey study, we established a structured and feasible prompt that was capable of guiding LLMs in assessing the ROB in RCTs. The LLMs used in this study produced assessments that were very close to those of experienced human reviewers,” the authors said of the study outcome. “Automated tools in systematic reviews exist but are underused due to difficult operation, poor user experience, and unreliable results. In contrast, both LLMs had high accessibility and user friendliness, demonstrating outstanding reliability and efficiency, thereby showing substantial potential for facilitating systematic review production.”
In most domains of the Cochrane tool, domain-specific correct rates ranged from around 80% to 90%. However, sensitivity below 0.80 was observed in domains 1 (random sequence generation), 2 (allocation concealment), and 6 (other concerns).
“To our knowledge, this study is the first to transparently explore the feasibility of applying LLMs to the assessment of ROB in RCTs. The study addressed multiple aspects of the feasibility of LLM use, including accuracy, consistency, and efficiency. A detailed and structured prompt was proposed and performed commendably in practical application,” the authors wrote. “Our findings preliminarily suggest that with an appropriate prompt, LLM 1 (ChatGPT) and LLM 2 (Claude) can be used alongside the modified Cochrane tool to assess the ROB of RCTs accurately and efficiently.”
Within the limitations of the study, the authors noted that the LLMs may not be capable of completing the review independently. However, providing the LLMs access links to external sources may strengthen their ability. This feature was only available in a beta testing version during the study, so the authors did not use it.
“In this survey study of the application of LLMs to the assessment of ROB in RCTs, we found that LLM 1 (ChatGPT) and LLM 2 (Claude) achieved commendable accuracy and consistency when directed by a structured prompt,” the authors concluded. “By scrutinizing the rationale provided and comparing multiple assessments across different models, researchers were able to efficiently identify and correct nearly all errors.”
1. Lai H, Ge L, Sun M, et al. Assessing the Risk of Bias in Randomized Clinical Trials With Large Language Models. JAMA Netw Open. 2024;7(5):e2412687. doi:10.1001/jamanetworkopen.2024.12687
Driving Diversity with the Integrated Research Model
October 16th 2024Ashley Moultrie, CCRP, senior director, DEI & community engagement, Javara discusses current trends and challenges with achieving greater diversity in clinical trials, how integrated research organizations are bringing care directly to patients, and more.
AI in Clinical Trials: A Long, But Promising Road Ahead
May 29th 2024Stephen Pyke, chief clinical data and digital officer, Parexel, discusses how AI can be used in clinical trials to streamline operational processes, the importance of collaboration and data sharing in advancing the use of technology, and more.
The Rise of Predictive Engagement Tools in Clinical Trials
November 22nd 2024Patient attrition can be a significant barrier to the success of a randomized controlled trial (RCT). Today, with the help of AI-powered predictive engagement tools, clinical study managers are finding ways to proactively reduce attrition rates in RCTs, and increase the effectiveness of their trial. In this guide, we look at the role AI-powered patient engagement tools play in clinical research, from the problems they’re being used to solve to the areas and indications in which they’re being deployed.