Understanding the causes and implementing processes to mitigate preventable sources of discordance.
Blinded independent central review (BICR) is the process by which radiographic exams and selected clinical data, performed as part of a clinical trial protocol, are submitted to a central location for independent review. The Food & Drug Administration (FDA) advocates BICR of radiographic exams for registrational oncology studies when the primary study endpoint is based on tumor measurements, such as progression-free survival (PFS), time to progression (TTP), or objective response rate (ORR).1 Current FDA guidance recommends multiple independent reviewers evaluating each subject.2 One consequence of a multiple-reviewer paradigm is the potential for discordance between readers on the outcome of the subjects. Discordance between readers is adjudicated by a third reader to determine the final outcome. Adjudication rates concern sponsors and regulators because the reasons are poorly understood and there are few published metrics regarding the cause of discordance between BICR radiologists. This article discusses the causes of discordance between BICR readers, outlines the frequency of contributing factors, and suggests steps to mitigate preventable sources of discordance. This data was previously presented at the 2010 American Society of Clinical Oncology (ASCO) Annual Meeting.3
Based on BICR data from 40 oncology clinical trials in 12 indications including 12,299 subjects, we determined that two readers did not agree 23 percent of the time when determining the best overall response and 31 percent of the time when determining the date of progression. In a subset analysis, data from only breast cancer clinical trials involving 876 subjects was blinded and pooled to identify cases wherein two primary readers were discordant in outcome. Discordance was defined as a difference in the overall best response, response date, or date of progression assessed between the two reviewers. There were 459 incidences of discordance identified. The radiographs were reviewed to determine the cause of discordance between readers and to establish if the discordance resulted from a justifiable interpretation difference where neither reader was incorrect in their assessment, or if it resulted from an assessment error by one reader. We acknowledge there may have been bias introduced into this process, as the "correctness" of the interpretation was judged by different radiologists from the same facility as the original reviewers.
Table 1. Differences between readers in lesion selection was the leading cause of discordance.
We identified the reasons for discordance including differences in lesion selection (37%), the perception of new lesions (30%), the perception of non-target disease progression (13%), image quality or missing image data (11%), differences in lesion measurements (9%), and missing clinical information (<1%). In total, 77 percent of discordant cases were deemed to be the result of justifiable perception differences between the two readers, while 23 percent of discordant cases were thought to be due to an interpretive error made by one of the readers. The distribution of reasons for discordance and the classification of justifiable perception differences versus interpretive errors is shown in Table 1 and Figure 1.
Figure 1. The causes of discordance and classification of justifiable versus interpretive error.
Lesion selection. When radiologists function as independent reviewers, they can potentially identify different target and non-target lesions that each considers representative of the subject's overall extent of disease. There is also the potential for the reviewers to identify the same lesions but classify them differently between target and non-target lesions. In a study by Hopper et al., two experienced radiologists identified the same lesions only 28 percent of the time.4 Since lesions do not respond or progress at the same rate, differences between reviewers with regard to response or progression of lesions can be observed. For example, in a RECIST (Response Evaluation Criteria in Solid Tumors) study, a percent change in the sum of the lesion diameters of target lesions may indicate an 18 percent increase (stable disease) by one reader and a 20 percent increase (progressive disease) by the second reader. In this example, it is likely both readers are correct and the results differ because the lesions chosen by Reader 1 changed at a different rate than the lesions chosen by Reader 2. Nonetheless, the outcomes are discordant, resulting in adjudication. There are multiple other examples where adjudication is forced by attempting to classify radiologist performance into categorical variables of complete response, partial response, stable disease, and progressive disease. One such example is illustrated in Table 2.
Table 2. Discordance resulted from a justifiable difference between readers in lesion selection.
In Table 2, there was discordance in the date of response between reviewers (R1 and R2) due to a justifiable difference in lesion selection. R2 chose more lung disease and R1 identified hilar adenopathy. Because the thoracic lesions, classified as target disease by R2, responded at a more rapid rate than the liver lesions, R2 confirmed a partial response (PR) one time point (TP) before R1. The best response and date of progression between reviewers were concordant, but a justifiable difference in number and classification of lesions resulted in a response date discordance by one time point.
Perception differences. For registrational studies, current FDA guidance recommends the use of a sequential unblinding image presentation paradigm. In the sequential unblinding paradigm, each time point is read in sequence without access to future time points or knowledge of the number of time points available for review.2 At times, this paradigm forces the reviewer to make an assessment based on limited information, such as the assessment of progressive disease versus stable disease in the face of an equivocal new lesion.
Due to the partially subjective nature of radiographic assessments, enforcement of the sequential unblinding paradigm, the presence of image artifacts, benign intercurrent diseases, and occasional image quality issues, some radiologists will determine that a new radiographic finding truly represents a new metastatic lesion, and therefore identify unequivocal progression earlier than others. An example of discordance resulting from a justifiable perception difference in the identification of new lesions is shown in Figure 2. Similarly, the threshold for assessing non-target disease progression may differ between reviewers as shown in Figure 3. Perception differences are unavoidable and are, in part, the rationale for the two-reviewer and adjudicator paradigm, where the adjudicator has the ability to review the outcomes and select the most correct assessment.
Figure 2. Discordance resulted from a justifiable difference in the perception of new lesions.
In Figure 2, there was discordance in the date of progression based on a justifiable difference in the perception of new lesions resulting from image quality issues. At baseline (A), a poor quality color paper bone scan was received. At TP2 (B), a grayscale bone scan with slightly better resolution was received. At TP2, R2 identified increased uptake on the bone scan in three areas where CT correlation was not possible, and assessed progressive disease. R1 did not assess progression because, in their judgment, an adequate comparison with baseline could not be made. This represents a justifiable perception difference, as comparison between time points was difficult due to poor quality images and changes in scan technique.
Figure 3. Discordance resulted from a justifiable difference in the perception of non-target PD.
In Figure 3, there was discordance in the date of progression resulting from a justifiable difference in the perception of non-target progressive disease (PD). R1 assessed progressive disease at TP2 based on an increase in miliary disease (B). R2, however, did not believe that non-target disease progression was unequivocal until TP3 (C). This resulted in discordance of progression date by one time point.
Lesion measurements. There is a component of inter-reader measurement variability that can contribute to outcome differences between radiologists. As discussed by Hopper et al., when the same lesions are chosen by two different reviewers, the difference in measurements can be up to 15 percent, and greatest for lesions with poorly defined margins.4 The potential for measurement variability depends on many factors including lesion size, margination, conspicuity, phase of contrast enhancement, and measurement technique (manual, semi-automated, fully automated), etc.
Figure 4 represents a justifiable interpretation difference in lesion measurements as well as an interpretation difference resulting from a lack of clinical information. Both R1 and R2 identified the same liver lesion as part of the baseline target disease (A-B, assessed as lesion 001 by R1 and 002 by R2). R1 and R2 measured the lesion on different image slices at baseline and subsequent time points (TP3, shown in C-D). At TP4, the lesion appeared to have increased in size (E-F). R1 measured the lesion which resulted in progression of target disease. R2 indicated the lesion had changed in appearance and decreased in density, and local treatment with radio frequency ablation was suspected. The lesion was therefore assessed as unevaluable by R2. This represents a justifiable interpretation difference, given the lack of clinical information.
Figure 4. Measurement differences and an interpretation difference due to missing clinical data.
Image quality and missing data. Given the global nature of large oncology studies and the limited technical equipment available in certain countries, independent reviewers may have to evaluate images they consider sub-optimal based on their local standards. Image variability can potentially impact discordance as indicated in Figure 5. Due to the potential for reviewers to identify different lesions at baseline, missing or poor quality image data may impact one reviewer's assessment but may have no bearing on the assessment made by the other reviewer. Similarly, as illustrated in Figure 4, missing or incomplete clinical data may affect the assessment made by one reviewer but not the other.
Figure 5. Discordance resulted from an inconsistency in scan technique across time points.
In Figure 5, there was a justifiable discordance in response between readers based on inconsistent scan technique between assessment points. Both readers identified the same target lesion (002) in the liver, however, inconsistent contrast administration made comparison between assessment points difficult. At baseline (A), the liver was imaged long after contrast was administered, resulting in delayed scans. At TP2 (B), the liver was imaged without contrast, making comparison with the baseline exam difficult. At TP3 (C), the contrast timing was similar to the baseline; however, the images were reconstructed with a different filter, further complicating the assessment. TP4 (D) was performed correctly, with the liver imaged in the portal venous phase. At TP3 (C), R2 attempted to follow the lesion. R1 did not believe reliable measurements could be made, due to the inconsistency in technique across time points, and assessed the lesion as unevaluable. At TP4 (D), both reviewers measured the lesion and R2 confirmed a partial response (PR). R1 was unable to confirm PR due to the prior unevaluable time point, and therefore assigned a best response of stable disease (SD).
Additional factors affecting adjudication rates. Additional factors which can contribute to discordance between reviewers include the number of adjudication variables, tumor type, drug efficacy, duration of treatment, the number of assessment points, the complexity of the assessment, the subjectivity of the assessment, the types of imaging exams required by the protocol, the precision of the response criteria, and the dating conventions that are followed for assessing the date of progression or response.5, 6
Reviewer qualifications and training. BICR reviewers should be qualified based on experience in the indication and response criteria specified in the clinical trial protocol and BICR charter. New BICR reviewers should undergo a certification process including training on BICR processes and reading paradigms followed by the completion of qualification test cases. In addition, each reviewer must complete dedicated protocol training according to BICR standard operating procedures for each clinical trial they are assigned to. Lastly, reviewers should be continuously qualified at a corporate level through ongoing monitoring of reviewer metrics.
BICR charters. BICR charters should detail the response criteria used, all modifications to the criteria, application of the criteria, approaches to qualitative assessments, dating conventions for response and progressive disease, and algorithms for handling missing radiographic and clinical data. Limiting the number of adjudication variables to only those that affect the primary endpoint is also recommended. For example, if a study has a primary endpoint of PFS or TTP, the only adjudication variable should be the date of progression.
Dating conventions. Dating conventions that assign a single time point date are recommended in order to prevent date discordance within the same time point. This would be in accordance with current FDA guidance, which advocates the concept of utilizing a single time point date for the assessment of progression dates.1 For example, if Reader 1 identifies progression of target disease on a CT scan and Reader 2 identifies new disease on a bone scan performed one day later, adjudication is required based on a difference in progression date by one day. If the BICR charter outlines a convention for using a single time point date for the assessment of progression dates, the date of progression assessed by both readers would be concordant, negating the need for adjudication.
Corrections in the sequential unblinding paradigm. While current FDA guidance indicates assessments should be locked after each successive time point, more recent feedback in public meetings attended by officers of the FDA indicates assessments can be overturned based on data which emerges after the initial assessment. Response assessments at prior time points can be corrected based on information presented subsequently as long as the process is defined in the BICR charter, the process is driven by data which emerges after the initial assessment, and there are adequate audit trails that can substantiate the changes.2, 6 Allowing for such corrections results in a data set more representative of the actual response of the subjects, while mitigating discordance associated with pseudo-progression, flare phenomenon of bone disease, and new radiographic findings that subsequently appear to be the result of benign intercurrent diseases. These issues are discussed in the updated RECIST 1.1 guidelines.7
Best practices for qualitative assessments. Discordance often results from perception differences in qualitative assessments where detail and guidance are lacking in published response criteria. In our analysis, 43 percent of discordance was based on perception differences between readers in the identification of new lesions and the qualitative assessment of non-target disease progression. One way to mitigate these perception differences is to reach consensus within the reader pool on interpretation approaches and establish best practices for qualitative assessments. This can be accomplished through regular case reviews and training using data from past studies.
As an example, the updated RECIST 1.1 guidelines provide additional detail and guidance regarding the assessment of non-target disease progression in patients with and without measurable disease.7 We presented data at ASCO in 2009 proposing additional rules for assessing non-target disease progression which can be used by the BICR as a supplement to RECIST 1.1.8 In addition, the RECIST 1.1 guidelines provide specific details regarding the handling of equivocal new lesions and an algorithm for using FDG-PET to identify new lesions.7 This additional detail has been particularly helpful in ensuring a consistent approach among readers. Additional areas where discordance can be mitigated through consensus of assessment approaches include: determining the threshold for when subtle findings represent unequivocal new disease, instituting thresholds regarding how a new bone scan lesion is unequivocally confirmed by subsequent modalities, and establishing thresholds to identify when an imaging exam is of inadequate diagnostic quality. The rules governing these approaches should be outlined in BICR charters.
Site compliance. Sponsors should consider contracting directly with imaging facilities to ensure imaging exams are performed according to the scanning parameters outlined in the protocol or BICR imaging manual. Just as investigator payment is tied to receipt of case report form (CRF) pages, payment to imaging facilities could be tied to receipt of images at the BICR and resolution of any quality issues.5
Educating and training the monitoring staff on the complexity of clinical trials with imaging endpoints is imperative for enforcing site compliance and reducing discordance. Understanding the impact that different quality issues have on response assessments, and the probability of resolution at the imaging facilities, will allow monitors to focus their efforts on critical open issues.9
Transfer of clinical data. A well-controlled process for providing the independent review committee with clinical data relevant to BICR can reduce discordance rates. In oncology trials, clinical data such as prior radiation therapy, benign findings, cytology, local therapy, and physical exam findings often contribute to the assessment made by independent reviewers. As illustrated in Figure 4, missing clinical data can result in discordance between reviewers.
It is important for clinical information to be collected in a manner that facilitates future BICR. The independent review committee can assist the sponsor during CRF development to ensure that all clinical data relevant to the BICR is collected consistently and is organized in a manner that can be effectively used during BICR (e.g., on a per-time-point basis).9 A process for providing clean clinical data and managing updates to data previously assessed should be prospectively defined and agreed upon in a BICR Clinical Data Transfer Plan.
Monitoring reviewer metrics. The overall adjudication rate and the causes of discordance should be monitored throughout the trial. In addition, reader-specific metrics (such as adjudication acceptance and rejection rates, error rates, and inter- and intra-reader variability) should be monitored on an ongoing basis. If outliers or significant discrepancies are identified, investigation, re-training and/or other corrective actions must occur. Monitoring reader performance throughout the trial can identify potential issues early in the process and mitigate unnecessary discordance resulting from those issues.
Some factors which cause discordance are process-driven or due to justifiable interpretation differences, as was observed in 77 percent of adjudicated cases in our review. Examples include lesion selection, image quality issues, inter-reader measurement variability, perception of new lesions, and the assessment of non-target progression. Separate from these factors are reader assessment errors, such as those identified in 23 percent of adjudicated cases. This finding compares favorably to reports in the literature regarding radiologist error rates.10 The BICR process of adjudication by a third reader identifies and mitigates these justifiable discordances and interpretive errors. Discordance can be further mitigated by qualifying and training reviewers, detailing response criteria and any modifications in BICR charters, allowing for corrections in the sequential unblinding paradigm, establishing best practices for qualitative assessments, taking steps to increase site compliance, ensuring a well-controlled process for transferring clinical data to the BICR committee, and implementing processes to monitor reviewer metrics throughout the trial.
Kristin Borradaile,* MS, is Director, Medical Affairs, email: kborradaile@corelabpartners.com., Robert Ford, MD, is Chairman, Scientific Advisory Committee, and Chief Medical Officer Emeritus, Michael O'Neal, MD, is Chief Medical Officer, and Kevin Byrne, MD, is Diagnostic Radiologist at CoreLab Partners, Inc., 100 Overlook Center, Princeton, NJ 08540.
* To whom all correspondence should be addressed.
1. US Department of Health and Human Services, Food and Drug Administration, Clinical Trial Endpoints for the Approval of Cancer Drugs and Biologics, (FDA, Rockville, MD, 2007).
2. US Department of Health and Human Services, Food and Drug Administration, Guidance for Industry: Developing Medical Imaging Drug and Biologic Products Part III: Design, Analysis and Interpretation of Clinical Studies (FDA, Rockville, MD, 2004).
3. K. Borradaile, R. Ford, M. O'Neal, et al., "Analysis of the Cause of Discordance Between Two Radiologists in the Assessment of Radiographic Response and Progression for Subjects Enrolled in Breast Cancer Clinical Trials Employing Blinded Independent Central Review." Poster presented at: 2010 American Society of Clinical Oncology Annual Meeting; June 4-8, 2010; Chicago, IL.
4. K.D. Hopper, C.J. Kasales, M.A. Van Slyke, et al., "Analysis of Interobserver and Intraobserver Variability in CT Tumor Measurements," American Journal of Roentgenology, 167, 851-854 (1996).
5. R. Ford, L. Schwartz, J. Dancey, et al., "Lessons Learned from Independent Central Review," European Journal of Cancer, 45, 268-274 (2009).
6. R. Ford, D. Mozley, "Report of Task Force II: Best Practices in the Use of Medical Imaging Techniques in Clinical Trials," Drug Information Journal, 42, 515-523 (2008).
7. E.A. Eisenhauer, P. Therasse, J. Bogaerts, et al., "New Response Evaluation in Solid Tumours: Revised RECIST Guideline (Version 1.1)," European Journal of Cancer, 45, 228-247 (2009).
8. K. Borradaile, R. Ford, "Analysis of the Rate of Non-Target Disease Progression in Patients with Stable or Responding Target Disease by the Response Evaluation Criteria in Solid Tumors (RECIST)." Poster presented at: 2009 American Society of Clinical Oncology Annual Meeting; May 29-Jun 2, 2009; Orlando, FL.
9. S. Bates, K. Williams, "Effective Management of the Independent Imaging Review Process," Applied Clinical Trials, Supplement, May 2007, 6-10.
10. L. Berlin, J. Berlin, "Malpractice and Radiologists in Cook County, IL: Trends in 20 Years of Litigation," American Journal of Roentgenology, 165, 781-788 (1995).
Driving Diversity with the Integrated Research Model
October 16th 2024Ashley Moultrie, CCRP, senior director, DEI & community engagement, Javara discusses current trends and challenges with achieving greater diversity in clinical trials, how integrated research organizations are bringing care directly to patients, and more.
AI in Clinical Trials: A Long, But Promising Road Ahead
May 29th 2024Stephen Pyke, chief clinical data and digital officer, Parexel, discusses how AI can be used in clinical trials to streamline operational processes, the importance of collaboration and data sharing in advancing the use of technology, and more.