Statistical methods used via this technique in centralized monitoring.
Although most clinical trial data quality benefits from centralized statistical monitoring, critical data, processes, and associated risks identified via risk assessment following the Clinical Trials Transformation Initiative (CTTI) framework1 should drive the overall monitoring approach. Inclusion of scientific surveillance techniques as part of centralized monitoring2 enhances our ability to monitor critical and non-critical data, which are particularly relevant and recommended on trials at higher risk for measurement error, such as those with patient-reported outcomes (PROs) and clinician-reported outcomes (ClinROs).
Methods employed by scientific surveillance that enhance our ability to monitor and mitigate risks are well aligned with regulatory guidance document recommendations on including statistical surveillance as a key component of centralized monitoring. Those include FDA guidance on risk-based monitoring3; a European Medicines Agency reflection paper from 20134; FDA’s Q9 (R1) draft from 20225; International Conference on Harmonization E6 (R2)6 and E8 (R1)7; and guidance from the National Medical Products Administration Center for Drug Evaluation (CDE). Some guidance documents call out specific statistical monitoring methods (e.g., the 2022 FDA Q9 (R1) and CDE guidance documents), many of which are included in scientific surveillance.
Many factors can degrade data quality8-11 in clinical trials and result in scientifically incompatible data. This may include:
These factors occur due to myriad reasons such as sloppiness, insufficient training or lack of engagement and manifest as too much or too little variability, inconsistencies between related scales, “too-good-to-be-true” values or implausible temporal shifts in outcomes, causing systematic irregularities and increasing measurement error. In placebo-controlled studies, variability also can be introduced by the subject, caregiver or investigator and placebo response rates can influence outcomes profoundly. Additionally, predicting key study level risks such as premature treatment discontinuation or lost to follow up can facilitate taking early actions to minimize these risks.
Scientific surveillance includes methods that fall into three main categories:
For many types of endpoints, no systematic differences are expected between sites. Hence, statistical validity is enhanced if sites’ measurements are similar in various respects. For instance, within a given scale (such as a depression rating scale), item scores are expected to exhibit similar correlations for all sites. Thus, statistical validity is enhanced if such correlations are shown to be similar between sites.
Scientific incompatibility begins with the calculation of site-level correlations (among item scores within a single scale or among scores from different scales) and corresponding overall study correlations. The presence of only random variation between outcomes of interest is tested. For a site to be declared unusual, it must lie atypically far from a central point calculated using data from all sites. Distance metrics and statistical tests using simulations are applied and the false discovery rate is controlled, after which sites of concern are identified.12,13
Subject-level correlations among item scores within a single scale or among scores from different scales and the corresponding overall study correlations also are calculated. Distances between each subject-level correlation and the overall study correlation are calculated and standardized by dividing by the number of non-missing elements in the subject’s correlation matrix.14
Figure 1 below shows two sites flagged (dark blue) based on site-level correlations for absolute values for all items within a PRO questionnaire (e.g., EQ-5D-5L) and four sites flagged for change from baseline values for the same questionnaire compared to all other sites after controlling the false discovery rate.
Subject-level multivariate mean (calculated using all relevant outcomes of interest) and correlation inliers and outliers are identified by using distance measures between subject and overall study by using standard rules associated with boxplot outliers. Observations lying at least 1.5 times the interquartile rate above the upper quartile or below the lower quartile are classed as unusual.
Figure 2 below depicts three subjects (shown as bubbles below the lower dotted control line) with a multivariate mean (using two PRO scales) below the lower quartile compared to all other subjects in the study.
Control charts recommended as statistical monitoring tools in the FDA Quality Risk Management Q9(R1) 2021 guidance are used to compare a site-specific summary measure to the study reference. Outlying sites, identified as having differences in means or variances two or three standard deviations from the study reference, are flagged for further investigation (subject to limitations as discussed later in this article).
Figure 3 below shows two sites that are at least two or three standard deviations below the control limit (shown as bubbles below the lower dotted control line) comparing variance for an outcome of interest pointing to unusually low variability at these sites. The size of the bubble is indicative of the number of subjects in that site, with larger bubbles corresponding to larger number of subjects relative to other sites.
Statistical summaries for outcomes of interest (such as those also mentioned later in this article) are visually inspected for total score/result inconsistencies, especially for sites and subjects that were flagged using multivariate inliers, outliers, and correlations.15 A visual inspection of subject-level outcomes within flagged sites is then assessed for the source of inconsistencies. For example, item scores within a PRO questionnaire, such as EQ-5D-5L, might be seen as repeated (i.e., response propagation) for multiple subjects across multiple visits at the flagged site based on site-level correlations.
Control charts also are used to compare site-specific responder rates using count type data (e.g., reduction in monthly migraine days by a defined timepoint by 30%) to the study reference value. Poisson regression with an offset is used.
Indication-specific consistency checks are implemented as applicable. For example, the primary endpoint for schizophrenia protocols is often the PANSS (positive and negative syndrome) total score. The International Society for CNS Clinical Trials and Methodology convened an expert working group that established consistency/inconsistency flags for the PANSS in 2017. The general strategy was to define irregular (e.g., excessively variable or insufficiently variable) scoring patterns, as well as incompatibility in scoring among items within the scale (i.e., cross-sectionally) and between assessments (i.e., longitudinally). The working group identified and classified 24 flags based on the extent to which they suggest error (possibly, probably, very probably, or definitely) within PANSS items or across repeated assessments. Scientific surveillance includes analyzing these flags for consistency to drive targeted actions.
The statistical monitoring methods summarized previously are applicable to a broad array of trials, spanning different therapeutic areas and clinical indications such as neuroscience, dermatology, immunology, oncology, cardiovascular outcomes, and others. Table 1 below provides examples of outcomes monitored through scientific surveillance across various therapeutic areas and indications. Key findings and recommended actions are then summarized at the study and site levels, based on these analyses.
Ideally, risks are predicted early in studies before they materialize. Early prediction facilitates adjustments to processes and retraining of involved clinical personnel so that problems do not worsen. If, for example, a misleading statement in a case report form completion guideline will lead study sites to conduct an assessment at the wrong time in subjects’ visits, then early identification of the wrong assessment times, using early on-study data, could facilitate identifying and adjusting the misleading statement before many visits are affected.
Risks are predicted using Bayesian statistical analysis which, unlike many traditional analyses, can generate probabilities about future (or otherwise unknown) events.
For instance, Bayesian prediction can generate the probability that, once the last subject completes the last visit, the to-be-observed mean deviations per subject will exceed a pre-specified threshold corresponding to some minimally acceptable level of trial quality.
In scientific surveillance, Bayesian predictions about variables of interest consist of probabilities that future, to-be-observed proportions of subjects, or means over subjects, will meet given criteria, assuming a specific number of future subjects will be observed. The criteria usually consist of upper and/or lower thresholds.
While the choices of thresholds may vary, they often correspond to one and two times the 95% upper one-sided asymptotic limits based on the trial sample size and the prior expected value selected.
In efficacy analyses of clinical trials of investigational drugs, placebo response refers to a tendency of placebo-treated subjects’ efficacy endpoints to improve due to the psychological effects of trial participation rather than any pharmacological effects.
While trial sponsors usually account for expected placebo response levels when designing trials, the actual level in any trial may exceed those expected levels, jeopardizing assay sensitivity, i.e., the ability to distinguish an effective treatment from a less effective or ineffective treatment (as defined in ICH E10 §1.5). Therefore, methods following Hartley (2012) for continuous endpoints16 and Hartley (2015) for binary endpoints17, have been developed and implemented for measuring, without unblinding treatment codes, the Bayesian probabilities that the placebo response exceeds expectations, and that it exceeds expectations by some clinically important amount or more.
The site correlation method described earlier only includes sites with a minimum of three subjects. It cannot incorporate smaller sites.
Similarly, site performance on efficacy measures only can be effectively assessed using the methods described in this article once a minimum number of subjects have completed assessments for the relevant post-baseline visit. Once this is the case, potential actions can be put forward.
The subject-level methods apply only to subjects with data from at least two post-baseline visits. Due to the limitations of blinded data, results are interpreted with caution. Overinterpretation of subject-level responses is avoided.
The Bayesian placebo response assessments apply only to binary and continuous (normally distributed) data in two-group blinded parallel-group trials.
Scientific surveillance is well integrated into the layered approach to protecting data integrity within centralized monitoring, as shown in Figure 4 below.
As with all risks, the process includes documenting risks that are detected via scientific surveillance in the Risk Assessment and Categorization Tool (RACT).18
Each analysis is documented in the centralized monitoring plan, along with a brief description, and is applied with the appropriate cadence. This is typically in alignment with centralized monitoring reviews after an adequate number of randomized subjects have primary endpoint data available, as per the limitations noted, such that:
Actions might target trial conduct, such as evaluating site process or improving subject-caregiver engagement, or might target the analyses plan, such as updating the analysis set definitions to exclude subjects from sites with serious concerns or adding sensitivity analyses to stress test impact on the analyses of primary and key secondary endpoints. Data correction may be performed where appropriate.
Incorporating advanced statistical monitoring methods in centralized monitoring significantly improves scientific integrity in clinical trials especially those with PROs or ClinROs. Implementation of scientific surveillance within the layered framework of centralized monitoring facilitates risk identification from multiple angles, ultimately contributing to a holistic risk detection and mitigation process resulting in tighter quality control. Trials of the future need to be more resilient to environmental disruptions and evolve with technological advancements.
Implementing scientific surveillance can be quite effective in detecting data errors that carry the highest potential to jeopardize study integrity. These advanced statistical monitoring methods are particularly useful as the clinical trial landscape shifts toward decentralization, coupled with continuous technological advancements in how we collect subject data.
The authors would like to acknowledge John Hamlet, Amy Kroeplin, Dorota Nieciecka, Christopher Perkins, and Timothy Peters-Strickland for the assistance they provided in developing this article.
Industry Assessment of Risk-Based Quality Management Emphasizes Value of Adoption
April 4th 2024A study conducted by the Tufts CSDD in collaboration with CluePoints and PwC revealed that slightly more than half of sponsors and contract research organizations have adopted risk-based quality management approaches.