How to account and adjust for covariates in clinical trial randomization-and be confident about uncertainty.
The unease that many feel about randomization in clinical trials, in particular, worrying about hidden covariates and their possible influence, can often be traced, in my opinion, to misunderstandings about what statistical analysis does. The object of this tutorial is not to explain randomisation, per se, (for such an explanation see a previous paper of mine1 in the pages of this journal or, for a more in-depth account, the classic text by Rosenberger and Lachin2), but to explain how to account for covariates. In so doing, it will, I hope, shed light on randomization. In fact, an explanation of how one adjusts for covariates and why, apart from being useful for understanding a common approach to analyzing clinical trials, is also illuminating for understanding what randomization can and cannot achieve for clinical trials.
In discussing the benefits of adjusting for covariates, I shall illustrate the value of adjustment, not in terms of significance testing but in terms of estimation. I shall assume that a major purpose of any trial is to estimate the effect of treatment. However, being a statistician means never having to say you are certain-although you are required to say how certain you are. A common way to do this is to calculate a confidence interval for the result. Such intervals may be less familiar to some readers than P-values but they are a long-established tool of statistics and, in particular, since their promotion by Martin Gardner and Doug Altman over 30 years ago in the British Medical Journal3, have become increasingly popular in reporting results in medical journals. For a guide to their use. see the book by Altman, Gardner, and others.4
Just as levels of significance can be set at different values, for example, 5%, 1%, 0.1%, when testing is the object, so intervals corresponding to different levels of confidence, 95%,99%, 99.9%, can be calculated when estimation is the purpose. However, just as 5% is the most common level for significance, 95% is the most commonly used value of confidence, and this is what will be illustrated here. Such 95% intervals have the property that if the assumptions required by the model on which they are based are true, then, over all randomizations, 95% would include the "true" estimated treatment effect.
In that connection, one needs to be aware of three things. First, confidence intervals are a statement about the true average effect for the patients studied. They say nothing in themselves about the individual patient effects. This can be seen quite simply by considering that, other things being equal, the larger the trial, the narrower the confidence interval; but, obviously, patients do not vary more in smaller trials than in larger ones. Second, the confidence property itself is a long-run average. If further information is available that enables one to see that the property of the average is not relevant to the case one is faced with, then the appropriate probability may no longer be 95%. (This will be illustrated in some detail below. However, another sort of information that will not be included in this piece is prior information. For example, if you knew that all previous drugs in a particular class had failed, this might cause you to believe that confidence intervals that excluded no difference from placebo would be less likely than others to include the true effect.) Third, it is not a valid criticism of statistical estimation approaches to say that they do not deliver certainty. Nothing delivers certainty. It is better to be honest about the uncertainties involved than to pretend one knows the unknowable.
I shall illustrate this by simulating 200 trials in asthma in which a bronchodilator is being compared to placebo for its effect of forced expiratory volume in one second (FEV1), a common measure of lung function. Each trial will involve 50 patients allocated to placebo and 50 patients allocated to the bronchodilator. I will illustrate the use of confidence intervals by carrying out 200 unadjusted analyses, one for each trial and 200 adjusted analyses, again one for each trial. The unadjusted analysis will be based on a simple comparison of means at outcome using a t-test. The adjusted analysis will adjust for the baseline values by fitting them in a regression model that includes them in addition to treatment. This is what statisticians call carrying out an analysis of covariance. Details of the simulation parameters are given in Table 1 below. Note, that because the treatment effect is set at 300 mL and the expected value at outcome under placebo is set at 2200 mL, this implies that the expected value at outcome for the bronchodilator is 2200 mL+300 mL=2500mL .All simulations, analyses and graphs were produced in Genstat® 19.5
Table 1. Parameter settings for the simulation of some parallel group trials in asthma. The variable being measured is assumed to be FEV1.
The covariate that I shall use here is the baseline measurement of the value used for outcome. This is the simplest and probably most common example of a covariate. However, the argument is perfectly general and would carry over for other covariates.
The results obtained by carrying out a simple analysis (just comparing means at outcome) are represented in Figure 1 below. The horizontal axis, labelled “sample number,” gives the number, from 1 to 200, of each of the simulated clinical trials. The vertical axis gives the point estimate (either a black diamond or a red circle) and the upper and lower confidence intervals (respectively, the lower and upper value of the accompanying whisker). A horizontal dashed line shows the "true" value of the treatment effect of 300 mL. We are privileged to know this because we have carried out a simulation. It is important to realize that, in practice, we will never know this.
Figure 1. Unadjusted point estimates and confidence intervals for 200 simulated trials in asthma (units are mL of FEV1).
The 95% confidence limits either include this value of 300 mL or they don’t. If they don’t, a red pen has been used and the point estimate is a red circle. If they do, a black pen has been used and the point is a black diamond. If the reader counts carefully, they will see that 10 of the confidence intervals exclude the true value and it thus follows that 190 include the true value. Thus, 95% of the 95% confidence intervals include the true value. This is a property they are constructed to have.
Figure 2 (see below) is a version of Figure 1, showing just the first 10 out of the 200 cases simulated. The first case, sample 1, gives an example of where the confidence interval covers the "true" value of 300 mL: the lower limit is 223 mL, the upper limit is 331 mL, and the point estimate is 277 mL. For sample seven, however, the interval does not cover the true value: the lower limit is 312 mL and, therefore, above the true value. This is, of course, regrettable but in any given case we could not know that this was so. Some intervals will include the true value and some will not, and we cannot tell which-but we can assign a probability to the value being included and, provided that nothing else is known, this is appropriate. Returning now to Figure 1, 190 out of 200 intervals, that is to say 95%, cover the true value.
Figure 2. Unadjusted point estimates and confidence intervals for the first 10 simulated trials in Figure 1.
The fact that this is so is partly fortuitous. In any given simulation of 200 trials, it would not invariably be the case that 190 of the 95% confidence intervals would include the true value. In any case, for any trial, the confidence interval may or may not include the true value. The point is, however, that if nothing else is known, this long-run confidence would be an appropriate value to attach to any given estimate.
Suppose, in fact, that we have baseline values available. How should this change our attitude to what we have just seen? Figure 3 below shows the 200 confidence intervals given in Figure 1, except that they are now no longer plotted in the order in which they were simulated but against the mean difference at baseline for the simulated data. It can be seen that the confidence intervals that fail to include the true parameter value tend to be at either edge of the graph. Roughly, the intervals are less likely to include the true parameter value when there is an unbalance at baseline, either favoring the placebo group or the bronchodilator group. The theory explaining this is well-understood. To use some statistical jargon, the baselines provide a means of identifying a recognizable subset, a group for which the average no longer applies. An analogy from life insurance may be helpful here. A life-table may provide a reasonable assessment of the expectation of life of a male aged 50. However, if it is based on an "average" population, then it will overstate the expectation, other things being equal, for a 50-year-old male who is a smoker and underestimate it if he is not. Thus, knowledge of an individual’s smoking status will lead to a revised estimate. Such prognostic factors should be taken into account when making probability calculations, otherwise one may be misled. What applies in life generally, also applies for the analysis of clinical trials.
Figure 3. The 200 confidence intervals plotted against the difference as baseline between the two groups.
How should one adjust for covariates? The solution is illustrated in Figure 4 below. This shows the original results at baseline and outcome for the 100 patients, 50 in each group, for trial number 200, the last of the 200 that were simulated. What is done in an analysis of covariance is to find the best fitting parallel lines that one could use to predict FEV1 at outcome using the baseline value. These are shown on the plot: the brown line for the patients under bronchodilator and the blue one for those under placebo. One allows the model to determine how far apart these lines should be. Here, the software has judged that the difference is about 290 ml. The software can also calculate confidence limits for this form of analysis. In this case, the limits are 232mL (lower limit) and 347mL (upper limit), so they include the true value of 300mL. Remember that this is a statement as to where the true average value lies. It is not a statement about effects for individual patients.
Figure 4. The results for trial number 200. The blue downward pointing triangles are values under placebo. The brown upward pointing triangles are values under the bronchodilator.
What happens if we make such an adjusted analysis for all the trials? Such a situation is illustrated in Figure 5 below. This is the analogous plot to Figure 3 but now using the adjusted analyses. A number of differences are noticeable. First, the confidence intervals are narrower. In fact, whereas the average width of the intervals for the unadjusted case given by Figure 3 is 119, for the adjusted case given by Figure 5 it is 86. Using the baseline information has proved useful. Secondly, it is still the case that not all 95% confidence intervals include the true value. It is in the nature of confidence intervals that they don’t. In fact, if you count the number of intervals that do not include the true value, you will see that there are, again, 10 such intervals representing 5% of the 200. This is, of course, again, partly fortuitous, but it is a long-run property of adjusted 95% confidence intervals just like unadjusted ones-that they will cover the true value with 95% probability provided that the modeling assumptions are correct. Finally, there is no longer any relationship between baseline imbalance and coverage. The confidence intervals that do not cover the true value are not distributed toward either side of the baseline imbalance measure as they are in Figure 3.
In other words, the gain in information that using baseline has brought to the analysis has been consumed by calculating narrower intervals. This is partly a matter of choice. We could keep the intervals at a similar width to what they were before but post a higher coverage probability; but then we should not describe them as 95% intervals. This choice is rarely made. Another way of putting this, is to say that the unadjusted analysis automatically made a provision for the fact that baseline values could differ. This provision was made because it used the variation within groups to predict probable differences between.3 But the variation within groups reflects all sorts of differences that there are between patients. This is an important point that the critics of randomization regularly overlook: the standard analysis makes an allowance for all types of factors that can affect outcome because they also will affect variation within groups.
Figure 5. Confidence intervals for the 200 simulated trials based on analysis of covariance. The adjusted intervals are plotted against the mean difference at baseline between groups.
The lessons for the analysis of clinical trials are the following. Where we have measured identifiable prognostic covariates, we should adjust for them. Doing so moves us from the situation given in Figure 3 to that in Figure 5: there is a gain in precision and, furthermore, we have appropriately taken the information we have into account.
However, if we have adjusted for what we can see, we should not worry about what we can’t see.6 Knowing that we have randomized, we know that, on average, our inferences will be correct. This has to suffice. The practical lesson for those working in drug development is to follow the advice of the International Conference on Harmonization guideline7 and the European Medicines Agency (EMA).8 Identify useful prognostic covariates before unblinding the data. Say you will adjust for them in the statistical analysis plan. Then do so.
Stephen Senn, PhD, FRSE, is a Consultant Statistician
References
Accelerating Clinical Trial Design and Operations
Fully-integrated, component-based CDMS offers flexibility, customization, and efficiency.