See this topic in the GRADE handbook: Study limitations (Risk of Bias)
The content below is provided by Gordon Guyatt, co-chair of the GRADE working group
Supplimental reading: GRADE guidelines: 4. Rating the quality of evidence—study limitations (risk of bias)
Both randomized control trials (RCTs) and observational studies may incur risk of misleading results if they are flawed in their design or conduct – what other publications refer to as problems with “validity”, “internal validity”, “study limitations” and we will refer to as “risk of bias”.
What method-issues to consider when assessing Risk of Bias
- Concealment of randomization
Those enrolling patients are aware of the group (or period in a cross-over trial) to which the next enrolled patient will be allocated (major problem in “pseudo” or “quasi” randomized trials with allocation by day of week, birth date, chart number etc.)
- Blinding
Patient, caregivers, those recording outcomes, those adjudicating outcomes, or data analysts are aware of the arm to which patients are allocated (or the medication currently being received in a cross-over trial)
- Loss to follow-up
Loss to follow-up and failure to adhere to the intention to treat principle in superiority trials; or, in non-inferiority trials, loss to follow-up and failure to conduct both analyses considering only those who adhered to treatment, and all patients for whom outcome data are available
- Selective outcome reporting
Incomplete or absent reporting of some outcomes and not others on the basis of the results
- Use of unvalidated outcome measures (e.g., patient-reported outcomes)
- Stopping early for benefit
How to do the assessment, practical aspects
- Summarizing risk of bias must be outcome specific
- Summarizing risk of bias requires consideration of all relevant evidence
- Existing systematic reviews are often limited in summarizing study limitations across studies
- What to do when there is only one RCT
- Moving from risk of bias in individual studies to rating confidence in estimates across studies
- Application of principles
What method-issues to consider when assessing Risk of Bias
1. Concealment of randomization
Although randomization is a powerful technique, it does not always succeed in creating groups with similar prognosis. Investigators may make mistakes that compromise randomization.
When those enrolling patients are unaware and cannot control the arm to which the patient is allocated, we refer to randomization as concealed. In unconcealed trials, those responsible for recruitment may systematically enroll sicker—or less sick—patients to either treatment or control groups. This behavior will compromise the purpose of randomization and the study will yield a biased result. Careful investigators will ensure that randomization is concealed through strategies such as remote randomization, in which the individual recruiting the patient makes a call to a methods center to discover the arm of the study to which the patient is assigned.
Consider, for instance, a trial of β-blockers vs angiotensin-converting enzyme (ACE) inhibitors for hypertension treatment that used opaque numbered envelopes to conceal randomization(1). At the time the study was conducted, evidence suggested that β-blockers were better for patients with heart disease. Significantly more patients with heart disease were assigned to receive β-blockers (P = .037). Also, evidence suggested that ACE inhibitors were better for patients with diabetes mellitus. Significantly more patients with diabetes were assigned to receive ACE inhibitors (P = .048). It is very possible that clinicians were opening envelopes and violating the randomization to ensure patients received what the clinicians believed was the best treatment. Thus, the prognostic balance that randomization could have achieved was prevented.
2. Blinding
If randomization succeeds, treatment and control groups begin with a similar prognosis. Randomization, however, provides no guarantees that the 2 groups will remain prognostically balanced. Blinding is the optimal strategy for maintaining prognostic balance.
Table 2 describes 5 groups involved in clinical trials that, ideally, will remain unaware of whether patients are receiving the experimental therapy or control therapy. Patients who take a treatment that they believe is effective may feel and perform better than those who do not, even if the treatment has no biologic activity. Investigators interested in determining the biologic impact of a treatment will ensure patients are blind to treatment allocation. Similarly, rigorous research designs will ensure blinding of those caring for participants, as well as those collecting, evaluating, and analyzing data. Demonstrations of bias introduced by unblinding—such as the results of a trial in multiple sclerosis in which a treatment benefit judged by unblinded outcome assessors disappeared when adjudicators of outcome were blinded(2) —highlight the importance of blinding. The more subjectivity involved in judging whether a patient has had a target outcome, the more important blinding becomes. For example, blinding of an outcome assessor is unnecessary when the outcome is all-cause mortality.
Finally, differences in patient care other than the intervention under study—cointerventions—can, if they affect study outcomes, bias the results. Effective blinding eliminates the possibility of either conscious or unconscious differential administration of effective interventions to treatment and control groups. When effective blinding is not possible, documentation of potential cointerventions becomes important.
Five Groups That Should, if Possible, Be Blind to Treatment Assignment
Patients: To avoid placebo effects
Clinicians: To prevent differential administration of therapies that affect the outcome of interest (cointervention)
Data collectors: To prevent bias in data collection
Adjudicators of outcome: To prevent bias in decisions about whether or not a patient has had an outcome of interest
Data analysts: To avoid bias in decisions regarding data analysis
3. Loss to Follow-up
Ideally, at the conclusion of a trial, investigators will know the status of each patient with respect to the target outcome. The greater the number of patients whose outcome is unknown—patients lost to follow-up—the more a study is potentially compromised. The reason is that patients who are lost often have different prognoses from those who are retained—they may disappear because they have adverse outcomes or because they are doing well and so did not return for assessment. The magnitude of the bias may be substantial. A systematic review suggested that up to a third of positive trials reported in high-impact journals may lose significance given plausible assumptions regarding differential loss to follow-up in treatment and control groups.
When does loss to follow-up pose a serious risk of bias? Although you may run across thresholds such as 20% for a serious risk of bias, such rules of thumb are misleading. Consider 2 hypothetical randomized trials, each of which enters 1000 patients into both treatment and control groups, of whom 30 (3%) are lost to follow-up (Table 3). In trial A, treated patients die at half the rate of the control group (200 vs 400), a relative risk (RR) of 50%. To what extent does the loss to follow-up threaten our inference that treatment reduces the death rate by half? If we assume the worst (ie, that all treated patients lost to follow-up died), the number of deaths in the experimental group would be 230 (23%). If there were no deaths among the control patients who were lost to follow-up, our best estimate of the effect of treatment in reducing the relative risk of death drops from 200/400, or 50%, to 230/400, or 58%. Thus, even assuming the worst makes little difference to the best estimate of the magnitude of the treatment effect. Our inference is therefore secure.
Contrast this with trial B. Here, the RR of death is also 50%. In this case, however, the total number of deaths is much lower; of the treated patients, 30 die, and the number of deaths in control patients is 60. In trial B, if we make the same worst-case assumption about the fate of the patients lost to follow-up, the results would change markedly. If we assume that all patients initially allocated to treatment—but subsequently lost to follow-up—die, the number of deaths among treated patients rises from 30 to 60, which is equal to the number of control group deaths. If this assumption is accurate, we would have 60 deaths in both the treatment and control groups and the effect of treatment would drop to 0. Because of this dramatic change in the treatment effect (50% RR if we ignore those lost to follow-up; 100% RR if we assume all patients in the treatment group who were lost to follow-up died), the 3% loss to follow-up in trial B threatens our inference about the magnitude of the RR.
Of course, this worst-case scenario is unlikely. When a worst-case scenario, were it true, substantially alters the results, you must judge the plausibility of a markedly different outcome event rate in the treatment and control group patients lost to follow-up.
The issue is conceptually identical with continuous outcomes: was the loss to follow-up such that reasonable assumptions about differences in outcomes among those lost to follow-up in intervention and control groups could change the overall results in an important way?
Within the context of a systematic review, one can test, for each study and ultimately for the pooled estimate, a variety of assumptions about rates of events in those lost to follow-up when the outcome is a binary variable(3). One can also conduct such sensitivity analyses when the data are continuous(4). Such approaches represent the ideal way of determine whether to rate down for risk of bias as a results of loss to follow-up.
4. Stopping early for benefit
Theoretical consideration(5), simulations(6), and empirical evidence(7) all suggest that trials stopped early for benefit overestimate treatment effects. The most recent empirical work suggests that in the real world formal stopping rules do not reduce this bias, that it is evident in stopped early trials with less than 500 events, and that on average the ratio of relative risks in trials stopped early versus the best estimate of the truth (trials not stopped early) is 0.71(8).
Systematic review authors and guideline developers should consider this important source of bias. Systematic reviews should provide sensitivity analyses of results including and excluding studies that stopped early for benefit; if estimates differ appreciably, those restricted to the trials that did not stop early should be considered the more credible. When evidence comes primarily or exclusively from trials stopped early for benefit, authors should infer that substantial overestimates are likely in trials with fewer than 500 events and that large overestimates are likely in trials with fewer than 200 events(8).
5. Selective outcome reporting
When authors selectively report positive outcomes and analyses within a trial, critics have used the label “selective outcome reporting”. Recent evidence suggests that selective outcome reporting, which tends to produce overestimates of the intervention effects, may be widespread(9-13).
For example, a systematic review of the effects of testosterone on erection satisfaction in men with low testosterone identified four eligible trials(14). The largest trial’s results were reported only as “not significant”, and could not, therefore, contribute to the meta-analysis. Data from the three smaller trials suggested a large treatment effect (1.3 standard deviations, 95% confidence interval 0.2 - 2.3). The review authors ultimately obtained the complete data from the larger trial: after including the less impressive results of the large trial, the magnitude of the effect was smaller and no longer statistically significant (0.8 standard deviations, 95% confidence interval -0.05 – 1.63)(15).
The Cochrane handbook suggests that definitive evidence that selective reporting has not occurred requires access to a protocol developed before the study was undertaken(16). Selective reporting is present if authors acknowledge pre-specified outcomes that they fail to report, or report outcomes incompletely such that they cannot be included in a meta-analysis. One should suspect reporting bias if the study report fails to include results for a key outcome that one would expect to see in such a study, or if composite outcomes are presented without the individual component outcomes.
Note that within the GRADE framework, which rates the confidence in estimates from a body of evidence, suspicion of publication bias in a number of included studies may lead to rating down of quality of the body of evidence. For instance, in the testosterone example above, had the authors not obtained the missing data, they would have considered rating down the body of evidence for the selective reporting bias suspected in the largest study.
How to do the assessment, practical aspects
1. Summarizing risk of bias must be outcome specific
Sources of bias may vary in importance across outcomes. Thus, within a single study, one may have higher quality evidence for one outcome than for another. For instance, RCTs of steroids for acute spinal cord injury measured both all-cause mortality and, based on a detailed physical examination, motor function (24-26). Blinding of outcome assessors is irrelevant for mortality, but crucial for motor function. Thus, as in this example, if the outcome assessors in the primary studies were not blinded, evidence might be categorized for all-cause mortality as having no serious risk of bias, and rated down for motor function by one level on the basis of serious risk of bias.
2. Summarizing risk of bias requires consideration of all relevant evidence
Every study addressing a particular outcome will differ, to some degree, in risk of bias. Review authors and guideline developers must make an overall judgment, considering all the evidence, whether quality of evidence for an outcome warrants rating down on the basis of risk of bias.
Individual trials achieve a low risk of bias when most or all key criteria are met, and any violations are not crucial. Studies that suffer from one crucial violation – a violation of crucial importance with regard to a point estimate (in the context of a systematic review) or decision (in the context of a guideline) – provide limited quality evidence. When one or more crucial limitations substantially lower confidence in a point estimate, a body of evidence provides only weak support for inferences regarding the magnitude of a treatment effect.
High quality evidence is available when most studies from a body of evidence meet bias-minimizing criteria. For example, of the 22 trials addressing the impact of beta blockers on mortality in patients with heart failure most, probably or certainly, used concealed allocation, all blinded at least some key groups, and follow-up of randomized patients was almost complete(27).
GRADE considers a body of evidence of moderate quality when the best evidence comes from individual studies of moderate quality. For instance, we cannot be confident that, in patients with falciparum malaria, amodiaquine and sulfadoxine-pyrimethamine together reduce treatment failures compared to sulfadoxine-pyrimethamine alone because the apparent advantage of sulfadoxine-pyrimethamine was sensitive to assumptions regarding the event rate in those lost to follow-up in two of three studies(28).
Surgery versus conservative treatment in the management of patients with lumbar disc prolapse provides an example of rating down two levels due to risk of bias in RCTs(29). We are uncertain of the benefit of open disectomy in reducing symptoms after one year or longer because of very serious limitations in the one credible trial of open disectomy compared to conservative treatment. That trial suffered from inadequate concealment of allocation and unblinded assessment of outcome by potentially biased raters (surgeons) using unvalidated rating instruments (Table 6).
3. Existing systematic reviews are often limited in summarizing study limitations across studies
To rate overall confidence in estimates with respect to an outcome, review authors and guideline developers must consider and summarize study limitations considering all the evidence from multiple studies. For a guideline developer, using an existing systematic review would be the most efficient way to address this issue.
Unfortunately, systematic reviews usually do not address all important outcomes, typically focusing on benefit and neglecting harm. For instance, one is required to go to separate reviews to assess the impact of beta blockers on mortality(27) and on quality of life(30). No systematic review has addressed beta-blocker toxicity in heart failure patients.
Review authors’ usual practice of rating the quality of studies across outcomes, rather than separately for each outcome, further limits the usefulness of existing systematic reviews for guideline developers. This approach becomes even more problematic when review authors use summary measures that aggregate across quality criteria (e.g., allocation concealment, blinding, loss to follow-up) to provide a single score. These measures are often limited in that they focus on quality of reporting rather than on the design and conduct of the study(31). Furthermore, they tend to be unreliable and less closely correlated with outcome than individual quality components(32-34). These problems arise, at least in part, because calculating a summary score inevitably involves assigning arbitrary weights to different criteria.
Finally, systematic reviews that address individual components of study limitations are often not comprehensive and fail to make transparent the judgments needed to evaluate study limitations. These judgments are often challenging, at least in part because of inadequate reporting: just because a safeguard against bias isn’t reported doesn’t mean it was neglected(35, 36).
Thus, although systematic reviews are often extremely useful in identifying the relevant primary studies, members of guideline panels or their delegates must often review individual studies if they wish to ensure accurate ratings of study limitations for all relevant outcomes. As review authors increasingly adopt the GRADE approach (and in particular as Cochrane review authors do so in combination with using the Cochrane risk-of-bias tool) the situation will improve.
4. What to do when there is only one RCT
Many people are uncomfortable designating a single RCT as high quality evidence. Given the many instances in which the first positive report has not held up under subsequent investigation, this discomfort is warranted. On the other hand, automatically rating down quality when there is a single study is not appropriate. A single, very large, rigorously planned and conducted multi-centre RCT may provide evidence warranting high confidence. GRADE suggests especially careful scrutiny of all relevant issues (risk of bias, precision, directness, publication bias) when only a single RCT addresses a particular question.
5. Moving from risk of bias in individual studies to rating confidence in estimates across studies
Moving from 6 risk of bias criteria for each individual study to a judgment about rating down for quality of evidence for risk of bias across a group of studies addressing a particular outcome presents challenges.
We suggest the following 5 principles:
- Judicious consideration
In deciding on the overall confidence in estimates, one does not average across studies (for instance if some studies have no serious limitations, some serious limitations, and some very serious limitations, one doesn’t automatically rate quality down by one level due to an average rating of serious limitations). Rather, judicious conside