Tips for learners of evidence-based medicine: 4. Assessing heterogeneity of primary studies in systematic reviews and whether to combine their results ====================================================================================================================================================== * Rose Hatala * Sheri Keitz * Peter Wyer * Gordon Guyatt * © 2005 CMA Media Inc. or its licensors Clinicians wishing to quickly answer a clinical question may seek a systematic review, rather than searching for primary articles. Such a review is also called a meta-analysis when the investigators have used statistical techniques to combine results across studies. Databases useful for this purpose include the Cochrane Library ([www.thecochranelibrary.com](http://www.thecochranelibrary.com)) and the ACP Journal Club ([www.acpjc.org](http://www.acpjc.org); use the search term “review”), both of which are available through personal or institutional subscription. Clinicians can use systematic reviews to guide clinical practice if they are able to understand and interpret the results. ![Figure1](http://www.cmaj.ca/https://www.cmaj.ca/content/cmaj/172/5/661/F1.medium.gif) [Figure1](http://www.cmaj.ca/content/172/5/661/F1) Figure. Systematic reviews differ from traditional reviews in that they are usually confined to a single focused question, which serves as the basis for systematic searching, selection and critical evaluation of the relevant research.1 Authors of systematic reviews use explicit methods to minimize bias and consider using statistical techniques to combine the results of individual studies. When appropriate, such pooling allows a more precise estimate of the magnitude of benefit or harm of a therapy. It may also increase the applicability of the result to a broader range of patient populations. Clinicians encountering a meta-analysis frequently find the pooling process mysterious. Specifically, they wonder how authors decide whether the ranges of patients, interventions and outcomes are too broad to sensibly pool the results of the primary studies. In this article we present an approach to evaluating potentially important differences in the results of individual studies being considered for a meta-analysis. These differences are frequently referred to as heterogeneity.1 Our discussion focuses on the qualitative, rather than the statistical, assessment of heterogeneity (see Box 1). ![Figure2](http://www.cmaj.ca/https://www.cmaj.ca/content/cmaj/172/5/661/F2.medium.gif) [Figure2](http://www.cmaj.ca/content/172/5/661/F2) Box 1. Two concepts are commonly implied in the assessment of heterogeneity. The first is an assessment for heterogeneity within 4 key elements of the *design* of the original studies: the patients, interventions, outcomes and methods. This assessment bears on the question of whether pooling the results is at all sensible. The second concept relates to assessing heterogeneity among the *results* of the original studies. Even if the study designs are similar, the researchers must decide whether it is useful to combine the primary studies' results. Our discussion assumes a basic familiarity with how investigators present the magnitude2,3 and precision4 of treatment effects in individual randomized trials. The tips in this article are adapted from approaches developed by educators with experience in teaching evidence-based medicine skills to clinicians.1,5,6 A related article, intended for people who teach these concepts to clinicians, is available online at [www.cmaj.ca/cgi/content/full/172/5/661/DC1](http://www.cmaj.ca/cgi/content/full/172/5/661/DC1). ## Clinician learners' objectives ### ***Qualitative assessment of the design of primary studies*** * Understand the concepts of heterogeneity of study design among the individual studies included in a systematic review. ### ***Qualitative assessment of the results of primary studies*** * Understand how to qualitatively determine the appropriateness of pooling estimates of effect from the individual studies by assessing (1) the degree of overlap of the confidence intervals around these point estimates of effect and (2) the disparity between the point estimates themselves. * Understand how to estimate the “true” value of the estimate of effect from a graphic display of the results of individual studies. ## Tip 1: Qualitative assessment of the design of primary studies Consider the following 3 hypothetical systematic reviews. For which of these systematic reviews does it make sense to combine the primary studies? * A systematic review of all therapies for all types of cancer, intended to generate a single estimate of the impact of these therapies on mortality. * A systematic review that examines the effect of different antibiotics, such as tetracyclines, penicillins and chloramphenicol, on improvement in peak expiratory flow rates and days of illness in patients with acute exacerbation of obstructive lung disease, including chronic bronchitis and emphysema.7 * A systematic review of the effectiveness of tissue plasminogen activator (tPA) compared with no treatment or placebo in reducing mortality among patients with acute myocardial infarction.8 Most clinicians would instinctively reject the first of these proposed reviews as overly broad but would be comfortable with the idea of combining the results of trials relevant to the third question. What about the second review? What aspects of the primary studies must be similar to justify combining their results in this systematic review? Table 1 lists features that would be relevant to the question considered in the second review and categorizes them according to the 4 key elements of study design: the patients, interventions, outcomes and methods of the primary studies. Combining results is appropriate when the biology is such that across the range of patients, interventions, outcomes and study methods, one can anticipate more or less the same magnitude of treatment effect. View this table: [Table1](http://www.cmaj.ca/content/172/5/661/T1) Table 1. In other words, the judgement as to whether the primary studies are similar enough to be combined in a systematic review is based on whether the underlying pathophysiology would predict a similar treatment effect across the range of patients, interventions, outcomes and study methods of the primary studies. If you think back to the first systematic review — all therapies for all cancers — you probably recognize that there is significant variability in the pathophysiology of different cancers (“patients” in Table 1) and in the mechanisms of action of different cancer therapies (“interventions” in Table 1). If you were inclined to reject pooling the results of the studies to be considered in the second systematic review, you might have reasoned that we would expect substantially different effects with different antibiotics, different infecting agents or different underlying lung pathology. If you were inclined to accept pooling of results in this review, you might argue that the antibiotics used in the different studies are all effective against the most common organisms underlying pulmonary exacerbations. You might also assert that the biology of an acute exacerbation of an obstructive lung disease (e.g., inflammation) is similar, despite variability in the underlying pathology. In other words, we would expect more or less the same effect across agents and across patients. Finally, you probably accepted the validity of pooling results for the third systematic review — tPA for myocardial infarction — because you consider that the mechanism of myocardial infarction is relatively constant across a broad range of patients. ### ** *The bottom line* ** * Similarity in the aspects of primary study design outlined in Table 1 (patients, interventions, outcomes, study methods) guides the decision as to whether it makes sense to combine the results of primary studies in a systematic review. * The range of characteristics of the primary studies across which it is sensible to combine results is a matter of judgment based on the researcher's understanding of the underlying biology of the disease. ## Tip 2: Qualitative assessment of the results of primary studies You should now understand that combining the results of different studies is sensible only when we expect more or less the same magnitude of treatment effects across the range of patients, interventions and outcomes that the investigators have included in their systematic review. However, even when we are confident of the similarity in design among the individual studies, we may still wonder whether the results of the studies should be pooled. The following graphic demonstration shows how to qualitatively assess the results of the primary studies to decide if meta-analysis (i.e., statistical pooling) is appropriate. You can find discussions of quantitative, or statistical, approaches to the assessment of heterogeneity elsewhere (see Box 1 or Higgins and associates9). Consider the results of the studies in 2 hypothetical systematic reviews (Fig. 1A and Fig. 1B). The central vertical line, labelled “no difference,” represents a treatment effect of 0. This would be equivalent to a risk ratio or relative risk of 1 or an absolute or relative risk reduction of 0.2 Values to the left of the “no difference” line indicate that the treatment is superior to the control, whereas those to the right of the line indicate that the control is superior to the treatment. For each of the 4 studies represented in the figures, the dot represents the point estimate of the treatment effect (the value observed in the study), and the horizontal line represents the confidence interval around that observed effect. For which systematic review does it make sense to combine results? Decide on the answer to this question before you read on. ![Figure3](http://www.cmaj.ca/https://www.cmaj.ca/content/cmaj/172/5/661/F3.medium.gif) [Figure3](http://www.cmaj.ca/content/172/5/661/F3) **Fig. 1: Results of the studies in 2 hypothetical systematic reviews.** The central vertical line represents a treatment effect of 0. Values to the left of this line indicate that the treatment is superior to the control, whereas those to the right of the line indicate that the control is superior to the treatment. For each of the 4 studies in each figure, the dot represents the point estimate of the treatment effect (the value observed in the study), and the horizontal line represents the confidence interval around that observed effect. You have probably concluded that pooling is appropriate for the studies represented in Fig. 1B but not for those represented in Fig. 1A. Can you explain why? Is it because the point estimates for the studies in Fig. 1A lie on opposite sides of the “no difference” line, whereas those for the studies in Fig. 1B lie on the same side of the “no difference” line? Before you answer this question, consider the studies represented in Fig. 2. Here, the point estimates of 2 studies are on the “favours new treatment” side of the “no difference” line, and the point estimates of 2 other studies are on the “favours control” side. However, all 4 point estimates are very close to the “no difference” line, and, in this case, investigators doing a systematic review will be satisfied that it is appropriate to pool the results. Therefore, it is not the position of the point estimates relative to the “no difference” line that determines the appropriateness of pooling. ![Figure4](http://www.cmaj.ca/https://www.cmaj.ca/content/cmaj/172/5/661/F4.medium.gif) [Figure4](http://www.cmaj.ca/content/172/5/661/F4) **Fig. 2: Point estimates and confidence intervals for 4 studies.** Two of the point estimates favour the new treatment, and the other 2 point estimates favour the control. Investigators doing a systematic review with these 4 studies would be satisfied that it is appropriate to pool the results. There are 2 criteria for not combining the results of studies in a meta-analysis: highly disparate point estimates and confidence intervals with little overlap, both of which are exemplified by Fig. 1A. When pooling *is* appropriate on the basis of these criteria, where is the best estimate of the underlying magnitude of effect likely to be? Look again at Fig. 1B and make a guess. Now look at Fig. 3. ![Figure5](http://www.cmaj.ca/https://www.cmaj.ca/content/cmaj/172/5/661/F5.medium.gif) [Figure5](http://www.cmaj.ca/content/172/5/661/F5) **Fig. 3: Results of the hypothetical systematic review presented in Fig. 1B.** The pooled estimate at the bottom of the chart (large diamond) provides the best guess as to the underlying treatment effect. It is centred on the midpoint of the area of overlap of the confidence intervals around the estimates of the individual trials. The pooled estimate at the bottom of Fig. 3 is centred on the midpoint of the area of overlap of the confidence intervals around the estimates of the individual trials. It provides our best guess as to the underlying treatment effect. Of course, we cannot actually know the “truth” and must be content with potentially misleading estimates. The intent of a meta-analysis is to include enough studies to narrow the confidence interval around the resulting pooled estimate sufficiently to provide estimates of benefit for our patients in which we can be confident. Thus, our best estimate of the truth will lie in the area of overlap among the confidence intervals around the point estimates of treatment effect presented in the primary studies. What is the clinician to do when presented with results such as those in Fig. 1A? If the investigators have done a good job of planning and executing the meta-analysis, they will provide some assistance.6 Before examining the study results in detail, they will have generated a priori hypotheses to explain the heterogeneity in magnitude of effect across studies that they are liable to encounter. These hypotheses will include differences in patients (effects may be larger in sicker patients), in interventions (larger doses may result in larger effects), in outcomes (longer follow-up may diminish the magnitude of effect) and in study design (methodologically weaker studies may generate larger effects). The investigators will then have examined the extent to which these hypotheses can explain the differences in magnitude of effect across studies. These subgroup analyses may be misleading, but if they meet 7 criteria suggested elsewhere10 (see Box 2), they may provide credible and satisfying explanations for the variability in results. ![Figure6](http://www.cmaj.ca/https://www.cmaj.ca/content/cmaj/172/5/661/F6.medium.gif) [Figure6](http://www.cmaj.ca/content/172/5/661/F6) Box 2. ### ** *The bottom line* ** * Readers can decide for themselves whether there is clinically important heterogeneity among the results of primary studies through a qualitative assessment of the graphic results. This assessment is based on the amount of disparity among the individual point estimates and the degree of overlap among the confidence intervals. ## Conclusions Understanding the concept of heterogeneity in a systematic review or meta-analysis is central to a full appreciation of the implications of such reviews for clinical practice. We have presented 2 tips aimed at helping clinical readers overcome commonly encountered difficulties in understanding this concept. ### Articles to date in this series * Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz S, et al. Tips for learners of evidence-based medicine: 1. Relative risk reduction, absolute risk reduction and number needed to treat. *CMAJ* 2004;171(4):353-8. * Montori VM, Kleinbart J, Newman TB, Keitz S, Wyer PC, Moyer V, et al. Tips for learners of evidence-based medicine: 2. Measures of precision (confidence intervals). *CMAJ* 2004;171(6):611-5. * McGinn T, Wyer PC, Newman TB, Keitz S, Leipzig R, Guyatt G, et al. Tips for learners of evidence-based medicine: 3. Measures of observer variability (kappa statistic). *CMAJ* 2004;171(11):1369-73. ## Footnotes * This article has been peer reviewed. *Contributors:* Rose Hatala modified the original ideas for tips 1 and 2, drafted the manuscript, coordinated input from reviewers and field-testing, and revised all drafts. Sheri Keitz used all of the tips as part of a live teaching exercise and submitted comments, suggestions and the possible variations that are described in the article. Peter Wyer reviewed and revised the final draft of the manuscript to achieve uniform adherence with format specifications. Gordon Guyatt developed the original ideas for tips 1 and 2, reviewed the manuscript at all phases of development, contributed to the writing as a coauthor, and, as general editor, reviewed and revised the final draft of the manuscript to achieve accuracy and consistency of content. *Competing interests:* None declared. ## References 1. 1. Oxman A, Guyatt G, Cook D, Montori V. Summarizing the evidence. In: Guyatt G, Rennie D, editors. *Users' guides to the medical literature: a manual for evidence-based clinical practice*. Chicago: AMA Press; 2002. p. 155-73. 2. 2. Barratt A, Wyer PC, Hatala R, McGinn T, Dans AL, Keitz S, et al, for the Evidence-Based Medicine Teaching Tips Working Group. Tips for learners of evidence-based medicine: 1. Relative risk reduction, absolute risk reduction and number needed to treat. CMAJ 2004;171(4):353-8. [FREE Full Text](http://www.cmaj.ca/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiRlVMTCI7czoxMToiam91cm5hbENvZGUiO3M6NDoiY21haiI7czo1OiJyZXNpZCI7czo5OiIxNzEvNC8zNTMiO3M6NDoiYXRvbSI7czoyMDoiL2NtYWovMTcyLzUvNjYxLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 3. 3. Guyatt G, Cook D, Devereaux PJ, Meade M, Straus S. Therapy. In: Guyatt G, Rennie D, editors. *Users' guides to the medical literature: a manual for evidence-based clinical practice*. Chicago: AMA Press; 2002. p. 55-79. 4. 4. Montori VM, Kleinbart J, Newman TB, Keitz S, Wyer PC, Moyer V, et al, for the Evidence-Based Medicine Teaching Tips Working Group. Tips for learners of evidence-based medicine: 2. Measures of precision (confidence intervals). CMAJ 2004;171(6):611-5. [FREE Full Text](http://www.cmaj.ca/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiRlVMTCI7czoxMToiam91cm5hbENvZGUiO3M6NDoiY21haiI7czo1OiJyZXNpZCI7czo5OiIxNzEvNi82MTEiO3M6NDoiYXRvbSI7czoyMDoiL2NtYWovMTcyLzUvNjYxLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 5. 5. Wyer PC, Keitz S, Hatala R, Hayward R, Barratt A, Montori V, et al. Tips for learning and teaching evidence-based medicine: introduction to the series. CMAJ 2004;171(4):347-8. [FREE Full Text](http://www.cmaj.ca/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiRlVMTCI7czoxMToiam91cm5hbENvZGUiO3M6NDoiY21haiI7czo1OiJyZXNpZCI7czo5OiIxNzEvNC8zNDciO3M6NDoiYXRvbSI7czoyMDoiL2NtYWovMTcyLzUvNjYxLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 6. 6. Montori V, Hatala R, Guyatt G. Summarizing the evidence: evaluating differences in study results. In: Guyatt G, Rennie D, editors. *Users' guides to the medical literature: a manual for evidence-based clinical practice*. Chicago: AMA Press; 2002. p. 547-52. 7. 7. Saint S, Bent S, Vittinghoff E, Grady D. Antibiotics in chronic obstructive pulmonary disease exacerbations. JAMA 1995;273:957-60. [CrossRef](http://www.cmaj.ca/lookup/external-ref?access_num=10.1001/jama.1995.03520360071042&link_type=DOI) [PubMed](http://www.cmaj.ca/lookup/external-ref?access_num=7884956&link_type=MED&atom=%2Fcmaj%2F172%2F5%2F661.atom) [Web of Science](http://www.cmaj.ca/lookup/external-ref?access_num=A1995QM17300032&link_type=ISI) 8. 8. Held PH, Teo KK, Yusuf S. Effects of tissue-type plasminogen activator and anisoylated plasminogen streptokinase activator complex on mortality in acute myocardial infarction. Circulation 1990;82:1668-74. [Abstract/FREE Full Text](http://www.cmaj.ca/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MTQ6ImNpcmN1bGF0aW9uYWhhIjtzOjU6InJlc2lkIjtzOjk6IjgyLzUvMTY2OCI7czo0OiJhdG9tIjtzOjIwOiIvY21hai8xNzIvNS82NjEuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 9. 9. Higgins JPT, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in meta-analyses. BMJ 2003;327:557-60. [FREE Full Text](http://www.cmaj.ca/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiRlVMTCI7czoxMToiam91cm5hbENvZGUiO3M6MzoiYm1qIjtzOjU6InJlc2lkIjtzOjEyOiIzMjcvNzQxNC81NTciO3M6NDoiYXRvbSI7czoyMDoiL2NtYWovMTcyLzUvNjYxLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 10. 10. Oxman A, Guyatt G. When to believe a subgroup analysis. In: Guyatt G, Rennie D, editors. *Users' guides to the medical literature: a manual for evidence-based clinical practice*. Chicago: AMA Press; 2002. p. 553-65.