Study heterogeneity

Last updated September 07, 2024

In statistics, (between-) study heterogeneity is a phenomenon that commonly occurs when attempting to undertake a meta-analysis. In a simplistic scenario, studies whose results are to be combined in the meta-analysis would all be undertaken in the same way and to the same experimental protocols. Differences between outcomes would only be due to measurement error (and studies would hence be homogeneous ). Study heterogeneity denotes the variability in outcomes that goes beyond what would be expected (or could be explained) due to measurement error alone.^[1]

Introduction

Meta-analysis is a method used to combine the results of different trials in order to obtain a quantitative synthesis. The size of individual clinical trials is often too small to detect treatment effects reliably. Meta-analysis increases the power of statistical analyses by pooling the results of all available trials.

As one tries to use meta-analysis to estimate a combined effect from a group of similar studies, the effects found in the individual studies need to be similar enough that one can be confident that a combined estimate will be a meaningful description of the set of studies. However, the individual estimates of treatment effect will vary by chance; some variation is expected due to observational error. Any excess variation (whether it is apparent or detectable or not) is called (statistical) heterogeneity.^[2] The presence of some heterogeneity is not unusual, e.g., analogous effects are also commonly encountered even within studies, in multicenter trials (between-center heterogeneity).

Reasons for the additional variability are usually differences in the studies themselves, the investigated populations, treatment schedules, endpoint definitions, or other circumstances ("clinical diversity"), or the way data were analyzed, what models were employed, or whether estimates have been adjusted in some way ("methodological diversity").^[1] Different types of effect measures (e.g., odds ratio vs. relative risk) may also be more or less susceptible to heterogeneity.^[3]

Modeling

In case the origin of heterogeneity can be identified and may be attributed to certain study features, the analysis may be stratified (by considering subgroups of studies, which would then hopefully be more homogeneous), or by extending the analysis to a meta-regression, accounting for (continuous or categorical) moderator variables. Unfortunately, literature-based meta-analysis may often not allow for gathering data on all (potentially) relevant moderators.^[4]

In addition, heterogeneity is usually accommodated by using a random effects model, in which the heterogeneity then constitutes a variance component.^[5] The model represents the lack of knowledge about why treatment effects may differ by treating the (potential) differences as unknowns. The centre of this symmetric distribution describes the average of the effects, while its width describes the degree of heterogeneity. The obvious and conventional choice of distribution is a normal distribution. It is difficult to establish the validity of any distributional assumption, and this is a common criticism of random effects meta-analyses. However, variations of the exact distributional form may not make much of a difference,^[6] and simulations have shown that methods are relatively robust even under extreme distributional assumptions, both in estimating heterogeneity,^[7] and calculating an overall effect size.^[8]

Inclusion of a random effect to the model has the effect of making the inferences (in a sense) more conservative or cautious, as a (non-zero) heterogeneity will lead to greater uncertainty (and avoid overconfidence) in the estimation of overall effects. In the special case of a zero heterogeneity variance, the random-effects model again reduces to the special case of the common-effect model.^[9]

Common meta-analysis models, however, should, of course, not be applied blindly or naively to collected sets of estimates. In case the results to be amalgamated differ substantially (in their contexts or in their estimated effects), a derived meta-analytic average may eventually not correspond to a reasonable estimand.^[10]^[11] When individual studies exhibit conflicting results, there likely are some reasons why the results differ; for instance, two subpopulations may experience different pharmacokinetic pathways.^[12] In such a scenario, it would be important to both know and consider relevant covariables in an analysis.

Testing

Statistical testing for a non-zero heterogeneity variance is often done based on Cochran's Q^[13] or related test procedures. This common procedure however is questionable for several reasons, namely, the low power of such tests^[14] especially in the very common case of only few estimates being combined in the analysis,^[15]^[7] as well as the specification of homogeneity as the null hypothesis which is then only rejected in the presence of sufficient evidence against it.^[16]

Estimation

While the main purpose of a meta-analysis usually is estimation of the main effect, investigation of the heterogeneity is also crucial for its interpretation. A large number of (frequentist and Bayesian) estimators is available.^[17] Bayesian estimation of the heterogeneity usually requires the specification of an appropriate prior distribution.^[9]^[18]

While many of these estimators behave similarly in case of a large number of studies, differences in particular arise in their behaviour in the common case of only few estimates.^[19] An incorrect zero between-study variance estimate is frequently obtained, leading to a false homogeneity assumption. Overall, it appears that heterogeneity is being consistently underestimated in meta-analyses.^[7]

Quantification

The heterogeneity variance is commonly denoted by τ², or the standard deviation (its square root) by τ. Heterogeneity is probably most readily interpretable in terms of τ, as this is the heterogeneity distribution's scale parameter, which is measured in the same units as the overall effect itself.^[18]

Another common measure of heterogeneity is I², a statistic that indicates the percentage of variance in a meta-analysis that is attributable to study heterogeneity (somewhat similarly to a coefficient of determination).^[20] I² relates the heterogeneity variance's magnitude to the size of the individual estimates' variances (squared standard errors); with this normalisation however, it is not quite obvious what exactly would constitute "small" or "large" amounts of heterogeneity. For a constant heterogeneity (τ), the availability of smaller or larger studies (with correspondingly differing standard errors associated) would affect the I² measure; so the actual interpretation of an I² value is not straightforward.^[21]^[22]

The joint consideration of a prediction interval along with a confidence interval for the main effect may help getting a better sense of the contribution of heterogeneity to the uncertainty around the effect estimate.^[5]^[23]^[24]^[25]

Related Research Articles

Meta-analysis is the statistical combination of the results of multiple studies addressing a similar research question. An important part of this method involves computing a combined effect size across all of the studies. As such, this statistical approach involves extracting effect sizes and variance measures from various studies. Meta-analyses are integral in supporting research grant proposals, shaping treatment guidelines, and influencing health policies. They are also pivotal in summarizing existing research to guide future studies, thereby cementing their role as a fundamental methodology in metascience.

Heritability is a statistic used in the fields of breeding and genetics that estimates the degree of variation in a phenotypic trait in a population that is due to genetic variation between individuals in that population. The concept of heritability can be expressed in the form of the following question: "What is the proportion of the variation in a given trait within a population that is not explained by the environment or random chance?"

A randomized controlled trial is a form of scientific experiment used to control factors not under direct experimental control. Examples of RCTs are clinical trials that compare the effects of drugs, surgical techniques, medical devices, diagnostic procedures, diets or other medical treatments.

In statistics, an effect size is a value measuring the strength of the relationship between two variables in a population, or a sample-based estimate of that quantity. It can refer to the value of a statistic calculated from a sample of data, the value of a parameter for a hypothetical population, or to the equation that operationalizes how statistics or parameters lead to the effect size value. Examples of effect sizes include the correlation between two variables, the regression coefficient in a regression, the mean difference, or the risk of a particular event happening. Effect sizes are a complement tool for statistical hypothesis testing, and play an important role in power analyses to assess the sample size required for new experiments. Effect size are fundamental in meta-analyses which aim at provide the combined effect size based on data from multiple studies. The cluster of data-analysis methods concerning effect sizes is referred to as estimation statistics.

In published academic research, publication bias occurs when the outcome of an experiment or research study biases the decision to publish or otherwise distribute it. Publishing only results that show a significant finding disturbs the balance of findings in favor of positive results. The study of publication bias is an important topic in metascience.

Sensitivity analysis is the study of how the uncertainty in the output of a mathematical model or system can be divided and allocated to different sources of uncertainty in its inputs. This involves estimating sensitivity indices that quantify the influence of an input or group of inputs on the output. A related practice is uncertainty analysis, which has a greater focus on uncertainty quantification and propagation of uncertainty; ideally, uncertainty and sensitivity analysis should be run in tandem.

Male circumcision reduces the risk of human immunodeficiency virus (HIV) transmission from HIV positive women to men in high risk populations.

A systematic review is a scholarly synthesis of the evidence on a clearly presented topic using critical methods to identify, define and assess research on the topic. A systematic review extracts and interprets data from published studies on the topic, then analyzes, describes, critically appraises and summarizes interpretations into a refined evidence-based conclusion. For example, a systematic review of randomized controlled trials is a way of summarizing and implementing evidence-based medicine.

Multilevel models are statistical models of parameters that vary at more than one level. An example could be a model of student performance that contains measures for individual students as well as measures for classrooms within which the students are grouped. These models can be seen as generalizations of linear models, although they can also extend to non-linear models. These models became much more popular after sufficient computing power and software became available.

In science, randomized experiments are the experiments that allow the greatest reliability and validity of statistical estimates of treatment effects. Randomization-based inference is especially important in experimental design and in survey sampling.

<span class="mw-page-title-main">Funnel plot</span>

A funnel plot is a graph designed to check for the existence of publication bias; funnel plots are commonly used in systematic reviews and meta-analyses. In the absence of publication bias, it assumes that studies with high precision will be plotted near the average, and studies with low precision will be spread evenly on both sides of the average, creating a roughly funnel-shaped distribution. Deviation from this shape can indicate publication bias.

In probability theory and statistics, the half-normal distribution is a special case of the folded normal distribution.

The Jadad scale, sometimes known as Jadad scoring or the Oxford quality scoring system, is a procedure to assess the methodological quality of a clinical trial by objective criteria. It is named after Canadian-Colombian physician Alex Jadad who in 1996 described a system for allocating such trials a score of between zero and five (rigorous). It is the most widely used such assessment in the world, and as of May 2024, its seminal paper has been cited in over 24,500 scientific works.

Seed-based d mapping or SDM is a statistical technique created by Joaquim Radua for meta-analyzing studies on differences in brain activity or structure which used neuroimaging techniques such as fMRI, VBM, DTI or PET. It may also refer to a specific piece of software created by the SDM Project to carry out such meta-analyses.

PRISMA is an evidence-based minimum set of items aimed at helping scientific authors to report a wide array of systematic reviews and meta-analyses, primarily used to assess the benefits and harms of a health care intervention. PRISMA focuses on ways in which authors can ensure a transparent and complete reporting of this type of research. The PRISMA standard superseded the earlier QUOROM standard. It offers the replicability of a systematic literature review. Researchers have to figure out research objectives that answer the research question, states the keywords, a set of exclusion and inclusion criteria. In the review stage, relevant articles were searched, irrelevant ones are removed. Articles are analyzed according to some pre-defined categories.

Meta-regression is defined to be a meta-analysis that uses regression analysis to combine, compare, and synthesize research findings from multiple studies while adjusting for the effects of available covariates on a response variable. A meta-regression analysis aims to reconcile conflicting studies or corroborate consistent ones; a meta-regression analysis is therefore characterized by the collated studies and their corresponding data sets—whether the response variable is study-level data or individual participant data. A data set is aggregate when it consists of summary statistics such as the sample mean, effect size, or odds ratio. On the other hand, individual participant data are in a sense raw in that all observations are reported with no abridgment and therefore no information loss. Aggregate data are easily compiled through internet search engines and therefore not expensive. However, individual participant data are usually confidential and are only accessible within the group or organization that performed the studies.

Lesley Ann Stewart is a Scottish academic whose research interests are in the development and application of evidence synthesis methods, particularly systematic reviews and individual participant data meta-analysis. She is head of department for the Centre for Reviews and Dissemination at the University of York and director for the NIHR Evidence Synthesis Programme. She was one of the founders of the Cochrane Collaboration in 1993. Stewart served as president of the Society for Research Synthesis Methodology (2013-2016) and was a founding co-editor in chief of the academic journal Systematic Reviews (2010–2021).

JASP is a free and open-source program for statistical analysis supported by the University of Amsterdam. It is designed to be easy to use, and familiar to users of SPSS. It offers standard analysis procedures in both their classical and Bayesian form. JASP generally produces APA style results tables and plots to ease publication. It promotes open science via integration with the Open Science Framework and reproducibility by integrating the analysis settings into the results. The development of JASP is financially supported by several universities and research funds. As the JASP GUI is developed in C++ using Qt framework, some of the team left to make a notable fork which is Jamovi which has its GUI developed in JavaScript and HTML5.

In statistics, the folded-t and half-t distributions are derived from Student's t-distribution by taking the absolute values of variates. This is analogous to the folded-normal and the half-normal statistical distributions being derived from the normal distribution.

Wolfgang Viechtbauer is a statistician. He is an associate professor of methodology and statistics at the Maastricht University in the Faculty of Health, Medicine and Life Sciences and Faculty of Psychology and Neuroscience. His most influential work has been focused on the field of meta-analysis and evidence synthesis.

References

1 2 Deeks, J.J.; Higgins, J.P.T.; Altman, D.G. (2021), "10.10 Heterogeneity", in Higgins, J.P.T.; Thomas, J.; Chandler, J.; Cumpston, M.; Li, T.; Page, M.J.; Welch, V.A. (eds.), Cochrane Handbook for Systematic Reviews of Interventions (6.2 ed.)
↑ Singh, A.; Hussain, S.; Najmi, A.N. (2017), "Number of studies, heterogeneity, generalisability, and the choice of method for meta-analysis", Journal of the Neurological Sciences, 15 (381): 347, doi:10.1016/j.jns.2017.09.026, PMID 28967410, S2CID 31073171
↑ Deeks, J.J.; Altman, D.G. (2001), "Effect measures for meta-analysis of trials with binary outcomes", in Egger, M.; Davey Smith, G.; Altman, D. (eds.), Systematic reviews in health care: Meta-analysis in context (2nd ed.), BMJ Publishing, pp. 313–335, doi:10.1002/9780470693926.ch16, ISBN 9780470693926
↑ Cooper, Harris; Hedges, Larry V.; Valentine, Jeffrey C. (2019-06-14). The Handbook of Research Synthesis and Meta-Analysis. Russell Sage Foundation. ISBN 978-1-61044-886-4.
1 2 Riley, R. D.; Higgins, J. P.; Deeks, J. J. (2011), "Interpretation of random-effects meta-analyses", BMJ, 342: d549, doi:10.1136/bmj.d549, PMID 21310794, S2CID 32994689
↑ Bretthorst, G.L. (1999), "The near-irrelevance of sampling frequency distributions", in von der Linden, W.; et al. (eds.), Maximum Entropy and Bayesian methods, Kluwer Academic Publishers, pp. 21–46, doi:10.1007/978-94-011-4710-1_3, ISBN 978-94-010-5982-4
1 2 3 Kontopantelis, E.; Springate, D. A.; Reeves, D. (2013). "A re-analysis of the Cochrane Library data: The dangers of unobserved heterogeneity in meta-analyses". PLOS ONE. 8 (7): e69930. Bibcode:2013PLoSO...869930K. doi: 10.1371/journal.pone.0069930 . PMC 3724681 . PMID 23922860.
↑ Kontopantelis, E.; Reeves, D. (2012). "Performance of statistical methods for meta-analysis when true study effects are non-normally distributed: A simulation study". Statistical Methods in Medical Research. 21 (4): 409–26. doi:10.1177/0962280210392008. PMID 21148194. S2CID 152379.
1 2 Röver, C. (2020), "Bayesian random-effects meta-analysis using the bayesmeta R package", Journal of Statistical Software, 93 (6): 1–51, arXiv: 1711.08683 , doi: 10.18637/jss.v093.i06
↑ Cornell, John E.; Mulrow, Cynthia D.; Localio, Russell; Stack, Catharine B.; Meibohm, Anne R.; Guallar, Eliseo; Goodman, Steven N. (2014-02-18). "Random-Effects Meta-analysis of Inconsistent Effects: A Time for Change". Annals of Internal Medicine. 160 (4): 267–270. doi:10.7326/M13-2886. ISSN 0003-4819. PMID 24727843. S2CID 9210956.
↑ Maziarz, Mariusz (2022-02-01). "Is meta-analysis of RCTs assessing the efficacy of interventions a reliable source of evidence for therapeutic decisions?". Studies in History and Philosophy of Science. 91: 159–167. doi: 10.1016/j.shpsa.2021.11.007 . ISSN 0039-3681. PMID 34922183. S2CID 245241150.
↑ Borenstein, Michael; Hedges, Larry V.; Higgins, Julian P. T.; Rothstein, Hannah R. (2010). "A basic introduction to fixed-effect and random-effects models for meta-analysis" . Research Synthesis Methods. 1 (2): 97–111. doi:10.1002/jrsm.12. ISSN 1759-2887. PMID 26061376. S2CID 1040498.
↑ Cochran, W.G. (1954), "The combination of estimates from different experiments", Biometrics, 10 (1): 101–129, doi:10.2307/3001666, JSTOR 3001666
↑ Hardy, R.J.; Thompson, S.G. (1998), "Detecting and describing heterogeneity in meta-analysis", Statistics in Medicine, 17 (8): 841–856, doi:10.1002/(SICI)1097-0258(19980430)17:8<841::AID-SIM781>3.0.CO;2-D, PMID 9595615
↑ Davey, J.; Turner, R.M.; Clarke, M.J.; Higgins, J.P.T. (2011), "Characteristics of meta-analyses and their component studies in the Cochrane Database of Systematic Reviews: a cross-sectional, descriptive analysis", BMC Medical Research Methodology, 11 (1): 160, doi: 10.1186/1471-2288-11-160 , PMC 3247075 , PMID 22114982
↑ Li, W.; Liu, F.; Snavely, D. (2020), "Revisit of test‐then‐pool methods and some practical considerations", Pharmaceutical Statistics, 19 (5): 498–517, doi:10.1002/pst.2009, PMID 32171048, S2CID 212718520
↑ Veroniki, A.A.; Jackson, D.; Viechtbauer, W.; Bender, R.; Bowden, J.; Knapp, G.; Kuß, O.; Higgins, J.P.T.; Langan, D.; Salanti, G. (2016), "Methods to estimate the between-study variance and its uncertainty in meta-analysis", Research Synthesis Methods, 7 (1): 55–79, doi: 10.1002/jrsm.1164 , PMC 4950030 , PMID 26332144
1 2 Röver, C.; Bender, R.; Dias, S.; Schmid, C.H.; Schmidli, H.; Sturtz, S.; Weber, S.; Friede, T. (2021), "On weakly informative prior distributions for the heterogeneity parameter in Bayesian random‐effects meta‐analysis", Research Synthesis Methods, 12 (4): 448–474, arXiv: 2007.08352 , doi:10.1002/jrsm.1475, PMID 33486828, S2CID 220546288
↑ Friede, T.; Röver, C.; Wandel, S.; Neuenschwander, B. (2017), "Meta-analysis of few small studies in orphan diseases", Research Synthesis Methods, 8 (1): 79–91, arXiv: 1601.06533 , doi:10.1002/jrsm.1217, PMC 5347842 , PMID 27362487
↑ Higgins, J. P. T.; Thompson, S. G.; Deeks, J. J.; Altman, D. G. (2003), "Measuring inconsistency in meta-analyses", BMJ, 327 (7414): 557–560, doi:10.1136/bmj.327.7414.557, PMC 192859 , PMID 12958120
↑ Rücker, G.; Schwarzer, G.; Carpenter, J.R.; Schumacher, M. (2008), "Undue reliance on I² in assessing heterogeneity may mislead", BMC Medical Research Methodology, 8 (79): 79, doi: 10.1186/1471-2288-8-79 , PMC 2648991 , PMID 19036172
↑ Borenstein, M.; Higgins, J.P.T.; Hedges, L.V.; Rothstein, H.R. (2017), "Basics of meta-analysis: I² is not an absolute measure of heterogeneity" (PDF), Research Synthesis Methods, 8 (1): 5–18, doi:10.1002/jrsm.1230, hdl: 1983/9cea2307-8e9b-4583-9403-3a37409ed1cb , PMID 28058794, S2CID 4235538
↑ Chiolero, A; Santschi, V.; Burnand, B.; Platt, R.W.; Paradis, G. (2012), "Meta-analyses: with confidence or prediction intervals?" (PDF), European Journal of Epidemiology, 27 (10): 823–5, doi:10.1007/s10654-012-9738-y, PMID 23070657, S2CID 20413290
↑ Bender, R.; Kuß, O.; Koch, A.; Schwenke, C.; Hauschke, D. (2014), Application of prediction intervals in meta-analyses with random effects (PDF), Joint statement of IQWiG, GMDS and IBS-DR
↑ IntHout, J; Ioannidis, J.P.A.; Rovers, M.M.; Goeman, J.J. (2016), "Plea for routinely presenting prediction intervals in meta-analysis" (PDF), BMJ Open, 6 (7): e010247, doi: 10.1136/bmjopen-2015-010247 , PMC 4947751 , PMID 27406637