We reviewed group-randomized trials (GRTs) published in the American Journal of Public Health and Preventive Medicine from 1998 through 2002 and estimated the proportion of GRTs that employ appropriate methods for design and analysis.
Of 60 articles, 9 (15.0%) reported evidence of using appropriate methods for sample size estimation. Of 59 articles in the analytic review, 27 (45.8%) reported at least 1 inappropriate analysis and 12 (20.3%) reported only inappropriate analyses. Nineteen (32.2%) reported analyses at an individual or subgroup level, ignoring group, or included group as a fixed effect.
Hence increased vigilance is needed to ensure that appropriate methods for GRTs are employed and that results based on inappropriate methods are not published.
DURING THE PAST 25 YEARS, increased attention has been devoted to exploring the impact of intraclass correlation (ICC) in the design and analysis of grouprandomized trials (GRTs) and to identifying appropriate methods for these trials. Despite this attention, periodic reviews of published GRTs have found that many investigators employed methods that do not account for the ICC properly.
A 1990 review of GRTs published in medical and epidemiological journals between 1979 and 19891 found that only 3 (19%) of the 16 reviewed articles accounted for ICC properly in sample size calculations, and only 8 (50%) accounted for ICC in the analysis. A meta-analysis of evaluations of 8 separate trials of a school-based program to prevent drug use reported that only 2 (25%) accounted for ICC in the analysis.2 Simpson et al.3 reviewed all GRTs published in the American Journal of Public Health and Preventive Medicine between 1990 and 1993; they reported that only 4 (19%) of 21 articles included power calculations and only 12 (57%) included analyses that took ICC into account. A more recent review of community health interventions4 included 8 GRTs; only 1 (12%) reported taking ICC into account properly in sample size calculations, though 7 (88%) accounted for ICC in the analysis.
In the meantime, methodologists have continued to focus attention on valid methods for estimating sample size and analyzing data from GRTs; a summary of the work published in the last 5 years is provided in another article in this issue.5 However, no recent review of published GRTs has examined the effect of this increased attention on the practices of investigators who conduct GRTs.
It is important to continue to monitor the published literature to determine the impact of recent methodological developments. Such reviews enable methodologists to determine the extent to which issues of clustering are recognized among investigators and to identify areas that may need further attention. They also alert investigators to attend more closely to the issues that they are missing. The goals of this study were to review GRTs recently published in the American Journal of Public Health and Preventive Medicine, determine the extent to which authors provided evidence of taking ICC into account properly in the design and analysis, and compare our results to prior reviews.
We searched issues of the American Journal of Public Health and Preventive Medicine published from January 1998 to December 2002, inclusive, selecting all articles reporting the results of GRTs. These 2 journals were chosen for review to make possible a direct comparison to the review by Simpson et al.3 GRTs were defined as studies that randomized intact social groups to study conditions but obtained observations from individuals nested within groups; we use the term group to designate the unit of assignment and condition to designate the experimental condition to which the group is assigned. Articles reporting the results of studies in which groups were not randomly assigned to study conditions were excluded, as were studies involving only observations at the group, rather than the individual, level, because these studies do not involve group-level ICC. We also excluded articles that did not include a clear statement indicating that all groups were randomized to study conditions, as well as articles indicating that some groups were randomized whereas others were nonrandomly assigned to conditions.
Each article was reviewed by the first 2 authors and by at least 1 of the other 2 authors to determine whether the article included sample size calculations and analyses taking ICC into account properly. In addition, because the design of a large GRT is often described in detail in a background article that is cited in subsequent publications, we reviewed any references to sample size in background papers cited in the articles.
We reviewed articles to determine whether authors included evidence of taking clustering into account in arriving at the number of groups assigned to study conditions, such as the expected ICC, the group component of variance, or the variance inflation factor (VIF).6 If no such evidence existed, articles were reviewed to determine whether the authors claimed that variance was inflated to account for the expected ICC, even if they provided no details on how it was done.
Because many articles presented more than 1 analytic strategy, we reviewed each article to determine whether all the analytic approaches used to evaluate intervention effects were appropriate, some were appropriate, or none were appropriate for GRTs. Also, each of the analytic approaches reported by the authors was recorded, along with any justifications offered by authors who reported inappropriate analytic strategies. Disagreements among reviewers on sample size or analytic ratings were resolved through roundtable discussion.
Murray7 and Donner and Klar8 have provided an extensive review of analytic methods appropriate for GRTs; Murray et al.5 have provided a review of even more recent analytic developments. Table 1 presents the criteria used to judge whether analytic approaches reported in each article were appropriate. Methods considered appropriate for GRTs included but were not limited to mixed-model regression approaches including analysis of variance/analysis of covariance (ANOVA/ANCOVA) and random coefficients models, 2-stage analyses (analysis on a summary statistic computed at the level of the group including randomization-based tests), and generalized estimating equations. Because each of these methods may be applied incorrectly, we established additional criteria for rating analyses as appropriately applied; these depend on the design of the study, the assumptions underlying the analytic method, and the robustness of the method to violations of these assumptions. Mixed-model analysis of variance/analysis of covariance (ANOVA/ANCOVAs) were considered appropriate if variation at the condition level was assessed against variation at the group level, with degrees of freedom (df) based on the number of groups, and with 1 or 2 time points included in the analysis. If more than 2 time points are included in the analysis of a GRT, a randomcoefficients analysis preserves the nominal type I error rate, whereas mixed-model ANOVA/ANCOVAs may not9; thus, mixed-model ANOVA/ANCOVAs were considered inappropriate for GRTs with more than 2 time points whereas random-coefficient analyses were considered appropriate. Two-stage approaches were considered appropriate if the second stage was conducted at the group level with df based on the number of groups. A generalized estimating equations approach was considered appropriate if the analysis included 40 groups or more, because the type I error rate is unreliable with fewer groups.9,10 Several articles reporting less common analytic methods referenced papers outlining these methods; we reviewed those articles for evidence that the analytic method described was suitable for the analysis of GRTs.
A total of 60 articles that met the inclusion criteria were identified in the American Journal of Public Health and Preventive Medicine during the period January 1998 to December 2002, inclusive.11–70 We also identified and reviewed 27 background papers referred to in the sample of reviewed articles.16,30,71–95 Twenty-seven (45.0%) of the 60 reviewed articles were published in the American Journal of Public Health, and 33 (55.0%) were published in Preventive Medicine. Table 2 presents design characteristics of the studies described in the articles. The number of GRTs published per year in these journals more than doubled since the earlier review by Simpson et al.,3 from 5.3 per year to 12 per year.
Overall, only 9 (15.0%) of the 60 papers reported an ICC, group component of variance, or VIF used for estimating the sample size for the trial, either in the reviewed article or in a background paper. Authors of an additional 3 (5.0%) articles claimed, either in the reviewed article or in a background paper, that variance had been inflated to account for the expected ICC but provided no evidence such as an ICC, variance components, or VIF.
We excluded 1 article published in Preventive Medicine from the analytic review because the authors did not provide enough detail to determine whether the analytic strategy was appropriate. Table 3 presents the number and percentage of the 59 remaining articles that reported only appropriate methods; some appropriate and some inappropriate methods; or only inappropriate methods. Note that percentages within subsections of Table 3 may not add to the subsection total, because the categories were not mutually exclusive.
Among the 27 articles selected from the American Journal of Public Health, 18 (66.7%) reported only analyses taking ICC into account properly, 5 (18.5%) reported some analyses that took ICC into account properly and some that did not, and 4 (14.8%) reported only analyses that did not take ICC into account properly. The 32 Preventive Medicine articles reviewed included 14 (43.8%) that reported only analyses that took ICC into account properly, 10 (31.3%) that included some analyses that took ICC into account properly and some that did not, and 8 (25.0%) that reported only analyses that did not take ICC into account properly.
Of the 15 articles that reported a mix of appropriate and inappropriate methods, 2 reported methods judged to be inappropriate that did not fall into any of the previously identified categories. One article reported a test of the intervention effect in a stratified analysis with an error term based on group variance rather than variance attributable to the interaction of the group and the stratum. Another article reported an analysis using df based on the number of groups for tests of the main effects, but used individual df for interaction terms.
Nineteen papers (32.2%) reported using analytic methods that ignored the group entirely, that included the group as a fixed effect, or that were conducted at a subgroup level. Among the 27 articles we reviewed from the American Journal of Public Health, 7 papers (25.9%) reported 1 of these analytic methods, whereas the corresponding figure among the 32 articles reviewed from Preventive Medicine was 12 (37.5%).
It has been 25 years since Cornfield first drew attention in the public health literature to the unique design and analytic issues presented by GRTs.96 Since then, an ever-increasing number of papers have appeared presenting comparisons of analytic methods, methods for sample size estimation, and reviews of published GRTs. Recent books on the design and analysis of GRTs have offered a comprehensive overview of relevant methods.7,8 At the same time, analysis software that accommodates data structures common to GRTs has become more readily available and more user-friendly. Thus, many of the barriers that in the past may have prevented investigators from properly designing and analyzing GRTs have been removed. The purpose of this review was to determine the extent to which investigators have employed appropriate design and analytic strategies in recent publications, to describe the appropriate and inappropriate strategies used, and to compare the results with those of previous reviews.
The results of our review of methods used for sample size estimation were discouraging. In the entire sample of studies, less than one fifth provided clear evidence of sample size estimation taking ICC into account. Although it is possible that many of these studies performed such power calculations but did not report them, 27 (45.8%) of the reviewed studies had fewer than 10 groups per condition. Of the 27, only 1 reported evidence of estimating sample size using methods appropriate for GRTs. Thus, it is also likely that many investigators reporting small GRTs planned their study without considering issues of sample size carefully.
The proportion of articles reporting an ICC, a group component of variance, or the VIF (15.0%) was lower than that reported by Simpson et al.,3 who found that 6 (29%) of their 21 articles reviewed reported that information. It is important that investigators who design and analyze GRTs make available the ICC estimates from their studies because it continues to be difficult for investigators to find estimates of ICCs to use in the planning stages of studies. The reporting of ICC estimates, whether in a main results paper or in a separate design or methods paper, should be common practice. Many such articles that focus on reporting ICC estimates and offering unadjusted and adjusted estimates have been published (see reference 5), but as our review shows, a large proportion of the investigators who could make estimates available do not.
The results of the review of analytic methods were mixed. The percentage of articles reporting only appropriate analytic methods (54.2%) was slightly lower than the 57% found in the review by Simpson et al.3 However, this comparison is not entirely fair because our review employed more stringent criteria for evaluating analytic methods as appropriate, based on the results of methodological work published after the review by Simpson et al.3 In particular, we evaluated as inappropriate the use of generalized estimating equations and other asymptotically robust methods with few groups per condition and mixed-model repeated-measures analysis of variance with more than 2 time points, whereas Simpson et al. did not. Of the 59 papers included in the analytic review, 40 (67.8%) avoided methods considered inappropriate at the time of the review by Simpson et al., suggesting that awareness of the problems associated with those methods improved over the last 10 years. At the same time, we were surprised to find that 19 (32.2%) of the studies we reviewed reported analyses considered inappropriate 10 years ago. Clearly, some investigators, reviewers, and journal editors have still not heard or accepted the long-standing warnings against analysis at an individual or subgroup level that ignores the group-level ICC, or an analysis that includes the group as a fixed effect. And certainly some investigators, reviewers, and journal editors are not familiar with the more recent developments that have identified other inappropriate methods. One method for ensuring a more careful screening of the design and analytic methods reported in published GRTs would be to require a review for the journal by a statistician or methodologist familiar with the unique analytic challenges presented by GRTs. Such measures are clearly needed, as reviews of appropriate methods and of published GRTs have not resulted in substantial change since the review by Simpson et al.3
The articles in our review that employed inappropriate analytic strategies generally did so without mentioning issues of clustering, although some did acknowledge clustering issues and offered justifications for ignoring them. One article that reported appropriate methods as well as inappropriate methods stated that the analyses ignoring group were performed because “individual variation in biological attributes could obscure important clinical changes in the school-level analyses as a result of reduced statistical power.”15 We don’t disagree with the statement, but we would disagree with its use as justification for an analysis that is likely to carry an inflated type I error rate. The 2 “penalties” of GRTs, variance inflation and limited df, simply cannot be avoided.96,97 Investigators should consider in advance whether the number of groups is adequate to permit the comparisons of interest, and if it is not, they should adopt methods to improve precision, likely to include adding more groups.
Another article that reported both appropriate and inappropriate methods gave as their justification that ICCs such as 0.024, 0.026, and 0.083 were “very small.”50(p158) This claim ignores the fact that variance inflation in a GRT depends on both the ICC and the average number of members per group.96 Ignoring an ICC as small as 0.001 can be dangerous if the number of members per group is large; similarly, even with as few as 10 members per group, an ICC of 0.08 can inflate type I error rates beyond the nominal level.
One paper reported an appropriate analysis that yielded a negative estimate for the ICC for the outcome variable of interest. The authors concluded that because the group component of variance was estimated as negative, the “true” value of the group component should be zero, and they then reported analyses that ignored the group, with df calculated based on the number of members. Intuitively, this idea makes sense, because variances are squared quantities and cannot theoretically be negative. However, if the true value is zero and estimates are normally distributed around the true value, the estimate would be expected to be negative approximately half the time. Some argue that negative estimates should be set to zero, but simulation studies have demonstrated that this practice depresses power.9,97,98 A negative variance component will result in a negative ICC and a VIF that is less than 1. The variance for the appropriate analysis will actually be larger, then, if variance components are constrained to be zero than if they are allowed to be estimated as negative. We considered the choice to set the group component of variance to zero to be a conservative strategy but judged the analysis overall to be inappropriate because the investigators used df based on the number of individuals. This choice is appropriate when individuals are randomized to study conditions, but in this case, identifiable social groups were assigned to study conditions, and df should be based on the number of groups.96 That distinction will not materially affect the outcome when there are a large number of groups assigned to each condition so that df based on groups are large, but it will if the number of groups per condition is small.
Three articles in our sample described studies that randomized only 1 group to each study condition. This design presents an analytic challenge without a satisfactory solution in that the group is confounded with the condition; thus, a proper test of the intervention effect is not possible. The authors of these articles offered no justification for employing such a problematic design.
The number of published GRTs continues to increase, as evidenced by our identification of 60 articles in 5 years, compared with the 21 articles found in 4 years in the review by Simpson et al.3 Given this increase, and the expense and effort required to design and implement GRTs, it is imperative that investigators employ appropriate design and analytic methods. It is also imperative that reviewers and journal editors do a better job of screening submissions to prevent publication of GRTs that employ inappropriate analytic methods.
Note. GRT = group-randomized trial; ANOVA = analysis of variance; ANCOVA = analysis of covariance; ICC = intraclass correlation. Note. GEE = generalized estimating equation; ANOVA = analysis of variance; ANCOVA = analysis of covariance; ICC = intraclass correlation.Method Appropriate Application in GRTs Mixed-model methods Repeated measures ANOVA/ANCOVA 1 or 2 time points Random-coefficients approach > 2 time points Generalized estimating equations With small-sample correction < 40 groups included in analysis With no correction ≥ 40 groups included in analysis 2-stage methods (analysis on group means or other summary statistic) Applied at level of unit of assignment Post-hoc correction based on external estimates of ICC Validity depends on validity of external estimates Analysis at subgroup level, ignoring group-level ICC Not appropriate for GRTs Analysis at individual level, ignoring group-level ICC Not appropriate for GRTs Articles Characteristic No. % Number of study conditions 2 51 85.0 3 6 10.0 ≥ 4 3 5.0 Matching or stratification in design Matching 22 36.7 Stratification 18 30.0 Matching and stratification 7 11.7 Randomization without matching or stratification 13 21.7 Type of group Schools or colleges 17 28.3 Worksites 11 18.3 Medical practices 9 15.0 Communities, neighborhoods, or postal networks 9 15.0 Housing projects or apartment buildings 3 5.0 Churches 3 5.0 Other 8 13.3 Number of groups per condition 1 group 3 5.0 2–3 groups 5 8.3 4–5 groups 7 11.7 6–12 groups 20 33.3 13–25 groups 18 30.0 > 25 groups 7 11.7 Number of members per group < 10 members 8 13.3 10–50 members 19 31.7 51–100 members 16 26.7 > 100 members 17 28.3 Number of time points 1 time point 2 3.3 2 time points 34 56.7 3 time points 17 28.3 4–9 time points 6 10.0 Number of time points varies within study 1 1.7 Design Cohort 38 63.3 Cross-sectional 13 21.7 Combination of cohort and cross-sectional 9 15.0 Primary outcome variables Smoking prevention or cessation 17 28.3 Dietary variables 12 20.0 Health screening 7 11.7 Alcohol, drug, or combination of alcohol, tobacco, drugs 5 8.3 Multiple health measures 5 8.3 Sun protection 3 5.0 Preventing physical or sexual abuse 2 3.3 Physician preventive practices 2 3.3 Workplace health and safety measures 2 3.3 Other 5 8.3 Criteria No. (%) (n = 59) Articles reporting only appropriate methods 32 (54.2) Method Mixed-model methods with baseline measurement as covariate 10 (16.9) Mixed-model ANOVA/ANCOVA approach with 1 or 2 time points 9 (15.3) GEEs with ≥ 40 groups 2 (3.4) 2-stage analysis (analysis of group means or other summary statistics) 12 (20.0) Articles reporting some appropriate and some inappropriate methods 15 (25.4) Appropriate methods Mixed-model methods with baseline measurement as covariate 6 (10.2) Mixed-model ANOVA/ANCOVA approach with 1 or 2 time points 4 (6.8) GEEs with ≥ 40 groups 2 (3.4) 2-stage analysis 3 (5.1) Inappropriate methods Analysis at an individual level, ignoring group-level ICC 9 (15.3) Analysis at a subgroup level, ignoring group-level ICC 1 (1.7) GEEs or other asymptotically robust method with < 40 groups 3 (5.1) Other 2 (3.4) Articles reporting only inappropriate methods 12 (20.3) Method Analysis at an individual level, ignoring group-level ICC 6 (10.2) Analysis at a subgroup level, ignoring group-level ICC 3 (5.1) Analysis with group as a fixed effect 1 (1.7) Mixed-model ANOVA/ANCOVA approach with > 2 time points 1 (1.7) GEEs with < 40 groups 5 (8.5)