© 2004 American Public Health Association
Correspondence: Requests for reprints should be sent to Kevin L. Delucchi, PhD, Department of Psychiatry, University of California, San Francisco, Box 0984-TRC, 401 Parnassus Ave, San Francisco, CA 94143-0984 (e-mail: kdelucc{at}itsa.ucsf.edu).
I reviewed sample estimation methods for research designs involving nonindependent data and a dichotomous response variable to examine the importance of proper sample size estimation and the need to align methods of sample size estimation with planned methods of statistical analysis. Examples and references to published literature are provided in this article. When the method of sample size estimation is not in concert with the method of planned analysis, poor estimates may result. The effects of multiple measures over time also need to be considered. Proper sample size estimation is often overlooked. Alignment of the sample size estimation method with the planned analysis method, especially in studies involving nonindependent data, will produce appropriate estimates.
WHEN DESIGNING A STUDY whether a program evaluation, a survey, a casecontrol comparison, or a clinical trialinvestigators often overlook sample size estimation. For ethical and practical reasons, it is important to accurately estimate the required sample size when one is testing a hypothesis or estimating the size of an effect in observational research.13 I seek to advance the existing literature by examining 3 points: (1) the importance of sample size estimation in research, (2) the need for alignment of sample size estimation with the planned analysis, and (3) the special case of a design involving clustered or correlated data and a dichotomous outcome. This discussion is framed primarily in terms of longitudinal study designs, which are more common and probably more familiar to many researchers than cluster-randomized designs. The broader points, however, apply to all research settings in which sample size is important. The more specific issues and methods apply to any design in which the data are nonindependent, such as studies of members of a household, comparisons of entire communities, and multiple measures of the same person. This topic can be framed from 2 separate perspectives: testing hypotheses and estimating parameters. When testing a hypothesis, one is concerned with estimating the number of study participants required to ensure a minimal probability (power) of detecting an effect if it exists. With many public health applications, the goal is not to test a hypothesis but rather to estimate the size of an effect, such as an odds ratio, a correlation coefficient, or a proportion. The focus is on the variation of the estimate, which is expressed by the size of the confidence interval after one asks the question, "If I have a sample of a given size, how large will the confidence interval around my estimate be?" Proper sample size estimation is equally important in both perspectives.
I have assumed that the need for sample size estimation in planning a study is both understood and appreciated. This is not a trivial assumption. Lenth2 pointed out that despite the importance of this topic, a limited body of published literature exists on methods of sample size estimation. Hoenig and Heisey3 demonstrated that some of the basic concepts about statistical power are still misunderstood, and Halpern et al.4 recently discussed the continuing appearance of underpowered medical research. When I reviewed the literature, I found surprisingly little evidence of improvement in applying sample size estimation to design studies despite the publication of numerous articles that have pointed to this problem.1,5 In 1988, Freiman et al. replicated a study they had first published in 1978.6 In this follow-up study (published in 19927), they concluded, as they had in the original work, that inadequate attention was being paid to the issue of statistical power in randomized clinical trials. Reviews within specialties have consistently found many studies to be underpowered.812 Although most of the literature on this topic is written from the experimental or clinical trials perspective, a few publications have addressed the estimation of sample size for confidence intervals.1315 Volatier et al.16 discussed sample size estimation principles for a dietary survey, and Brogger et al.,17 Bennett et al.,18 and Panagiotakos et al.19 have provided recent examples of study design for effect size estimation. Additionally, several articles have addressed sample size estimation in the context of estimating geneenvironment interactions.2022
When one plans a research study, several steps are needed to estimate the number of study participants. Brief introductions to this subject can be found in articles by Streiner23 and Clark.24 The procedures for estimating a sample size can be summarized as follows: (1) design the study to meet its specific aims; (2) use pilot data and published study results to estimate the effect size, or neighborhood of effect sizes, for each statistical hypothesis to be tested or effect size to be estimated; (3) set the type I error rate (usually .05 [ ]) and minimal required power (1 - ß, usually 80%.); (4) compute the required number of study participants, or sets of study participants, for each estimated effect size and each tested hypothesis; and (5) if necessary, revise study parameters to accommodate a smaller number of study participants while retaining adequate power.2830 It should be noted that in actual practice, the sample size estimation process is often more interactive and adaptive (a slightly different version of the process outlined here is provided by Castelloe and OBrien,25 Maxwell,26 and Cohen27).
With step 4, it is important to use an estimation method that closely matches the planned analysis method.31 Consider a study designed to compare 2 groups of participants on a dichotomous outcome with a logistic regression model to statistically control for a set of covariates. For instance, when one compares smoking rates, 1 group may have slightly higher levels of depression symptoms and a greater average age. To estimate the required sample size for a logistic regression, one requires an estimate of the expected outcome proportions of the 2 conditions (the effect size) plus the level of correlation
If, however, one is unable to estimate that correlation, it may be tempting to use a simple comparison of the 2 proportions as a test for the basis of estimating the sample size. For the sake of the example, if the proportions of the 2 groups are expected to be 0.20 and 0.35 (
In the end, the sample size must be a compromise between the competing demands of good science and available resources of time and budget.
For designs in which the outcome data are continuous and nonindependent, a number of references3437 and software packages30,3841 provide resources for estimating sample requirements, depending on the planned analysis (see Muller et al.,31 Hedeker et al.,42 and Rochon43 for more complex models).
To illustrate sample size estimation for a dichotomous longitudinal outcome, consider estimating the sample size for a proposed study of smoking rates in 2 groups measured at 3 time points. The analysis plan is to conduct tests for the 3 main effects: a comparison of the rates between the 2 groups, the change over time, and the interaction of group by time. Set the
Use cross-sectional methods to approximate the sample size.
Estimates may vary slightly for even this simple comparison. For example, 107 study participants per group are required if an arcsine transformation is applied to the proportions first, 118 study participants per group are required if a correction for continuity is used, and 117 study participants per group are required if both are used. Rochons SAS macro43 estimates 111 study participants per group and OBriens UnifyPow38 estimates 109 study participants per group when the Pearson
However, the analysis plan calls for testing for changes across time, and a better approximation may be to compare the proportions of study participants who smoked at each time point. This comparison requires a multiple-testing control, such as a Bonferroni-type correction that sets the testwise
Incorporate the across-assessment correlation.
The data in the example represent 3 2 x 2 matrices of proportions of smokers by group and by time. The hypothesis of a common odds ratio can be tested with the CochranMantelHaenszel test45 for comparing binary outcomes between 2 groups while controlling for 1 or more stratifying variables, such as site in a multisite clinical trial. Zhang and Boos46 extended the CochranMantelHaenszel test to a case in which the outcomes were correlated, and they derived 2 related tests. They also provided power calculations on the basis of Wittes and Wallensteins research47 by incorporating the population correlation coefficientthe intraclass correlationinto their formula number 3. This incorporation can be applied directly to the example data, which yield estimates (depending on the correlation, assumed to range from Another version of a method incorporating the nonindependence among study participants in a power analysis comes from the research that used the cluster-randomized design, which was discussed by Donner48 and Donner and Klar49 for the continuous case, while methods of power analysis for clustered binary data are discussed by Lee and Durbin50, Jung et al.,51 and Pan.52 One can conceptualize a repeated-measures design as a cluster-randomized design by thinking of the set of assessments for each participant as the cluster that will be randomized to a group. In this case, the cluster size is fixed; hence, one should use the average assessment-toassessment correlation as the estimate of the population correlation coefficient, which is known as the variance inflation factor in this context. In the example, if one examines the same range of intraclass correlations that range from .20 to .80 and if one uses the formula provided by Donner and Klar,49 one obtains the same sample size estimates of 47 to 91 per group. (If one uses Rochons program43 and assumes the same proportions across time, the estimates are 53 and 98.) Although such methods allow the investigator to take into account correlation across time, I have had to assume that the correlations are equal from time to time (i.e., compound symmetric) and that the test is a simple comparison of 2 proportions. These methods for calculating acrossassessment correlation still do not provide estimates for either the test of change over time or the test of group-by-time interaction. As these estimates and Muller et al.31 demonstrate, such approximations can be risky.
Use a fully aligned method.
Rochons method43 is based on the Wald With the generalized estimating equation (GEE) approach, the correlation of error terms in a model is assumed to be a nuisance in the sense that error terms must be accounted for if one is to obtain robust estimates of the standard errors in the model, but these error terms are not of direct interest. (Lindsey and Lambert54 have argued that such marginal models are not optimal for this analysis and that a mixed model should be used instead.) While the correct specification of the correlational structure will improve efficiency, the estimates of the mean structure will not be biased if the specification is incorrect.
Table 1
Estimating these additional parameters (correlation and shape of the correlation matrix) places an additional burden on the researcher. Just as one may have multiple estimates of effect, one also may have multiple estimates of the additional parameters, and one should check the extent to which the estimated sample sizes vary as the parameter estimates vary.
Before considering the effects of these parameters on the sample size estimates, compare the estimates from the fully aligned analysis with the approximations on the basis of the effects provided in the example data, which are summarized in Table 2
When one focuses on the GEE-based estimates that are aligned with the analysis plan, the interaction test requires many more participants than either of the other 2 effects. Such a difference is quite common unless the interaction is very pronounced. Also notice the increase in sample size that accompanies the increase in assumed level of correlation for the treatment effect and the reduction in sample size for the other 2 effects. The reason is that as the correlation from assessment to assessment rises, less information is available from each assessment for the treatment comparisons, but more information is available about the changes over time. Also, the study in this example would be overpowered if we conservatively assumed that no correlation across time ( = 0) existed when in fact such a correlation did exist. A study with too many participants is not desirable, because it is unethical and a waste of limited resources to expose more participants to research than necessary.
The relationship of correlational structure to the number of study participants can be seen in greater detail in Figure 2
The assumed shape of the correlation matrix makes almost no difference in the case of treatment effects and makes only a small difference in the case of time-related effects. The differences can be meaningful, however, in cases where more study participants are needed, such as for the interaction effect. If is equal to .50, 164 study participants per group are required under a compound-symmetric assumption, while 226 study participants are necessary under an autoregressive structure. This approach can be applied to both continuous and categorical data, and it allows for more variations than are discussed in this article, including unequally spaced assessments, differential attrition among samples, and unequal number of subjects per group.43
Use simulations.
In addition to the examples presented in this article, studies published by Cohen,27 Sedlmeier and Gigerenzer,5 Freiman et al.,6 Thornley and Adams,9 and Bezeau and Graves10 demonstrate that more careful attention to sample sizes used in research is still needed. A poorly conducted sample size estimation can result in a study with very little chance of demonstrating any meaningful effect. The 2 most important considerations when estimating the required number of participants are to align the sample size estimation with the data analysis and to verify the sensitivity of the resultant estimates. Although modern methods for data analysis seem to be expanding at a rapid rate, methods of sample estimation are not far behind, and user-friendly software for conducting sample size estimation is increasingly available. The impact of aligning sample estimation methods with data analytic methods is often overlooked; the closer the methods of estimating sample size are to the methods of analysis, the better the chances are that the actual power achieved will match the level of planned power. Part of the cost of planning a more complex design and analysis derives from the additional information that must be acquired or approximated to accurately estimate how many participants will be required. The effort expended in gathering those pieces of information will necessarily be in proportion to the size of the study and the maturity of the research field in which the study is set. Once the methods are aligned, efforts should be focused on estimating the required parameters, while at the same time one must realize that it is uncommon to be able to base sample size estimates on a single, well-established effect size. It is equally important to recognize that the effect size and some of the other parameters, such as attrition rates, are themselves estimates. The more the estimates of these parameters vary, the more the sample estimates will vary. Whereas the scientifically conservative decision in the face of such variation would be to select the largest estimated sample size, decision may be impractical and may be far in excess of the true requirement. Even well-established estimates of the parameters should be subjected to a sensitivity analysis to determine the extent to which the estimated sample size varies as the parameters vary. Following these recommendations means more work for the investigators planning a study and for the reviewers of proposals and manuscripts, but it is work that pays off in the long runboth for the investigators themselves and for the scientific community as a whole.
This work was supported by National Institute on Drug Abuse grant P50DA09253. Drs David Wasserman, Alan Bostrom, Roger Vaughan, and 3 anonymous reviewers provided many very helpful comments and suggestions.
Human Participant Protection
Peer Reviewed Accepted for publication July 14, 2003.
1. Cohen J. The statistical power of abnormal social and psychological research: a review. J Abnorm Soc Psychol. 1962;65:145153.[Medline] 2. Lenth RV. Some practical guidelines for effective sample size determination. Am Statistician. 2001;55:187193. 3. Hoenig JM, Heisey DM. The abuse of power: the pervasive fallacy of power calculations for data analysis. Am Statistician. 2001;55:1924.
4. Halpern SD, Karlawish JHT, Berlin JA. The continuing unethical conduct of underpowered clinical trials. JAMA. 2002;288:358367. 5. Sedlmeier P, Gigerenzer G. Do studies of statistical power have an effect on the power of studies? Psychol Bull. 1989;105:309316. 6. Freiman JA, Chalmers TC, Smith H Jr, Kuebler RR. The importance of beta, the type II error and sample size in the design and interpretation of the randomized controlled trial. Survey of 71 "negative" trials. N Engl J Med. 1978;299:690694.[Abstract] 7. Freiman JA, Chalmers TC, Smith H Jr, Kuebler RR. The importance of beta, the type II error, and sample size in the design and interpretation of the randomized controlled trial. In: Bailar JC III, Mosteller F, eds. Medical Uses of Statistics. 2nd ed. Boston, Mass: NEJM Books; 1992: 357373.
8. Sloan NL, Jordan E, Winikoff B. Effects of iron supplementation on maternal hematologic status in pregnancy. Am J Public Health. 2002;92:288293.
9. Thornley B, Adams C. Content and quality of 2000 controlled trials in schizophrenia over 50 years. BMJ. 1998;317:11811184. 10. Bezeau S, Graves R. Statistical power and effect sizes of clinical neuropsychology research. J Clin Exp Neuropsychol. 2001;23:399406.[Web of Science][Medline]
11. Freedman KB, Bernstein J. Sample size and statistical power in clinical orthopaedic research. J Bone Joint Surg. 1999:81:14541460.
12. Dickison K, Bunn F, Wentz R, Edwards P, Roberts I. Size and quality of randomized controlled trials in head injury: review of published studies. BMJ. 2000:320:13081311. 13. Beal SL. Sample size determination for confidence intervals on the population mean and on the difference between two populations means. Biometrics. 1989;45:969977.[Web of Science][Medline] 14. Daly LE. Confidence intervals and sample sizes: dont throw out all your old sample size tables. BMJ. 1991;302:333336.
15. Satten GA, Kupper LL. Sample size requirements for interval estimation of the odds ratio. Am J Epidemiol. 1990;131:177184. 16. Volatier JL, Turrini A, Welten D; EFCOSUM Group. Some statistical aspects of food intake assessment. Eur J Clin Nutr. 2002:56(suppl 2):S46S52.
17. Brogger J, Bakke P, Eide GE, Gulsvik A. Comparison of telephone and postal survey modes on respiratory symptoms and risk factors. Am J Epidemiol. 2002;155:572576.
18. Bennett S, Lienhardt C, Bah-Wow O, et al. Investigation of environmental and host-related risk factors for tuberculosis in Africa, II: investigation of host genetic factors. Am J Epidemiol. 2002:155:10741079. 19. Panagiotakos DB, Chrysohoou C, Pitsavos C, et al. The association between secondhand smoke and the risk of developing acute coronary syndromes, among non-smokers, under the presence of several cardiovascular risk factors: the CARDIO2000 casecontrol study. BMC Public Health. 2002;2(1):9.[Medline]
20. Sturmer T, Brenner H. Flexible matching strategies to increase power and efficiency to detect and estimate gene-environment interactions in casecontrol studies. Am J Epidemiol. 2002;155:593602. 21. Yang Q, Khoury MJ, Friedman JM, Flanders DW. On the use of population attributable fraction to determine sample size for casecontrol studies of gene-environment interaction. Epidemiology. 2003;14:161167.[Web of Science][Medline] 22. Umbach DM. On the determination of sample size. Epidemiology. 2003;14:137138.[Web of Science][Medline] 23. Streiner DL. Sample size and power in psychiatric research. Can J Psychiatry. 1990;35:616620.[Web of Science][Medline] 24. Clark V. Sample size determination. Plast Reconstr Surg. 1991;87:569573.[Web of Science][Medline] 25. Castelloe JM, OBrien RG. Power and Sample Size Determination for Linear Models. Proceedings of the Twenty-Sixth Annual SAS Users Group International Conference, Long Beach, Calif, 2225 April 2001. Cary, NC: SAS Institute Inc; 2001. 26. Maxwell SE. Sample size and multiple regression analysis. Psychol Methods. 2000;5:434458.[Web of Science][Medline] 27. Cohen J. Statistical Power Analysis for the Behavioral Sciences. Hillsdale, NJ: Lawrence Erlbaum; 1988. 28. Kraemer HC. To increase power in randomized clinical trials without increasing sample size. Psychopharmacol Bull. 1991;27:217224.[Web of Science][Medline] 29. McAweeney MJ, Klockars AJ. Maximizing power in skewed distributions: analysis and assignment. Psychol Methods. 1998;3:117122. 30. McClelland GH. Optimal design in psychological research. Psychol Methods. 1997;2:319. 31. Muller KE, LaVange LM, Landesman-Ramey S, Ramey CT. Power calculations for general linear multivariate models including repeated measures applications. J Am Stat Assoc. 1992;87:12091226.[Web of Science] 32. Hsieh FY, Block DA, Larson MD. A simple method for sample size calculation for linear and logistic regression. Stat Med. 1998;17:16231634.[Web of Science][Medline] 33. Hintze J. PASS 2000 [computer software]. Kaysville, Utah: Number Cruncher Statistical Software; 2000. 34. Muller KE, Barton CN. Approximate power for repeated measures ANOVA lacking sphericity. J Am Stat Assoc. 1989;84:549555. 35. Overall JE, Doyle SR. Estimating sample sizes for repeated measurement designs. Control Clin Trials. 1994;15:100123.[Web of Science][Medline] 36. Overall JE, Atlas RS. Power of univariate and multivariate analyses of repeated measurements in controlled clinical trials. J Clin Psychol. 1999;55:465485.[Web of Science][Medline] 37. Rochon J. Sample size calculations for two-group repeated-measures experiments. Biometrics. 1991;47:13831398. 38. OBrien RG. A Tour of UnifyPow, A SAS Module/Macro for Sample-Size Analysis. Proceedings of the Twenty-Third Annual SAS Users Group International Conference, Nashville, Tenn, 2225 March 1998. Cary, NC: SAS Institute Inc; 1998. 39. Elashoff JD. nQuery Advisor [computer software]. Version 4.0. Sagus, Mass: Statistical Solutions; 2000. 40. Ahn C, Overall JE, Tonidandel S. Sample size and power calculations in repeated measurement analysis. Comput Methods Programs Biomed. 2001;64:121124.[Web of Science][Medline] 41. EgretSIZ [computer program]. Cytel Software Inc: Cambridge, Mass; 1994. 42. Hedeker D, Gibbons RD, Waternaux C. Sample size estimation for longitudinal designs with attrition: comparing time-related contrasts between two groups. J Educ Behav Stat. 1999;24:7093. 43. Rochon J. Application of GEE procedures for sample size calculations in repeated measures experiments. Stat Med. 1998;17:16431658.[Web of Science][Medline] 44. Delucchi KL. The use and misuse of chi-square: Lewis and Burke revisited. Psychol Bull. 1983;94:166176. 45. Mantel N, Haenszel W. Statistical aspects of the analysis of data from retrospective studies of disease. J Natl Cancer Inst. 1959;22:719748. 46. Zhang J, Boos DD. Mantel-Haenszel test statistics for correlated binary data. Biometrics. 1997;53:11851198.[Web of Science][Medline] 47. Wittes J, Wallenstein S. The power of the Mantel-Haenszel test. J Am Stat Assoc. 1987;82:11041109. 48. Donner A. Sample size requirements for stratified cluster randomized designs. Stat Med. 1992;11:74350.[Web of Science][Medline] 49. Donner A, Klar N. Design and Analysis of Cluster Randomization Trials in Health Research. London, England: Arnold; 2000. 50. Lee EW, Durbin N. Estimation and sample size considerations for clustered binary responses. Stat Med. 1994;13:12411252.[Web of Science][Medline] 51. Jung S-H, Kang S-H, Ahn C. Sample size calculations for clustered binary data. Stat Med. 2001;20:19711782.[Web of Science][Medline] 52. Pan W. Sample size and power calculations with correlated binary data. Control Clin Trials. 2001;22:211227.[Web of Science][Medline] 53. Liu G, Liang K-Y. Sample size calculations for studies with correlated observations. Biometrics. 1997;53:937947.[Web of Science][Medline] 54. Lindsey JK, Lambert P. On the appropriateness of marginal models for repeated measurements in clinical trials. Stat Med. 1998;17:447469.[Web of Science][Medline] 55. Muñoz A, Carey V, Shouten JP, Segal M, Rosner B. A parametric family of correlation structures for the analysis of longitudinal data. Biometrics. 1992;48:733742.[Web of Science][Medline] This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||