The present article provides a primer on using effect sizes in research. A small heuristic data set is used in order to make the discussion concrete. Additionally, various admonitions for best practice in reporting and interpreting effect sizes are presented. Among these is the admonition to not use Cohen’s benchmarks for “small,” “medium,” and “large” effects, and instead to interpret effects in direct and explicit comparison against the effects in the related prior literature.
Introduction
The origins of pCALCULATED values and null hypothesis statistical significance testing (NHSST) can be traced back to around 1700 (Huberty, 1999). But the widespread uptake of NHSST by social scientists actually did not occur until the 1950s (Hubbard & Ryan, 2000). During the next four decades pCALCULATED values obtained what some have called an “ontological mystique” as the sine qua non of research.
It may not be an exaggeration to say that for many PhD students, for whom the .05 alpha has acquired an almost ontological mystique, it can mean joy, a doctoral degree, and a tenure-track position at a major university if their dissertation p is less than .05. However, if the p is greater than .05, it can mean ruin, despair, and advisor’s suddenly thinking of a new control condition that should be run. (Rosnow & Rosenthal, 1989, p. 1277)
During these decades the literature was plagued by a “file drawer” problem (Rosenthal, 1979) in which studies yielding p>0.05 were rarely submitted for publication, or published if submitted (Greenwald, 1975).
Of course, almost from the beginning, overreliance on NHSST attracted forceful critics such as Boring (1919) and Berkson (1938). Especially influential criticisms were published by Carver (1978), Cohen (1994), and Schmidt (1996). For example, Schmidt and Hunter (1997) argued that “Statistical significance testing retards the growth of scientific knowledge; it never makes a positive contribution” (p. 37). Rozeboom (1997) was equally forceful:
Null-hypothesis significance testing is surely the most bone- headedly misguided procedure ever institutionalized in the rote training of science students… [I]t is a sociology-of-science wonderment that this statistical practice has remained so unresponsive to criticism… (p. 335)
The onslaught of criticisms of NHSST grew exponentially over time, and occurred across numerous disciplines, as documented by Anderson, Burnham and Thompson (2000) and Fidler (2005). For example, criticisms of NHSST have been published in disciplines as diverse as economics (Ziliak & McCloskey, 2004), marketing (Armstrong, 2007; Hubbard & Armstrong, 2006), and the wildlife sciences (Johnson, 1999).
Three Criticisms OfNhsst
Most of the published criticisms of NHSST can be sorted into three categories. Today, these criticisms have been widely accepted, and have led to changed emphases in social science research.
p Values Do Not Test Result Replicability
As conclusively shown by Cohen (1994), pCALCULATED values provide no information about result replicability. A pCALCULATED value “estimates the probability of the sample statistic(s) (and sample results even more extreme in their divergence from the null hypothesis than our sample results), assuming (a) the sample came from a population exactly described by the null hypothesis, and (b) given the sample size” (Thompson, 2006a, p. 179). A pCALCULATED value does not estimate the probability of the population parameters given the sample results (i.e., p(P|S). Instead, NHSST estimates p(S|P).
Our pCALCULATED values indeed would bear upon result replicability if the pCALCULATED values were about p(P|S). NHSST results (a) sadly are aboutp(S|P), and not p(P|S), and (b) sadly p(S|P) ≠ p(P|S), just as the probability of being an animal given that one is a human (p(A|H)) does not equal the probability that one is human given that one is an animal (p(H|A))!
p Values Do Not Evaluate Result Importance
A pCALCULATED value is a mathematical probability of the sample results under the two NHSST assumptions. This mathematical estimate does not include as input any information about the researcher’s personal values. A valid deductive logic simply cannot contain in conclusions any information not present in the premises of the logic. As Thompson (1993) explained, “If the computer package [e.g., SPSS] did not ask you your values prior to its analysis, it could not have considered your value system in calculating p’s, and so p’s cannot be blithely used to infer the value of research results” (p. 365).
p Values Are In Part Tautological
Our pCALCULATED values are a joint function of two, and only two, study features: (a) our sample size, and (b) our effect size (e.g., r2, R2, η2). Effect sizes quantify the extent to which our sample results diverge from the expectations specified in the null hypothesis we are testing (Cohen, 1994; Thompson, 1996, 2006a).
Our pCALCULATED values cannot be used as effect sizes, because p’s are confounded results affected by more than only our effect sizes. Thus, as Thompson (1999) explained, Because p values are confounded indices, in theory 100 studies with varying sample sizes and 100 different effect sizes could each have the same single pCALCULATED, and 100 studies with the same single effect size could each have 100 different values for pCALCULATED. (pp. 169-170)
These pCALCULATED values in part are tautological, because they tell us how big (or small) our sample size was. As Cohen (1994, p. 1000) noted, “The point is made piercingly by Thompson (1992),” who was then quoted as saying:
Statistical significance testing can involve a tautological logic in which tired researchers, having collected data from hundreds of subjects, then conduct a statistical test to evaluate whether there were a lot of subjects, which the researchers already know, because they collected the data and know they’re tired. This tautology has created considerable damage as regards the cumulation of knowledge… (Thompson, 1992, p. 436)
Each and every non-zero effect size will be statistically significant at some sample size, as we can see from the following illustrative results for testing the statistical significance of various Pearson r2 values at various sample sizes.
| r2 | n | PCALCULATED | n | PCALCULATED |
|---|---|---|---|---|
| 30.00% | 13 | 0.053 | 14 | 0.043 |
| 5.00% | 77 | 0.05 | 78 | 0.049 |
| 1.00% | 384 | 0.0502 | 385 | 0.0499 |
| 0.50% | 768 | 0.0501 | 769 | 0.0499 |
| 0.01% | 38,415 | 0.0500005 | 38,416 | 0.049998 |
| r2 | n | PCALCULATED | n | PCALCULATED |
|---|---|---|---|---|
| 30.00% | 13 | 0.053 | 14 | 0.043 |
| 5.00% | 77 | 0.05 | 78 | 0.049 |
| 1.00% | 384 | 0.0502 | 385 | 0.0499 |
| 0.50% | 768 | 0.0501 | 769 | 0.0499 |
| 0.01% | 38,415 | 0.0500005 | 38,416 | 0.049998 |
The dynamic that every nonzero effect size will be statistically significant at some sample size means that p in part merely measures how tenacious we are at collecting large samples to obtain this inevitable outcome, and thus makes pCALCULATED less interesting to at least some scholars.
Purpose of the Present Article
Contemporary scholars focus their efforts on addressing two research questions:
How big are the effect sizes in our studies?; and
Are our effect size results replicable?
In the words of Roger Kirk (2003), the …practice of focusing exclusively on a dichotomous reject-nonreject decision strategy of null hypothesis testing can actually impede scientific progress… In fact, focusing on p values and rejecting null hypotheses actually distracts us from our real goals: deciding whether data support our scientific hypotheses and are practically significant. The focus of research should be on our scientific hypotheses, what data tell us about the magnitude of effects, the practical significance of effects, and the steady accumulation of knowledge. (p. 100, italics added)
The present article has two purposes. First, some commonly used effect sizes are explained. Unfortunately, empirical studies of textbooks show that statistics textbooks offer little if any coverage of effect sizes (Capraro & Capraro 2002), although some modern textbooks have improved in this respect (Thompson, 2006a).
Second, some recommended effect size reporting and interpretation practices are presented. Readers interested in more in-depth treatment of these issues are directed to various articles (e.g., Thompson, 2002a, 2007; Vacha-Haase & Thompson, 2004), book chapters (e.g., Thompson, 2006b, 2008), or the excellent book written by Grissom and Kim (2005).
Heuristic Data
To make the discussion as accessible and concrete as possible, a small heuristic data set was created so that interested readers can replicate the calculations illustrated here. The example presumes the existence of the Martha Washington All-Girls Middle School Academy from which 22 young women were randomly selected. These 22 young women and parents voluntarily completed required IRB forms to participate in a yearlong experiment intended to boost the posttest IQ scores of the 11 students randomly selected to be in the intervention group. Table 1 presents the posttest IQ scores of the 22 study participants.
Heuristic Data Used to Calculate Effect Sizes
| Control | Experimental | ||||
|---|---|---|---|---|---|
| ID | Name | Posttest | ID | Name | Posttest |
| 1 | Ida | 75 | 12 | Molly | 95 |
| 2 | Anne | 65 | 13 | Nancy | 105 |
| 3 | Eileen | 65 | 14 | Geri | 150 |
| 4 | Susan | 95 | 15 | Camie | 125 |
| 5 | Mary | 95 | 16 | Nancy Lynne | 85 |
| 6 | Barbara | 95 | 17 | Jan | 100 |
| 7 | Donna | 105 | 18 | Murray | 160 |
| 8 | Catherine | 105 | 19 | Carol | 135 |
| 9 | Kathy | 95 | 20 | Allegra | 80 |
| 10 | Wendy | 155 | 21 | Shawn | 125 |
| 11 | Deborah | 150 | 22 | Peggy | 105 |
| Control | Experimental | ||||
|---|---|---|---|---|---|
| ID | Name | Posttest | ID | Name | Posttest |
| 1 | Ida | 75 | 12 | Molly | 95 |
| 2 | Anne | 65 | 13 | Nancy | 105 |
| 3 | Eileen | 65 | 14 | Geri | 150 |
| 4 | Susan | 95 | 15 | Camie | 125 |
| 5 | Mary | 95 | 16 | Nancy Lynne | 85 |
| 6 | Barbara | 95 | 17 | Jan | 100 |
| 7 | Donna | 105 | 18 | Murray | 160 |
| 8 | Catherine | 105 | 19 | Carol | 135 |
| 9 | Kathy | 95 | 20 | Allegra | 80 |
| 10 | Wendy | 155 | 21 | Shawn | 125 |
| 11 | Deborah | 150 | 22 | Peggy | 105 |
| M | 100 | 115 | |||
| SD | 29.58 | 26.08 | |||
| Median | 95 | 105 |
| M | 100 | 115 | |||
| SD | 29.58 | 26.08 | |||
| Median | 95 | 105 |
Effect Sizes
Some researchers erroneously believe that there is only one effect size, or only a few effect size statistics. In fact, there are dozens and dozens of choices. For example, in his seminal article Roger Kirk (1996) presented a table listing 40 choices, and his list did not even include the probability of superiority effect size (Grissom, 1994) or Huberty’s group- overlap effect size (Hess, Olejnik, & Huberty, 2001; Huberty & Holmes, 1983; Huberty & Lowman, 2000).
Here, effect sizes of three types are presented, and in each category one or a few of the most commonly used choices are discussed. All statistics are either in an unsquared, “score-world” (e.g., mean, median, SD, r,R) or a squared, “area-world” (e.g., Cronbach’s a, variance, r2) metric (see Thompson, 2006a). The effect sizes presented here will include effects from both worlds.
Standardized Differences Effect Sizes
Medical researchers routinely conduct experiments. And medical researchers also measure many outcomes worldwide in uniform ways. For example, medical researchers from everywhere in the world measure cholesterol in milligrams per deciliter, and mortality in deaths per 1,000 (or 10,000 or 100,000). These are naturally occurring metrics that are readily interpretable and intrinsically meaningful. Medical researchers routinely compute score-world unstandardized difference effect sizes by subtracting the control group’s mean (or median or some other central tendency statistic) from the intervention group’s mean, to quantify how much a medication lowered cholesterol or reduced mortality.
However, in the social sciences, outcomes have no intrinsic scale or standard deviation. For example, the Piers-Harris Self-Concept Test may have an SD of 15 while the Tennessee Self-Concept Test has an SD of 30. This means that unstandardized effect sizes are not useful in the social sciences, because we want to be able to compare effect sizes apples-to- apples across studies using different posttests with different standard deviations.
Standardized difference effect sizes compute effects by subtracting the control group’s location or central tendency statistic from the intervention group’s central tendency statistic, and then dividing that difference by some estimate of the standard deviation. In statistics, when we divide by something, we are removing from the answer whatever we are dividing by. Whenever we divide by an SD, we are removing the SD from our answer, and any statistic now having no metric of measurement (thanks to the division) is called “standardized.”
Glass’ delta is an unsquared, score-world standardized difference effect size computed by dividing the unstandardized difference (e.g., MEXPERIMENTAL - MCONTROL) by the SD of the posttest scores only of the control group. Glass reasoned that an intervention might affect posttest means or medians or trimmed means or winsorized means, but also might affect the dispersion of the posttest scores in the experimental group. He reasoned that in cases where the control group got no intervention, or a placebo, the dispersion of the control group’s posttest scores could not have been impacted by an intervention that these participants did not receive. For the Table 1 data, we have:
Δ = (MEXPERIMENTAL - MCONTROL) / SDCONTROL
or
(115 - 100) / 29.58
15 / 29.58 = 0.51.
Cohen’s d is computed by dividing the unstandardized difference effect size by a weighted average of the posttest SDs of both groups. Cohen reasoned that (a) some interventions might not impact posttest score dispersion and (b) an estimate of the SD used in standardizing involving both groups would have a larger n, and therefore potentially be a better estimate of the population SD. For our data involving equal group sizes, we have:
d = (MEXPERIMENTAL - MCONTROL) / ((SDEXPERIMENTAL2 + SDCONTROL2) / 2)0.5
or
(115 - 100) / ((26.082 + 29.582) / 2)0.5
15 / ((26.082 + 29.582) / 2)0.5
15 / ((680.17 + 874.98) / 2)0.5
15 / (1555.15 / 2)0.5
15 / 777.580.5
15 / 27.88 = 0.54.
The existence of these two rival choices for estimating a standardized difference effect size reinforces the important understanding that statistics is not the business of always performing one definitively “correct” analysis. Instead, statistics is about being thoughtful, and making a reasoned decision taking into account the features of a particular study.
When control group sample size is large, there may be little gain from pooling SD estimates across groups, and so Glass’ Δ might be the most reasonable effect characterization. When sample size is smaller, and there are theoretical or empirical arguments that the intervention did not impact posttest score dispersion, Cohen’s d might be the most reasonable effect characterization.
We also are not limited to using means in the standardized difference effect size estimate. We could use medians (or trimmed means, or winsorized means) instead. For example, if we are using medians to calculate d for our data, we have:
(105 - 95) / ((26.082 + 29.582) / 2)0.5
10 / ((26.082 + 29.582) / 2)0.5
10 / ((680.17 + 874.98) / 2)0.5
10 / (1555.15 / 2)0.5
10 / 777.580.5
10 / 27.88 = 0.36.
Means are not always the best statistics with which to characterize the central tendency of scores (see Thompson, 2006a).
Variance-Accounted-For Effect Sizes
In the bivariate case, the Pearson r2 is an effect size, because the sample result characterizes how far the sample r2 diverges from the hypothesized value of 0.0 under the null hypothesis, H0: r2 = 0. However, as explained in detail elsewhere (Thompson, 2006a), all statistical analyses are correlational, and part of a single general linear model (cf. Cohen, 1968; Knapp, 1978). Thus, all commonly used analyses yield or can yield r2-type effect sizes.
For example, for a t test or for any ANOVA the effect size η2 is computed by dividing the Sum of SquaresMODEL by the Sum of SquaresTOTAL, as illustrated in Table 2. Our η2 = 7.37% (1,237.5 / 16,787.5) tells us that, given only knowledge of to which group our 22 participants belonged, we can explain or predict 7.37% of the variability or the individual differences on the posttest IQ scores. For more information about t tests see Capraro and Capraro and Zang elsewhere in this volume.
ANOVA Summary Table for the Table 1 Example
| Source | Sum of Squares | df | Mean Square | F CALCULATED | PCALCULATED | η2 |
|---|---|---|---|---|---|---|
| Model | 1237.5 | 1 | 1237.5 | 1.592 | 0.222 | 7.37% |
| Unexplained | 15550 | 20 | 777.5 |
| Source | Sum of Squares | df | Mean Square | F CALCULATED | PCALCULATED | η2 |
|---|---|---|---|---|---|---|
| Model | 1237.5 | 1 | 1237.5 | 1.592 | 0.222 | 7.37% |
| Unexplained | 15550 | 20 | 777.5 |
| Total | 16787.5 | 21 |
| Total | 16787.5 | 21 |
Corrected Effect Sizes
Sampling error occurs when sample data diverge in some respect from the population in which the sample was drawn. All samples have sampling error. And the sampling error that occurs in one sample occurs in no other sample. Samples, like people, are all individually unique, and are all weird, and some samples (and people) are way weird.
The problem is that commonly used statistical methods, invoking “least squares” methods, which mathematically maximize the Sum of SquaresMODEL and minimize the Sum of SquaresUNEXPLAINED, tend to yield inflated effect size estimates due to the presence of sampling error. Fortunately, we are aware that this bias occurs, we know the factors that cause sampling error, and thus we can adjust or correct our effect size estimates to obtain more accurate effect estimates.
Three things affect sampling error. There is more sampling error in samples as (a) sample size is smaller, (b) the number of measured variables is larger, and (c) the population effect size is smaller (see Thompson, 2002a).
For our data, the corrected variance-accounted-for effect size Hays’ ω2 is computed as:
ω2 = [SOSMODEL - (k - 1)MSUNEXPLAINED] / [SOSTOTAL + MSUNEXPLAINED],
where k is the number of levels in the ANOVA way. As reported in Table 2, for our data we have:
[1237.5 - (2 - 1)777.5] / [16787.5 + 777.5]
[1237.5 - (1)777.5] / [16787.5 + 777.5]
[1237.5 - 777.5] / [16787.5 + 777.5]
460.0 / [16787.5 + 777.5]
460.0 / 17565.0 = 2.62%
Our “corrected” effect size (i.e., 2.62%) is smaller than our “uncorrected” estimate (i.e., 7.37%), as expected.
Another correction formula, the Ezekiel correction, can be applied with both multiple R2 and the Pearson r2 (Wang & Thompson, 2006). The corrected R2, R2*, can be computed as:
R2* = 1 - [(n - 1) / (n - p - 1)] [1 - R2],
or as:
R2* = R2 - ((1 - R2) (p / (n - p -1))),
where p is the number of predictor variables. SPSS will compute this adjusted effect size automatically if we run the REGRESSION analysis command. For more information about regression see Yetkiner elsewhere in this volume.
For example, presume that we were predicting an outcome variable with five predictor variables, n = 51, and that our uncorrected R2 = 10.00%. For that situation we have:
0.10 - ((1 - 0.10) (5 / (51 - 5 - 1)))
0.10 - (0.90 (5 / (51 - 5 - 1)))
0.10 - (0.90 (5 / 45))
0.10 - (0.90 (0.11))
0.10 - 0.10 = 0.00%
If this analysis was conducted retrospectively, after the regression data were collected and analyzed, the result would indicate that the detected R2 effect size of 10.0% was solely caused by sampling, and that the predictor variables actually are completely useless in predicting the outcome variable, once we take sampling error into account. If the analysis was conducted prospectively, before data were collected, under a premise that the population R2 was 0.0%, the analysis can also be interpreted as indicating that for n = 51 the uncorrected R2 is expected to be 10.0% even if the population R2 is 0.0%!
Some Effect Size Interpretation Precepts
The effect sizes presented here and elsewhere (cf. Grissom & Kim, 2005; Thompson, 2006b, 2008) either will be computed by SPSS (e.g., η2, adjusted R2) or can readily be computed by entering SPSS output information into Excel (e.g., Hays’ ω2). In 1988, the first journal -Measurement and Evaluation in Counseling and Development -- required effect size reporting, and the second journal -- Educational and Psychological Measurement -- did so in 1994. In 1996, the American Psychological Association Board of Scientific Affairs appointed its Task Force on Statistical Inference, which in its 1999 report unequivocally repeatedly emphasized that effect size reporting is “essential” (Wilkinson & APA Task Force, 1999).
In 2002, the Australian Fiona Fidler noted that, “Of the major American associations, only all the journals of the American Educational Research Association [AERA] have remained silent on all these issues” (p. 754). But in 2006 AERA finally did speak, and required effect size reporting in all quantitative articles in AERA journals.
Thus, today social scientists are routinely expected to report and interpret effect sizes. However, following some precepts in reporting and interpreting effect sizes will result in best practices.
First, because there are dozens of effect sizes statistics, authors must explicitly tell readers what effect sizes they are reporting. Some effect sizes can be negative (e.g., Cohen’s d), but some effect sizes cannot be negative (e.g., η2). Some effect sizes can be greater than +1 (e.g., Glass’ Δ, Cohen’s d), but some cannot (e.g., η2, Hays’ ω2). Different effect size statistics have different ranges and properties, and readers can only intelligently interpret effects when they know which specific effect size is being reported.
Second, do not interpret effect sizes using Cohen’s benchmarks for “small,” “medium,” and “large” effects, and instead interpret effects in direct and explicit comparison against the effects in the related prior literature (Thompson, 2006b). We cannot reasonably use universal benchmarks that fail to take into account the context of the study being conducted. A 1.0% η2 when the outcome is human longevity, as in the case of the effect of smoking on longevity, is important even though the effect size is numerically small! Cohen (1988) himself emphasized, … that these proposed conventions were set forth throughout with much diffidence, qualifications, and invitations not to employ them ifpossible [italics added]. They were offered as conventions because they were needed in a research climate characterized by a neglect of attention to issues of [effect size] magnitude. (p. 532)
As noted elsewhere, “if people interpreted effect sizes [using fixed benchmarks] with the same rigidity that α = .05 has been used in statistical testing, we would merely be being stupid in another metric” (Thompson, 2001, pp. 82-83).
Third, we must remember that the failure to meet the statistical assumptions of analytic methods (e.g., homogeneity of regression in ANCOVA) to some degree compromises the accuracy of all statistical results, including effect sizes. We never perfectly meet statistical assumptions, and we need to take into account how inaccurate our estimates may be given varying degrees of assumption violations.
Fourth, effect sizes should be reported even for statistically nonsignificant effects. A cumulative literature may detect important effects even when all studies individually yield pCALCULATED = 0.06, as illustrated by Thompson (2002b). Remember Rosnow and Rosenthal’s (1989) view that “surely, God loves the .06 [level of statistical significance] nearly as much as the .05” (p. 1277)!
