Computing & Interpreting Effect Sizes in Educational Research

Thompson, Bruce

doi:10.1108/MGRJ-09-2009-0003

The present article provides a primer on using effect sizes in research. A small heuristic data set is used in order to make the discussion concrete. Additionally, various admonitions for best practice in reporting and interpreting effect sizes are presented. Among these is the admonition to not use Cohen’s benchmarks for “small,” “medium,” and “large” effects, and instead to interpret effects in direct and explicit comparison against the effects in the related prior literature.

Introduction

The origins of p_CALCULATED values and null hypothesis statistical significance testing (NHSST) can be traced back to around 1700 (Huberty, 1999). But the widespread uptake of NHSST by social scientists actually did not occur until the 1950s (Hubbard & Ryan, 2000). During the next four decades p_CALCULATED values obtained what some have called an “ontological mystique” as the sine qua non of research.

It may not be an exaggeration to say that for many PhD students, for whom the .05 alpha has acquired an almost ontological mystique, it can mean joy, a doctoral degree, and a tenure-track position at a major university if their dissertation p is less than .05. However, if the p is greater than .05, it can mean ruin, despair, and advisor’s suddenly thinking of a new control condition that should be run. (Rosnow & Rosenthal, 1989, p. 1277)

During these decades the literature was plagued by a “file drawer” problem (Rosenthal, 1979) in which studies yielding p>0.05 were rarely submitted for publication, or published if submitted (Greenwald, 1975).

Of course, almost from the beginning, overreliance on NHSST attracted forceful critics such as Boring (1919) and Berkson (1938). Especially influential criticisms were published by Carver (1978), Cohen (1994), and Schmidt (1996). For example, Schmidt and Hunter (1997) argued that “Statistical significance testing retards the growth of scientific knowledge; it never makes a positive contribution” (p. 37). Rozeboom (1997) was equally forceful:

Null-hypothesis significance testing is surely the most bone- headedly misguided procedure ever institutionalized in the rote training of science students… [I]t is a sociology-of-science wonderment that this statistical practice has remained so unresponsive to criticism… (p. 335)

The onslaught of criticisms of NHSST grew exponentially over time, and occurred across numerous disciplines, as documented by Anderson, Burnham and Thompson (2000) and Fidler (2005). For example, criticisms of NHSST have been published in disciplines as diverse as economics (Ziliak & McCloskey, 2004), marketing (Armstrong, 2007; Hubbard & Armstrong, 2006), and the wildlife sciences (Johnson, 1999).

Three Criticisms OfNhsst

Most of the published criticisms of NHSST can be sorted into three categories. Today, these criticisms have been widely accepted, and have led to changed emphases in social science research.

p Values Do Not Test Result Replicability

As conclusively shown by Cohen (1994), p_CALCULATED values provide no information about result replicability. A p_CALCULATED value “estimates the probability of the sample statistic(s) (and sample results even more extreme in their divergence from the null hypothesis than our sample results), assuming (a) the sample came from a population exactly described by the null hypothesis, and (b) given the sample size” (Thompson, 2006a, p. 179). A p_CALCULATED value does not estimate the probability of the population parameters given the sample results (i.e., p(P|S). Instead, NHSST estimates p(S|P).

Our p_CALCULATED values indeed would bear upon result replicability if the p_CALCULATED values were about p(P|S). NHSST results (a) sadly are aboutp(S|P), and not p(P|S), and (b) sadly p(S|P) ≠ p(P|S), just as the probability of being an animal given that one is a human (p(A|H)) does not equal the probability that one is human given that one is an animal (p(H|A))!

p Values Do Not Evaluate Result Importance

A p_CALCULATED value is a mathematical probability of the sample results under the two NHSST assumptions. This mathematical estimate does not include as input any information about the researcher’s personal values. A valid deductive logic simply cannot contain in conclusions any information not present in the premises of the logic. As Thompson (1993) explained, “If the computer package [e.g., SPSS] did not ask you your values prior to its analysis, it could not have considered your value system in calculating p’s, and so p’s cannot be blithely used to infer the value of research results” (p. 365).

p Values Are In Part Tautological

Our p_CALCULATED values are a joint function of two, and only two, study features: (a) our sample size, and (b) our effect size (e.g., r², R², η²). Effect sizes quantify the extent to which our sample results diverge from the expectations specified in the null hypothesis we are testing (Cohen, 1994; Thompson, 1996, 2006a).

Our p_CALCULATED values cannot be used as effect sizes, because p’s are confounded results affected by more than only our effect sizes. Thus, as Thompson (1999) explained, Because p values are confounded indices, in theory 100 studies with varying sample sizes and 100 different effect sizes could each have the same single p_CALCULATED, and 100 studies with the same single effect size could each have 100 different values for p_CALCULATED. (pp. 169-170)

These p_CALCULATED values in part are tautological, because they tell us how big (or small) our sample size was. As Cohen (1994, p. 1000) noted, “The point is made piercingly by Thompson (1992),” who was then quoted as saying:

Statistical significance testing can involve a tautological logic in which tired researchers, having collected data from hundreds of subjects, then conduct a statistical test to evaluate whether there were a lot of subjects, which the researchers already know, because they collected the data and know they’re tired. This tautology has created considerable damage as regards the cumulation of knowledge… (Thompson, 1992, p. 436)

Each and every non-zero effect size will be statistically significant at some sample size, as we can see from the following illustrative results for testing the statistical significance of various Pearson r² values at various sample sizes.

r²	n	P_CALCULATED	n	P_CALCULATED
30.00%	13	0.053	14	0.043
5.00%	77	0.05	78	0.049
1.00%	384	0.0502	385	0.0499
0.50%	768	0.0501	769	0.0499
0.01%	38,415	0.0500005	38,416	0.049998

The dynamic that every nonzero effect size will be statistically significant at some sample size means that p in part merely measures how tenacious we are at collecting large samples to obtain this inevitable outcome, and thus makes p_CALCULATED less interesting to at least some scholars.

Purpose of the Present Article

Contemporary scholars focus their efforts on addressing two research questions:

How big are the effect sizes in our studies?; and
Are our effect size results replicable?

In the words of Roger Kirk (2003), the …practice of focusing exclusively on a dichotomous reject-nonreject decision strategy of null hypothesis testing can actually impede scientific progress… In fact, focusing on p values and rejecting null hypotheses actually distracts us from our real goals: deciding whether data support our scientific hypotheses and are practically significant. The focus of research should be on our scientific hypotheses, what data tell us about the magnitude of effects, the practical significance of effects, and the steady accumulation of knowledge. (p. 100, italics added)

The present article has two purposes. First, some commonly used effect sizes are explained. Unfortunately, empirical studies of textbooks show that statistics textbooks offer little if any coverage of effect sizes (Capraro & Capraro 2002), although some modern textbooks have improved in this respect (Thompson, 2006a).

Second, some recommended effect size reporting and interpretation practices are presented. Readers interested in more in-depth treatment of these issues are directed to various articles (e.g., Thompson, 2002a, 2007; Vacha-Haase & Thompson, 2004), book chapters (e.g., Thompson, 2006b, 2008), or the excellent book written by Grissom and Kim (2005).

Heuristic Data

To make the discussion as accessible and concrete as possible, a small heuristic data set was created so that interested readers can replicate the calculations illustrated here. The example presumes the existence of the Martha Washington All-Girls Middle School Academy from which 22 young women were randomly selected. These 22 young women and parents voluntarily completed required IRB forms to participate in a yearlong experiment intended to boost the posttest IQ scores of the 11 students randomly selected to be in the intervention group. Table 1 presents the posttest IQ scores of the 22 study participants.

Table 1

Heuristic Data Used to Calculate Effect Sizes

Control			Experimental
ID	Name	Posttest	ID	Name	Posttest
1	Ida	75	12	Molly	95
2	Anne	65	13	Nancy	105
3	Eileen	65	14	Geri	150
4	Susan	95	15	Camie	125
5	Mary	95	16	Nancy Lynne	85
6	Barbara	95	17	Jan	100
7	Donna	105	18	Murray	160
8	Catherine	105	19	Carol	135
9	Kathy	95	20	Allegra	80
10	Wendy	155	21	Shawn	125
11	Deborah	150	22	Peggy	105

Control			Experimental
ID	Name	Posttest	ID	Name	Posttest
1	Ida	75	12	Molly	95
2	Anne	65	13	Nancy	105
3	Eileen	65	14	Geri	150
4	Susan	95	15	Camie	125
5	Mary	95	16	Nancy Lynne	85
6	Barbara	95	17	Jan	100
7	Donna	105	18	Murray	160
8	Catherine	105	19	Carol	135
9	Kathy	95	20	Allegra	80
10	Wendy	155	21	Shawn	125
11	Deborah	150	22	Peggy	105

M	100	115
SD	29.58	26.08
Median	95	105

Effect Sizes

Some researchers erroneously believe that there is only one effect size, or only a few effect size statistics. In fact, there are dozens and dozens of choices. For example, in his seminal article Roger Kirk (1996) presented a table listing 40 choices, and his list did not even include the probability of superiority effect size (Grissom, 1994) or Huberty’s group- overlap effect size (Hess, Olejnik, & Huberty, 2001; Huberty & Holmes, 1983; Huberty & Lowman, 2000).

Here, effect sizes of three types are presented, and in each category one or a few of the most commonly used choices are discussed. All statistics are either in an unsquared, “score-world” (e.g., mean, median, SD, r,R) or a squared, “area-world” (e.g., Cronbach’s a, variance, r2) metric (see Thompson, 2006a). The effect sizes presented here will include effects from both worlds.

Standardized Differences Effect Sizes

Medical researchers routinely conduct experiments. And medical researchers also measure many outcomes worldwide in uniform ways. For example, medical researchers from everywhere in the world measure cholesterol in milligrams per deciliter, and mortality in deaths per 1,000 (or 10,000 or 100,000). These are naturally occurring metrics that are readily interpretable and intrinsically meaningful. Medical researchers routinely compute score-world unstandardized difference effect sizes by subtracting the control group’s mean (or median or some other central tendency statistic) from the intervention group’s mean, to quantify how much a medication lowered cholesterol or reduced mortality.

However, in the social sciences, outcomes have no intrinsic scale or standard deviation. For example, the Piers-Harris Self-Concept Test may have an SD of 15 while the Tennessee Self-Concept Test has an SD of 30. This means that unstandardized effect sizes are not useful in the social sciences, because we want to be able to compare effect sizes apples-to- apples across studies using different posttests with different standard deviations.

Standardized difference effect sizes compute effects by subtracting the control group’s location or central tendency statistic from the intervention group’s central tendency statistic, and then dividing that difference by some estimate of the standard deviation. In statistics, when we divide by something, we are removing from the answer whatever we are dividing by. Whenever we divide by an SD, we are removing the SD from our answer, and any statistic now having no metric of measurement (thanks to the division) is called “standardized.”

Glass’ delta is an unsquared, score-world standardized difference effect size computed by dividing the unstandardized difference (e.g., M_EXPERIMENTAL - M_CONTROL) by the SD of the posttest scores only of the control group. Glass reasoned that an intervention might affect posttest means or medians or trimmed means or winsorized means, but also might affect the dispersion of the posttest scores in the experimental group. He reasoned that in cases where the control group got no intervention, or a placebo, the dispersion of the control group’s posttest scores could not have been impacted by an intervention that these participants did not receive. For the Table 1 data, we have:

Δ = (M_EXPERIMENTAL - M_CONTROL) / SD_CONTROL

or

(115 - 100) / 29.58

15 / 29.58 = 0.51.

Cohen’s d is computed by dividing the unstandardized difference effect size by a weighted average of the posttest SDs of both groups. Cohen reasoned that (a) some interventions might not impact posttest score dispersion and (b) an estimate of the SD used in standardizing involving both groups would have a larger n, and therefore potentially be a better estimate of the population SD. For our data involving equal group sizes, we have:

d = (M_EXPERIMENTAL - M_CONTROL) / ((SD_EXPERIMENTAL² + SD_CONTROL²) / 2)^0.5

or

(115 - 100) / ((26.08² + 29.58²) / 2)^0.5

15 / ((26.08² + 29.58²) / 2)^0.5

15 / ((680.17 + 874.98) / 2)^0.5

15 / (1555.15 / 2)^0.5

15 / 777.58^0.5

15 / 27.88 = 0.54.

The existence of these two rival choices for estimating a standardized difference effect size reinforces the important understanding that statistics is not the business of always performing one definitively “correct” analysis. Instead, statistics is about being thoughtful, and making a reasoned decision taking into account the features of a particular study.

When control group sample size is large, there may be little gain from pooling SD estimates across groups, and so Glass’ Δ might be the most reasonable effect characterization. When sample size is smaller, and there are theoretical or empirical arguments that the intervention did not impact posttest score dispersion, Cohen’s d might be the most reasonable effect characterization.

We also are not limited to using means in the standardized difference effect size estimate. We could use medians (or trimmed means, or winsorized means) instead. For example, if we are using medians to calculate d for our data, we have:

(105 - 95) / ((26.08² + 29.58²) / 2)^0.5

10 / ((26.08² + 29.58²) / 2)^0.5

10 / ((680.17 + 874.98) / 2)^0.5

10 / (1555.15 / 2)^0.5

10 / 777.58^0.5

10 / 27.88 = 0.36.

Means are not always the best statistics with which to characterize the central tendency of scores (see Thompson, 2006a).

Variance-Accounted-For Effect Sizes

In the bivariate case, the Pearson r² is an effect size, because the sample result characterizes how far the sample r² diverges from the hypothesized value of 0.0 under the null hypothesis, H₀: r² = 0. However, as explained in detail elsewhere (Thompson, 2006a), all statistical analyses are correlational, and part of a single general linear model (cf. Cohen, 1968; Knapp, 1978). Thus, all commonly used analyses yield or can yield r²-type effect sizes.

For example, for a t test or for any ANOVA the effect size η² is computed by dividing the Sum of Squares_MODEL by the Sum of Squares_TOTAL, as illustrated in Table 2. Our η² = 7.37% (1,237.5 / 16,787.5) tells us that, given only knowledge of to which group our 22 participants belonged, we can explain or predict 7.37% of the variability or the individual differences on the posttest IQ scores. For more information about t tests see Capraro and Capraro and Zang elsewhere in this volume.

Table 2

ANOVA Summary Table for the Table 1 Example

Source	Sum of Squares	df	Mean Square	F _CALCULATED	P_CALCULATED	η²
Model	1237.5	1	1237.5	1.592	0.222	7.37%
Unexplained	15550	20	777.5

Total	16787.5	21

Corrected Effect Sizes

Sampling error occurs when sample data diverge in some respect from the population in which the sample was drawn. All samples have sampling error. And the sampling error that occurs in one sample occurs in no other sample. Samples, like people, are all individually unique, and are all weird, and some samples (and people) are way weird.

The problem is that commonly used statistical methods, invoking “least squares” methods, which mathematically maximize the Sum of Squares_MODEL and minimize the Sum of Squares_UNEXPLAINED, tend to yield inflated effect size estimates due to the presence of sampling error. Fortunately, we are aware that this bias occurs, we know the factors that cause sampling error, and thus we can adjust or correct our effect size estimates to obtain more accurate effect estimates.

Three things affect sampling error. There is more sampling error in samples as (a) sample size is smaller, (b) the number of measured variables is larger, and (c) the population effect size is smaller (see Thompson, 2002a).

For our data, the corrected variance-accounted-for effect size Hays’ ω² is computed as:

ω² = [SOS_MODEL - (k - 1)MS_UNEXPLAINED] / [SOS_TOTAL + MS_UNEXPLAINED],

where k is the number of levels in the ANOVA way. As reported in Table 2, for our data we have:

[1237.5 - (2 - 1)777.5] / [16787.5 + 777.5]

[1237.5 - (1)777.5] / [16787.5 + 777.5]

[1237.5 - 777.5] / [16787.5 + 777.5]

460.0 / [16787.5 + 777.5]

460.0 / 17565.0 = 2.62%

Our “corrected” effect size (i.e., 2.62%) is smaller than our “uncorrected” estimate (i.e., 7.37%), as expected.

Another correction formula, the Ezekiel correction, can be applied with both multiple R² and the Pearson r² (Wang & Thompson, 2006). The corrected R², R²^*, can be computed as:

R²^* = 1 - [(n - 1) / (n - p - 1)] [1 - R²],

or as:

R²^* = R² - ((1 - R²) (p / (n - p -1))),

where p is the number of predictor variables. SPSS will compute this adjusted effect size automatically if we run the REGRESSION analysis command. For more information about regression see Yetkiner elsewhere in this volume.

For example, presume that we were predicting an outcome variable with five predictor variables, n = 51, and that our uncorrected R² = 10.00%. For that situation we have:

0.10 - ((1 - 0.10) (5 / (51 - 5 - 1)))

0.10 - (0.90 (5 / (51 - 5 - 1)))

0.10 - (0.90 (5 / 45))

0.10 - (0.90 (0.11))

0.10 - 0.10 = 0.00%

If this analysis was conducted retrospectively, after the regression data were collected and analyzed, the result would indicate that the detected R² effect size of 10.0% was solely caused by sampling, and that the predictor variables actually are completely useless in predicting the outcome variable, once we take sampling error into account. If the analysis was conducted prospectively, before data were collected, under a premise that the population R² was 0.0%, the analysis can also be interpreted as indicating that for n = 51 the uncorrected R² is expected to be 10.0% even if the population R² is 0.0%!

Some Effect Size Interpretation Precepts

The effect sizes presented here and elsewhere (cf. Grissom & Kim, 2005; Thompson, 2006b, 2008) either will be computed by SPSS (e.g., η2, adjusted R²) or can readily be computed by entering SPSS output information into Excel (e.g., Hays’ ω²). In 1988, the first journal -Measurement and Evaluation in Counseling and Development -- required effect size reporting, and the second journal -- Educational and Psychological Measurement -- did so in 1994. In 1996, the American Psychological Association Board of Scientific Affairs appointed its Task Force on Statistical Inference, which in its 1999 report unequivocally repeatedly emphasized that effect size reporting is “essential” (Wilkinson & APA Task Force, 1999).

In 2002, the Australian Fiona Fidler noted that, “Of the major American associations, only all the journals of the American Educational Research Association [AERA] have remained silent on all these issues” (p. 754). But in 2006 AERA finally did speak, and required effect size reporting in all quantitative articles in AERA journals.

Thus, today social scientists are routinely expected to report and interpret effect sizes. However, following some precepts in reporting and interpreting effect sizes will result in best practices.

First, because there are dozens of effect sizes statistics, authors must explicitly tell readers what effect sizes they are reporting. Some effect sizes can be negative (e.g., Cohen’s d), but some effect sizes cannot be negative (e.g., η²). Some effect sizes can be greater than +1 (e.g., Glass’ Δ, Cohen’s d), but some cannot (e.g., η², Hays’ ω²). Different effect size statistics have different ranges and properties, and readers can only intelligently interpret effects when they know which specific effect size is being reported.

Second, do not interpret effect sizes using Cohen’s benchmarks for “small,” “medium,” and “large” effects, and instead interpret effects in direct and explicit comparison against the effects in the related prior literature (Thompson, 2006b). We cannot reasonably use universal benchmarks that fail to take into account the context of the study being conducted. A 1.0% η2 when the outcome is human longevity, as in the case of the effect of smoking on longevity, is important even though the effect size is numerically small! Cohen (1988) himself emphasized, … that these proposed conventions were set forth throughout with much diffidence, qualifications, and invitations not to employ them ifpossible [italics added]. They were offered as conventions because they were needed in a research climate characterized by a neglect of attention to issues of [effect size] magnitude. (p. 532)

As noted elsewhere, “if people interpreted effect sizes [using fixed benchmarks] with the same rigidity that α = .05 has been used in statistical testing, we would merely be being stupid in another metric” (Thompson, 2001, pp. 82-83).

Third, we must remember that the failure to meet the statistical assumptions of analytic methods (e.g., homogeneity of regression in ANCOVA) to some degree compromises the accuracy of all statistical results, including effect sizes. We never perfectly meet statistical assumptions, and we need to take into account how inaccurate our estimates may be given varying degrees of assumption violations.

Fourth, effect sizes should be reported even for statistically nonsignificant effects. A cumulative literature may detect important effects even when all studies individually yield p_CALCULATED = 0.06, as illustrated by Thompson (2002b). Remember Rosnow and Rosenthal’s (1989) view that “surely, God loves the .06 [level of statistical significance] nearly as much as the .05” (p. 1277)!

Reference

American Educational Research Association.

(

2006

).

Standards for reporting on empirical social science research in AERA publications

.

Educational Researcher

35

(

6

)

33

-

40

.

Crossref

Anderson

,

D. R.

,

Burnham

,

K. P.

, &

Thompson

,

W.

(

2000

).

Null hypothesis testing: Problems, prevalence, and an alternative

.

Journal of Wildlife Management

,

64

,

912

-

923

.

Google Scholar

Crossref

Armstrong

,

J. S.

(

2007

).

Significance tests harm progress in forecasting

.

International Journal ofForecasting

,

23

,

321

-

327

.

Google Scholar

Berkson

,

J.

(

1938

).

Some difficulties of interpretation encountered in the application of the chi-square test

.

Journal of the American Statistical Association

,

33

,

526

-

536

.

Google Scholar

Crossref

Boring

,

E. G.

(

1919

).

Mathematical vs

. scientific importance.

Psychological Bulletin

,

16

,

335

-

338

.

Google Scholar

Crossref

Capraro

,

R. M.

, &

Capraro

,

M.

(

2002

).

Treatments of effect sizes and statistical significance tests in textbooks

.

Educational and Psychological Measurement

,

62

,

771

-

782

.

Google Scholar

Crossref

Carver

,

R.

(

1978

).

The case against statistical significance testing

.

Harvard Educational Review

,

48

,

378

-

399

.

Google Scholar

Crossref

Cohen

,

J.

(

1968

).

Multiple regression as a general data-analytic system

.

Psychological Bulletin

,

70

,

426

-

443

.

Google Scholar

Crossref

Cohen

,

J.

(

1988

).

Statistical power analysis for the behavioral sciences

(2nd ed.).

Hillsdale, NJ

:

Erlbaum

.

Google Scholar

Cohen

,

J.

(

1994

).

The Earth is round

(p < .05).

American Psychologist

,

49

,

997

-

1003

.

Google Scholar

Crossref

Fidler

,

F.

(

2002

).

The fifth edition of the APA Publication Manual: Why its statistics recommendations are so controversial

.

Educational and Psychological Measurement

,

62

,

749

-

770

.

Google Scholar

Crossref

Fidler

,

F.

(

2005

).

From statistical significance to effect estimation: Statistical reform in psychology, medicine, and ecology

. Doctoral dissertation, University of Melbourne. Retrieved from http://www.botany.unimelb.edu.au/envisci/fiona/fidlerphd_aug06.pdf

Greenwald

,

A. G.

(

1975

).

Consequences of prejudice against the null hypothesis

.

Psychological Bulletin

,

82

,

1

-

20

.

Google Scholar

Crossref

Grissom

,

R. J.

(

1994

).

Probability of the superior outcome of one treatment over another

.

Journal ofApplied Psychology

,

79

,

314

-

316

.

Google Scholar

Crossref

Grissom

,

R.

, &

Kim

,

J. J.

(

2005

).

Effect sizes for research: A broad practical approach

.

Mahwah, NJ

:

Erlbaum

.

Google Scholar

Hess

,

B.

,

Olejnik

,

S.

, &

Huberty

,

C. J

(

2001

).

The efficacy of two Improvement-over-chance effect sizes for two-group univariate comparisons under variance heterogeneity and nonnormality

.

Educational and Psychological Measurement

,

61

,

909

-

936

.

Google Scholar

Crossref

Hubbard

,

R.

, &

Armstrong

,

J. S.

(

2006

).

Why we don’t really know what statistical significance means: A major educational failure

.

Journal of Marketing Education

,

28

,

114

-

120

.

Google Scholar

Crossref

Hubbard

,

R.

, &

Ryan

,

P. A.

(

2000

).

The historical growth of statistical significance testing in psychology--and its future prospects

.

Educational and Psychological Measurement

,

60

,

661

-

681

.

Google Scholar

Huberty

,

C. J

(

1999

). On some history regarding statistical testing. In

B.

Thompson

(Ed.),

Advances in social science methodology

(Vol. 5, pp.

1

-

23

).

Stamford, CT

:

JAI Press

.

Google Scholar

Huberty

,

C. J

, &

Holmes

,

S. E.

(

1983

).

Two-group comparisons and univariate classification

.

Educational and Psychological Measurement

,

43

,

15

-

26

.

Google Scholar

Crossref

Huberty

,

C. J

, &

Lowman

,

L. L.

(

2000

).

Group overlap as a basis for effect size

.

Educational and Psychological Measurement

,

60

,

543

-

563

.

Google Scholar

Crossref

Kirk

,

R. E.

(

1996

).

Practical significance: A concept whose time has come

.

Educational and Psychological Measurement

,

56

,

746

-

759

.

Google Scholar

Crossref

Kirk

,

R. E.

(

2003

). The importance of effect magnitude. In

S. F.

Davis

(Ed.),

Handbook of research methods in experimental psychology

(pp.

83

-

105

).

Oxford, United Kingdom

:

Blackwell

.

Google Scholar

Crossref

Knapp

,

T. R.

(

1978

).

Canonical correlation analysis: A general parametric significance testing system

.

Psychological Bulletin

,

85

,

410

-

416

.

Google Scholar

Crossref

Johnson

,

D. H.

(

1999

).

The insignificance of statistical significance testing

.

Journal of Wildlife Management

,

63

,

763

-

772

.

Google Scholar

Crossref

Rosenthal

,

R.

(

1979

).

The “file drawer problem” and tolerance for null results

.

Psychological Bulletin

,

86

,

638

-

641

.

Google Scholar

Crossref

Rosnow

,

R. L.

, &

Rosenthal

,

R.

(

1989

).

Statistical procedures and the justification of knowledge in psychological science

.

American Psychologist

,

44

,

1276

-

1284

.

Google Scholar

Crossref

Rozeboom

,

W. W.

(

1997

). Good science is abductive, not hypothetico- deductive. In

L. L.

Harlow

,

S. A.

Mulaik

, &

J. H.

Steiger

(Eds.),

What if there were no significance tests?

(pp.

335

-

392

).

Mahwah, NJ

:

Erlbaum

.

Google Scholar

Schmidt

,

F. L.

(

1996

).

Statistical significance testing and cumulative knowledge in psychology: Implications for the training of researchers

.

Psychological Methods

,

1

,

115

-

129

.

Google Scholar

Crossref

Schmidt

,

F. L.

, &

Hunter

,

J. E.

(

1997

). Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In

L. L.

Harlow

,

S. A.

Mulaik

, &

J. H.

Steiger

(Eds.),

What if there were no significance tests?

(pp.

37

-

64

).

Mahwah, NJ

:

Erlbaum

.

Google Scholar

Thompson

,

B.

(

1992

).

Two and one-half decades of leadership in measurement and evaluation

.

Journal of Counseling and Development

,

70

,

434

-

438

.

Google Scholar

Crossref

Thompson

,

B.

(

1993

).

The use of statistical significance tests in research: Bootstrap and other alternatives

.

Journal ofExperimental Education

,

61

,

361

-

377

.

Google Scholar

Thompson

,

B.

(

1996

).

AERA editorial policies regarding statistical significance testing: Three suggested reforms

.

Educational Researcher

,

25

(

2

),

26

-

30

.

Google Scholar

Thompson

,

B.

(

1999

).

If statistical significance tests are broken/misused, what practices should supplement or replace them?

Theory & Psychology

,

9

,

167

-

183

.

Google Scholar

Thompson

,

B.

(

2001

).

Significance, effect sizes, stepwise methods, and other issues: Strong arguments move the field

.

Journal of Experimental Education

,

70

,

80

-

93

.

Google Scholar

Crossref

Thompson

,

B.

(

2002a

).

“Statistical,” “practical,” and “clinical”: How many kinds of significance do counselors need to consider?

Journal of Counseling and Development

,

80

,

64

-

71

.

Google Scholar

Crossref

Thompson

,

B.

(

2002b

).

What future quantitative social science research could look like: Confidence intervals for effect sizes

.

Educational Researcher

31

(

3

)

24

-

31

.

Google Scholar

Crossref

Thompson

,

B.

(

2006a

).

Foundations of behavioral statistics: An insightbased approach

.

New York

:

Guilford

.

Google Scholar

Thompson

,

B.

(

2006b

). Research synthesis: Effect sizes. In

J.

Green

,

G.

Camilli

, &

P. B.

Elmore

& (Eds.),

Complementary methods for research in education

(pp.

583

-

603

).

Washington, DC

:

American Educational Research Association

.

Google Scholar

Thompson

,

B.

(

2007

).

Effect sizes, confidence intervals, and confidence intervals for effect sizes

.

Psychology in the Schools

,

44

,

423

-

432

.

Google Scholar

Crossref

Thompson

,

B.

(

2008

). Computing and interpreting effect sizes, confidence intervals, and confidence intervals for effect sizes. In

J.

Osborne

(Ed.),

Best practices in quantitative methods

(pp.

246

-

262

).

Newbury Park, CA

:

Sage

.

Google Scholar

Crossref

Vacha-Haase

,

T.

, &

Thompson

,

B.

(

2004

).

How to estimate and interpret various effect sizes

.

Journal of Counseling Psychology

,

51

,

473

-

481

.

Google Scholar

Crossref

Wang

,

Z.

, &

Thompson

,

B.

(

2007

).

Is the Pearson r² biased, and if so, what is the best correction formula?

Journal ofExperimental Education

,

75

,

109

-

125

.

Google Scholar

Wilkinson

,

L.

, &

APA

Task Force on Statistical Inference.

(

1999

).

Statistical methods in psychology journals: Guidelines and explanations

.

American Psychologist

,

54

,

594

-

604

.

Google Scholar

Crossref

Ziliak

,

S. T.

, &

McCloskey

,

D. N.

(

2004

).

Size matters: the standard error of regressions in the

American Economic Review. Journal ofSocio-Economics

,

33

,

527

-

546

.

Google Scholar

2025

Emerald Publishing Limited

Licensed re-use rights only

Computing & Interpreting Effect Sizes in Educational Research

Introduction

Three Criticisms OfNhsst

p Values Do Not Test Result Replicability

p Values Do Not Evaluate Result Importance

p Values Are In Part Tautological

Purpose of the Present Article

Heuristic Data

Effect Sizes

Standardized Differences Effect Sizes

Variance-Accounted-For Effect Sizes

Corrected Effect Sizes

Some Effect Size Interpretation Precepts

Reference

Email Alerts

Cited By

Computing & Interpreting Effect Sizes in Educational Research Free

Introduction

Three Criticisms OfNhsst

p Values Do Not Test Result Replicability

p Values Do Not Evaluate Result Importance

p Values Are In Part Tautological

Purpose of the Present Article

Heuristic Data

Effect Sizes

Standardized Differences Effect Sizes

Variance-Accounted-For Effect Sizes

Corrected Effect Sizes

Some Effect Size Interpretation Precepts

Reference

Email Alerts

Suggested Reading

Related Chapters

Recommended for you

Cited By

Computing & Interpreting Effect Sizes in Educational Research