Skip to Main Content
Article navigation

The present article provides a primer on using effect sizes in research. A small heuristic data set is used in order to make the discussion concrete. Additionally, various admonitions for best practice in reporting and interpreting effect sizes are presented. Among these is the admonition to not use Cohen’s benchmarks for “small,” “medium,” and “large” effects, and instead to interpret effects in direct and explicit comparison against the effects in the related prior literature.

The origins of pCALCULATED values and null hypothesis statistical significance testing (NHSST) can be traced back to around 1700 (Huberty, 1999). But the widespread uptake of NHSST by social scientists actually did not occur until the 1950s (Hubbard & Ryan, 2000). During the next four decades pCALCULATED values obtained what some have called an “ontological mystique” as the sine qua non of research.

It may not be an exaggeration to say that for many PhD students, for whom the .05 alpha has acquired an almost ontological mystique, it can mean joy, a doctoral degree, and a tenure-track position at a major university if their dissertation p is less than .05. However, if the p is greater than .05, it can mean ruin, despair, and advisor’s suddenly thinking of a new control condition that should be run. (Rosnow & Rosenthal, 1989, p. 1277)

During these decades the literature was plagued by a “file drawer” problem (Rosenthal, 1979) in which studies yielding p>0.05 were rarely submitted for publication, or published if submitted (Greenwald, 1975).

Of course, almost from the beginning, overreliance on NHSST attracted forceful critics such as Boring (1919) and Berkson (1938). Especially influential criticisms were published by Carver (1978), Cohen (1994), and Schmidt (1996). For example, Schmidt and Hunter (1997) argued that “Statistical significance testing retards the growth of scientific knowledge; it never makes a positive contribution” (p. 37). Rozeboom (1997) was equally forceful:

Null-hypothesis significance testing is surely the most bone- headedly misguided procedure ever institutionalized in the rote training of science students… [I]t is a sociology-of-science wonderment that this statistical practice has remained so unresponsive to criticism… (p. 335)

The onslaught of criticisms of NHSST grew exponentially over time, and occurred across numerous disciplines, as documented by Anderson, Burnham and Thompson (2000) and Fidler (2005). For example, criticisms of NHSST have been published in disciplines as diverse as economics (Ziliak & McCloskey, 2004), marketing (Armstrong, 2007; Hubbard & Armstrong, 2006), and the wildlife sciences (Johnson, 1999).

Three Criticisms OfNhsst

Most of the published criticisms of NHSST can be sorted into three categories. Today, these criticisms have been widely accepted, and have led to changed emphases in social science research.

p Values Do Not Test Result Replicability

As conclusively shown by Cohen (1994), pCALCULATED values provide no information about result replicability. A pCALCULATED value “estimates the probability of the sample statistic(s) (and sample results even more extreme in their divergence from the null hypothesis than our sample results), assuming (a) the sample came from a population exactly described by the null hypothesis, and (b) given the sample size” (Thompson, 2006a, p. 179). A pCALCULATED value does not estimate the probability of the population parameters given the sample results (i.e., p(P|S). Instead, NHSST estimates p(S|P).

Our pCALCULATED values indeed would bear upon result replicability if the pCALCULATED values were about p(P|S). NHSST results (a) sadly are aboutp(S|P), and not p(P|S), and (b) sadly p(S|P) ≠ p(P|S), just as the probability of being an animal given that one is a human (p(A|H)) does not equal the probability that one is human given that one is an animal (p(H|A))!

p Values Do Not Evaluate Result Importance

A pCALCULATED value is a mathematical probability of the sample results under the two NHSST assumptions. This mathematical estimate does not include as input any information about the researcher’s personal values. A valid deductive logic simply cannot contain in conclusions any information not present in the premises of the logic. As Thompson (1993) explained, “If the computer package [e.g., SPSS] did not ask you your values prior to its analysis, it could not have considered your value system in calculating p’s, and so p’s cannot be blithely used to infer the value of research results” (p. 365).

p Values Are In Part Tautological

Our pCALCULATED values are a joint function of two, and only two, study features: (a) our sample size, and (b) our effect size (e.g., r2, R2, η2). Effect sizes quantify the extent to which our sample results diverge from the expectations specified in the null hypothesis we are testing (Cohen, 1994; Thompson, 1996, 2006a).

Our pCALCULATED values cannot be used as effect sizes, because p’s are confounded results affected by more than only our effect sizes. Thus, as Thompson (1999) explained, Because p values are confounded indices, in theory 100 studies with varying sample sizes and 100 different effect sizes could each have the same single pCALCULATED, and 100 studies with the same single effect size could each have 100 different values for pCALCULATED. (pp. 169-170)

These pCALCULATED values in part are tautological, because they tell us how big (or small) our sample size was. As Cohen (1994, p. 1000) noted, “The point is made piercingly by Thompson (1992),” who was then quoted as saying:

Statistical significance testing can involve a tautological logic in which tired researchers, having collected data from hundreds of subjects, then conduct a statistical test to evaluate whether there were a lot of subjects, which the researchers already know, because they collected the data and know they’re tired. This tautology has created considerable damage as regards the cumulation of knowledge… (Thompson, 1992, p. 436)

Each and every non-zero effect size will be statistically significant at some sample size, as we can see from the following illustrative results for testing the statistical significance of various Pearson r2 values at various sample sizes.

r2nPCALCULATEDnPCALCULATED
30.00%130.053140.043
5.00%770.05780.049
1.00%3840.05023850.0499
0.50%7680.05017690.0499
0.01%38,4150.050000538,4160.049998

The dynamic that every nonzero effect size will be statistically significant at some sample size means that p in part merely measures how tenacious we are at collecting large samples to obtain this inevitable outcome, and thus makes pCALCULATED less interesting to at least some scholars.

Purpose of the Present Article

Contemporary scholars focus their efforts on addressing two research questions:

  1. How big are the effect sizes in our studies?; and

  2. Are our effect size results replicable?

In the words of Roger Kirk (2003), the …practice of focusing exclusively on a dichotomous reject-nonreject decision strategy of null hypothesis testing can actually impede scientific progress… In fact, focusing on p values and rejecting null hypotheses actually distracts us from our real goals: deciding whether data support our scientific hypotheses and are practically significant. The focus of research should be on our scientific hypotheses, what data tell us about the magnitude of effects, the practical significance of effects, and the steady accumulation of knowledge. (p. 100, italics added)

The present article has two purposes. First, some commonly used effect sizes are explained. Unfortunately, empirical studies of textbooks show that statistics textbooks offer little if any coverage of effect sizes (Capraro & Capraro 2002), although some modern textbooks have improved in this respect (Thompson, 2006a).

Second, some recommended effect size reporting and interpretation practices are presented. Readers interested in more in-depth treatment of these issues are directed to various articles (e.g., Thompson, 2002a, 2007; Vacha-Haase & Thompson, 2004), book chapters (e.g., Thompson, 2006b, 2008), or the excellent book written by Grissom and Kim (2005).

Heuristic Data

To make the discussion as accessible and concrete as possible, a small heuristic data set was created so that interested readers can replicate the calculations illustrated here. The example presumes the existence of the Martha Washington All-Girls Middle School Academy from which 22 young women were randomly selected. These 22 young women and parents voluntarily completed required IRB forms to participate in a yearlong experiment intended to boost the posttest IQ scores of the 11 students randomly selected to be in the intervention group. Table 1 presents the posttest IQ scores of the 22 study participants.

Table 1

Heuristic Data Used to Calculate Effect Sizes

ControlExperimental
IDNamePosttestIDNamePosttest
1Ida7512Molly95
2Anne6513Nancy105
3Eileen6514Geri150
4Susan9515Camie125
5Mary9516Nancy Lynne85
6Barbara9517Jan100
7Donna10518Murray160
8Catherine10519Carol135
9Kathy9520Allegra80
10Wendy15521Shawn125
11Deborah15022Peggy105
M 100  115
SD 29.58  26.08
Median 95  105

Some researchers erroneously believe that there is only one effect size, or only a few effect size statistics. In fact, there are dozens and dozens of choices. For example, in his seminal article Roger Kirk (1996) presented a table listing 40 choices, and his list did not even include the probability of superiority effect size (Grissom, 1994) or Huberty’s group- overlap effect size (Hess, Olejnik, & Huberty, 2001; Huberty & Holmes, 1983; Huberty & Lowman, 2000).

Here, effect sizes of three types are presented, and in each category one or a few of the most commonly used choices are discussed. All statistics are either in an unsquared, “score-world” (e.g., mean, median, SD, r,R) or a squared, “area-world” (e.g., Cronbach’s a, variance, r2) metric (see Thompson, 2006a). The effect sizes presented here will include effects from both worlds.

Standardized Differences Effect Sizes

Medical researchers routinely conduct experiments. And medical researchers also measure many outcomes worldwide in uniform ways. For example, medical researchers from everywhere in the world measure cholesterol in milligrams per deciliter, and mortality in deaths per 1,000 (or 10,000 or 100,000). These are naturally occurring metrics that are readily interpretable and intrinsically meaningful. Medical researchers routinely compute score-world unstandardized difference effect sizes by subtracting the control group’s mean (or median or some other central tendency statistic) from the intervention group’s mean, to quantify how much a medication lowered cholesterol or reduced mortality.

However, in the social sciences, outcomes have no intrinsic scale or standard deviation. For example, the Piers-Harris Self-Concept Test may have an SD of 15 while the Tennessee Self-Concept Test has an SD of 30. This means that unstandardized effect sizes are not useful in the social sciences, because we want to be able to compare effect sizes apples-to- apples across studies using different posttests with different standard deviations.

Standardized difference effect sizes compute effects by subtracting the control group’s location or central tendency statistic from the intervention group’s central tendency statistic, and then dividing that difference by some estimate of the standard deviation. In statistics, when we divide by something, we are removing from the answer whatever we are dividing by. Whenever we divide by an SD, we are removing the SD from our answer, and any statistic now having no metric of measurement (thanks to the division) is called “standardized.”

Glass’ delta is an unsquared, score-world standardized difference effect size computed by dividing the unstandardized difference (e.g., MEXPERIMENTAL - MCONTROL) by the SD of the posttest scores only of the control group. Glass reasoned that an intervention might affect posttest means or medians or trimmed means or winsorized means, but also might affect the dispersion of the posttest scores in the experimental group. He reasoned that in cases where the control group got no intervention, or a placebo, the dispersion of the control group’s posttest scores could not have been impacted by an intervention that these participants did not receive. For the Table 1 data, we have:

Δ = (MEXPERIMENTAL - MCONTROL) / SDCONTROL

or

(115 - 100) / 29.58

15 / 29.58 = 0.51.

Cohen’s d is computed by dividing the unstandardized difference effect size by a weighted average of the posttest SDs of both groups. Cohen reasoned that (a) some interventions might not impact posttest score dispersion and (b) an estimate of the SD used in standardizing involving both groups would have a larger n, and therefore potentially be a better estimate of the population SD. For our data involving equal group sizes, we have:

d = (MEXPERIMENTAL - MCONTROL) / ((SDEXPERIMENTAL2 + SDCONTROL2) / 2)0.5

or

(115 - 100) / ((26.082 + 29.582) / 2)0.5

15 / ((26.082 + 29.582) / 2)0.5

15 / ((680.17 + 874.98) / 2)0.5

15 / (1555.15 / 2)0.5

15 / 777.580.5

15 / 27.88 = 0.54.

The existence of these two rival choices for estimating a standardized difference effect size reinforces the important understanding that statistics is not the business of always performing one definitively “correct” analysis. Instead, statistics is about being thoughtful, and making a reasoned decision taking into account the features of a particular study.

When control group sample size is large, there may be little gain from pooling SD estimates across groups, and so Glass’ Δ might be the most reasonable effect characterization. When sample size is smaller, and there are theoretical or empirical arguments that the intervention did not impact posttest score dispersion, Cohen’s d might be the most reasonable effect characterization.

We also are not limited to using means in the standardized difference effect size estimate. We could use medians (or trimmed means, or winsorized means) instead. For example, if we are using medians to calculate d for our data, we have:

(105 - 95) / ((26.082 + 29.582) / 2)0.5

10 / ((26.082 + 29.582) / 2)0.5

10 / ((680.17 + 874.98) / 2)0.5

10 / (1555.15 / 2)0.5

10 / 777.580.5

10 / 27.88 = 0.36.

Means are not always the best statistics with which to characterize the central tendency of scores (see Thompson, 2006a).

Variance-Accounted-For Effect Sizes

In the bivariate case, the Pearson r2 is an effect size, because the sample result characterizes how far the sample r2 diverges from the hypothesized value of 0.0 under the null hypothesis, H0: r2 = 0. However, as explained in detail elsewhere (Thompson, 2006a), all statistical analyses are correlational, and part of a single general linear model (cf. Cohen, 1968; Knapp, 1978). Thus, all commonly used analyses yield or can yield r2-type effect sizes.

For example, for a t test or for any ANOVA the effect size η2 is computed by dividing the Sum of SquaresMODEL by the Sum of SquaresTOTAL, as illustrated in Table 2. Our η2 = 7.37% (1,237.5 / 16,787.5) tells us that, given only knowledge of to which group our 22 participants belonged, we can explain or predict 7.37% of the variability or the individual differences on the posttest IQ scores. For more information about t tests see Capraro and Capraro and Zang elsewhere in this volume.

Table 2

ANOVA Summary Table for the Table 1 Example

SourceSum of SquaresdfMean SquareF CALCULATEDPCALCULATEDη2
Model1237.511237.51.5920.2227.37%
Unexplained1555020777.5   
Total16787.521    

Corrected Effect Sizes

Sampling error occurs when sample data diverge in some respect from the population in which the sample was drawn. All samples have sampling error. And the sampling error that occurs in one sample occurs in no other sample. Samples, like people, are all individually unique, and are all weird, and some samples (and people) are way weird.

The problem is that commonly used statistical methods, invoking “least squares” methods, which mathematically maximize the Sum of SquaresMODEL and minimize the Sum of SquaresUNEXPLAINED, tend to yield inflated effect size estimates due to the presence of sampling error. Fortunately, we are aware that this bias occurs, we know the factors that cause sampling error, and thus we can adjust or correct our effect size estimates to obtain more accurate effect estimates.

Three things affect sampling error. There is more sampling error in samples as (a) sample size is smaller, (b) the number of measured variables is larger, and (c) the population effect size is smaller (see Thompson, 2002a).

For our data, the corrected variance-accounted-for effect size Hays’ ω2 is computed as:

ω2 = [SOSMODEL - (k - 1)MSUNEXPLAINED] / [SOSTOTAL + MSUNEXPLAINED],

where k is the number of levels in the ANOVA way. As reported in Table 2, for our data we have:

[1237.5 - (2 - 1)777.5] / [16787.5 + 777.5]

[1237.5 - (1)777.5] / [16787.5 + 777.5]

[1237.5 - 777.5] / [16787.5 + 777.5]

460.0 / [16787.5 + 777.5]

460.0 / 17565.0 = 2.62%

Our “corrected” effect size (i.e., 2.62%) is smaller than our “uncorrected” estimate (i.e., 7.37%), as expected.

Another correction formula, the Ezekiel correction, can be applied with both multiple R2 and the Pearson r2 (Wang & Thompson, 2006). The corrected R2, R2*, can be computed as:

R2* = 1 - [(n - 1) / (n - p - 1)] [1 - R2],

or as:

R2* = R2 - ((1 - R2) (p / (n - p -1))),

where p is the number of predictor variables. SPSS will compute this adjusted effect size automatically if we run the REGRESSION analysis command. For more information about regression see Yetkiner elsewhere in this volume.

For example, presume that we were predicting an outcome variable with five predictor variables, n = 51, and that our uncorrected R2 = 10.00%. For that situation we have:

0.10 - ((1 - 0.10) (5 / (51 - 5 - 1)))

0.10 - (0.90 (5 / (51 - 5 - 1)))

0.10 - (0.90 (5 / 45))

0.10 - (0.90 (0.11))

0.10 - 0.10 = 0.00%

If this analysis was conducted retrospectively, after the regression data were collected and analyzed, the result would indicate that the detected R2 effect size of 10.0% was solely caused by sampling, and that the predictor variables actually are completely useless in predicting the outcome variable, once we take sampling error into account. If the analysis was conducted prospectively, before data were collected, under a premise that the population R2 was 0.0%, the analysis can also be interpreted as indicating that for n = 51 the uncorrected R2 is expected to be 10.0% even if the population R2 is 0.0%!

Some Effect Size Interpretation Precepts

The effect sizes presented here and elsewhere (cf. Grissom & Kim, 2005; Thompson, 2006b, 2008) either will be computed by SPSS (e.g., η2, adjusted R2) or can readily be computed by entering SPSS output information into Excel (e.g., Hays’ ω2). In 1988, the first journal -Measurement and Evaluation in Counseling and Development -- required effect size reporting, and the second journal -- Educational and Psychological Measurement -- did so in 1994. In 1996, the American Psychological Association Board of Scientific Affairs appointed its Task Force on Statistical Inference, which in its 1999 report unequivocally repeatedly emphasized that effect size reporting is “essential” (Wilkinson & APA Task Force, 1999).

In 2002, the Australian Fiona Fidler noted that, “Of the major American associations, only all the journals of the American Educational Research Association [AERA] have remained silent on all these issues” (p. 754). But in 2006 AERA finally did speak, and required effect size reporting in all quantitative articles in AERA journals.

Thus, today social scientists are routinely expected to report and interpret effect sizes. However, following some precepts in reporting and interpreting effect sizes will result in best practices.

First, because there are dozens of effect sizes statistics, authors must explicitly tell readers what effect sizes they are reporting. Some effect sizes can be negative (e.g., Cohen’s d), but some effect sizes cannot be negative (e.g., η2). Some effect sizes can be greater than +1 (e.g., Glass’ Δ, Cohen’s d), but some cannot (e.g., η2, Hays’ ω2). Different effect size statistics have different ranges and properties, and readers can only intelligently interpret effects when they know which specific effect size is being reported.

Second, do not interpret effect sizes using Cohen’s benchmarks for “small,” “medium,” and “large” effects, and instead interpret effects in direct and explicit comparison against the effects in the related prior literature (Thompson, 2006b). We cannot reasonably use universal benchmarks that fail to take into account the context of the study being conducted. A 1.0% η2 when the outcome is human longevity, as in the case of the effect of smoking on longevity, is important even though the effect size is numerically small! Cohen (1988) himself emphasized, … that these proposed conventions were set forth throughout with much diffidence, qualifications, and invitations not to employ them ifpossible [italics added]. They were offered as conventions because they were needed in a research climate characterized by a neglect of attention to issues of [effect size] magnitude. (p. 532)

As noted elsewhere, “if people interpreted effect sizes [using fixed benchmarks] with the same rigidity that α = .05 has been used in statistical testing, we would merely be being stupid in another metric” (Thompson, 2001, pp. 82-83).

Third, we must remember that the failure to meet the statistical assumptions of analytic methods (e.g., homogeneity of regression in ANCOVA) to some degree compromises the accuracy of all statistical results, including effect sizes. We never perfectly meet statistical assumptions, and we need to take into account how inaccurate our estimates may be given varying degrees of assumption violations.

Fourth, effect sizes should be reported even for statistically nonsignificant effects. A cumulative literature may detect important effects even when all studies individually yield pCALCULATED = 0.06, as illustrated by Thompson (2002b). Remember Rosnow and Rosenthal’s (1989) view that “surely, God loves the .06 [level of statistical significance] nearly as much as the .05” (p. 1277)!

American Educational Research Association.
(
2006
).
Standards for reporting on empirical social science research in AERA publications
.
Educational Researcher
35
(
6
)
33
-
40
.
Anderson
,
D. R.
,
Burnham
,
K. P.
, &
Thompson
,
W.
(
2000
).
Null hypothesis testing: Problems, prevalence, and an alternative
.
Journal of Wildlife Management
,
64
,
912
-
923
.
Armstrong
,
J. S.
(
2007
).
Significance tests harm progress in forecasting
.
International Journal ofForecasting
,
23
,
321
-
327
.
Berkson
,
J.
(
1938
).
Some difficulties of interpretation encountered in the application of the chi-square test
.
Journal of the American Statistical Association
,
33
,
526
-
536
.
Boring
,
E. G.
(
1919
).
Mathematical vs
. scientific importance.
Psychological Bulletin
,
16
,
335
-
338
.
Capraro
,
R. M.
, &
Capraro
,
M.
(
2002
).
Treatments of effect sizes and statistical significance tests in textbooks
.
Educational and Psychological Measurement
,
62
,
771
-
782
.
Carver
,
R.
(
1978
).
The case against statistical significance testing
.
Harvard Educational Review
,
48
,
378
-
399
.
Cohen
,
J.
(
1968
).
Multiple regression as a general data-analytic system
.
Psychological Bulletin
,
70
,
426
-
443
.
Cohen
,
J.
(
1988
).
Statistical power analysis for the behavioral sciences
(2nd ed.).
Hillsdale, NJ
:
Erlbaum
.
Cohen
,
J.
(
1994
).
The Earth is round
(p < .05).
American Psychologist
,
49
,
997
-
1003
.
Fidler
,
F.
(
2002
).
The fifth edition of the APA Publication Manual: Why its statistics recommendations are so controversial
.
Educational and Psychological Measurement
,
62
,
749
-
770
.
Fidler
,
F.
(
2005
).
From statistical significance to effect estimation: Statistical reform in psychology, medicine, and ecology
. Doctoral dissertation, University of Melbourne. Retrieved from http://www.botany.unimelb.edu.au/envisci/fiona/fidlerphd_aug06.pdf
Greenwald
,
A. G.
(
1975
).
Consequences of prejudice against the null hypothesis
.
Psychological Bulletin
,
82
,
1
-
20
.
Grissom
,
R. J.
(
1994
).
Probability of the superior outcome of one treatment over another
.
Journal ofApplied Psychology
,
79
,
314
-
316
.
Grissom
,
R.
, &
Kim
,
J. J.
(
2005
).
Effect sizes for research: A broad practical approach
.
Mahwah, NJ
:
Erlbaum
.
Hess
,
B.
,
Olejnik
,
S.
, &
Huberty
,
C. J
(
2001
).
The efficacy of two Improvement-over-chance effect sizes for two-group univariate comparisons under variance heterogeneity and nonnormality
.
Educational and Psychological Measurement
,
61
,
909
-
936
.
Hubbard
,
R.
, &
Armstrong
,
J. S.
(
2006
).
Why we don’t really know what statistical significance means: A major educational failure
.
Journal of Marketing Education
,
28
,
114
-
120
.
Hubbard
,
R.
, &
Ryan
,
P. A.
(
2000
).
The historical growth of statistical significance testing in psychology--and its future prospects
.
Educational and Psychological Measurement
,
60
,
661
-
681
.
Huberty
,
C. J
(
1999
). On some history regarding statistical testing. In
B.
Thompson
(Ed.),
Advances in social science methodology
(Vol. 5, pp.
1
-
23
).
Stamford, CT
:
JAI Press
.
Huberty
,
C. J
, &
Holmes
,
S. E.
(
1983
).
Two-group comparisons and univariate classification
.
Educational and Psychological Measurement
,
43
,
15
-
26
.
Huberty
,
C. J
, &
Lowman
,
L. L.
(
2000
).
Group overlap as a basis for effect size
.
Educational and Psychological Measurement
,
60
,
543
-
563
.
Kirk
,
R. E.
(
1996
).
Practical significance: A concept whose time has come
.
Educational and Psychological Measurement
,
56
,
746
-
759
.
Kirk
,
R. E.
(
2003
). The importance of effect magnitude. In
S. F.
Davis
(Ed.),
Handbook of research methods in experimental psychology
(pp.
83
-
105
).
Oxford, United Kingdom
:
Blackwell
.
Knapp
,
T. R.
(
1978
).
Canonical correlation analysis: A general parametric significance testing system
.
Psychological Bulletin
,
85
,
410
-
416
.
Johnson
,
D. H.
(
1999
).
The insignificance of statistical significance testing
.
Journal of Wildlife Management
,
63
,
763
-
772
.
Rosenthal
,
R.
(
1979
).
The “file drawer problem” and tolerance for null results
.
Psychological Bulletin
,
86
,
638
-
641
.
Rosnow
,
R. L.
, &
Rosenthal
,
R.
(
1989
).
Statistical procedures and the justification of knowledge in psychological science
.
American Psychologist
,
44
,
1276
-
1284
.
Rozeboom
,
W. W.
(
1997
). Good science is abductive, not hypothetico- deductive. In
L. L.
Harlow
,
S. A.
Mulaik
, &
J. H.
Steiger
(Eds.),
What if there were no significance tests?
(pp.
335
-
392
).
Mahwah, NJ
:
Erlbaum
.
Schmidt
,
F. L.
(
1996
).
Statistical significance testing and cumulative knowledge in psychology: Implications for the training of researchers
.
Psychological Methods
,
1
,
115
-
129
.
Schmidt
,
F. L.
, &
Hunter
,
J. E.
(
1997
). Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In
L. L.
Harlow
,
S. A.
Mulaik
, &
J. H.
Steiger
(Eds.),
What if there were no significance tests?
(pp.
37
-
64
).
Mahwah, NJ
:
Erlbaum
.
Thompson
,
B.
(
1992
).
Two and one-half decades of leadership in measurement and evaluation
.
Journal of Counseling and Development
,
70
,
434
-
438
.
Thompson
,
B.
(
1993
).
The use of statistical significance tests in research: Bootstrap and other alternatives
.
Journal ofExperimental Education
,
61
,
361
-
377
.
Thompson
,
B.
(
1996
).
AERA editorial policies regarding statistical significance testing: Three suggested reforms
.
Educational Researcher
,
25
(
2
),
26
-
30
.
Thompson
,
B.
(
1999
).
If statistical significance tests are broken/misused, what practices should supplement or replace them?
Theory & Psychology
,
9
,
167
-
183
.
Thompson
,
B.
(
2001
).
Significance, effect sizes, stepwise methods, and other issues: Strong arguments move the field
.
Journal of Experimental Education
,
70
,
80
-
93
.
Thompson
,
B.
(
2002a
).
“Statistical,” “practical,” and “clinical”: How many kinds of significance do counselors need to consider?
Journal of Counseling and Development
,
80
,
64
-
71
.
Thompson
,
B.
(
2002b
).
What future quantitative social science research could look like: Confidence intervals for effect sizes
.
Educational Researcher
31
(
3
)
24
-
31
.
Thompson
,
B.
(
2006a
).
Foundations of behavioral statistics: An insightbased approach
.
New York
:
Guilford
.
Thompson
,
B.
(
2006b
). Research synthesis: Effect sizes. In
J.
Green
,
G.
Camilli
, &
P. B.
Elmore
& (Eds.),
Complementary methods for research in education
(pp.
583
-
603
).
Washington, DC
:
American Educational Research Association
.
Thompson
,
B.
(
2007
).
Effect sizes, confidence intervals, and confidence intervals for effect sizes
.
Psychology in the Schools
,
44
,
423
-
432
.
Thompson
,
B.
(
2008
). Computing and interpreting effect sizes, confidence intervals, and confidence intervals for effect sizes. In
J.
Osborne
(Ed.),
Best practices in quantitative methods
(pp.
246
-
262
).
Newbury Park, CA
:
Sage
.
Vacha-Haase
,
T.
, &
Thompson
,
B.
(
2004
).
How to estimate and interpret various effect sizes
.
Journal of Counseling Psychology
,
51
,
473
-
481
.
Wang
,
Z.
, &
Thompson
,
B.
(
2007
).
Is the Pearson r2 biased, and if so, what is the best correction formula?
Journal ofExperimental Education
,
75
,
109
-
125
.
Wilkinson
,
L.
, &
APA
Task Force on Statistical Inference.
(
1999
).
Statistical methods in psychology journals: Guidelines and explanations
.
American Psychologist
,
54
,
594
-
604
.
Ziliak
,
S. T.
, &
McCloskey
,
D. N.
(
2004
).
Size matters: the standard error of regressions in the
American Economic Review. Journal ofSocio-Economics
,
33
,
527
-
546
.
Licensed re-use rights only

or Create an Account

Close Modal
Close Modal