What Confidence Intervals Really Do and Why They Are so Important for Middle Grades Educational Research

Skidmore, Susan Troncoso

doi:10.1108/MGRJ-09-2009-0005

Recommendations made by major educational and psychological organizations (American Educational Research Association, 2006; American Psychological Association, 2001) call for researchers to regularly report confidence intervals. The purpose of the present paper is to provide support for the use of confidence intervals. To contextualize this discussion, a brief account of what null hypothesis statistical significance testing (NHSST) does and does not do as well as a sensible alternative to NHSST precedes the discussion of confidence intervals. Next, accessible instructions for computing confidence intervals and related graphics are provided to encourage and facilitate usage. Finally, recommendations are discussed that will facilitate the collective research enterprise.

Introduction

Do you remember the game 20 questions where the object of the game was to guess what the other person was thinking about? Suppose you could choose to play the game the old way, where you were allowed to ask only yes/no questions to arrive at your best guess, or instead play the new and improved version of the game. In the new version of the game, not only would you be given a yes or no response to your questions but the response would be quantitatively qualified. For example, suppose one of your questions about a mystery object was whether the object had a given particular length. With the old rules, you would get a response of yes (or no, but suppose it was yes). With the new rules, you would get an answer of yes and the length is 12 cm. Additionally, if you would like you could ask for the range of values that were plausible. If you were playing with the old rules, numerous questions would be required before you could determine a rough range of plausible values (e.g., Is the length equal to 2 cm? Is the length equal to 55 cm?). With the new rules, you could simply request the range. Perhaps you would be given range values between 8 cm and 16 cm. Even more exciting, in this new way to play you would be allowed (actually encouraged) to ask for input from your friends that had played this game before so that you can gather more information to help make your best guess! Which game rules would you choose? If you really do want to know what the mystery object is, the obvious response would be to choose to play by the new rules.

What does this story about a childhood game have to do with confidence intervals? In a crude and simplistic sense, this story is analogous to the choice between strict adherence to null hypothesis statistical significance testing (NHSST) versus recommended best practices of using effect sizes, confidence intervals, and confidence intervals for effect sizes (Hunter & Schmidt, 2004; Thompson, 2007). There is a long history of researchers (Bakan, 1966; Cohen, 1994; Jones, 1955; Oakes, 1986; Thompson, 1993, 1996, 1997) who have advocated the move away from the less informative, often misinterpreted (Nickerson, 2000), ways of knowing ascribed to NHSST, and toward more enlightened views. However, there are still those who just as vehemently stand by significance testing (Chow, 1988, 1998; Cortina & Dunlap, 1997; Robinson & Levin, 1997).

In support of NHSST, researchers have claimed that the arguments against NHSST are actually directed at the users. Therefore, some researchers claim the problems attributed to NHSST “is the fault of those who are doing the interpreting, not the tools that they choose” (Cortina & Dunlap, 1997, p. 164). A similar argument insists “it is not the fault of the process… if some of its users misunderstand it, expect more from it than it promises to deliver, or apply it inappropriately” (Nickerson, 2000, p. 274). Misinterpretations aside, researchers who disparage NHSST argue that NHSST does not offer any information regarding the magnitude of the effect or any indication of sampling error (Carver, 1993).

In 1996, the Task Force on Statistical Inference (TFSI) was commissioned in part with the task of shedding light on the issues surrounding the statistical significance debate. While the Task Force did not recommend the banning of NHSST, the Task Force did make several recommendations regarding effect sizes and confidence intervals. Wilkinson and the TFSI (1999) directed researchers to “always [emphasis added] provide some effect-size estimate when reporting a p value” (p. 599). Furthermore they explained, “it is hard to imagine a situation in which a dichotomous accept-reject decision is better than reporting an actual p value or, better still, a confidence interval” (Wilkinson & TFSI, 1999, p. 599).

Similarly, the American Psychological Association (2001) reminded researchers that in order “for the reader to fully understand the importance of your findings, it is almost always necessary to include some index of effect size or strength of relationship in your Results section” (p. 25). On the issue of reporting confidence intervals, the APA noted that confidence intervals …can be an extremely effective way of reporting results. Because confidence intervals combine information on location and precision and can often be directly used to infer significance levels, they are in general, the best reporting strategy. The use of confidence intervals is therefore strongly [emphasis added] recommended.” (p. 22)

This position has been criticized by some who expected more than mere recommendations from an organization which publishes a manual detailing specific requirements on arguably more trivial matters (Fidler, 2002; Finch, Thomason, & Cumming, 2002).

Also taking a stand on effect sizes and confidence intervals, the American Educational Research Association (2006) called effect sizes “useful” and recommended that researchers report effect sizes as a way to “capture the magnitude” (p. 37) of results. In addition, the AERA required that researchers report “for each of the statistical results that is critical to the logic of the design and analysis” a standard error or a confidence interval to give readers “an indication of the uncertainty” of that estimate (p. 37).

The purpose of the present paper is to explain the use of confidence intervals for middle grades research. To contextualize this discussion, a brief account of what NHSST does and does not do as well as a sensible alternative to NHSST precedes the discussion of confidence intervals. Next, accessible instructions for computing confidence intervals and related graphics are provided to encourage and facilitate usage. Finally, recommendations are discussed that will facilitate the collective research enterprise.

The Truth about what NHSST does & does not do

While calculated p-values allow us to reject or fail to reject the null hypothesis (often the nil null hypothesis), they provide no information about the magnitude of the effect we are purportedly investigating. For example, a statistical significance test may tell us that a sample statistic (e.g., mean, median, standard deviation, correlation) is greater than another sample statistic, but they do not tell us to what extent the first of the two statistics being compared is greater.

What We Wish p_calculated Was

So if p-values are not describing how different our sample statistic is, what do p-values tell us? A clear explanation is given by Thompson (1996): “p_calculated is the probability (0 to 1.0) of the sample statistics, given the sample size and assuming the sample was derived from a population in which the null hypothesis is exactly true” [emphasis added] (p. 27) . First, this quote makes clear that the probability applies to the sample mean (or median, correlation, etc.) that is being investigated, and not to the population. As noted by several researchers the inference is from the sample to the population, not from the population to the sample (Cohen, 1990; Nickerson, 2000; Thompson, 2006a). What we would like to know is how likely our value is in the population, because that would give us an indication of result replicability. However, as Cohen (1994) so poignantly explained, a significance test “does not tell us what we want to know, and we so much want to know what we want to know, out of desperation, we nevertheless believe that it does” (p. 997)!

Size Matters

The p-value is dependent on sample size. As has been noted by numerous researchers, any difference (or relationship) can be shown to be statistically significant if a large enough sample is used (Cohen, 1990; Thompson, 2006a). Thus, a statistically significant p-value can remind researchers that the sample is large, which of course they know the sample is large because they collected the data (Thompson, 1992, 2006a).

NHSST Tells You What You Already Know

Finally, the sample is assumed to have been drawn from a population where there was no difference in the sample statistic of interest (ex. Mean_girls = Mean_boys). However, as Meehl (1978) pointed out “the null hypothesis, taken literally, is always false [emphasis added]” (p. 822). Furthermore, Loftus noted (1996), “rejecting a typical null hypothesis is like rejecting the proposition that the moon is made of green cheese. The appropriate response would be ‘Well, yes, okay … but so what?’”(p. 163)

Statistically Significance ≠ Importance

Another important point about NHSST: just because you obtain statistically significant results, this does not necessarily mean you have something worthwhile or important. For example, if you correlate the district ID number with the enrollment of all Texas public secondary schools you would find a statistically significant correlation of -0.105 (p = 0.000208, r² = 1.1%, n = 1244). Wow, there is a negative relationship between secondary enrollment and the district identification number. Surely this is an important discovery that is going to change the world. Umm…no. Whether a given result is important cannot be answered by any given p-value. Long ago commentary about significance testing warned us “when we reach a point where our statistical procedures are substitutes instead of aids to thought, and we are led to absurdities, then we must return to the common sense basis” (Bakan, 1966, p. 436). Surely researchers recognize that p-values cannot make judgment calls.

Finally, Schmidt called the misconception that a non-statistically significant result meant that the difference or the relation was zero, or so small that it could be considered to be zero, “the most devastating of all to the research enterprise” (1996, p. 126). This belief is premised on the idea that if the null hypothesis cannot be rejected, then it must be accepted (Schmidt). Of course, we all remember from elementary statistics courses that the null can never be accepted; we can only fail to reject the null (see Thompson, 2006a, for a lucid explanation involving the celebration of Mardi Gras in New Orleans).

A Sensible Alternative to NHSST

Effect sizes quantify “the extent to which the sample statistics diverge from the null hypothesis” (Thompson, 2006a, p. 172). In other words, effect sizes tell you how different the sample statistics are from a specified value. When effect sizes are zero, your results support the null hypothesis that for example, Mean_girls = Mean_boys on an outcome variable. Any deviation from this null will result in a nonzero effect size. For example, if Mean_boys = 99 and Mean_girls= 99, then your effect size = 0. However, if Mean_boys = 98 and Mean_girls = 99, then the effect size will not be zero. It will be very small, but it will not be zero. If Mean_boys = 80 and Mean_girls = 99, you would get a comparatively larger effect size. There are effect sizes that provide information about mean differences, known as the d family, and effect sizes that quantify relationships, known as the r family (Rosenthal, 1994). Just as there are formulas to calculate statistics, there are formulas to calculate the numerous types of effect sizes.

The use of effect sizes has been advocated by numerous researchers (Capraro, 2004; Elmore & Rotou, 2001; Henson, 2006; Thompson, 1999a, 1999b, 1999c, 2001, 2006b). Effect sizes come in many flavors. Some of the better known effect sizes available are r², ϕ, R², η², ω², Wilks’ λ, Glass Δ, and Cohen’s d (cf. Elmore & Rotou, 2001; Kirk, 1996), and an excellent historical account of these and other effect sizes is provided by Huberty (2002). One idea that should come to mind when a list of various effect sizes is given is the necessity to indicate in study results which effect size is used. No sensible researcher provides a location statistic without identifying which location statistic is being described because they recognize that there are different location statistics. Furthermore, not all the effect sizes are in the range of 0 to 1, and some are even scaled in the opposite direction, therefore correct effect size interpretation mandates that researchers always declare exactly which effect size is being presented in a given report and then interpret its meaning in terms of the study at hand.

How Do You Interpret Effect Sizes?

Some have argued that effect sizes are as subject to misinterpretation as significance tests. Deciding whether an effect is practically significant or not is not a yes/no mechanical evaluation versus a predetermined level, as in significance testing. Instead, recommended best practices encourage that researchers compare effect sizes, “directly and explicitly,” to other effect sizes within the relevant literature (Thompson, Diamond, McWilliam, Snyder, & Snyder, 2005, p. 191).

In his classic text, Cohen (1988) gave general guidelines for researchers to help qualify the magnitude of effect sizes as “small,” “medium,” and “large” but encouraged people “not [emphasis added] to employ them if possible” (p. 532). Implicit in the encouragement to compare effect sizes within the relevant literature is that researchers should usually not use Cohen’s benchmarks as an interpretation crutch (Shaver, 1993). Blindly relying on Cohen’s benchmarks would be ignoring the relevant related literature that is really the ruler by which researchers should evaluate the magnitude of a study’s effects. As Thompson pointed out, “if people interpreted effect sizes with the same rigidity that α = .05 has been used in statistical testing, we would merely be being stupid in another metric” (2001, pp. 82-83). Stated differently, a very small effect size may actually have very important consequences, and a very large effect size may actually be irrelevant. Thus, it is not sensible to qualify effect sizes out of context. For example, people are warned about the dangers of smoking even though the relationship between smoking and lung cancer is comparatively small (r² =1% to 2%) because for most people, knowledge about how to decrease the probability of getting lung cancer is important. On the other hand, no one would particularly care if I was to reveal to them that campus numbers and county numbers in Texas public schools are perfectly correlated (r² = 100%).

Effect sizes are not a panacea for middle grades research. The old adage “garbage in, garbage out” is particularly relevant here. Because statistics calculated from a study are used to compute effect sizes, it follows that if a study has been poorly executed, neither the statistics, nor the effect sizes derived from them, are necessarily useful.

The point is, you cannot avoid careful reflection (nor should you want to!). A researcher invests a tremendous amount of time, energy and resources in conducting a study, therefore, as previous researchers have noted, “no one is in a better position than the researcher who collected and analyzed the data to decide whether the effects are trivial or not. It is a curious anomaly that researchers are trusted to make a variety of complex decisions in the design and execution of an experiment, but in the name of objectivity they are not expected to nor even encouraged to decide whether the effects are practically significant” (Kirk, 2001, p. 214).

Confidence Intervals

While effect size and statistical significance test proponents disagree on issues of statistical inference, one area of agreement is support for the inclusion of confidence intervals (Cohen, 1990, 1994; Gardner & Altman, 1986; Greenwald, Gonzalez, Harris, & Guthrie, 1996; Hunter & Schmidt, 2004; Loftus, 1996; Robinson & Wainer, 2002; Rozeboom, 1960; Schmidt, 1996; Thompson, 2006c). Put simply, confidence intervals are an estimated range of values around a point estimate (e.g., mean or correlation) calculated from a given set of sample data that provide an indication about the precision of your estimate. Calls for reporting of confidence intervals are not new. In fact, Jones (1955) argued over fifty years ago that an investigator would be misled less frequently and would be more likely to obtain the information he seeks were he to formulate his experimental problem in terms of the estimation of population parameters, with the establishment of confidence intervals about the estimated values, rather than in terms of a null hypothesis against all possible alternatives. (p. 407)

Figure 1

A graph depicting a function with three sections: the top shows the population distribution, the middle illustrates a sample of thirty drawn from this distribution represented by circles, and the bottom displays randomly drawn 95% confidence intervals for successive samples of size n=30. The x-axis is labeled with numbers ranging from 0 to 70, and the y-axis indicates probability density values from 0.000 to 0.025.

View large Download slide

Twenty Randomly Drawn 95% Confidence Intervals for the Mean When M=30 And ∑=17

Confident About What?

Figure 1 presents 20 randomly drawn 95% confidence intervals for the mean, given n = 30, µ = 30 and σ = 17, where µ is the population mean and σ is the population standard deviation. The figure was drawn with the Exploratory Software for Confidence Intervals (ESCI) software, CIjumping (cf. Cumming & Finch, 2001). Examining Figure 1, it is easy to see that out of the 20 randomly drawn 95% confidence intervals; only one (shown in black) did not capture the population mean, as compared to the other 19. It is important to make clear here what the confidence intervals are not telling you. A typical description of a confidence interval might be, “in the case of a 95% confidence interval, a researcher can be 95% confident that the computed interval contains the true value of the population parameter” (Sheskin, 2004, p. 65). However, as you can see, the probability of this ninth sample (centered at ~39) capturing the population mean is not 95%.

The appropriate interpretation is that if you drew an infinite number of samples from a given population, 95% would capture the population value and 5% would not capture the population value. The confidence interval is not about one sample; it is about an infinite number of samples. As tempting as it is to be confident in an individual study’s results, do not forget Thompson’s (2007) infamous inequality, 1 ≠ ꚙ As is clearly illustrated in Figure 1, researchers should humbly recognize that even if their study is a randomly drawn sample from the population of interest, the 95% confidence interval for their one study may not be capturing the population mean.

Precision

Confidence intervals offer valuable insight about the precision of your sample estimates. If a confidence interval is narrower rather than wider, one will probably be more willing to trust the point estimate, because “when intervals are wide, the evidence for a given point estimate being correct is called into question”(Thompson et al., 2005). One important way to think about confidence intervals is then that they provide an estimate of the precision of sample statistics through the width of the confidence band. Thus, confidence intervals provide a means to help decide whether researchers should or shouldn’t be overly enthusiastic about their point estimate (Thompson, 2006a).

Consistency

Confidence intervals for a body of literature also paint a picture of the status of an entire area of research. If the confidence intervals around the sample point estimates are fairly stable, this implies that the literature is fairly consistent across studies. On the other hand, if the estimates fluctuate a lot and exhibit wide confidence intervals, the studies may be using small sample sizes, but in any case, are and for whatever reason, getting unrealiable estimates.

Examples From a Published MGRJ Article

To make subsequent examples more concrete, where possible, sample calculations will be drawn from a study that evaluated the impact of a standards-based curriculum, Connected Mathematics Project, on students’ mathematics achievement (Kulm, Capraro, & Capraro, 2007). For the purposes of these examples, one outcome measure from the study, the Texas Learning Index (TLI) will be used. TLI describes a student’s performance on a state achievement test in relation to the passing standard. Figure 2 displays the actual 6th grade mean TLI score and 95% confidence interval from Kulm et al. depicted alongside nine hypothetical studies results. You can see that some confidence intervals are quite broad, and some confidence intervals are narrower. Nonetheless, the confidence intervals seem to be centered on 77. Graphical depictions of the confidence interval from Kulm et al. versus the confidence intervals from the hypothetical related prior studies clearly situate results within the relevant literature as well as communicate an overall pattern or trend across a field.

Figure 2

A graph displaying confidence intervals for the mean across ten studies, with labeled data points from Study 1 to Study 10. The Y-axis ranges from 60.00 to 100.00, indicating the scores in a 6th-grade TLI (Topic Learning Index).

View large Download slide

Heuristic Comparison ofMean 6^th Grade TLI Scores with 95% Confidence Intervals for Kulm et al. (2007) and Nine Hypothetical Studies

Interpreting Confidence Intervals

Cohen (1994) quipped ‘“everyone knows” that confidence intervals contain all the information to be found in significance tests and much more” (p. 1002). For example, in Figure 2, if the null hypothesis is that the samples were drawn from a population where the mean is assumed to be 77, about 5 of the intervals would capture the parameter value. Those intervals that contain 77 would not be statistically significant (p < 0.05) at the stated 95% confidence interval. While confidence intervals can be used to test a null hypothesis, providing the same result as statistical significance tests, this is not the best use confidence intervals. One critical difference between using confidence intervals versus NHSST is when one realizes that one can compute and interpret confidence intervals without even having a null hypothesis, but one cannot conduct NHSST without a null hypothesis. Researchers can take advantage of what confidence intervals offer by comparing the overlaps of confidence intervals across studies, thereby, empirically evaluating the consistency of evidence across studies (Thompson, 2002).

Confidence Intervals for Effect Sizes

The endorsement of contextualizing current research in the framework of previous research is not likely to draw opposition. Surely we recognize the value of a thorough literature review that concisely summarizes what is known about middle grades research, particularly if the result includes a graphic figure portraying the confidence intervals for a specific literature. How wonderfully concise is a figure depicting confidence intervals that clearly and coherently contextualizes our results in the framework of prior research studies.

While depicting confidence intervals for statistics across studies is important, in educational research we are often more interested in examining the differences between or relationships among two or more groups. Confidence intervals for effect size satisfy this need.

Computing Confidence Intervals

After recognizing the inherent value of confidence intervals, the next pressing question is how to compute confidence intervals. Thankfully, the process of computing and graphically depicting confidence intervals has been greatly facilitated by computers.

It is not the intent of the present paper to demonstrate how to compute confidence intervals for all statistics and effect sizes as there are several accessible and more comprehensive sources explaining how to compute confidence intervals for sample statistics (Fan & Thompson, 2001; Henson, 2006) and for effect sizes (Cumming & Finch, 2005; Smithson, 2001, 2003). Also confidence intervals for commonly used statistics such as the mean can be obtained with ease using any statistical software package or Excel. The present paper considers examples of confidence intervals for the Pearson r, and for an effect size, the standardized mean difference given by Cohen’s d.

Confidence Interval for the Correlation Coefficient

Computing the confidence interval for the correlation coefficient requires the transformation of the Pearson product-moment correlation coefficient into Fisher’s z. This conversion can be done using tabled Fisher’s z to r values, or more conveniently via a number of websites where simply providing r values returns the transformed Fisher’s z (see Lane, 2008; Lowry, 2001). Next the confidence interval is computed using the following formula,

{CI}_{zr (1 - α)} = z_{r} \pm (z_{c r i t}) \sqrt{\frac{1}{n - 3}}

where z_r, is the result of transforming z to r as described above, z_crit, 95% confidence interval, α = 0.05 for the z standard normal distribution, and n, is the sample size. The results of this formula can then be converted back into r via the same websites (see Lane; Lowry).

Given an r value of 0.33, n = 40, you need to first convert the r to a z_r (using tabled values or websites), thus obtaining z_r = 0.343. Inputting values into the given equation results in the following

{CI}_{zr (1 - α)} = z_{r} \pm (z_{c r i t}) \sqrt{\frac{1}{n - 3}}

= 0.343 \pm (1.96) \sqrt{\frac{1}{40 - 3}}

= 0.343 ±0.322

(0.665, 0.021).

Finally, you convert the z_r back to r (again using tabled values or websites) resulting in a confidence interval of (0.582, 0.021).

Computing Confidence Intervals for Effect Sizes

While there are formulas to compute confidence intervals for statistics, there are no formulas to directly compute confidence intervals for effect sizes. The results are obtained through a computerized estimation process involving successive tweaking of successive guesses, called iteration. Thankfully, there are several statistical packages (Smithson, 2001, 2008; Zou, 2007) and software programs (Cumming & Finch, 2001, 2005; Fidler et al., 2005) that painlessly perform these operations.

Table 1

Heuristic Example Comparing Results Across Studies

						Lower Upper
Study	n	t	P	Decision	d	95%	CI for d
1	33	1.90	0.067	NS	0.33	-0.02	0.68
2	120	1.20	0.231	NS	0.11	-0.07	0.29
3	35	1.72	0.095	NS	0.29	-0.05	0.63
4	50	-2.62	0.012	***	-0.37	-0.65	-0.08
5	37	1.09	0.281	NS	0.18	-0.15	0.50
6	39	1.81	0.078	NS	0.29	-0.03	0.61
7	99	1.49	0.139	NS	0.15	-0.05	0.35
8	6	2.40	0.062	NS	0.98	-0.04	1.94
9	40	1.71	0.096	NS	0.27	-0.05	0.58
10	37	1.89	0.067	NS	0.31	-0.02	0.64
Previous studies
combined	496	3.44	0.001	***	0.16	0.07	0.24
Kulm et al. (2007)	105	8.40	<0.00	***	0.82	0.60	1.04
Kulm et al. & previous studies combined	601	6.64	<0.00	***	0.27	0.19	0.35

						Lower Upper
Study	n	t	P	Decision	d	95%	CI for d
1	33	1.90	0.067	NS	0.33	-0.02	0.68
2	120	1.20	0.231	NS	0.11	-0.07	0.29
3	35	1.72	0.095	NS	0.29	-0.05	0.63
4	50	-2.62	0.012	***	-0.37	-0.65	-0.08
5	37	1.09	0.281	NS	0.18	-0.15	0.50
6	39	1.81	0.078	NS	0.29	-0.03	0.61
7	99	1.49	0.139	NS	0.15	-0.05	0.35
8	6	2.40	0.062	NS	0.98	-0.04	1.94
9	40	1.71	0.096	NS	0.27	-0.05	0.58
10	37	1.89	0.067	NS	0.31	-0.02	0.64
Previous studies
combined	496	3.44	0.001	***	0.16	0.07	0.24
Kulm et al. (2007)	105	8.40	<0.00	***	0.82	0.60	1.04
Kulm et al. & previous studies combined	601	6.64	<0.00	***	0.27	0.19	0.35

Note. NS indicates non-statistically significant results.

The confidence intervals for Figure 3 were drawn using Cumming’s ESCI software MAthinking (Cumming & Finch, 2001, 2005). Suppose that researchers across these individual studies were interested in the standardized mean difference in mathematics achievement as measured by the TLI, with a positive standardized mean difference score describing improvement from grade five to grade six, and a negative standardized mean difference score indicating a decline in performance from grade five to grade six.

Table 1 displays the values for Figure 3. For this example, ten hypothetical studies are compared to Kulm et al. (2007) study results of the mean standardized difference between 5th grade TLI scores and 6th grade TLI scores. You will notice that in previous hypothetical studies, only one researcher (Study 4) found statistically significant results (indicated by p-values and by the confidence interval not capturing zero). Interpreting only statistically significant results from previous hypothetical studies would result in an interpretation that on average the TLI score decreases from grade five to grade six. On the other hand, following recommended practices of situating your current study in the relevant related literature shows that grade six TLI scores show statistically significant improvement over grade five TLI scores (d = 0.27). In other words, across all 11 studies, the grade six TLI scores were on average roughly a quarter of a standard deviation better than grade five TLI scores. Moreover, the figure makes it easy to intuitively interpret that the results obtained by Kulm et al, show more improvement than other similar studies. However, it is the researcher’s responsibility to make the case for whether this is a meaningful result or not.

Figure 3

A graph displaying the Standardized Effect Size (d) with values ranging from -1 to 3, comparing previous studies, Kulm et al. (2007), and the combination of Kulm et al. with previous studies. The graph includes labeled axes and lines indicating different data points.

View large Download slide

Heuristic Comparison of Standardized Mean Difference Effect Sizes, d, for Kulm et al. (2007) and Nine Hypothetical Studies

Graphing Your Results

Appendix A provides the syntax option to depict the graphic in SPSS^® (version 16.0) given in Figure 2. Notice that a partial screenshot of the data window shows how the variables are labeled. For those that prefer to point and click, (1) select GRAPHS (2) CHART BUILDER (3) under GALLERY select HIGH-LOW (4) select the first box (5) drag and drop your variables listed on the left hand pane into the appropriate box in the Chart Preview Window, Low Variable (lower confidence band), High Variable (upper confidence band), Close Variable (point estimate), X-Axis (Study ID). Selecting the Titles/Footnotes tab displayed underneath the Chart preview window allows you to input whatever title you choose. After the chart is displayed in the Output, changes can be made within the Chart Editor by double clicking on any part of the chart.

Appendix B displays a Microsoft Excel^® screenshot. Note the location of the order of the variables, CI upper, CI lower, Mean. Ordering them in this way ensures correct displaying of the chart. After typing in the data, highlight the three variables (CI upper, CI lower, Mean) and select the Insert tab. In the tab containing the charts, select the arrow pointing diagonally. Next, select the Stock option. Select the first chart under the Stock option. After the chart is displayed in the Output file, changes can be made with the Chart Editor by double clicking on any part of the chart.

Summary

Middle grades research is a cooperative endeavor among researchers, practitioners, and middle grades advocates who seek to find answers to questions that impact the field. In order to facilitate this process several recommendations have been put forth in the present paper. These recommendations are not new, but rather, are an aggregate of previous exhortations, recommendations, and insightful opinions of expert scholars. The recommendations are (a) if you must use NHSST, interpret your results correctly, (b) always report effect sizes, (c) always tell the reader exactly which effect size you are using, (d) interpret the results in light of previous research (i.e., do not use Cohen’s benchmarks), (e) report confidence intervals and especially confidence intervals for effect sizes, (f) graphically represent confidence intervals to demonstrate general trends across studies, and (g) think, reflect, then make the judgment call. The last recommendation could be interpreted as encompassing all the rest, because as Thompson (2006a) noted “methodology is not about math. Instead good social science research is primarily about thinking, about reflection, and about judgment” [emphasis added] (p. v).

Reference

American Educational Research Association

(

2006

).

Standards for reporting on empirical social science research in AERA publications

.

Educational Researcher

,

35

,

33

-

40

.

American Psychological Association

(

2001

).

Publication manual of the American Psychological Association

(5th ed.).

Washington, DC

:

American Psychological Association

.

Bakan

,

D.

(

1966

).

The test of significance in psychological research

.

Psychological Bulletin

,

66

,

423

-

437

.

Google Scholar

Crossref

PubMed

Capraro

,

R. M.

(

2004

).

Statistical significance, effect size reporting, and confidence intervals: Best reporting strategies

.

Journal for Research in Mathematics Education

,

35

,

57

-

62

.

Google Scholar

Crossref

Carver

,

R. P.

(

1993

).

The case against statistical significance testing, revisited

.

Journal of Experimental Education

,

61

,

287

-

292

.

Google Scholar

Crossref

Chow

,

S. L.

(

1988

).

Significance test or effect size

.

Psychological Bulletin

,

103

,

105

-

110

.

Google Scholar

Crossref

Chow

,

S. L.

(

1998

).

The null-hypothesis significance-test procedure is still warranted

.

Behavioral and Brain Sciences

,

21

,

228

-

235

.

Google Scholar

Crossref

Cohen

,

J.

(

1988

).

Statistical power analysis for the bevioral sciences

(2nd ed.).

Hillsdale, NJ

:

Erlbaum

.

Google Scholar

Cohen

,

J.

(

1990

).

Things I have learned (so far)

.

American Psychologist

,

45

,

1304

-

1312

.

Google Scholar

Crossref

Cohen

,

J.

(

1994

).

The Earth is round (p<.05)

.

American Psychologist

,

49

,

997

-

1003

.

Google Scholar

Crossref

Cortina

,

J. M.

, &

Dunlap

,

W. P.

(

1997

).

On the logic and purpose of significance testing

.

Psychological Methods

,

2

,

161

-

172

.

Google Scholar

Crossref

Cumming

,

G.

, &

Finch

,

S.

(

2001

).

A primer on the understanding, use, and calculation of confidence intervals that are based on central and noncentral distributions

.

Educational and Psychological Measurement

,

61

,

532

-

574

.

Google Scholar

Crossref

Cumming

,

G.

, &

Finch

,

S.

(

2005

).

Inference by eye

.

American Psychologist

,

60

,

170

-

180

.

Google Scholar

Crossref

PubMed

Elmore

,

P. B.

, &

Rotou

,

O.

(

2001

April

).

A primer on basic effect size concepts

.

Paper presented at the annual meeting of the American Educational Research Association

,

Seattle, WA

Fan

,

X.

, &

Thompson

,

B.

(

2001

).

Confidence intervals for effect sizes

.

Educational and Psychological Measurement

,

61

,

517

-

531

.

Google Scholar

Fidler

,

F.

(

2002

).

The fifth edition of the APA Publication Manual: Why its statistics recommendations are so controversial

.

Educational and Psychological Measurement

,

62

,

749

-

770

.

Google Scholar

Crossref

Fidler

,

F.

,

Cumming

,

G.

,

Thomason

,

N.

,

Pannuzzo

,

D.

,

Smith

,

J.

, &

Fyffe

,

P.

, et al. (

2005

).

Toward improved statistical reporting in the Journal of Consulting and Clinical Psychology

.

Journal of Consulting and Clinical Psychology

,

73

,

136

-

143

.

Google Scholar

Crossref

PubMed

Finch

,

S.

,

Thomason

,

N.

, &

Cumming

,

G.

(

2002

).

Past and future American Psychological Association guidelines for statistical practice

.

Theory & Psychology

,

12

,

825

-

853

.

Google Scholar

Crossref

Gardner

,

M. J.

, &

Altman

,

D. G.

(

1986

).

Confidence intervals rather than p values: Estimation rather than hypothesis testing

.

British Medical Journal

,

292

,

746

-

750

.

Google Scholar

Crossref

PubMed

Greenwald

,

A. G.

,

Gonzalez

,

R.

,

Harris

,

R. J.

, &

Guthrie

,

D.

(

1996

).

Effect sizes and p values: what should be reported and what should be replicated?

Psychophysiology

,

33

,

175

-

183

.

Google Scholar

Crossref

PubMed

Henson

,

R.

(

2006

).

Effect-size measures and meta-analytic thinking in counseling psychology research

.

The Counseling Psychologist

,

34

,

601

-

629

.

Google Scholar

Crossref

Huberty

,

C. J

(

2002

).

History of effect sizes

.

Educational and Psychological Measurement

,

62

,

227

-

240

.

Google Scholar

Crossref

Hunter

,

J. E.

, &

Schmidt

,

F. L.

(

2004

).

Methods of meta-analysis: Correcting error and bias in research findings

(2nd ed.).

Thousand Oaks, CA

:

Sage

.

Google Scholar

Crossref

Jones

,

L. V.

(

1955

).

Statistics and research design

.

Annual Review of Psychology

,

6

,

405

-

430

.

Google Scholar

Crossref

PubMed

Kirk

,

R. E.

(

1996

).

Practical significance: A concept whose time has come

.

Educational and Psychological Measurement

,

56

,

746

-

759

.

Google Scholar

Crossref

Kirk

,

R. E.

(

2001

).

Promoting good statistical practices: Some suggestions

.

Educational and Psychological Measurement

,

61

, 213.

Google Scholar

Kulm

,

G.

,

Capraro

,

R. M.

, &

Capraro

,

M. M.

(

2007

).

Teaching and learning middle grades mathematics with understanding

.

Middle Grades Research Journal

,

2

,

23

-

48

.

Google Scholar

Crossref

Lane

,

D.

(

2008

).

Online statistics home page: r to Fisher z'

.

Retrieved April 25, 2009, from

http://onlinestatbook.com/analysis_lab/r_to_z.html

Loftus

,

G.

(

1996

).

Psychology will be a much better science when we change the way we analyze data

.

Current Directions in Psychological Science

,

5

,

161

-

171

.

Google Scholar

Crossref

Lowry

,

R.

(

2001

).

Fisher r-to-z transformation

.

Retrieved April 25, 2009, from

http://faculty.vassar.edu/lowry/tabs.html#fisher

Meehl

,

P. E.

(

1978

).

Theoretical risk and tabular asterisks: Karl, Ronald and slow progress of soft psychology

.

Journal of Consulting and Clinical Psychology

,

46

,

806

-

834

.

Google Scholar

Crossref

Nickerson

,

R. S.

(

2000

).

Null hypothesis significance testing: A review of an old and continuing controversy

.

Psychological Methods

,

5

,

241

-

301

.

Google Scholar

Crossref

PubMed

Oakes

,

M.

(

1986

).

Statistical inference: A commentary for the social and behavioral sciences

.

New York

:

John Wiley & Sons

.

Google Scholar

Robinson

,

D. H.

, &

Levin

,

J. R.

(

1997

).

Reflections on statistical and substantive significance, with a slice of replication

.

Educational Researcher

,

26

,

21

-

26

.

Google Scholar

Robinson

,

D. H.

, &

Wainer

,

H.

(

2002

).

On the past and future of null hypothesis significance testing

.

The Journal of Wildlife Management

,

66

,

263

-

271

.

Google Scholar

Crossref

Rosenthal

,

R.

(

1994

). Parametric measures of effect size. In

H.

Cooper

, &

L. V.

Hedges

(Eds.),

The handbook of research synthesis

.

New York

:

Russell Sage Foundation

.

Google Scholar

Rozeboom

,

W. W.

(

1960

).

The fallacy of the null hypothesis significance test

.

Psychological Bulletin

,

57

,

416

-

428

.

Google Scholar

Crossref

PubMed

Schmidt

,

F. L.

(

1996

).

Statistical significance testing and cumulative knowledge in psychology: Implications for the training of researchers

.

Psychological Methods

,

1

,

115

-

129

.

Google Scholar

Crossref

Shaver

,

J. P.

(

1993

).

What statistical testing is, and what it is not?

Journal of Experimental Education

,

61

,

293

-

316

.

Google Scholar

Crossref

Sheskin

,

D.

(Ed.). (

2004

).

Handbook of parametric and nonparametric statistical procedures

(3rd ed.).

Boca Raton, FL

:

Chapman & Hall

.

Google Scholar

Smithson

,

M. J.

(

2001

).

Correct confidence intervals for various regression effect sizes and parameters: The importance of noncentral distributions in computing intervals

.

Educational and Psychological Measurement

,

61

,

605

-

632

.

Google Scholar

Crossref

Smithson

,

M. J.

(Ed.). (

2003

).

Confidence intervals

.

Thousand Oaks, CA

:

Sage

.

Google Scholar

Crossref

Smithson

,

M. J.

(

2008

April

).

Scripts and software for noncentral confidence interval and power calculations

.

Retrieved 2009, April 25, from

http://psychology.anu.edu.au/people/smithson/details/CIstuff/CI.html

Thompson

,

B.

(

1992

).

Two and one-half decades of leadership in measurement and evaluation

.

Journal of Counseling & Development

,

70

,

434

-

438

.

Google Scholar

Crossref

Thompson

,

B.

(

1993

).

The use of statistical significance tests in research: Bootstrap and other alternatives

.

Journal of Experimental Education

,

61

,

361

-

377

.

Google Scholar

Crossref

Thompson

,

B.

(

1996

).

AERA editorial policies regarding statistical significance testing: Three suggested reforms

.

Educational Researcher

,

25

,

26

-

30

.

Google Scholar

Thompson

,

B.

(

1997

).

Rejoinder: Editorial policies regarding statistical significance tests: Further comments

.

Educational Researcher

,

26

,

29

-

32

.

Google Scholar

Crossref

Thompson

,

B.

(

1999a

).

If statistical significance tests are broken/misused, what practices should supplement or replace them?

Theory & Psychology

,

9

,

165

-

181

.

Google Scholar

Crossref

Thompson

,

B.

(

1999b

).

Statistical significance tests, effect size reporting and the vain pursuit of pseudo-objectivity

.

Theory and Psychology

,

9

,

191

-

196

.

Google Scholar

Crossref

Thompson

,

B.

(

1999c

).

Why “encouraging” effect size reporting is not working: The etiology of researcher resistance to changing practices

.

Journal of Psychology

,

133

,

133

-

140

.

Google Scholar

Crossref

Thompson

,

B.

(

2001

).

Significance, effect sizes, stepwise methods, and other issues: Strong arguments move the field

.

Journal of Experimental Education

,

70

,

80

-

93

.

Google Scholar

Crossref

Thompson

,

B.

(

2002

).

What future quantitative social science research could look like: Confidence intervals for effect sizes

.

Educational Researcher

,

31

,

24

-

31

.

Google Scholar

Crossref

Thompson

,

B.

(

2006a

).

Foundations of behavioral statistics: An insight-based approach

.

New York

:

Guilford

.

Google Scholar

Thompson

,

B.

(

2006b

). Research synthesis: Effect sizes. In

J.

Green

,

G.

Camilli

, &

P. B.

Elmore

(Eds.),

Handbook of complementary methods in education research

(pp.

583

-

603

).

Washington, DC

:

American Educational Research Association

.

Google Scholar

Thompson

,

B.

(

2007

).

Effect sizes, confidence intervals, and confidence intervals for effect sizes

.

Psychology in the Schools

,

44

,

423

-

432

.

Google Scholar

Crossref

Thompson

,

B.

,

Diamond

,

K. E.

,

McWilliam

,

R.

,

Snyder

,

P.

, &

Snyder

,

S. W.

(

2005

).

Evaluating the quality of evidence from correlational research for evidence-based practice

.

Exceptional Children

,

71

,

181

-

194

.

Google Scholar

Crossref

Wilkinson

,

L.

, &

Task Force on Statistical Inference.

(

1999

).

Statistical methods in psychology journals: Guidelines and explanations

.

American Psychologist

,

54

,

594

-

604

.

Google Scholar

Crossref

Zou

,

G. Y.

(

2007

).

Exact confidence interval for Cohen’s effect size is readily available

.

Statistics in Medicine

,

26

,

3054

-

3056

.

Google Scholar

Crossref

PubMed

Appendix A

SPSS^® syntax for plotting confidence intervals

^*Chart Builder.

GGRAPH

/GRAPHDATASET NAME=“graphdataset” VARIABLES=Study

MAXIMUM(CIUpper)[name=“MAXIMUM_CIUpper”]

MINIMUM(CILower)[name=“MINIMUM_CILower”]

/GRAPHSPEC SOURCE=INLINE.

BEGIN GPL

SOURCE: s=userSource(id(“graphdataset”))

DATA: Study=col(source(s), name(“Study”), unit.category())

DATA: MAXIMUM_CIUpper=col(source(s), name(“MAXIMUM_CIUpper”))

DATA: MINIMUM_CILower=col(source(s), name(“MINIMUM_CILower”))

DATA: MEAN_Mean=col(source(s), name(“MEAN_Mean”))

GUIDE: axis(dim(1), label(“Study”))

GUIDE: axis(dim(2), label(“6th Grade TLI”))

GUIDE: text.title(label(“Confidence Intervals for the Mean Across

Ten Studies”))

SCALE: linear(dim(2), include(0))

ELEMENT:

interval(position(region.spread.range(Study^*(MINIMUM_CILower+M

AXIMUM_CIUpper))),

shape(shape.ibeam))

ELEMENT: point(position(Study^*MEAN_Mean), shape(shape.circle))

END GPL.

^**** Double click on the chart - choose OPTIONS, TRANSPOSE CHART.

^**** Double click on the marker, to change the shape, color or size of the marker.

Figure A1

A screenshot of a computer display showing a statistical software interface with various data input fields, charts, and properties. The image includes sections for study data statistics, variable settings for X and Y axes, and options to manage graphical elements such as boxplots and dual axes.

View large Download slide

SPSS^® Screenshot Describing How To Plot Confidence Intervals

Appendix B

Figure B

A screenshot of a spreadsheet containing data related to study confidence intervals, means, and other statistical measurements. The header includes columns for Study ID, upper CI, lower CI, and Mean values for multiple studies. Various chart and diagram options are visible in the interface.

View large Download slide

Microsoft Excel^® Screenshot Describing How To Plot Confidence Intervals

2025

Emerald Publishing Limited

Licensed re-use rights only

What Confidence Intervals Really Do and Why They Are so Important for Middle Grades Educational Research

Introduction

The Truth about what NHSST does & does not do

What We Wish p_calculated Was

Size Matters

NHSST Tells You What You Already Know

Statistically Significance ≠ Importance

A Sensible Alternative to NHSST

How Do You Interpret Effect Sizes?

Confidence Intervals

Confident About What?

Precision

Consistency

Examples From a Published MGRJ Article

Interpreting Confidence Intervals

Confidence Intervals for Effect Sizes

Computing Confidence Intervals

Confidence Interval for the Correlation Coefficient

Computing Confidence Intervals for Effect Sizes

Graphing Your Results

Summary

Reference

Appendix A

Appendix B

Email Alerts

Cited By

What Confidence Intervals Really Do and Why They Are so Important for Middle Grades Educational Research Free

Introduction

The Truth about what NHSST does & does not do

What We Wish pcalculated Was

Size Matters

NHSST Tells You What You Already Know

Statistically Significance ≠ Importance

A Sensible Alternative to NHSST

How Do You Interpret Effect Sizes?

Confidence Intervals

Confident About What?

Precision

Consistency

Examples From a Published MGRJ Article

Interpreting Confidence Intervals

Confidence Intervals for Effect Sizes

Computing Confidence Intervals

Confidence Interval for the Correlation Coefficient

Computing Confidence Intervals for Effect Sizes

Graphing Your Results

Summary

Reference

Appendix A

Appendix B

Email Alerts

Suggested Reading

Related Chapters

Recommended for you

Cited By

What Confidence Intervals Really Do and Why They Are so Important for Middle Grades Educational Research

What We Wish p_calculated Was