There are situations within middle school settings where measurements of students and teachers are used for high-stakes decisions. For example, student performance is used as an indicator of teacher quality or determines student eligibility for particular types of support services. Given the high-stakes nature of these types of assessments, understanding the strengths and limitations of the measurement is essential. Generalizability theory is a method to help better understand reliability and can be used to inform and improve measurement issues in middle school settings. The purpose of this article is to create a more informed consumer of measurement issues in middle school settings by focusing on the reliability of a particular measurement scenario: observation of middle school teacher behaviors. Teacher observations might be conducted by other teachers, principals, or other administrators or researchers as one indicator of instructional quality. We describe how a 1-facet crossed design and 2-facet crossed design can be applied to different observational protocols and offer considerations for those interested in conducting a generalizability study. We emphasize that no single design will apply to all generalizability studies and findings for a particular observational measure will not necessarily generalize to other observational measures. Instead, the researcher must consider the purpose and use for their particular observational measure.
Researchers have long recognized that measurement issues in education are important but often overlooked (e.g., Haertel, 2006; Koretz, 2009; Linn, 2001; Pedhazur & Schmelkin, 1991). The high-stakes nature of testing-based accountability systems has led to increased attention of the impact of measurement issues and the need to better communicate the impact of these issues to practitioners and policymakers. Examples of measurement issues include whether student performance on assessments is reflective of teacher behavior or whether the sample of observations of a particular teacher is reflective of the day-to-day practices of the teacher. Without attention to measurement issues, findings predicated on the measures are not valid and can lead to inaccurate conclusions about students and teachers. The purpose of this article is to create a more informed consumer of measurement issues in middle school settings by focusing on the reliability of a particular measurement scenario: observation of middle school teacher behaviors.
The Association for Middle Level Education (2010) offers Curriculum, Instruction, and Assessment characteristics of successful middle level education. These characteristics describe the learning opportunities that should be provided for middle grade students, among them engagement in active, purposeful learning and a challenging, exploratory, integrative, and relevant curriculum. Teachers should be prepared to teach and also use multiple teaching approaches. So, one example of an observation at the middle school level would be to see if a teacher’s classroom exemplified the curriculum, instruction, and assessment characteristics of the Association for Middle Level Education. Other observations of teacher behaviors in middle school settings include situations where a principal observes a mathematics teacher (see for example, Blase & Blase, 2000; Ing, 2010; Nelson & Sassi, 2005) or teachers observe other teachers as part of a professional development opportunity (see for example, Fernandez, Cannon, & Chokshi, 2003; Hargreaves, 1991; Stigler & Hiebert, 1999). The details of every observational protocol vary depending on the purpose of the particular observation. For example, a principal might spend approximately 5 minutes per month observing each teacher in their school. The purpose of this observation might be to gather general information about whether each teacher holds their students’ attention. To do this, the principal might be looking to see whether students have their textbooks open or if the students are facing the teacher. The level of detail that the principal is able to gather in these 5 minutes is different from the purpose of another observation by researchers who set up audio and video equipment to capture the details of teacher-student interactions. There are measurement issues within each of these situations that should be of concern for the middle school population. For example, how long should the principal be present in each classroom to get a sense of student engagement? How many times should the principal visit each classroom? Should the principal observe all students in the classroom or select one or two students within each classroom? Decisions to these questions will influence the type of information gathered from the observation. Generalizability theory helps address some of these questions and, in doing so, helps to improve the quality of behavioral measures in middle school settings.
INTRODUCTION TO GENERALIZABILITY THEORY
Generalizability theory (G theory) is a framework to evaluate the reliability or dependability of behavioral measurements (e.g., Brennan, 1992, 1997, 2000, 2001b; Cardinet, Tourneur, & Allal, 1976, 1981; Cronbach, Gleser, Nanda, & Rajaratnam, 1972; Cronbach, Rajaratnma, & Gleser, 1963; Shavelson & Webb, 1991). Behavioral measurement is a generic term that refers to any sort of measure of behaviors, such as a measure of student mathematics achievement based on performance on multiple-choice response items and a measure of teacher instructional practice based on observations of what occurs in classrooms. G theory addresses a limitation of classical test theory where sources of error are not differentiated (Brennan, 2011; Webb, Shavelson, & Haertel, 2007). In classical test theory, the observed score of the individual is the sum of the true score of the individual and error. The error term is undifferentiated in that one cannot separate whether the error is due to differences in how raters rank ordered individuals differently or whether different individuals were rank ordered differently depending on when they were observed. One does not know whether improving the dependability of the measure is best addressed by increasing the number of raters observing the individual, increasing the number of occasions to observe the individual, or increasing both raters and occasions. Moreover, in classical test theory, one does not know how many more raters to include or how many additional observations are needed to achieve a sufficient level of reliability.
Unlike classical test theory, G theory considers random and systematic sources of error and differentiates these different sources of variation. Through this process of identifying and quantifying these different sources of error, the reliability or dependability of the behavioral measurement can be improved. G theory draws on a sampling framework. A measurement is considered a sample from a universe of all possible observations. For example, a single observation of a teacher’s instructional practice is considered to be a sample from all possible observations of the teacher’s practice. The extent to which this single observation represents all possible observations indicates how dependable it is to generalize from this single observation to all possible observations.
To determine the extent to which one can generalize from this single observation, it is necessary to identify the potential sources of variation of the measurement situation. In G theory, these sources of variation are referred to as facets of measurement. Some typical examples of a facet are the occasion, raters, items, and test forms. For observations of teachers, the occasion facet refers to the effect for all teachers due to behavioral inconsistencies from one occasion to another. In other words, do different occasions elicit different ratings of teacher behavior? Consider whether a teacher was observed on a Monday compared to a Friday. Would you expect to see different ratings of their behavior? In some cases, you might consider ratings to be different on these different days or occasions, whereas on other days you might find behavior to be fairly similar regardless of which day or occasion you observed the teacher. In the Learning and Teaching Geometry project (Seago, Driscoll, & Jacobs, 2010), the researchers chose to videotape three lessons on 3 consecutive days, when the middle school students were introduced to the concept of similarity, to examine the change in mathematical focus of the lesson. It is important to emphasize that not all behavioral measures will have the same sources of error. Some behavioral measures will not be concerned with observations as a source of variation. Some measures might be more concerned with who is conducting the observations or the items included in the observational protocol. Decisions on which source of error to include are based on theoretical, practical, and statistical considerations.
One-Facet Crossed Design
In a one-facet design, there is one potential source of measurement error. For example, teachers (p) are the object measurement and occasions (o) are the only source of error identified. This is a crossed design because all teachers were observed on all occasions (p × o). If there were ten teachers and four occasions in the study, a crossed design indicates that all ten teachers were observed on all rater occasions. If five teachers were observed on the first two occasions and the other five teachers were observed on the last two occasions, this would not be a “crossed” design because all teachers are not observed across all occasions (see Figure 1).
In this one-facet crossed design, the observed score of teachers (p) on observation (o) is denoted as Xpo. This is the sum of the grand mean, teacher effect, observation effect, and residual effect. The design is decomposed as follows:
where μ = grand mean (EpEoXpo, mean over both the teacher population and universe of possible scores in the observation universe), μp is the teacher’s universe score (EOXpo, the expected value of the random variable Xpo across observations), μO is the teacher population mean for observation o. The residual is the effect attributable to the interaction of teacher p with observation o confounded with experimental error, which is denoted as po, e. A teacher’s universe score (μp) is defined as the expected value of his or her observed score over all observations in the universe. This is similar to “true” score in classical test theory.
After the observed score is decomposed, the variance for each component is estimated. The variance components of the one-facet crossed model are as follows:
where is the variance due to teachers, is the variance due to observations, and is the residual variance. The residual variance includes variance in scores that are not associated with persons or with observations. This includes both systematic and unsystematic sources of error that have not been explicitly accounted for in this particular design.
The relative magnitude of the different variance components provides information about the potential source of error influencing a measurement and determines the dependability of the measure. The variance components are calculated using mean squares from an analysis of variance, equating these to their expected values, and solving a set of linear equations. The advantage of using generalizability theory is to interpret variance components, particularly, the main effect difference from one occasion to another.
Two-Facet Crossed Design
In a two-facet design, there are two potential sources of measurement error. For example, teachers (p) could be observed on multiple occasions (o) by multiple raters (r). In this situation, one is interested in the degree to which there are differences in teacher behaviors as observed by different raters on different occasions. This is referred to as person-crossed-with raters and occasions (p × r × o) design because raters and occasions are the two potential sources of error included in the measurement procedure and all teachers are observed on all occasions by all raters. The observed score of teacher (p) on rater (r) and occasion (o) is denoted as Xpro, and is the sum of the grand mean, the teacher, rater and occasion effects, their corresponding two-way interactions, and the residual effect. The residual effect is attributable to the three-way interaction of teacher p with rater r and occasion o confounded with unsystematic error.
The variance components of these effects in this model are as follows:
where is the variance due to teachers, is the variance due to raters, is the variance due to occasions, is the variance due to the interaction between teacher and raters, is the variance due to the interaction between teacher and occasions, is the variance due to the interaction between rater and occasions, and is the residual variance.
Other Design Options
G studies need not be crossed (Brennan, 2001b). In a crossed design, facets are crossed with one other (Figure 1). For example, if students are the object of measurement, all students are observed on all occasions. In a nested design, facets are not crossed with each other. Instead, one level of a facet is associated with one level and not another level. For example, if students are the object of measurement, some students were only observed on one occasion and another group of students were observed on another occasion. In this situation, students are nested within observation since students were associated with only one occasion and not another occasion. It is still possible to estimate the contribution of the nested facets, but it is not possible to separate the contribution of the different facets to the total variation (see for example, Cardinet, Johnson, & Pini, 2010).
Another design option is treating the facets as fixed and not random. G theory is based on a sampling framework, so facets are typically treated as being randomly selected from the universe. For example, the content of the observation (such as mathematics) is randomly sampled from across all possible content areas (such as science, art, social studies). Findings could then be generalized to all content areas and not just the areas that were randomly selected. A random facet indicates that the levels have been randomly selected from the respective population or universe of interest. With random facets, there is an intention to generalize to a larger universe, not just those facets included in the generalizability study. A fixed facet indicates that specific levels of the facet are purposefully selected. For example, if a particular content area is selected, such as mathematics, the researcher is not interested in generalizing beyond the particular content selected. G theory allows for both fixed and random facets; however, at least one facet must be random. It is not possible to have all fixed facets in a generalizability study because there would be no sampling or no intentions to generalize to a larger universe.
Assessing Dependability: Generalizability Coefficients
After the variance components are calculated, the dependability or reliability of the measure is assessed. There are two generalizability coefficients to help assess the dependability of a measure: relative coefficient and absolute coefficient. A relative coefficient is used for decisions for where there is interest in rank ordering the object of measurement. For example, in a norm-referenced test, one might be interested in ranking student performance relative to other students’ performance. An absolute coefficient is used for decisions related to some absolute level of performance. For example, in a criterion referenced test, one might be interested in making inferences of whether students achieved a particular level of understanding relative to specific content standards. Both coefficients range from 0 to 1, with higher values reflecting more dependable or reliable procedures. This coefficient is analogous to the reliability coefficient in classical test theory.
To calculate the relative coefficient for a one-facet crossed design (p × o), variance components are used to estimate the sources of error for relative decisions and then the sources of error are used to calculate the relative coefficient. The variance components are considered “estimates” of the population values. These values are never obtained and are only estimated from the sample of responses that are obtained. Here is an example of how the source of error for relative decisions for a one-facet crossed design (p × o) are calculated.
Here no is the number of observations, is the estimated variance component for the residual (po, e). Then the variance component for relative decisions is used to calculate the relative coefficient.
Here, the variance component related to the object of measurement (persons, and the estimated variance component for relative decisions () are included.
To calculate the absolute coefficient 2 () for one-facet crossed design (p × o), variance components are used to estimate the sources of error for absolute decisions and then the sources are error are used to calculate the absolute coefficient.
Here, () is the estimated variance component for observations; no is the number of observations; () and is the estimated variance component for the residual (po, e). This variance component for relative decisions is then used to calculate the absolute coefficient (sometimes referred to as φ).
Here, the variance component related to the object of measurement (persons, () and the estimated variance component for absolute decisions () are included. For this one-facet crossed design (p × o), the difference in how the relative coefficient and absolute coefficient are calculated is that the absolute coefficient also includes the variance component for items.
Improving Dependability: Decision Study
To improve dependability of the measure, results from a generalizability study are used to calculate a “what-if” decision study (d-study). Information about potential tradeoffs in future implementation of the measure is determined (e.g., Marcoulides 1993, 1997; Marcou-lides & Goldstein, 1990) by manipulating the variance components for relative or absolute decisions. This provides information about what could be done differently if different sources of error are modified. For example, if one was interested in the observations facet, the no could be changed to reflect what would happen if the number of observations increased or decreased. A hypothetical numerical example of how the variance components and generalizability coefficients change as the number of occasions change for a one-facet crossed design (p × o) is presented in Table 1.
As can be seen from this example, with 10 observations, the absolute and relative generalizability coefficients are lower than .60; however, as the number of observations increase, so do the generalizability coefficients. However, after 40 observations, there appears to be a point of diminishing returns in that the increase in the generalizability coefficient levels off. This information could be used to guide how many observations to include in future administrations by forecasting the minimum number of observations that should be carried to achieve a minimum level of reliability.
Current Trends
Current trends in G theory include different approaches for handling missing data (see for example, Chiu & Wolfe, 2002; Marcoulides, 1996) and conceptualizing facets as latent traits (see for example, Briggs & Wilson, 2007; Marcoulides & Drezner, 1993). Missing data is not necessarily problematic when conducting a generalizability analysis because if a score is missing on one occasion is does not impact the score on another occasion. In other words, if teachers are supposed to be observed on three occasions, but data is missing on one occasion, the two other observations are not likely to be impacted. How the teacher scores on occasion is related to scores on the other occasions. Statistical methods such as item response theory and structural equation modeling can be applied to address such missing data issues. However, there may be conceptual issues underlying the missing data that need to be addressed. For example, if data is systematically missing from teachers in difficult to staff schools (who dropped out of teaching during the duration of the study), it might call into question the extent to which the remaining sample is representative of the population of teachers that the researchers hope to generalize to.
Statistical methods are applied to modeling latent traits in generalizability theory. In G theory, for example, models are based on what is observed. Using latent variable modeling techniques such as item response theory or structural equation modeling, models can be estimated in terms of latent variables. Facets using a latent variable approach represent a continuum rather than a single observable point in time. In other word, it is not just about what we observed for particular raters that contribute to the rater facet. Raters are conceptualized in terms of unobservable characteristics that can place them along a continuum that might describe how severe or harsh they are. These unobservable characteristics are not necessarily something that is directly measure with a single variable but is modeled using several observable variables.
DESIGNING A GENERALIZABILITY STUDY
There is no shortage of considerations when designing a study to assess the reliability of observational measures. The reliability of an observational measure can be impacted by numerous factors such as the characteristics of the measure, the characteristics of the sample being measured and the conditions of administration. Conducting a study of the reliability of the measure provides evidence of the extent to which these factors, contribute to the consistency or dependability of the scores obtained. No single design will apply to all generalizability studies and findings for a particular observational measure will not necessarily generalize to other observational measures. Instead, the researcher must consider the different purposes of their particular measure. Cronbach (2004) provided several considerations including aspects of the test plan; independence of sampling; heterogeneity of content; how the measurement will be used; and number of conditions for the test. Cronbach also suggested that the researcher first consider whether or not the measure will be used by others or not. If the measure would be used again by the researcher or others, there is a greater need to provide guidance for instrumentation. If there is no intention to use the measure at another time, the research is only concerned with the adequacy of the scores for the purpose of that one study and does need to be concerned with instrumentation issues. Brennan (2001a) also suggested that the researcher start off with deciding on the purpose of the instrument by asking themselves, “What constitutes a replication of a measurement procedure?” He also suggested asking questions about the “characteristics of the data actually available or to be collected to estimate reliability” (Brennan, 2004, p. 9). With these considerations in mind, here are examples of the measurement issues around two hypothetical observational measures.
Observational Measure Example 1. Here is a hypothetical but widely applicable scenario of using an observational measure of teacher behaviors. In this scenario, consider an observational measure of mathematics teaching. This measure is designed to be used in middle school classrooms throughout a state. The state plans to use this measure again and is open to sharing this measure with others.
The purpose of this particular measure is to look at global or stable aspects of mathematics teaching, such as the physical layout of the classroom. The expectation is that what is being measured is fairly consistent throughout the school year regardless of the lesson or the students observed. For example, the physical layout of most middle school mathematics classrooms might not change very much throughout the school year. Mathematics teachers tend to have the same placement of desks and chairs. Unlike science teachers, mathematics teachers are not likely to have specialized equipment that might obstruct the pathway to the door or prohibit students from accessing their desks or chairs. Mathematics teachers are also not likely to change the layout of the desks from one class period to another. Students might change seats throughout the year, but the general layout is likely to be fairly consistent. Thus, observing the physical space will not depend on whether the classroom observation occurred at the beginning of the school year or end of the school year or what period is observed. It is also not likely to matter what students are included in the classroom or how the teacher and students are interacting on the day(s) the observations are conducted.
The facet of most concern, given these assumptions about the particular instrument, is who conducts the observations. There is likely a protocol for preparing raters to conduct these observations (see, for example, Martinez, Borko, & Stecher, 2011), which aims for consistency in the observations regardless of who conducts the observation. In other words, if one rater has 20 years of experience conducting these types of observations and another rater has no experience, all raters should notice and interpret aspects of the physical layout of the classroom in similar ways. One rater shouldn’t be paying attention to aspects of the classroom that do not relate to the physical layout compared to another rater.
To better understand whether raters are operating in similar ways, a generalizability study can be designed to investigate raters as a source of error. A one-facet crossed generalizability study in which the same raters observe the same classrooms provides information about the proportion of variation due to raters. If raters were fairly consistent, the proportion of variation due to raters is low. If raters were not fairly consistent, the proportion of variation due to raters is high. After conducting a generalizability study, a decision study provides more specific information about how the generalizability coefficients change when the number of raters is increased.
Table 2 provides hypothetical data that illustrates this example. In this illustration, five raters observed the physical layout of 10 classrooms with a 5-point scale. This is a one-facet crossed design in that there is only one-facet considered in this study (raters) and all raters observed all objects of measurements (classrooms). A visual inspection of this hypothetical data suggests that raters are fairly consistent. For example, scores for Classrooms 6-10 are exactly the same regardless of which rater is considered.
When this data is analyzed using generalizability theory, 92% of the variation is due to differences between classrooms and only 2% of the variation is due to raters, which supports the visual inspection of the data (Table 3). The remaining 6% of the variation (residual) is not associated with any of the effects of raters. The residual includes both systematic and unsystematic error. For example, some raters might be thinking about a different version of the observational protocol when visiting particular classrooms. The raters might be looking for different aspects of the physical layout or may pay attention to aspects that are not central to the current version of the observational protocol. This sort of variation due to raters might not necessarily show up in the rater facet because this is something that only happens once and with one or two raters (not consistently across all five raters).
The concern with this sort of observational protocol is probably related to absolute decisions. The observations might be done to ensure that the physical layout of all classrooms meet some specified level of safety standards. The observational measure is probably not as concerned with which classroom layout is safer than another classroom. Thus, the focus for this particular measure is on absolute decisions rather than relative decisions. The generalizability coefficient for absolute decisions is 0.98. This is a high coefficient and the percent of variance due to raters is low. One concludes from this hypothetical example that raters who are trained do not contribute to the variation in scores. A decision study indicates that with only one rater, the generalizability coefficient is 0.92. This suggests that only rater is needed for future observations to achieve an acceptable level of reliability using this particular protocol and training procedure. There is no need for more than one rater to observe classrooms using this hypothetical observational measure because additional raters represent a need for additional resources (such as funds to pay additional raters, coordinate the observation, transportation to the school). These resources would not be well utilized because additional raters provide redundant information about classrooms that is consistent with information obtained from a single rater. However, it is still necessary to monitor raters throughout the course of all observations because sometimes raters shift their thinking about the protocol or become fatigued. To help do this, two raters could be assigned to a random sample of classrooms to ensure that these raters are still on track and thinking about the observational protocol in the same way.
Observational Measure Example 2. Suppose that instead of observing a stable or global aspect of instruction, such as the physical layout of the classroom, you are interested in observing something that might be less stable or consistent, such as student engagement (Brinegar & Bishop, 2011; Association for Middle Level Education, 2010). Middle school mathematics teachers are the object of measurement in this situation. You will need to make decisions on whether you are interested in teachers focused on particular content area or teachers focused on particular students (perhaps focusing on students who are not English language learners). For example, if students are observed while they are working independently on a worksheet, it is not likely that there will be much to observe in terms of student engagement. If the same students were observed on another day when they were comparing explanations, there might be greater opportunity to observe student engagement. Once these initial decisions are made, you will need to identify different facets of measurement. Here are examples of three potential facets.
Items. This facet of measurement relates to variation depending on how student engagement is defined. There are differences in how researchers define and measure student engagement (see, for example, Jansen, 2012). The items included in the protocol might define student engagement in terms of general student-teacher interaction. This type of interaction is logistically easier to observe but might potentially miss nuanced differences of teacher interaction between particular students compared to other students. The way in which the items are written might not allow for such nuances. Instead, the items might focus on more general observations, such as student-teacher interaction, rather than items that require greater attention to the quality of the interaction between teachers and students.
The items might help the rater focus on particular aspects of student-teacher engagement but might fail to capture potentially important student-student engagement (interactions that happen between students when the teacher isn’t present). The items included might also focus on what students say to each other. This method of defining engagement privileges verbal engagement but might miss important nonverbal student engagement. These nonverbal interactions might be more difficult to capture or observe but might be an important part of student engagement that occurs in a classroom. Thus, the items included may not capture the full range of opportunities to measure student engagement. Items are considered a source of variation because you might be interested in the extent to which the sample of items selected for the observational measure are representative of all possible ways in which students engage in classrooms.
Observations. This facet of measurement relates to variation depending on when and how frequently the classrooms are observed. For a given observational protocol, it might matter what time of year or how frequently the observation occurred. For example, student engagement at the beginning of the school year might be different than student engagement at the end of the school year or before winter or spring break. Student engagement might also be dependent on the time of day the observation occurs or the lesson, content, activity, or student you are observing. This aspect might also be dependent on what lesson or activity is observed. If you observed a classroom on a single day, you might not feel confident that you captured student engagement in this classroom. Thus, exploring observations as a facet will provide information about how well a single observation represents all possible observations.
Raters. This facet of measurement relates to variation depending on who conducts the observations. All raters are expected to complete the same training procedures. However, for observational protocols that require particularly difficult to capture aspects of instruction, such as student engagement, there might still be differences between raters in terms of what they notice or pay attention to. Raters who have extensive experience observing middle school student interactions might pay attention to different aspects of the engagement compared to raters with less experience. The training is supposed to even out these sorts of differences between raters but when there are protocols that require attention to how students are engaging around the mathematical content, if the raters are not familiar with what this looks like for this particular content area, there is likely to be some variation between raters that is worth addressing. Raters might be trained for the mathematical content in algebra, but might be less familiar with what to look for when the content shifts to geometry. One might not expect engagement to look particularly different depending on the mathematical content, but if the items included in the protocol require an understanding of the differences in student engagement depending on the content or problems posed, variability between raters might be a facet worth investigating further.
Students. This facet of measurement relates to variation depending on which students are included in the observation (Ing & Webb, 2012). You might be concerned that a randomly selected group of middle school students, rather than considering all middle school students, does not capture typical student engagement. Are the middle school students selected to be observed representative of student engagement for the entire classroom or for students across all classrooms for a single middle school teacher? What would be observed if the students selected for the observation were particularly verbal or interactive compared to students who did not interact much with others? Suppose randomly select a pair of middle school students to observe. The middle students might not be engaging with each other due to a range of reasons from logistical (maybe they are assigned to work independently) to more substantive (maybe these students never interact with each other even when provided the opportunity to do so). Conclusions about student engagement based on these two students might not necessarily be representative of student engagement for all of the students in that class. Suppose instead, that you randomly selected a pair of students who are constantly talking with each other. You would still be concerned with whether this pair is representative of engagement for all middle school students in the classroom. Observing more middle school students in the classroom might address this issue of representativeness but then raises issues of how closely observations can attend to the details of student engagement. The level of engagement that can be captured for more students might be more general or superficial compared to the level of engagement that can be captured for fewer students. These are just a few of the tradeoffs that you must consider when measuring students from a given middle school classroom.
These are also just a few facets that might be relevant to this hypothetical observational measure. These facets are conceptually and statistically intertwined, so although these facets were presented in isolation, it is likely that there is interaction between these facets that needs to be considered. Generalizability theory models these interactions and provides information about the variance due to these interactions. For example, you might be more confident if you went back on several occasions or if you had additional observers in the classroom with you so that you could observe additional student interactions or could observe the same interactions and discuss what you observed. How student engagement is defined and measured on these different occasions raises issues about the interaction between the different facets that might require attention. The goal of this process is to determine how representative this particular observation is from a sample of observations. This requires attention to multiple facets, not just a single facet.
The impact of these types of decisions was explored in a recent article of mathematics classrooms using an observational protocol that focused on student explanations (Ing & Webb, 2012). The researchers found decisions about issues such as how to define and measure engagement, what student(s) to focus on, and the overall structure of the classroom (whole class, small-group work, independent work) influenced what is observed and conclusions about student engagement. Classroom profiles vastly changed depending on these types of decisions. This raises issues of the potential tradeoffs when conducting these observations. For example, it might be more feasible to hire raters who do not have a lot of experience observing classrooms. These raters could be prepared to implement observational protocols that are relatively straightforward, such as scoring the physical layout of the classroom. However, these raters might have a more challenging time with the implementation of more nuanced observational protocols, such as scoring student engagement. With more nuanced observational protocol, it might matter more which raters are conducting these observations, how many observations are included, or what students are actually included in the observation. The tradeoff comes in decisions around what observational protocol to use and how to implement this protocol consistently.
These are the types of tradeoffs to consider when designing your own generalizability study and interpreting technical information about an observational measure. For example, suppose you are most concerned with raters and occasion. You could design a two-facet crossed design in which multiple raters could observe classrooms on multiple occasions. The occasions could be randomly selected throughout a particular lesson or unit and/or across multiple units. The period in which the observations occur could also be randomly selected or the same periods could be selected across multiple observations. If the goal is to generalize across all periods, then the periods should be randomly selected. If there is no interest in generalizing across all periods, perhaps you’re only interested in periods with students with similar achievement levels, then selecting the same period on multiple occasions is reasonable. The selection of the periods might also be guided by theoretical and practical assumptions. It is possible to also incorporate some nested aspects, such as which period the teachers are observed (teachers nested with period) or which lesson the teachers are observed (teachers nested within lessons). Once the data have been collected, there are numerous software options to help you conduct your generalizability study including specialized options such as EduG (Cardinet et al., 2010) or GENOVA (Crick & Brennan, 1983); or more general statistical packages such as SAS (2011) or Stata (2012). There is no single design that will be appropriate for all observational protocols, and the facets identified for each study will vary depending on the issues around the particular observational protocols. It is the responsibility of the researcher to lay out the issue and gather the appropriate evidence to address those issues.
CONCLUSIONS
Generalizability theory has been used in practice for many years (see for example, Cronbach et al., 1963) and is incorporated in many research studies about middle schools (see for example, Martinez, Borko, Stecher, Luskin, & Kloser, 2012; Newton, 2011). Generalizability is not limited to teacher observations and has been used for student observations and school observations (see for example, Cronbach et al., 1997). Generalizability theory can be used to inform measurement issues in these different situations by estimating the magnitude of error of different sources of error and providing a reliability (generalizability) coefficient for the proposed use of the observational measure. Decomposing the sources of variation provides information to redesign an observational tool. While the statistical theory is well established, what is perhaps more challenging are the practical and conceptual questions around conducting observations (see, for example, Correnti & Martinez, 2012). For example, it is challenging to identify the facets of measurement for a particular observational tool. This requires making clear what the assumptions are about what is being observed and how these assumptions relate to decisions about what factors to consider in a generalizability study.
There is no straightforward design that will work for all observational measures. It is up to the discretion of those using the observational measure to make choices about the most important facets to capture and then design a study that provides information about those facets. Doing so will help create information about observational measures that are more reliable and hopefully lead to more accurate conclusions about middle school students and teachers.

