Skip to Main Content
Purpose

This article addresses issues regarding the consistency and fragility of assessment instruments in learning study based in a case. This includes what constitutes an adequate validation discussion as well as a reasonable, reliable and valid description of the research results, including necessary limitations.

Design/methodology/approach

Drawing on two learning studies, with negative numbers and the constitution of matter as objects of learning, the process of test development and validation is focused. The qualitative analysis of interviews and quantitative analysis of pilot test taking is discussed in relation to the validity of knowledge tests in a learning study context.

Findings

This paper demonstrates the process of validating the interpretation and use of knowledge test in relation to the aim of the learning study in a specific case. Findings indicate that validation of knowledge tests can be achieved and articulate what may be appropriate measures within existing limitations.

Practical implications

This article is an example of hands-on evaluation of validity of interpretation and use of knowledge test, suggesting methods that can be implemented on the scale of learning studies to achieve this.

Originality/value

Few studies have thoroughly addressed issues of validity of interpretation and use of knowledge tests in learning and lesson studies. This study does so by discussing the dominant role of knowledge tests in some studies, the current position of validity discussions in published studies and describes the process of test development and validation in a specific learning study project.

Over the last two decades, increasing attention has been paid to studies of educational practice and teaching and classroom studies focusing on didactics (e.g. Osbeck et al., 2018). In the field of science and mathematics education, several researchers, for example Nuthall (2004), have argued for the importance of results that provide insights into relationships between teaching and learning for specific lessons or content areas, leading to the development of methods that are possible to use in teaching practice. Examples of such research are design-based research, lesson study and learning study. Of these, learning study has the most delimited focus on the educational objectives for specific content or as it is described in this tradition: the objects of learning.

Learning study draws on elements from design-based research: the ambition to relate learning outcomes to teaching for specific content (see, e.g. Brown, 1992; Collins, 1992); from lesson study: the focus on a single lesson or a small set of lessons and a cyclical development process in which teaching is evaluated, observed and altered (see, e.g. Lewis et al., 2006; Stigler and Hiebert, 1999; Yoshida, 1999) and from action research: collaboration with teachers to develop practice (see Elliott, 1991). Learning study uses a theoretical input to the design and analysis of teaching and learning: variation theory (Marton et al., 2004). Many studies have shown considerable potential gains in the learning outcome (e.g. Marton and Pang, 2006). Making claims regarding the result of such studies, however, requires a large degree of control of relevant elements of teaching, as well as of appropriate and specific measures of the learning outcome (in relation to what students knew beforehand).

Within the learning study tradition, attention has primarily been paid to what happens in the classroom and what might have been possible to learn (Kullberg, 2010). In contrast, as discussed below, little attention has been paid to how the learning outcome is assessed. In some cases, where pre- and post-tests have been set in an ad hoc way, little or no attention has been given to analysis or validation of the test and the test results are instead used, without further investigation, to support the adequacy of the teaching design.

Further, learning studies often mainly focus on the development process, usually involving a qualitative approach, on a small scale (Lo, 2012). This means that claims from the study may be limited, being based on a small number of students, with only a few teachers involved. In the light of this, the certainty of such claims about learning outcomes might be questioned. While the learning outcomes to frequent cases are measured through pre- and post-test (Maunula, 2018; Vikström, 2014; Kullberg, 2010; Runesson, 2007), a firmer basis for how the learning outcome is measured would contribute greatly to the overall validity of learning studies.

The aim of this article is to address and articulate pragmatic ways of assessing student knowledge specific to the object of learning in the context of a learning study. Further, it aims to evaluate to what extent progression in student knowledge with respect to the object of learning may be claimed to be statistically validated. Fulfilling the latter aim may be an important component in making causal inferences between teaching and learning in the learning study.

We will address this aim through discussing the process of test development and validation in a learning study project involving two objects of learning, the microscopic structure of matter in chemistry and negative numbers in mathematics. By drawing on the extensive literature on test development and assessment research we offer concrete descriptions of how validating the claims of the test use and interpretation in relation to the claims of the learning study may be carried out in a specific case and how this process leads to an increased awareness of difficulties and uncertainties, which scaffolds the validity of the assessment and, accordingly, the validity of the research results.

In lesson and learning studies, teachers collaboratively develop teaching of one or a few lessons in order to support student learning. A learning study is a form of action research (often) grounded on variation theory (Adamson and Walker, 2011). The process starts with identifying an object of learning and involves an investigation into different ways in which students may experience the object of learning and thus what students need to discern to see the object of learning in the intended way (Kullberg, 2010; Runesson, 2007; Pong et al., 2005). Teaching is then carried out which attempts to give the students opportunities to discern that which is critical. The object of learning and the teaching design are central to a learning study and develop iteratively during the process. To get an indication of how a specific group of students understands the object of learning, a test or interview is typically used before and after the lesson(s). A test that is used as a measure of the students’ knowledge of the object of learning must be specific in relation to the object of learning, for example, the addition and subtraction of negative numbers and cover both the whole and important parts of the object of learning, while not being overly reliant on generic domain knowledge aspects.

In learning studies and similar kinds of lesson studies, the learners’ knowledge development in relation to the object of learning and the development of teachers’ skills are intertwined. Thus, one common approach for evaluating these studies is to focus on the teachers’ experiences. Another approach, which is used in this study, is to assess the knowledge development of students participating in the designed lessons. When students’ knowledge development is in focus, the form of assessment is chosen based on the participants’ age, the design, the aim of the project and the size of the group. Observations (e.g. Kullberg et al., 2017), interviews or oral assessment (e.g. Björklund et al., 2021; Ljung-Djärf et al., 2014) and pre-and post-tests (e.g. Adulyasas and Abdul Rahman, 2014) or a mix of some of these methods (e.g. Wester, 2021; Ryberg, 2018; Driver et al., 2015), are chosen to assess the knowledge development of learners.

In evaluations of the students' knowledge at the group level, the focus is on establishing a measure of the outcome. In these cases, the students’ knowledge is assessed through oral or written pre-and post-tests and is mainly analyzed quantitatively. The validity of the research outcomes is therefore partly dependent on the validity of interpretation and use of the test (Bass et al., 2016; Newton and Shaw, 2014). However, as it will be presented next in this article, the validity and reliability of the tests are rarely discussed when written tests are implemented and almost never mentioned when oral assessment is applied.

We carried out a review on published articles on research that involves implementation of knowledge tests in the last 10 years in the field of learning and lesson studies (2013–2023). We searched the articles in one central journal in the field: International Journal for Lesson and Learning Studies. This source is not the sole journal for publication of articles about lesson and learning studies but can be expected to reflect the current state of the field in terms of how lesson and learning studies have been carried out. To complement this sample of articles and find a larger proportion of the lesson or learning studies carried out within the fields of mathematics and natural science, in which our project is situated, we searched the broad database “Scopus” (Elsevier’s abstract and citation database).

Those articles that had used instruments such as surveys or attitude questionnaires were excluded, as well as those that mentioned using a knowledge test in their study but did not present the test results in that specific paper. This filtering of identified articles led us to 53 articles that used a knowledge test to evaluate the development of student knowledge in the content area affected by the study design. The assessments of student knowledge were, in most cases, implemented as written pre-and post-tests and in rare cases, through interviews and essays.

In over 60% of included articles, there was no information about the process of test development. In those where the test development process was mentioned, the information was often limited to mentioning that the test was developed for the aim of the current study or that the test items were borrowed from other sources. More than 60% of the articles presented some sort of statistical measure of the test results (such as the average test results), but less than 20% presented the test itself or a selection of test items. The concept of validity regarding the test was mentioned in only 10% of the articles and even then, the information was mostly limited to explaining that the validity was examined by experts or that the items were taken from standard tests. In some rare cases, both the process of test development and the concept of test validity were mentioned (Ryberg, 2018; Lewis and Perry, 2017).

This review of articles presenting the knowledge outcome assessed by tests strongly suggests that very little attention has been given to the process of test development and validity of interpretation and use of tests in the context of lesson and learning study. This raises test development and validation as an important issue to address. The present article explores the procedure and methods we implemented when developing two tests for a learning study research project to achieve an adequate validity discussion with the available resources.

As mentioned earlier, the development of tests described in this article was conducted within the realms of a larger research project. The aim of the learning study project was to test lesson designs developed in a pilot phase, in two different lesson design, developing and using knowledge tests as a primary measure of student knowledge before and after teaching. During the pilot phase, lesson designs and knowledge tests were developed in sync with the cycles of a learning study (Kullberg et al., 2024). This article explores the process of knowledge test development.

In the learning study, one object of learning within mathematics and one within science were in focus. Both objects of learning are appropriate for the objectives specified for Grade 6 in the Swedish compulsory school curriculum and the empirical work was also in a school context involving students in this grade, i.e. approximately 12 years old. Within mathematics, the object of learning was addition and subtraction with negative numbers. Within science, the object of learning was the microscopic constitution of matter (the particle model) in relation to macroscopic properties. In both cases, the developed lessons (total 110–120 min) may be understood as an introduction, which could be followed by further lessons detailing further aspects of the respective object of learning. Two knowledge tests were developed in parallel with developing the teaching design during the learning study in the project pilot phase. These tests were later used in the main phase of the project as one of the tools for assessment.

This study was financed by Swedish Institute for Educational Research and followed the institute policy for ethical considerations in collaborative educational research. The reported study used test results from students and some students’ interviews regarding the test. The collection of this data did not affect the participants physically or psychologically and no sensitive information regarding the participants was collected. Collected data were made anonymous before being subject to analysis.

In the work reported here, we draw on the extensive literature on test development and in particular on structures for development and validity process for small-scale uses. The method and analysis should be understood in the context of and in relation to the function of the knowledge test in the learning study project. A triangulation of methods is implemented to offer discussions about the validity of interpretation and use of tests that are developed for the project. This includes subject-matter expert discussions, statistical analysis of the test result and analysis of think-aloud interviews.

The knowledge structure for a specific individual can only be measured or interpreted indirectly. This means that knowledge tests belong to the category of psychometric testing in which the attributes are of latent character, meaning that they are not directly measurable but are interpreted through sampling (Crocker and Algina, 2008). The main challenges of constructing knowledge tests are that, even in best-case scenarios where the items have a high certainty and reliability, test items are limited samples of the targeted construct (Koretz, 2008). The uncertainty of test items structure influences the generalizability, reliability and accuracy of test use (Kane and Wools, 2020).

Meeting the standards of testing requires resources in term of subject matter and test experts, qualitatively evaluating test items and quantitatively evaluate the test and item statistics. These resources are limited in small-scale studies such as learning study. One method is borrowing test items from other standard tests which can increase the validity of interpretation of the item, but still, the validity of test use needs to be evaluated. Another method is to construct test items specifically for the designed study, which increases the relevance or the validity of use of test, while the investigation on validity of interpretation of each item might not be enough. Having in mind that test items are small and delimited samples of the construct and considering the impact of test result on the conclusions drawn on the teaching design in learning study as a research domain, the validity of interpretation and use of tests will be of major importance.

In classroom assessments aiming to grade students or providing feedback for development, the measurement perspective obtained by test result is often complemented with a functional perspective such as student–teacher interactions and students’ performance during activities. The functional perspective, even though far from standardized assessment, plays a dominant role in assessing students’ achievement at classroom level (Kane and Wools, 2020). But when tests are used to evaluate the effects of interventions during research, the functional perspective often fades in favor of knowledge tests and the statistical presentation of test result. Using locally constructed knowledge tests as the only way of knowledge assessment is not ideal. But when the research design implies the obligation of using test as the main assessment method to evaluate the result of the learning study, basic principles of quantitative and qualitative criteria are applicable even on small scales and can provide some insight into item and test accuracy (Kane and Wools, 2020).

Test development includes defining the construct, generating items that reflect the content, deciding the format and structure of the test, searching for evidence of validity and deciding on the measurement models that can be applied to the test (Irwing and Wiley, 2018; Downing, 2006; Wilson, 2005). Knowledge tests that are developed and used for the aim of classroom studies are usually implemented once before and once after the experimental design as pre- and post-tests. Possibilities for evaluating the test in advance, through pilot test-takers or interviews with potential test-takers, are limited due to constraints of resources and time. The great opportunity that learning study offers to the process of test development is embedded in learning study principles. In a learning study, the teaching strategies are evaluated and refined iteratively based on information gathered on students’ knowledge development and teacher discussions (Kullberg et al., 2024). The test and test items can be developed in parallel as a part of this process. In reality, many researchers refine the tests during this process as a natural and essential part of the evolving learning study design. But in most cases, this part is not documented and reported.

The test development model used in our study is an iterative model aligned with the iterative cycles in a learning study. During the process, test items were discussed and designed in a subject matter expert (SME) panel and tested with a group of test-takers for item statistics and some think-aloud interviews for respond processes (Downing, 2006). We followed Kane (2013) validity framework which defines validity of interpretation and use of test as the degree of plausibility of claims of the test through analyzing evidence that supports those claims. For this to be possible, the primary requirement is to have “a clear statement of the claims inherent in the proposed interpretations and uses of the test scores” (Kane, 2013). The object of learning and the critical aspects that are used in learning study as input from variation theory can be illuminating in defining the claim related to the knowledge test. The next step is seeking evidence for validity, in which we insistently looked “to find out what might be wrong” because the degree of validity is achieved only “when it has survived serious attempts to falsify it” (Cronbach, 1988).

Thus, we focused on some of the claims of the test that assess the object of learning (interpretation) and are aligned with the claims of the learning study (use). We then gathered the evidence for these claims by pilot-testing and think-aloud interviews. When analyzing the evidence, we made efforts to detect threats to validity (see Kane, 2013). Those threats became apparent when quantitative and qualitative approaches were compared. The pieces of evidence described here are some of those suggested in Standards for Educational and Psychological Testing, including content, internal structure, response processes and relation to other variables (American Educational Research Association, 2014): content is discussed in SME meetings; internal structure includes the statistical analysis of the results from pilot test-takers; response processes can be followed through think-aloud interviews and relation to other variables includes teachers’ perceptions of students’ knowledge or classroom observations. This kind of evidence also aligns closely with the requirement in the learning study to gain insights into the students’ knowledge regarding the object of learning throughout the process, to help in identifying weak parts of the lessons as well as less powerful examples or explanations used by the teacher.

It is important to notice that test items were not constructed individually to reflect critical aspects, but the aim was that the tests were developed to cover the content which in learning study is defined as the object of the learning. We carefully discussed all test items to ensure that items are related to the object of learning, also made sure that discerning critical aspects individually or in combination is required to answer test items correctly.

The qualitative aspects were based on SMEs discussions and the critically analyzed think-aloud interviews with potential test-takers. A think-aloud interview is a one-to-one dialogue between the researcher and a participant on a specific task. The aim is to identify the cognitive process, thoughts and strategies that the participant experiences when completing the task (Leighton, 2017). The researcher asks the participant to explain – in as much detail as possible – the thoughts that come to mind when interpreting and responding to the items. When developing test items, this method can help the test developers to ensure that the test takers understand the items as they are meant to and use the desired strategies when they respond to items.

The quantitative evaluation was based on item statistics. The most basic indexes in classical test theory used for this evaluation are reliability coefficient, item difficulty, item discrimination and standard deviation (Downing, 2006). These indexes can be reliable even when the data are small, they are simple to generate and easy to argue whether they are in an acceptable range.

The overall reliability coefficient (Cronbach’s alpha) refers to the degree of intercorrelation among test items and the impact of each item on the overall reliability is measured “Cronbach’s alpha if item dropped”. If the reliability increases when an item is dropped, this item negatively affects the test reliability and should be carefully considered and possibly eliminated from the test.

Item difficulty, in one-point items with one correct answer, is the same as the mean of each item, e.g. how big proportion of the students have answered the item correctly. This indicator is important to ensure that the test has various levels of difficulty and is not too easy or too difficult for the target group. Besides the overall difficulty of the test, items that are unexpectedly easy or difficult should be qualitatively examined (Livingston, 2006). Item discrimination is the tendency of the item to be answered correctly by test-takers with higher expertise in the content knowledge (Livingston, 2006). Items with low difficulty and low discrimination might be desirable to ensure that a specific knowledge level is achieved by the group (Koretz, 2008). However, difficult items with low or negative discrimination index mean that students with lower performance on the test have performed relatively well on this item; these items should be examined closely.

The quantitative analysis of tests for the two objects of learning is based on 75 students in Grade 6, regarding the test on the constitution of matter and 87 students, regarding the test on addition and subtraction of negative numbers. All students participated in the learning study. Results from participants who did only one of the tests (before and after the lessons) or missed one of the lessons were excluded. The data were analyzed using R software. All items were coded to dichotomous variables with correct (1) or incorrect (0) answers. Missing responses were coded as incorrect.

The SME panel consisted of three researchers engaged in the project (including the researchers responsible for developing the teaching design and for conducting the main study in the learning study) and was supervised by researchers independent of the project but knowledgeable in test development in science and mathematics.

Think-aloud interviews were conducted with four students on the test on the constitution of matter and four on the test on addition and subtraction of negative numbers, all in sixth grade. The interviewees did not participate in lessons on these topics in connection to the project. The aim of these interviews in this phase of the project was to map the interviewees’ process of interpreting the items and not to evaluate their knowledge.

In this section, we report on some of the challenges that we faced when developing the tests. The data were collected during three cycles for each test and the tests were revised after each cycle. After discussion in the SME panel, some of the statistical data and qualitative feedback was found to be useful to revise and reconstruct some test items; in some cases, the item was totally removed or replaced with a new one. The main arguments that led to reconstructing items were content- and construct-related issues and the reasons for removing some items related to insufficient item reliability. However, there were items that were retained, although there was some mismatch between the quantitative and qualitative data. We exemplify this challenge with two of these items, one from each test and present our argument for why the items remain on the test, with caution. The presented item from the test on the constitution of matter was interpreted and answered correctly by one test-taker without employing the required knowledge and the item from the test on addition and subtraction of negative numbers was similarly answered correctly but using an incorrect argument. It is important to mention that these two patterns for answering these items were exceptions and not observed in other interviews. The conclusion is that these two items, according to other interviews and statistical results, achieve an acceptable quality level.

The test on the constitution of matter has a total score of 19, where 12 points are selected response (multiple choice) and one task is constructed response (open-ended) and graded with up to seven individual points. In the statistical analysis, each point was considered as a dichotomic item, both for the selected and constructed response. The total reliability (Cronbach’s alpha) is 0.805 (measured for post-test with N = 75, SD = 0.20). There is no item that can be dropped in order to increase the reliability of the test. The paired-sample T-test shows a statistically significant increase from the pre-to post-test (p-value <0.05) indicating that the test was aligned with the content of the lessons.

The test on addition and subtraction of negative numbers consists of 33 items, 1 point each. One item was eliminated after two revisions in Cycle 1 and Cycle 2. This item showed poor discrimination and influenced the reliability index negatively. The total reliability (Cronbach’s alpha) for the test is 0.88 (measured for post-test with N = 87, SD = 0.21) and none of the remaining 33 items influenced this index negatively. The paired-sample T-test shows a development from pre-to post-test that is statistically significant (p-value<0.01).

Some data and statistical analysis for the two tests are represented in Table 1.

Table 1

Statistical analysis for the two tests

ContentTotal points (0/1)Number of test-takersMean pre-testMean post-testCronbach’s alphaSD
Matter19757.211.20.8050.20
Neg-numbers338710.214.70.8800.21

Knowledge test on the microscopic constitution of matter

In developing the test on the microscopic constitution of matter, one main challenge was creating multiple-choice items. Due to time limitations for taking the test for participants and assessing the results for teachers/researchers and to ensure equality and prevent different interpretations in scoring, we decided to have most of the items as multiple choice. The challenge was to phrase the alternatives in ways that do not provide clues to the correct answer. However, think-aloud interviews revealed how a test-taker could succeed in excluding incorrect statements and limiting the options to the correct choice by arguments that did not indicate the required knowledge.

All matter consists of atoms of various kinds. Pelle sets a piece of wood on fire. What happens to the atoms in the piece of wood? Pick one or two correct options.

  • □ All the atoms in the piece of wood burn up and disappear

  • □ Some atoms in the piece of wood burn up and disappear

  • □ All the atoms in the piece of wood are now the ash

  • □ Some atoms in the piece of wood are now the ash

  • □ All the atoms in the piece of wood are dispersed into the air

  • □ Some atoms in the piece of wood are now part of the air

  • □ All the atoms in the piece of wood are part of the ash

We demonstrate this problem with an example of an item (Item 7, which has two correct answers out of six alternatives, where Q7a_alt.4 refers to the first correct alternative and Q7b_alt.6 refers to the second). The statistical analysis (Table 2) indicates good discrimination. The changes in difficulty of Items Q7a and Q7b are not statistically significant (p > 0.05). The other important statistical index to evaluate items quantitatively was “Cronbach’s alpha if item dropped”. While Cronbach’s alpha for the test was 0.805, Cronbach’s alpha if items Q7a and Q7b were dropped, showed a decrease, which means that these items positively contributed to test reliability. Although the statistical analysis for the final version of the item fulfilled the validity criteria, a think-aloud interview with one participant illuminated a different perspective on the uncertainty of item validity.

Table 2

Statistical analysis for Item 7

mean_premean_postsdDisc_preDisc_postCronbach’s alpha if item dropped
Q7a_alt.40.250.480.500.450.750.802
Q7b_alt.60.260.320.480.700.560.803

The participant, called Memo here, a twelve-year-old boy in Grade 6, with low performance on the test on the constitution of matter, gave the correct answer to Item 7, which according to the statistics, is a difficult item with good discrimination (Table 2). He mentions at the beginning of the interview that he does not remember what the difference between an atom and a molecule is and asks the interviewer: “I know one of them is bigger, but which one?”.

Memo, when answering Item 7, does not only rely on his knowledge about matter, but he evaluates the choices by comparing them linguistically to one other. He first reads all the choices very carefully. He then explains: “I would rather not choose those alternatives that say ‘All atoms’, because it is too certain” (Memo, Grade 6, think-aloud interview, 2022). Following this argument, he had three statements left out of seven to choose between. He continues: “I choose alternative two because some of the atoms are the smoke now because we see the smoke … but wait, there is another option, part of the air, hmmm, they do not disappear you can still see them … hmm, this alternative makes more sense. I choose those two [four and six, the correct answers]” (Memo, Grade 6, think-aloud interview, 2022). Memo compares and discusses alternatives without mentioning any knowledge about the structures of atoms. It does not mean that his reasoning about the alternatives is wrong, but his way of narrowing the options down to the correct alternatives was not predictable by the SME panel.

Knowledge test on addition and subtraction of negative numbers

In this part, two items from the test on negative numbers are presented. According to SME discussions, the items are similar and intended to refer to the same knowledge with similar difficulty levels. The uncertainty about how these items were interpreted was increased by the statistical analysis and think-aloud interviews.

Item 2i: (−4) – (−1) = ?

Item 2k: (−6) – (−8) = ?

According to SME discussions, both Item 2i and 2k require good knowledge about the subtraction of negative numbers in order to be answered correctly. SMEs believed that subtracting a negative number from another negative number is a relatively difficult task and that these two items have the potential to discriminate test-takers with good content knowledge. But a think-aloud interview with one of the test takers, Jona, 11 years old, showed how an incorrect argument might lead to a correct answer. Jona solved Item 2i in his own way:

There are a lot of minus signs here, so I just simply neglect them for now. Then I will have 4 and 1. And it is a minus in between, so I write 3. Now I can put back the minus sign that I removed earlier, which makes the answer −3 (Jona, Grade 6, think-aloud interview, 2022)

While this argument works for this item, it will not for Item 2k: (−6) – (−8). Jona concludes with the same argument that the answer here is −2, which is a wrong answer in this case. Jona’s reasoning on Item 2i made us more observant of this item and made it an interesting case to discuss. The statistical analysis was also complicated to evaluate. Table 3 shows some of the statistical criteria for both items. None of the items strongly influenced Cronbach’s alpha for the test (Cronbach’s alpha for the test: 0.88; if Item 2i dropped: 0.88; if Item 2k dropped: 0.87) and the discrimination criteria for both items are in the acceptable range (Table 3). Some observations from the statistical analysis are:

Table 3

Statistical analysis for Items 2i and 2k in test on negative numbers

mean_premean_postsdDisc_preDisc_postCronbach’s alpha if item dropped
2i: (−4) – (−1)0.480.490.500.300.260.88
2k: (−6) – (−8)0.080.260.440.350.520.87
  1. Item 2i is easier for test-takers than Item 2k to solve in both pre-and post-tests.

  2. Item 2k shows statistically significant improvement from pre-to post-test (p-value<0.01), while Item 2i remains unchanged.

  3. Item 2k discrimination index increases from pre-to post-test while Item 2i discrimination index decreases.

When facing such challenges, one option might be to eliminate the item. Another option is to make one item conditional on the other, which means that test-takers will get the point for Item 2i only if they have answered Item 2k correctly. Another possibility is that the two items emphasize different parts of the constructs. This could be an interesting case for further investigation and future research. Being aware of the complexity of the items, we believed both items were essential and worthy of further analysis and therefore decided that they should remain in the test.

In summary, during the cycles of learning study and parallel with developing teaching strategies, test items were revised and refined. The results warrant a primary claim, that the tests in its final form can assess knowledge with respect to the two objects of learning, which is supported by evidence that includes content, internal structure and response processes. Another claim, which is important for what is claimed in the learning study as a whole, is that the test on the group level can monitor the knowledge development resulted by the lesson design, which is examined through statistical analysis when comparing means from pre-to post-test (p-value <0.05 for both tests).

The results also contribute to discussions about validity of content, internal structure and response processes in the context of classroom studies, which is based in analysis of evidence gathered from different resources, such as pilot-testing and think-aloud interviews. In those cases where the item influenced the reliability negatively or the response process was not aligned with the intended purpose of the item, we revised or removed the item. Through this cyclical process of test development and validity discussion we would argue that the test achieved adequate validity of interpretation and use (c.f. Kane, 2013), which is the degree to which the evidence supports the claims of interpretation and use of the test.

This article has pointed to the key role of assessment in lesson and learning study for the validity of the whole study, given a purpose of improving student learning. Consistent attention to assessment needs to be given in a way commensurable with the lesson and learning study tradition, yet be comprehensive and systematic, to support the overall claims in these studies. Further, to fulfill this role, it must through assessment be possible to make a validated claim of what the group of students know about the object of learning or focus of the lesson and possible differences before and after teaching that may be attributable to what has been taught in the lesson(s). As this article proposes, in addition to drawing on the research and professional expertise involved in lesson and learning studies, it is sensible to draw on the methods developed in educational assessment more broadly.

Assessment in a lesson and learning study differs from, on the one hand, normal classroom assessment and on the other hand, large-scale educational assessments. The four primary such differences to the latter concern.

  1. the statistical constraints due to a limited number of students taking the test as well as being available for piloting

  2. the well delimited content focus of the test and that the purpose of the test is to measure changes in student knowledge due to teaching being delimited to one or a few lessons

  3. the iterative process of development aligned with the lesson or learning study’s cyclical development process and the input from teachers that is inherently involved in that process

  4. the dialogical dependency of the test construction, the test outcome and the teaching design

Validating the interpretation and use of the test requires different resources and might be challenging in small-scale classroom studies. This can be one reason that many similar studies operating on a small or intermediate scale do not discuss the test validity in their practices. Compromises are therefore necessary and different roads may reasonably be used in different situations. Ryberg (2018) made use of the test items from an existing test and adapted some of the items to the context of his study. The test results in his work played a supporting role while the main analysis was based on observations and analysis of students’ reasoning. Lewis and Perry (2017) chose to mention the source of items and Cronbach’s alpha for tests of educators’ knowledge and presented examples of items and their original sources, for the knowledge test taken by students. Our literature review on different learning studies leads us to the conclusion that using items from standard tests is a common approach to ensure the quality of the test. This approach has the advantage of ensuring the quality of each individual item’s structure and its relationship to the object of learning but limits the researchers in various ways. There might not be enough standard items for the chosen object of learning and it might be challenging to examine the compatibility of items with the context and the culture that the research is surrounded by.

Constructing a test with high accuracy that is completely aligned with the purpose of the assessment, with high certainty in item interpretation, no bias or other threats to validity, is unlikely to be achieved with the limitations that a learning study implies. The measurement standards for classroom assessments are not always practical to follow by those who locally construct or use knowledge tests (Ferrara et al., 2020). Even if these standards are translated and followed by subject experts, the quantitative means of assessment in classroom is insufficient to gather different aspects of learning. Assessment in a classroom should involve collecting information from different sources, covering different forms of knowledge or skills (Kane and Wools, 2020). Kane and Wools (2020) suggest that the measurement perspective in classroom assessment which includes the test result should play a complementary role to functional perspective, which is mainly achieved by student-teacher interactions. The principles of classroom assessment as such, with compromises, can be used in classroom-scale research.

The final quantitative analysis of the items indicates that we have both appropriate focus and coverage of the object of learning and appropriate item difficulty in a wide range from easy to difficult and that we have items that enable proper discrimination between test-takers with different levels of competencies. The effort to validate the interpretation and use of the tests, we believe, has led to an acceptable degree of validity for the test’s specific use but also resulted in rethinking the tools for assessment for the main phase of the project to increase the validity of its claims. In a learning study, the object of learning is defined by its critical aspects and how well these aspects are discerned is the main interest of researchers. This purpose is plausible to be achieved by qualitative forms of assessment. Accordingly, in the main phase of the project, we supplemented the test results with a functional perspective (Kane and Wools, 2020) through gathering data from think-aloud interviews, classroom observation and feedback from class teachers. These qualitative sources of assessments form the basis for the functional perspective and will be supported by the test results as the measurement perspective.

Lesson study as well as learning study are research approaches in which researchers and teachers collaborate and concern issues that is investigated “with teachers, rather than on teachers” (Runesson Kempe, 2019). In the iterative cycles of learning study, teachers could contribute with their ongoing assessment of learners while discussing and developing teaching design with researchers. These research designs provide a ground for functional perspective to be included in a final evaluation of possible knowledge achievement. Integrating observation, interviews, teacher informal assessment and test results contribute to achieving a firmer assessment of student knowledge development and thus establishing the validity of the research results.

In cases where the learning outcome is mainly assessed by knowledge tests, the purpose of the test must be understood as closely aligned with the learning goal, i.e. the object of learning. Thus, addressing validity of a knowledge test focused on the object of learning in accordance with what has been described in this article is clearly warranted. Locally developed tests by educators and researchers have the advantage of being a better match for their specific use but might be challenging to validate and need to be supported by other means of assessment (Kane and Wools, 2020).

As we demonstrate in this paper, there are recommended paths during the test development process that increase the degree of validity of use of knowledge tests and as a result leads to an increased validity of claims of the learning study. First, it is recommended to discuss test items in a team of SMEs; Second, even with the limitation of resources available, a few think-aloud interviews with potential test-takers contributes to validity argument; Third, ensuring a range of difficulty on test items supports further analysis of the test result and accordingly qualitative insight to the effects of teaching design on students learning. Finally, presenting the test or some of test items when reporting results ease the communication with researchers in the domain.

An ambition of this paper has been to increase attention to the consistency and fragility of assessment instruments in the learning and lesson study community and to provide an example of a test validation discussions. We are aware that not every article that is published will be able to describe and discuss the entire process of developing and validating the interpretation and use of tests. We are also aware that some studies will use previously developed tests or borrow items from standard tests. However, when test results are a ground for discussing the effects of lesson design on learning, giving some description of the process of test development and validation, including accompanying limitations contributes to a fair, reliable and valid description of the research results.

The authors would like to thank Ulf Ryberg, Jenny Svanteson Wester and the participating teachers for their contributions to the study.

Adamson
,
B.
and
Walker
,
E.
(
2011
), “
Messy collaboration: learning from a learning study
”,
Teaching and Teacher Education
, Vol. 
27
No. 
1
, pp. 
29
-
36
, doi: .
Adulyasas
,
L.
and
Abdul Rahman
,
S.
(
2014
), “
Lesson study incorporating phase-based instruction using Geometer's Sketchpad and its effects on Thai students' geometric thinking
”,
International Journal of Language and Literary Studies
, Vol. 
3
No. 
3
, pp. 
252
-
271
, doi: .
American Educational Research Association
(
2014
), “American psychological association and national council on measurement in education”, in
Standards for Educational and Psychological Testing
,
Washington, DC
.
Bass
,
K.M.
,
Drits-Esser
,
D.
and
Stark
,
L.A.
(
2016
), “
A primer for developing measures of science content knowledge for small-scale research and instructional use
”,
CBE-Life Sciences Education
, Vol. 
15
No. 
2
, p.
rm2
, doi: .
Björklund
,
C.
,
Marton
,
F.
and
Kullberg
,
A.
(
2021
), “
What is to be learnt? Critical aspects of elementary arithmetic skills
”,
Educational Studies in Mathematics
, Vol. 
107
No. 
2
, pp. 
261
-
284
, doi: .
Brown
,
A.L.
(
1992
), “
Design experiments: theoretical and methodological challenges in creating complex interventions in classroom settings
”,
The Journal of the Learning Sciences
, Vol. 
2
No. 
2
, pp. 
141
-
178
, doi: .
Collins
,
A.
(
1992
), “Toward a design science of education”, in
Scanlon
,
E.
and
O'Shea
,
T.
(Eds),
New Directions in Educational Technology
,
Berlin, Heidelberg
, pp. 
15
-
22
.
Crocker
,
L.M.
and
Algina
,
J.
(
2008
),
Introduction to Classical and Modern Test Theory
,
Cengage Learning
,
Mason, OH
.
Cronbach
,
L.J.
(
1988
), “Five perspectives on validity argument”, in
Wainer
,
H.
and
Braun
,
H.
(Eds),
Test Validity
,
Erlbaum
,
Hillsdale, NJ
, pp. 
3
-
17
.
Downing
,
S.M.
(
2006
), “Twelve steps for effective test development”, in
Downing
,
S.
and
Haladyna
,
T.
(Eds),
Handbook of Test Development
,
L. Erlbaum
,
Mahwah, NJ
.
Driver
,
A.
,
Elliott
,
K.
and
Wilson
,
A.
(
2015
), “
Variation theory based approaches to teaching subject-specific vocabulary within differing practical subjects
”,
International Journal of Language and Literary Studies
, Vol. 
4
No. 
1
, pp. 
72
-
90
, doi: .
Elliott
,
J.
(
1991
),
Action Research for Educational Change
,
Open University Press
,
Milton Keynes
.
Ferrara
,
S.
,
Maxey-Moore
,
K.
,
Brookhart
,
S.M.
,
McMillan
,
J.H.
and
Brookhart
,
S.M.
(
2020
),
Guidance in the Standards for Classroom Assessment: Useful or Irrelevant?
, (1 ed) ,
Routledge
, pp. 
97
-
119
.
Irwing
,
F.P.
and
Wiley
,
I.
(
2018
),
The Wiley Handbook of Psychometric Testing : A Multidisciplinary Reference on Survey, Scale and Test Development
,
John Wiley & Sons
,
Hoboken, NJ
.
Kane
,
M.T.
(
2013
), “
Validating the interpretations and uses of test scores
”,
Journal of Educational Measurement
, Vol. 
50
No. 
1
, pp. 
1
-
73
, doi: .
Kane
,
M.T.
and
Wools
,
S.
(
2020
),
Perspectives on the Validity of Classroom Assessments
, (1 ed) ,
Routledge
, pp. 
11
-
26
.
Koretz
,
D.M.
(
2008
),
Measuring up : What Educational Testing Really Tells Us
,
Cambridge, Mass.: Harvard University Press
,
Cambridge, MA
.
Kullberg
,
A.
(
2010
),
What is Taught and What is Learned. Gothenburg Studies in Educational Sciences 293
,
Acta Universitatis Gothuburgensis
,
Götebrog
.
Kullberg
,
A.
,
Runesson Kempe
,
U.
and
Marton
,
F.
(
2017
), “
What is made possible to learn when using the variation theory of learning in teaching mathematics?
”,
ZDM
, Vol. 
49
No. 
4
, pp. 
559
-
569
, doi: .
Kullberg
,
A.
,
Ingerman
,
Å.
and
Marton
,
F.
(
2024
),
Planning and Analyzing Teaching: Using the Variation Theory of Learning
,
Oxford: Routledge
,
Oxford
, doi: .
Leighton
,
J.P.
(
2017
),
Using Think-Aloud Interviews and Cognitive Labs in Educational Research
,
New York: Oxford University Press
,
New York
.
Lewis
,
C.
and
Perry
,
R.
(
2017
), “
Lesson study to scale up research-based knowledge: a randomized, controlled trial of fractions learning
”,
Journal for Research in Mathematics Education
, Vol. 
48
No. 
3
, pp. 
261
-
299
, doi: .
Lewis
,
C.
,
Perry
,
R.
and
Murata
,
A.
(
2006
), “
How should research contribute to instructional improvement? The case of lesson study
”,
Educational Researcher
, Vol. 
35
No. 
3
, pp. 
3
-
14
, doi: .
Livingston
,
S.
(
2006
), “Item analysis”, in
Downing
,
S.
and
Haladyna
,
T.
(Eds),
Handbook of Test Development
,
L. Erlbaum
,
Mahwah, NJ
.
Ljung-Djärf
,
A.
,
Magnusson
,
A.
and
Peterson
,
S.
(
2014
), “
From Doing to Learning: changed focus during a pre-school learning study project on organic decomposition
”,
International Journal of Science Education
, Vol. 
36
No. 
4
, pp. 
659
-
676
, doi: .
Lo
,
M.L.
(
2012
),
Variation Theory and the Improvement of Teaching and Learning. Gothenburg Studies in Educational Sciences 323
,
Acta Universitatis Gothuburgensis
,
Gothenburg
.
Marton
,
F.
and
Pang
,
M.-F.
(
2006
), “
On some necessary conditions of learning
”,
The Journal of the Learning Sciences
, Vol. 
15
No. 
2
, pp.
193
-
220
, doi: .
Marton
,
F.
,
Tsui
,
A.B.M.
,
Chik
,
P.P.M.
,
Ko
,
P.Y.
and
Lo
,
M.L.
(
2004
),
Classroom Discourse and the Space of Learning
,
Mahwah: Routledge
,
Mahwah
.
Maunula
,
T.
(
2018
),
Students’ and Teachers’ Jointly Constituted Learning Opportunities: The Case of Linear Equations. Göteborg Studies in Educational Sciences 410
,
Acta Universitatis Gothuburgensis
,
Gothenburg
.
Newton
,
P.E.
and
Shaw
,
S.D.
(
2014
),
Validity in Educational and Psychological Assessment
,
Los Angeles : SAGE
,
Los Angeles
.
Nuthall
,
G.
(
2004
), “
Relating classroom teaching to student learning: a critical analysis of why research has failed to bridge the theory-practice gap
”,
Harvard Educational Review
, Vol. 
74
No. 
3
, pp. 
273
-
306
, doi: .
Osbeck
,
C.
,
Claesson
,
S.
and
Ingerman
,
A.
(
2018
),
Didactic Classroom Studies: A Potential Research Direction
,
Kriterium and Lund: Nordic Academic Press, Göteborg
, doi: .
Pong
,
W.Y.
,
Chik
,
P.M.P.
and
Lo
,
M.L.
(
2005
),
For Each and Everyone : Catering for Individual Differences through Learning Studies
,
Hong Kong University Press
,
Hong Kong
.
Runesson
,
U.
(
2007
), “
A collective enquiry into critical aspects of teaching the concept of angles
”,
Nordic Studies in Mathematics Education [Nordisk matematikdidaktik]
, Vol. 
12
No. 
4
, p.
7
, doi: .
Runesson Kempe
,
U.
(
2019
), “
Teachers and researchers in collaboration. A possibility to overcome the research‐practice gap?
”,
European Journal of Education
, Vol. 
54
No. 
2
, pp. 
250
-
260
, doi: .
Ryberg
,
U.
(
2018
), “
Generating different lesson designs and analyzing their effects: the impact of representations when discerning aspects of the derivative
”,
The Journal of Mathematical Behavior
, Vol. 
51
, pp. 
1
-
14
, doi: .
Stigler
,
J.W.
and
Hiebert
,
J.
(
1999
),
The Teaching Gap: Best Ideas from the World's Teachers for Improving Education in the Classroom
,
New York: Free Press
,
New York
.
Vikström
,
A.
(
2014
), “
What makes the difference? Teachers explore what must be taught and what must be learned in order to understand the particulate character of matter
”,
Journal of Science Teacher Education
, Vol. 
25
No. 
6
, pp. 
709
-
727
, doi: .
Wester
,
J.S.
(
2021
), “
Students' possibilities to learn from group discussions integrated in whole-class teaching in mathematics
”,
Scandinavian Journal of Educational Research
, Vol. 
65
No. 
6
, pp. 
1020
-
1036
, doi: .
Wilson
,
M.
(
2005
),
Constructing Measures : An Item Response Modeling Approach
,
Lawrence Erlbaum Associates
,
Mahwah, NJ
.
Yoshida
,
M.
(
1999
), “
Lesson study a case study of a Japanese approach to improving instruction through school-based teacher development: a dissertation submitted to the Faculty of the Division of the Social Sciences in candidacy for the degree of Doctor of Philosophy
”,
Chicago, Thesis (Ph.D.) -- The University of Chicago, 1999
.
Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and non-commercial purposes), subject to full attribution to the original publication and authors. The full terms of this licence may be seen at Link to the terms of the CC BY 4.0 licence.

or Create an Account

Close Modal
Close Modal