|
|
|
|
|
| |
|
|
| |
- Test Score Equating Using a Mini-Version Anchor and a Midi Anchor: A Case Study Using SAT® Data
by Jinghua Liu
This study explores an anchor that is different from the traditional miniature anchor in test score equating. In contrast to a traditional ?mini? anchor that has the same spread of item difficulties as the tests to be equated, the studied anchor, referred to as a ?midi? anchor (Sinharay & Holland), has a smaller spread of item difficulties than the tests to be equated. Both anchors were administered in an operational SAT administration and the impact of anchor type on equating was evaluated with respect to systematic error or equating bias. Contradicting the popular belief that the mini anchor is best, the results showed that the mini anchor does not always produce more accurate equating functions than the midi anchor; the midi anchor was found to perform as well as or even better than the mini anchor. Because testing programs usually have more middle difficulty items and few very hard or very easy items, midi external anchors are operationally easier to build. Therefore, the results of our study provide evidence in favor of the midi anchor, the use of which will lead to cost saving with no reduction in equating quality.
- A Paradox in the Study of the Benefits of Test-Item Review
by Wim J. van der Linden
According to a popular belief, test takers should trust their initial instinct and retain their initial responses when they have the opportunity to review test items. More than 80 years of empirical research on item review, however, has contradicted this belief and shown minor but consistently positive score gains for test takers who changed answers they found to be incorrect during review. This study reanalyzed the problem of the benefits of answer changes using item response theory modeling of the probability of an answer change as a function of the test taker?s ability level and the properties of items. Our empirical results support the popular belief and reveal substantial losses due to changing initial responses for all ability levels. Both the contradiction of the earlier research and support of the popular belief are explained as a manifestation of Simpson?s paradox in statistics.
- Rater Effects on Essay Scoring: A Multilevel Analysis of Severity Drift, Central Tendency, and Rater Experience
by George Leckie
This study examined rater effects on essay scoring in an operational monitoring system from England's 2008 national curriculum English writing test for 14-year-olds. We fitted two multilevel models and analyzed: (1) drift in rater severity effects over time; (2) rater central tendency effects; and (3) differences in rater severity and central tendency effects by raters? previous rating experience. We found no significant evidence of rater drift and, while raters with less experience appeared more severe than raters with more experience, this result also was not significant. However, we did find that there was a central tendency to raters? scoring. We also found that rater severity was significantly unstable over time. We discuss the theoretical and practical questions that our findings raise.
- Observed Score Linear Equating with Covariates
by Kenny Bränberg
This paper examined observed score linear equating in two different data collection designs, the equivalent groups design and the nonequivalent groups design, when information from covariates (i.e., background variables correlated with the test scores) was included. The main purpose of the study was to examine the effect (i.e., bias, variance, and mean squared error) on the estimators of including this additional information. A model for observed score linear equating with covariates first was suggested. As a second step, the model was used in a simulation study to show that the use of covariates such as gender and education can increase the accuracy of an equating by reducing the mean squared error of the estimators. Finally, data from two administrations of the Swedish Scholastic Assessment Test were used to illustrate the use of the model.
- The Random-Effect Generalized Rating Scale Model
by Wen-Chung Wang
Rating scale items have been widely used in educational and psychological tests. These items require people to make subjective judgments, and these subjective judgments usually involve randomness. To account for this randomness, Wang, Wilson, and Shih proposed the random-effect rating scale model in which the threshold parameters are treated as random effects rather than fixed effects. In the present study, the Wang et al. model was further extended to incorporate slope parameters and embed the new model within the framework of multilevel nonlinear mixed-effect models. This was done so that (1) no efforts are needed to derive parameter estimation procedures, and (2) existing computer programs can be applied directly. A brief simulation study was conducted to ascertain parameter recovery using the SAS NLMIXED procedure. An empirical example regarding students? interest in learning science is presented to demonstrate the implications and applications of the new model.
- The Concept of Validity: Revisions, New Directions, and Applications edited by Robert W. Lissitz
by Andrew Maul
- REVIEWER ACKNOWLEDGMENTS
|
|
|
| |
|
|
|
|
| |
| |
|
| |
|