|
|
|
|
|
| |
|
|
| |

- The Performance of IRT Model Selection Methods With Mixed-Format Tests
by Whittaker, T. A., Chang, W., Dodd, B. G.
When tests consist of multiple-choice and constructed-response items, researchers are confronted with the question of which item response theory (IRT) model combination will appropriately represent the data collected from these mixed-format tests. This simulation study examined the performance of six model selection criteria, including the likelihood ratio test, Akaike’s information criterion (AIC), corrected AIC, Bayesian information criterion, Hannon and Quinn’s information criterion, and consistent AIC, with respect to correct model selection among a set of three competing mixed-format IRT models (i.e., one-parameter logistic/partial credit [1PL/PC], two-parameter logistic/generalized partial credit [2PL/GPC], and three-parameter logistic/generalized partial credit [3PL/GPC]). The criteria were able to correctly select less parameterized IRT models, including the PC, 1PL, and 1PL/PC models. In contrast, the criteria were less able to correctly select more parameterized IRT models, including the GPC, 3PL, and 3PL/GPC models. Implications of the findings and recommendations are discussed.
- A Stochastic Method for Balancing Item Exposure Rates in Computerized Classification Tests
by Huebner, A., Li, Z.
Computerized classification tests (CCTs) classify examinees into categories such as pass/fail, master/nonmaster, and so on. This article proposes the use of stochastic methods from sequential analysis to address item overexposure, a practical concern in operational CCTs. Item overexposure is traditionally dealt with in CCTs by the Sympson-Hetter (SH) method, but this method is unable to restrict the exposure of the most informative items to the desired level. The authors’ new method of stochastic item exposure balance (SIEB) works in conjunction with the SH method and is shown to greatly reduce the number of overexposed items in a pool and improve overall exposure balance while maintaining classification accuracy comparable with using the SH method alone. The method is demonstrated using a simulation study.
- Dynamic Problem Solving: A New Assessment Perspective
by Greiff, S., Wustenberg, S., Funke, J.
This article addresses two unsolved measurement issues in dynamic problem solving (DPS) research: (a) unsystematic construction of DPS tests making a comparison of results obtained in different studies difficult and (b) use of time-intensive single tasks leading to severe reliability problems. To solve these issues, the MicroDYN approach is presented, which combines (a) the formal framework of linear structural equation models as a systematic way to construct tasks with (b) multiple and independent tasks to increase reliability. Results indicated that the assumed measurement model that comprised three dimensions, information retrieval, model building, and forecasting, fitted the data well (n = 114 students) and could be replicated in another sample (n = 140), showing excellent reliability estimates for all dimensions. Predictive validity of school grades was excellent for model building but nonexistent for the other two MicroDYN dimensions and for an additional measure of DPS. Implications are discussed.
- Improving Item Response Theory Model Calibration by Considering Response Times in Psychological Tests
by Ranger, J., Kuhn, J.-T.
Research findings indicate that response times in personality scales are related to the trait level according to the so-called speed–distance hypothesis. Against this background, Ferrando and Lorenzo-Seva proposed a latent trait model for the responses and response times in a test. The model consists of two components, a standard item response model and a supplemental response time model. The authors could demonstrate that the consideration of response time increases the precision of trait estimation. However, the use of the model is not limited to trait estimation. The consideration of response time additionally enhances the calibration of the item response model. Due to limitations of the suggested estimation method for model calibration, Ferrando and Lorenzo-Seva were not able to exploit this additional benefit of response time modeling. In the article, an improved estimation method is proposed that is faster and more efficient than the original approach. It can be shown that the standard deviation of the parameter estimates of the item response model can be reduced up to 25% when applying the new method.
- Software Note: Using BILOG for Fixed-Anchor Item Calibration
by DeMars, C. E., Jurich, D. P.
- Using the GLIMMIX Procedure in SAS 9.3 to Fit a Standard Dichotomous Rasch and Hierarchical 1-PL IRT Model
by Black, R. A., Butler, S. F.
Although Rasch models have been shown to be a sound methodological approach to develop and validate measures of psychological constructs for more than 50 years, they remain underutilized in psychology and other social sciences. Until recently, one reason for this underutilization was the lack of syntactically simple procedures to fit Rasch and item response theory (IRT) models in general statistical software packages. In this article, the authors demonstrate how to fit the standard dichotomous Rasch model and a dichotomous one-parameter logistic IRT model with nested random effects via the easy-to-use GLIMMIX procedure in SAS 9.3. For comparison purposes, the standard dichotomous Rasch model was also fit using the Rasch specialized software, WINSTEPS 3.68.2. The SAS code used to simulate the data on which the Rasch model was fit is provided to allow replication of estimates. Findings suggest that the GLIMMIX procedure may be a viable option for fitting the standard dichotomous Rasch and dichotomous IRT models.
- An Empirical Evaluation of the Slip Correction in the Four Parameter Logistic Models With Computerized Adaptive Testing
by Yen, Y.-C., Ho, R.-G., Laio, W.-W., Chen, L.-J., Kuo, C.-C.
In a selected response test, aberrant responses such as careless errors and lucky guesses might cause error in ability estimation because these responses do not actually reflect the knowledge that examinees possess. In a computerized adaptive test (CAT), these aberrant responses could further cause serious estimation error due to dynamic item administration. To enhance the robust performance of CAT against aberrant responses, Barton and Lord proposed the four-parameter logistic (4PL) item response theory (IRT) model. However, most studies relevant to the 4PL IRT model were conducted based on simulation experiments. This study attempts to investigate the performance of the 4PL IRT model as a slip-correction mechanism with an empirical experiment. The results showed that the 4PL IRT model could not only reduce the problematic underestimation of the examinees’ ability introduced by careless mistakes in practical situations but also improve measurement efficiency.
- A Negative Binomial Regression Model for Accuracy Tests
by Hung, L.-F.
Rasch used a Poisson model to analyze errors and speed in reading tests. An important property of the Poisson distribution is that the mean and variance are equal. However, in social science research, it is very common for the variance to be greater than the mean (i.e., the data are overdispersed). This study embeds the Rasch model within an overdispersion framework and proposes new estimation methods. The parameters in the proposed model can be estimated using the Markov chain Monte Carlo method implemented in WinBUGS and the marginal maximum likelihood method implemented in SAS. An empirical example based on models generated by the results of empirical data, which are fitted and discussed, is examined.
- Confirming Testlet Effects
by DeMars, C. E.
A testlet is a cluster of items that share a common passage, scenario, or other context. These items might measure something in common beyond the trait measured by the test as a whole; if so, the model for the item responses should allow for this testlet trait. But modeling testlet effects that are negligible makes the model unnecessarily complicated and risks capitalization on chance, increasing the error in parameter estimates. Checking each testlet to see if the items within the testlet share something beyond the primary trait could therefore be useful. This study included (a) comparison between a model with no testlets and a model with testlet g, (b) comparison between a model with all suspected testlets and a model with all suspected testlets except testlet g, and (c) a test of essential unidimensionality. Overall, Comparison b was most useful for detecting testlet effects. Model comparisons based on information criteria, specifically the sample-size adjusted Bayesian Information Criteria (SSA-BIC) and BIC, resulted in fewer false alarms than statistical significance tests. The test of essential unidimensionality had true hit rates and false alarm rates similar to the SSA-BIC when the testlet effect was zero for all testlets except the studied testlet. But the presence of additional testlet effects in the partitioning test led to higher false alarm rates for the test of essential unidimensionality.
- Using the Graded Response Model to Control Spurious Interactions in Moderated Multiple Regression
by Morse, B. J., Johanson, G. A., Griffeth, R. W.
Recent simulation research has demonstrated that using simple raw score to operationalize a latent construct can result in inflated Type I error rates for the interaction term of a moderated statistical model when the interaction (or lack thereof) is proposed at the latent variable level. Rescaling the scores using an appropriate item response theory (IRT) model can mitigate this effect under similar conditions. However, this work has thus far been limited to dichotomous data. The purpose of this study was to extend this investigation to multicategory (polytomous) data using the graded response model (GRM). Consistent with previous studies, inflated Type I error rates were observed under some conditions when polytomous number-correct scores were used, and were mitigated when the data were rescaled with the GRM. These results support the proposition that IRT-derived scores are more robust to spurious interaction effects in moderated statistical models than simple raw scores under certain conditions.
- Bayesian Analysis of Item Response Models Using WinBUGS 1.4.3
by Cho, S.-J., Suh, Y.
- EstCRM: An R Package for Samejima's Continuous IRT Model
by Zopluoglu, C.
- SAMPAL: A Program for Determining Sample Sizes for Testing and Estimating Coefficient Alpha and for Comparing Two Alpha Coefficients
by Silver, N. C., Steiner, E. T., Guillaume, M. M.
- Exploring the Full-Information Bifactor Model in Vertical Scaling With Construct Shift
by Li, Y., Lissitz, R. W.
To address the lack of attention to construct shift in item response theory (IRT) vertical scaling, a multigroup, bifactor model was proposed to model the common dimension for all grades and the grade-specific dimensions. Bifactor model estimation accuracy was evaluated through a simulation study with manipulated factors of percentage of common items, sample size, and degree of construct shift. In addition, the unidimensional IRT (UIRT) model, which ignores construct shift, was also estimated to represent current practice. It was found that (a) bifactor models were well recovered overall, though the grade-specific dimensions were not as well recovered as the general dimension; (b) item discrimination parameter estimates were overestimated in UIRT models due to the effect of construct shift; (c) the person parameters of UIRT models were less accurately estimated than those of bifactor models; (d) group mean parameter estimates from UIRT models were less accurate than those of bifactor models; and (e) a large effect due to construct shift was found for the group mean parameter estimates of UIRT models. A real data analysis provided an illustration of how bifactor models can be applied to problems involving vertical scaling with construct shift. General procedures for testing practice were recommended and discussed.
- Effects of Vertical Scaling Methods on Linear Growth Estimation
by Lei, P.-W., Zhao, Y.
Vertical scaling is necessary to facilitate comparison of scores from test forms of different difficulty levels. It is widely used to enable the tracking of student growth in academic performance over time. Most previous studies on vertical scaling methods assume relatively long tests and large samples. Little is known about their performance when the sample is small or the test is short, challenges that small testing programs often face. This study examined effects of sample size, test length, and choice of item response theory (IRT) models on the performance of IRT-based scaling methods (concurrent calibration, separate calibration with Stocking–Lord, Haebara, Mean/Mean, and Mean/Sigma transformation) in linear growth estimation when the 2-parameter IRT model was appropriate. Results showed that IRT vertical scales could be used for growth estimation without grossly biasing growth parameter estimates when sample size was not large, as long as the test was not too short (≥20 items), although larger sample sizes would generally increase the stability of the growth parameter estimates. The optimal rate of return in total estimation error reduction as a result of increasing sample size appeared to be around 250. Concurrent calibration produced slightly lower total estimation error than separate calibration in the worst combination of short test length (≤20 items) and small sample size (n ≤ 100), whereas separate calibration, except in the case of the Mean/Sigma method, produced similar or somewhat lower amounts of total error in other conditions.
|
|
|
| |
|
|
|
|
| |
| |
|
| |
|