bright
bright
bright
bright
bright
tl
 
tr
lunder
Recent Content of Educational and Psychological Measurement
This site designed and maintained by
Prof. Glenn Fulcher

@languagetesting.info

runder
lcunder
rcunder
lnavl
Navigation
lnavr
navtop
lnavs
rnavs
lnavs
rnavs
lnavs
rnavs
lnavs
rnavs
lnavs
rnavs
bottomnav
 
 

 

 
Titles and Abstracts
     
 
Educational and Psychological Measurement

  • Evaluating the Bookmark Judgments of Standard-Setting Panelists
    by Engelhard, G.

    The purpose of this study is to describe a new approach for evaluating the judgments of standard-setting panelists within the context of the bookmark procedure. The bookmark procedure is widely used for setting performance standards on high-stakes assessments. A many-faceted Rasch (MFR) model is proposed for evaluating the bookmark judgments of the panelists, and its use illustrated with standard-setting judgments from the Michigan Educational Assessment Program (MEAP). Panelists set three performance standards to create four performance levels (Apprentice, Basic, Met, and Exceeded). The content area used to illustrate the model is mathematics in Grade 3. The analyses suggest that the MFR model provides a promising framework for examining bookmark judgments.



  • Computerized Classification Testing Under the One-Parameter Logistic Response Model With Ability-Based Guessing
    by Wang, W.-C., Huang, S.-Y.

    The one-parameter logistic model with ability-based guessing (1PL-AG) has been recently developed to account for effect of ability on guessing behavior in multiple-choice items. In this study, the authors developed algorithms for computerized classification testing under the 1PL-AG and conducted a series of simulations to evaluate their performances. Four item selection methods (the Fisher information, the Fisher information with a posterior distribution, the progressive method, and the adjusted progressive method) and two termination criteria (the ability confidence interval [ACI] method and the sequential probability ratio test [SPRT]) were developed. In addition, the Sympson–Hetter online method with freeze (SHOF) was implemented for item exposure control. Major results include the following: (a) when no item exposure control was made, all the four item selection methods yielded very similar correct classification rates, but the Fisher information method had the worst item bank usage and the highest item exposure rate; (b) SHOF can successfully maintain the item exposure rate at a prespecified level, without compromising substantial accuracy and efficiency in classification; (c) once SHOF was implemented, all the four methods performed almost identically; (d) ACI appeared to be slightly more efficient than SPRT; and (e) in general, a higher weight of ability in guessing led to a slightly higher accuracy and efficiency, and a lower forced classification rate.



  • Comparing Construct Definition in the Angoff and Objective Standard Setting Models: Playing in a House of Cards Without a Full Deck
    by Stone, G. E., Koskey, K. L. K., Sondergeld, T. A.

    Typical validation studies on standard setting models, most notably the Angoff and modified Angoff models, have ignored construct development, a critical aspect associated with all conceptualizations of measurement processes. Stone compared the Angoff and objective standard setting (OSS) models and found that Angoff failed to define a legitimate and stable construct. The present study replicates and expands this work by presenting results from a 5-year investigation of both models, using two different approaches (equating and annual standard setting) within two testing settings (health care and education). The results support the original conclusion that although the OSS model demonstrates effective construct development, the Angoff approach appears random and lacking in clarity. Implications for creating meaningful and valid standards are discussed.



  • The Similarity of Bookmark Cut Scores With Different Response Probability Values
    by Wyse, A. E.

    Standard setting is a method used to set cut scores on large-scale assessments. One of the most popular standard setting methods is the Bookmark method. In the Bookmark method, panelists are asked to envision a response probability (RP) criterion and move through a booklet of ordered items based on a RP criterion. This study investigates whether or not it is possible to end up with the same cut scores if one were to apply the Bookmark method with two different RP values. Analytical formulas and two hypothetical examples from a large-scale state testing program indicate that it is rarely possible to obtain the same cut score estimates with two different RP values because of the presence of item difficulty gaps present when applying the procedure in practice. Results indicate that if the same group of panelists applied the Bookmark procedure as it is traditionally explained, then cut scores should be lower with the second chosen RP value than they were with the first RP value. This result holds whether or not the second RP value is higher or lower than the first RP value. The examples also reveal that differences in cut score estimates with different RP values can lead to changes in the percentage of examinees at or above the cut scores that may have important practical impacts.



  • Performance of the S - {chi}2 Statistic for Full-Information Bifactor Models
    by Li, Y., Rupp, A. A.

    This study investigated the Type I error rate and power of the multivariate extension of the S – 2 statistic using unidimensional and multidimensional item response theory (UIRT and MIRT, respectively) models as well as full-information bifactor (FI-bifactor) models through simulation. Manipulated factors included test length, sample size, latent trait characteristics such as discrimination pattern and intertrait correlations, and model type misspecification. The nominal Type I error rates were observed under all conditions. The power of the S – 2 statistic for UIRT models was high for MIRT and FI-bifactor models that were structurally most distinct from the UIRT models but was low otherwise. The power of the S – 2 statistic to detect misfitting between MIRT and FI-bifactor models was low across all conditions because of the structural similarity of these two models. Finally, information-based indices of relative model–data fit and latent variable correlations were obtained, and these showed expected patterns across conditions.



  • Polytomous Adaptive Classification Testing: Effects of Item Pool Size, Test Termination Criterion, and Number of Cutscores
    by Gnambs, T., Batinic, B.

    Computer-adaptive classification tests focus on classifying respondents in different proficiency groups (e.g., for pass/fail decisions). To date, adaptive classification testing has been dominated by research on dichotomous response formats and classifications in two groups. This article extends this line of research to polytomous classification tests for two- and three-group scenarios (e.g., inferior, mediocre, and superior proficiencies). Results of two simulation experiments with generated and real responses (N = 2,000) to established personality scales of different length (12, 20, or 29 items) demonstrate that adaptive item presentations significantly reduce the number of items required to make such classification decisions while maintaining a consistent classification accuracy. Furthermore, the simulations highlight the importance of the selected test termination criterion, which has a significant impact on the average test length.



  • Formulating the Rasch Differential Item Functioning Model Under the Marginal Maximum Likelihood Estimation Context and Its Comparison With Mantel-Haenszel Procedure in Short Test and Small Sample Conditions
    by Paek, I., Wilson, M.

    This study elaborates the Rasch differential item functioning (DIF) model formulation under the marginal maximum likelihood estimation context. Also, the Rasch DIF model performance was examined and compared with the Mantel–Haenszel (MH) procedure in small sample and short test length conditions through simulations. The theoretically known relationship of the DIF estimators between the Rasch DIF model and the MH procedure was confirmed. In general, the MH method showed a conservative tendency for DIF detection rates compared with the Rasch DIF model approach. When there is DIF, the z test (when the standard error of the DIF estimator is estimated properly) and the likelihood ratio test in the Rasch DIF model approach showed higher DIF detection rates than the MH chi-square test for sample sizes of 100 to 300 per group and test lengths ranging from 4 to 39. In addition, this study discusses proposed Rasch DIF classification rules that accommodate statistical inference on the direction of DIF.



  • Examining Parallelism of Sets of Psychometric Measures Using Latent Variable Modeling
    by Raykov, T., Patelis, T., Marcoulides, G. A.

    A latent variable modeling approach that can be used to examine whether several psychometric tests are parallel is discussed. The method consists of sequentially testing the properties of parallel measures via a corresponding relaxation of parameter constraints in a saturated model or an appropriately constructed latent variable model. The underlying procedure is readily used in cases with at least three single measures studied for parallelism and is directly extended to the setting of multiple-component measuring instruments. The methodological limitations concerning intended attempts to test if two given measures are parallel are also explicated. The proposed modeling approach is illustrated with an example.



  • Erratum

    Turgeon, L., & Chartrand, E. (2003). Psychometric properties of the French Canadian version of the State-Trait Anxiety Inventory for Children. Educational and Psychological Measurement 63(1), 174-185. (DOI: 10.1177/0013164402239324).



  • A Commentary on the Use of Formative Measurement
    by Hardin, A., Marcoulides, G. A.

    The recent flurry of articles on formative measurement, particularly in the information systems literature, appears to be symptomatic of a much larger problem. Despite significant objections by methodological experts, these articles continue to deliver a predominately pro formative measurement message to researchers who rapidly incorporate these recommendations into their research. This commentary argues that many of these articles misinform readers due to the lack of theory underlying formative measurement and a misinterpretation of the early psychometric literature. The authors suggest that to avoid further confusing the consumers of this research, the prudent course of action may be to consider temporarily suspending the use of formative measurement. They further contend that the debate on formative measurement should be restricted primarily to premier methods journals where experts can ultimately develop a theoretical perspective that supports or rejects its implementation. Combined, these steps should help alleviate the rapid adoption of such controversial methods before they have been thoroughly vetted.



  • The Influence of Dimensionality on Parameter Estimation Accuracy in the Generalized Graded Unfolding Model
    by Carter, N. T., Zickar, M. J.

    The generalized graded unfolding model (GGUM) is an ideal point model of responding that is consistent with the Thurstonian theory of respondent behavior. Ideal point models have recently generated interest in the realms of attitude and personality assessment. One unclear aspect of applying ideal point models is the influence of multidimensionality on GGUM item and person parameters estimation accuracy. Using simulated data, the authors tested the influence of the balance, or ratio, of items loading onto two dimensions, the degree of bidimensionality and sample size on parameter estimation accuracy. The results suggest that bidimensionality and the proportion of items loading onto a second trait increases estimation error. The second trait was chosen in estimation when a large number of the items in the survey reflected a highly irrelevant second trait. Estimation error was greater for persons and items at the extreme ends of the continuum; positive estimates were biased upward and positive parameters downward. The results suggest that although the GGUM chooses another trait in estimation, in most cases conventional fit analyses and checks for item parameter extremity are likely to be successful in identifying items measuring another trait. Furthermore, the conditions in which the trait being estimated may not be clear should be rare in practice. The implications of these results for researchers who wish to apply these models to real-life data are discussed.



  • Do Adjusted Subscores Lack Validity? Don't Blame the Messenger
    by Sinharay, S., Haberman, S. J., Wainer, H.

    There are several techniques that increase the precision of subscores by borrowing information from other parts of the test. These techniques have been criticized on validity grounds in several of the recent publications. In this note, the authors question the argument used in these publications and suggest both inherent limits to the validity argument and empirical issues worth examining.



  • Optimal Sampling of Units in Three-Level Cluster Randomized Designs: An ANCOVA Framework
    by Konstantopoulos, S.

    Field experiments with nested structures assign entire groups such as schools to treatment and control conditions. Key aspects of such cluster randomized experiments include knowledge of the intraclass correlation structure and the sample sizes necessary to achieve adequate power to detect the treatment effect. The units at each level of the hierarchy have a cost associated with them, however, and thus, researchers need to take budget and costs into account when designing their studies. This article uses analysis of covariance and provides methods for computing power within an optimal design framework that incorporates costs of units at different levels and covariate effects for three-level cluster randomized balanced designs. The optimal sample sizes are a function of the variances at each level and the cost of each unit. Overall, the results suggest that when units at higher levels become more expensive, the researcher should sample units at lower levels. The covariates affect the sampling of units and the power estimates. Fewer units need to be sampled at levels where covariates explain considerable proportions of the variance.



  • Multiscale Measurement of Extreme Response Style
    by Bolt, D. M., Newton, J. R.

    This article extends a methodological approach considered by Bolt and Johnson for the measurement and control of extreme response style (ERS) to the analysis of rating data from multiple scales. Specifically, it is shown how the simultaneous analysis of item responses across scales allows for more accurate identification of ERS, and more effective control of ERS effects on the substantive trait estimates, than when analyzing just one scale. Moreover, unlike a competing approach presented by Greenleaf, the current strategy can accommodate conditions in which the substantive traits across scales correlate, as is almost always the case in social sciences research. Simulation and real data analyses are used for illustration.



  • Assessing Goodness of Fit in Item Response Theory With Nonparametric Models: A Comparison of Posterior Probabilities and Kernel-Smoothing Approaches
    by Sueiro, M. J., Abad, F. J.

    The distance between nonparametric and parametric item characteristic curves has been proposed as an index of goodness of fit in item response theory in the form of a root integrated squared error index. This article proposes to use the posterior distribution of the latent trait as the nonparametric model and compares the performance of an index based on this method with another approach based on the kernel-smoothing model. Error rates and power are evaluated using the two-parameter logistic model and three types of realistic misfitted items. Results show that for fitting items, the distance between parametric and nonparametric item characteristic curves decreased as the sample size increased for both procedures. Kernel-smoothing root integrated squared error also decreased as test length increased. Bootstrap methods are used to obtain a significance test. Both procedures performed adequately in terms of Type I error rates. Regarding power, the posterior probabilities method was superior, especially in small samples, although in short tests both procedures performed in a similar way.