bright
bright
bright
bright
bright
tl
 
tr
lunder
Recent Content of Language Testing
This site designed and maintained by
Prof. Glenn Fulcher

@languagetesting.info

runder
lcunder
rcunder
lnavl
Navigation
lnavr
navtop
lnavs
rnavs
lnavs
rnavs
lnavs
rnavs
lnavs
rnavs
lnavs
rnavs
bottomnav
 
 

 

 
Titles and Abstracts
     
 
Language Testing

  • Validating score interpretations and uses
    by Kane, M.

    The argument-based approach to validation involves two steps; specification of the proposed interpretations and uses of the test scores as an interpretive argument, and the evaluation of the plausibility of the proposed interpretive argument. More ambitious interpretations and uses tend to involve an extended network of inferences and assumptions and require extensive evidence for their support. Simpler interpretations do not claim much, and therefore, may not require much evidential support. The evaluation of score based decisions generally requires an evaluation of the consequences of the decision rule. In any case, the claims that are being made need to be justified.



  • Validity argument for language assessment: The framework is simple...
    by Chapelle, C. A.

  • Grounding the argument-based framework for validating score interpretations and uses
    by Oller, J. W.

    Kane’s argument-based framework is summarized and examined. He implicitly appeals to the backgrounded concepts of fairness and justice. From there it is a short distance to grounding the whole system in the mundane notion of truth. In fact, valid argument systems must depend on representations that are ‘true’ by virtue of agreement with purported facts. As a friendly amendment, therefore, I argue that (provided the ceteris paribus, all else being equal, requirement is met) agreement with known facts in testing, experimental research, and scientific measurement counts for a great deal more than disagreement. It follows by Peircean ‘exact logic’ that higher test scores (if the tests have any validity at all) are invariably more informative (interpretable in general) and thus more useful than lower scores. Why? Because higher scores show more agreement between the test-makers and the higher scoring test-takers about whatever facts (or performances) may be at issue. Exceptions are cases where the ceteris paribus requirement is not met. Necessary (but testable) inferences follow for interpretations and uses of ‘cutscores.’



  • Kane, validity and soundness
    by Davies, A.

  • Confidence scoring of speaking performance: How does fuzziness become exact?
    by Jin, T., Mak, B., Zhou, P.

    The fuzziness of assessing second language speaking performance raises two difficulties in scoring speaking performance: indistinction between adjacent levels and overlap between scales. To address these two problems, this article proposes a new approach, confidence scoring, to deal with such fuzziness, leading to confidence scores between two adjacent levels applied to three scales. Since confidence scores have to be transformed to an exact score for test interpretation and use, membership functions and rule bases are applied and a confidence scoring algorithm is developed. Confidence scoring is demonstrated in the paper by an example to facilitate easy understanding. The paper then describes a pilot study that was conducted to try out the confidence scoring design. Initial results reveal that: first, confidence scoring is as feasible as traditional scoring; second, confidence scoring performs better in scoring dependability and in correlations with established benchmarks. At the end of the article, further studies are called for in order to build a validity argument and make further revisions to the confidence scoring method described here.



  • Note-taking quality and performance on an L2 academic listening test
    by Song, M.-Y.

    This study investigated the relationships among the quality of L2 test takers’ notes evaluated in terms of different levels of information and test takers’ performance on open-ended listening tasks tapping into different comprehension subskills. In addition, this study examined the invariance of the structural relationships among the variables across two different note-taking formats, that is, a blank format and an outline format, by employing a multi-group structural equation modeling (SEM) approach. The results indicated that note quality measures, in particular the number of topical ideas found in the notes and the organization of these notes, may be good indicators of test takers’ second language academic listening proficiency. It was also found that despite the invariance of structural relationships among variables across the two note-taking formats, the associations between the open-ended listening measures and note quality measures were slightly stronger in the outline format than in the blank format. The implications of these results for L2 academic listening assessment are considered.



  • TOEFL iBT speaking test scores as indicators of oral communicative language proficiency
    by Bridgeman, B., Powers, D., Stone, E., Mollaun, P.

    Scores assigned by trained raters and by an automated scoring system (SpeechRaterTM) on the speaking section of the TOEFL iBT™ were validated against a communicative competence criterion. Specifically, a sample of 555 undergraduate students listened to speech samples from 184 examinees who took the Test of English as a Foreign Language Internet-based test (TOEFL iBT). Oral communicative effectiveness was evaluated both by rating scales and by the ability of the undergraduate raters to answer multiple-choice questions that could be answered only if the spoken response was understood. Correlations of these communicative competence indicators from the undergraduate raters with speech scores were substantially higher for the scores provided by the professional TOEFL iBT raters than for the scores provided by SpeechRater. Results suggested that both expert raters and SpeechRater are evaluating aspects of communicative competence, but that SpeechRater fails to measure aspects of the construct that human raters can evaluate.



  • Re-fitting for a different purpose: A case study of item writer practices in adapting source texts for a test of academic reading
    by Green, A., Hawkey, R.

    The important yet under-researched role of item writers in the selection and adaptation of texts for high-stakes reading tests is investigated through a case study involving a group of trained item writers working on the International English Language Testing System (IELTS). In the first phase of the study, participants were invited to reflect in writing, and then audio-recorded in a semantic-differential-based joint discussion, on the processes they employed to generate test material. The group were next observed at a simulated item writers’ editing meeting to refine their texts and items for an IELTS reading test module. The participants’ written descriptions and recorded discussions provided rich data on how source texts were perceived, selected and adapted for the Test. The study reports findings from textual analyses using indices of readability and lexical density from the original material sourced by the item writers and their adapted versions for the test. Results from qualitative and quantitive analyses are discussed in terms of the implications for the IELTS reading module of editing actions such as: reducing redundancy and technical language, changing styles, deciding on potentially sensitive issues and relationships between texts and test items. The important issue of text authenticity in tests such as IELTS is also broached.



  • Factor structure of the revised TOEIC(R) test: A multiple-sample analysis
    by In'nami, Y., Koizumi, R.

    This study examined the factor structure of the listening and reading sections of the revised Test of English for International Communication (TOEIC®) test. The data from the TOEIC IP (institutional program) test taken by 569 English learners were randomly split into two samples (n = 285 vs. 284). Four models (higher-order, correlated, uncorrelated, and unitary) were hypothesized on the basis of the literature and were tested with each sample. The results from confirmatory factor analysis suggested that the correlated model fit the data best in both samples. Further, multiple-sample analysis using the two samples supported an invariance of factor loadings, measurement error variances, factor variances, and factor covariances for the correlated model in the revised TOEIC test. The presence of distinctive factors of listening and reading skills supports the reporting of separate scores for each skill, whereas the relatively high correlation between the two factors may support single score reporting. This is in accordance with the formats used to report revised TOEIC test scores. The results of the current study provide empirical support for the reporting practice of the revised TOEIC test and thus for test interpretation based on the test scores.



  • Norman Segalowitz, Cognitive Bases of Second Language Fluency
    by Dimova, S.

  • Z.H. Han and T. Cadierno (Eds), Linguistic Relativity in SLA: Thinking for Speaking
    by Davies, A.

  • Applied linguistics and measurement: A dialogue
    by McNamara, T.

  • Building out a measurement model to incorporate complexities of testing in the language domain
    by Wilson, M., Moore, S.

    This paper provides a summary of a novel and integrated way to think about the item response models (most often used in measurement applications in social science areas such as psychology, education, and especially testing of various kinds) from the viewpoint of the statistical theory of generalized linear and nonlinear mixed models. In addition, this new approach emphasizes how item response models can be coordinated and broadened to emphasize their explanatory uses beyond their standard descriptive uses. The basic explanatory principle is that item responses can be modeled as a function of qualities and features of various measurement contexts. These qualities and features can be: (a) characteristics of (i) items, (ii) persons, and (iii) combinations of items and persons; (b) observed or latent (of either items or persons); and (c) continuous or categorical. These ideas are exemplified in the context of a reading comprehension test. The paper starts with an introduction to the framework and then provides: (a) a description of the data that will be used to illustrate the new framework; (b) a discussion of data structure; (c) a brief description of the statistical approach we used; (d) a discussion of how the framework helps one to conceptualize existing item response models, linking the formal features of the models to substantive issues in the assessment of reading comprehension, as well as incorporating an example that goes beyond the usual range of item response models; and (e) a brief summary of further expansion.



  • Testing of second language pragmatics: Past and future
    by Roever, C.

    Testing of second language pragmatic competence is an underexplored but growing area of second language assessment. Tests have focused on assessing learners’ sociopragmatic and pragmalinguistic abilities but the speech act framework informing most current productive testing instruments in interlanguage pragmatics has been criticized for under-representing the construct. In particular, the assessment of learners’ ability to produce extended monologic and dialogic discourse is a missing component in existing assessments. This paper reviews existing tests and argues for a discursive re-orientation of pragmatics tests. Suggestions for tasks and scoring approaches to assess discursive abilities while maintaining practicality are provided, and the problematicity of native speaker benchmarking is discussed.



  • Effects of test-taker characteristics and the number of participants in group oral tests
    by Nakatsuhara, F.

    This study explores the nature of co-constructed interaction in group oral tests by examining whether a test-taker’s own and his or her group members’ extraversion levels and oral proficiency levels have different influences on conversational styles between two group sizes: groups of three and groups of four. Data were collected from 269 Japanese upper-secondary school students, who took group oral tests either in groups of three or four. All sessions were video-taped and transcribed following Conversation Analysis (CA) conventions. The data were quantitatively analysed in terms of goal-orientation, interactional contingency and quantitative dominance. Then, CA methodology was used to interpret and elaborate the statistical results. The findings have implications for our understanding of the group oral test construct and for appropriate choices of group size in group oral testing.