Research

This page lists some of the recent studies I worked on. If you have difficulty in obtaining the full paper for any of them, let me know at  boz@uwm.edu.

On Standards-based Grading:

A Didactic Explanation of Standards-based Grading

While standards-based grading (SBG) has been implemented in school districts across the country, it faces two major challenges. As research on standards-based grading is limited, many practices under SBG are not substantiated by empirical evidence. Meanwhile, not all teachers are well prepared to implement SBG in the classroom. This didactic guide focuses on how to conduct standards-based grading with the traditional point-based grading in comparison. This guide starts with the rationales for standards-based grading, then moves to its major components, and ends with a discussion of major challenges in using standards-based grading.

On Test Validity, with Regina Navejar

We found a clear association between noise and math achievement measurement. About 40% of students are bothered by noise during testing. The more bothersome the noise is, the lower the math score tends to be. Noise coping explains about 10% of the test score difference, about the same by GPA.

On Classroom Assessment and Grading, with Jacob Misiak

We found written feedback improves student achievement. Standard-based grading without written feedback is not better than points-based grading. We recommend that written feedback be highly relevant (to the academic standard being assessed), limited in number (to the major misconceptions students have), and giving a second chance (for students to act on to improve learning).

On Language Testing, Rater Reliability, with Yunnan Xiao and Juan Luo

When a large number of well-trained raters are used, rater reliability can be high and similar for both holistic and analytic scoring of second language writing tasks. But the scores assigned can be quite different. Students with lower writing proficiency tend to receive higher scores under analytic scoring; students with higher proficiency score higher under holistic scoring.

 On Language Testing, Testlet Model and Proficiency Classification

Testlets are the darlings of language testers. But they create the problem of item dependency. Here I show the testlet response theory (TRT) model should be used for proficiency classification. Using the standard IRT model inflates the classification accuracy due to the underestimated measurement error.

Here is more.