This page lists some of the recent studies I worked on. If you have difficulty in obtaining the full paper for any of them, let me know at boz@uwm.edu.
On Test Validity, with Regina Navejar
We found a clear association between noise and math achievement measurement. About 40% of students are bothered by noise during testing. The more bothersome the noise is, the lower the math score tends to be. Noise coping explains about 10% of the test score difference, about the same by GPA.
On Classroom Assessment and Grading, with Jacob Misiak
We found written feedback improves student achievement. Standard-based grading without written feedback is not better than points-based grading. We recommend that written feedback be highly relevant (to the academic standard being assessed), limited in number (to the major misconceptions students have), and giving a second chance (for students to act on to improve learning).
On Language Testing, Rater Reliability, with Yunnan Xiao and Juan Luo
When a large number of well-trained raters are used, rater reliability can be high and similar for both holistic and analytic scoring of second language writing tasks. But the scores assigned can be quite different. Students with lower writing proficiency tend to receive higher scores under analytic scoring; students with higher proficiency score higher under holistic scoring.
On Language Testing, Testlet Model and Proficiency Classification
Testlets are the darlings of language testers. But they create the problem of item dependency. Here I show the testlet response theory (TRT) model should be used for proficiency classification. Using the standard IRT model inflates the classification accuracy due to the underestimated measurement error.
Here is more.