A Monte Carlo study of the effects of several test score properties on gain score reliability

Date of Award




Degree Name

Doctor of Philosophy (Ph.D.)


Educational Research

First Committee Member

Maria M. Llabre - Committee Chair


The primary purpose of this study was to examine variations in gain score reliability for six types of test scores: Raw, z, stanine, percentile, normal curve equivalent, and grade equivalent. A secondary purpose was to compare gain score reliabilities for simple, residualized, and base-free difference scores. The data used for this study were generated using Monte Carlo techniques in order to simulate a variety of testing situations. The parameters that were varied in the simulations included pre-test and post-test reliabilities, the correlation between the two tests, and the ratio of the sample standard deviations (lambda). Simulations were performed to investigate generalized testing situations as well as normative sample results for seventh, eighth, and ninth grade performance on the total mathematics component of the Stanford Achievement Test. Simulation results suggest that raw scores produce the most reliable gain scores when the pre-test and post-test distributions have different variances, and therefore a value of lambda which deviates from 1.0. Derived and developmental scores, which in most cases produce a lambda value of one due to equal pre-test and post-test distribution variances, were found to be generally less reliable for measuring gain. In cases where the reliabilities of the pre-test and posttest were extremely high, as would be found with most standardized achievement tests, there were few differences noted between the use of raw or transformed scores. Simple gain scores were found to be the most reliable, and easiest to interpret, when raw scores were under consideration. However, for occasions where derived scores were to be used, residualized gain scores often produced higher reliability estimates. Educators, evaluators, and researchers are cautioned to carefully consider relevant factors, specifically transformation of raw scores, before attempting to use these scores in the assessment of instructional or treatment effectiveness.


Education, Tests and Measurements

Link to Full Text


Link to Full Text