An Investigation Of The Sampling Distribution Of Several Criterion-Referenced Reliability Coefficients Using Monte Carlo Techniques

Date of Award




Degree Name

Doctor of Philosophy (Ph.D.)


Educational Research


There are at least two concepts of test reliability. For over 50 years, measurement specialists perceived reliability as being related to the consistency of examinee performance on the items of parallel forms of a test. More recently, with the advent of criterion-referenced tests, reliability has been described in terms of the consistency of classification of examinees into mutually exclusive categories over parallel forms of a test.The purpose of the study was to investigate the sampling distribution of two coefficients of reliability of criterion-referenced tests. The first, Livingston's K('2), was developed from the classical definition of reliability, while the second, Cohen's kappa ((kappa)), is representative of the decision-theoretic approach to reliability.A FORTRAN computer program was written and developed by the investigator to simulate test results under various combinations of item parameters and population distributions. Each experiment consisted of generating 400 classically parallel pairs of item-by-subject response matrices. The normal ogive model was used to relate the input parameters to the response matrices. The indices under investigation, (kappa) and K('2), and their sampling variances, were then computed from the response matrices for each of several cutting scores. The classical reliability coefficients, (rho) and KR-20, were computed for comparative purposes and for validation of the simulation model.The effect of test length on the indices was examined using tests with 5, 10, 15, 20, nd 30 items. It was found that changes in K('2) due to changes in test length could be accurately described by the Spearman-Brown prophecy formula. The kappa ratio also increased with test length, but not in accordance with the Spearman-Brown formula. Attempts to develop an alternative formula, from empirical data, to predict changes in kappa corresponding to changes in test length were not entirely successful.Stability was defined as the tendency for a reliability index to give consistent results for random samples of examinees taken from the same population. The stability of the two indices was examined by comparing their sampling variances (from 400 random samples of 30 examinees) with each other and with the sampling variance of (rho). The kappa ratio was found to be less stable than (rho), and K('2) more stable than (rho) over all levels and combinations of item difficulty, biserial correlations and criterion scores.Variation in the relative magnitude of K('2), (kappa), and (rho) corresponding to changes in the biserial correlations were approximately the same.The effect of three population distributions, normal, exponential, and uniform, on the three reliability indices and their sampling variances was investigated. The effects of distributional changes on (kappa) and K('2) were simular. Both showed a general tendency to increase with item difficulty for the exponential distribution and to decrease with item difficulty for the uniform distribution when compared with the behavior of (rho). Therefore, neither (kappa) nor K('2) can be considered to be distribution free.One obvious conclusion from the results of the study is that (kappa) and K('2) do not appear to be measures of the same thing, and neither is equivalent to the classical reliability coefficient. Kappa appears somewhat less stable than the other indices of reliability when used to assess the reliability of a test with a small number of binary-scored items administered to classroom size samples.


Education, Tests and Measurements

Link to Full Text


Link to Full Text