Abstract
Purpose: Agreement across methods for identifying students as inadequate responders or as learning disabled is often poor. We report (1) an empirical examination of final status (postintervention benchmarks) and dual-discrepancy growth methods based on growth during the intervention and final status for assessing response to intervention and (2) a statistical simulation of psychometric issues that may explain low agreement.
Methods: After a Tier 2 intervention, final status benchmark criteria were used to identify 104 inadequate and 85 adequate responders to intervention, with comparisons of agreement and coverage for these methods and a dual-discrepancy method. Factors affecting agreement were investigated using computer simulation to manipulate reliability, the intercorrelation between measures, cutoff points, normative samples, and sample size.
Results: Identification of inadequate responders based on individual measures showed that single measures tended not to identify many members of the pool of 104 inadequate responders. Poor to fair levels of agreement for identifying inadequate responders were apparent between pairs of measures. In the simulation, comparisons across 2 simulated measures generated indices of agreement ([kappa]) that were generally low because of multiple psychometric issues inherent in any test.
Conclusions: Expecting excellent agreement between 2 correlated tests with even small amounts of unreliability may not be realistic. Assessing outcomes based on multiple measures, such as level of curriculum-based measure performance and short norm-referenced assessments of fluency, may improve the reliability of diagnostic decisions.