In research, as well as clinical practice, different measures are used to evaluate patient status. The effectiveness of using these measures is in large part due to the reliability and validity of the instruments themselves. This article reviews the concepts of instrument reliability and validity. The article, Comparison of Self-Reported Pain and the PAINAD Scale in Hospitalized Cognitively Impaired and Intact Older Adults After Hip Fracture Surgery (DeWaters et al., 2008), is used, in part, to explain these concepts. DeWaters and colleagues' article examines the reliability and validity of the PAINAD scale as a measure of pain in the cognitively impaired and intact older adult after hip fracture surgery.
Why worry about the reliability and validity of an instrument? Perhaps the common saying, "garbage in, garbage out," is a relevant response to this question. When you use a measure, you expect that the measure will accurately reflect the phenomenon or construct of interest-this refers to the validity of the instrument. Thus, you want a pain measure to accurately reflect pain rather than similar phenomenon, such as anxiety or restlessness. Furthermore, you want the measure to be consistent so that you can rely on the accuracy of the measure-the reliability. To use a comical example to illustrate this point, I share a recent experience that caused me to question the reliability of a measure. I was staying in a hotel and in the morning got on the scale so I could monitor my weight loss or weight gain during my trip. Much to my excitement, I found that in 48 hours I had lost 85 pounds!! Need I get too excited, by the next morning I had gained another 45 pounds!! Clearly, I did not have to be a researcher to conclude that the measure was not reliable-there was too much fluctuation in scores to believe in its accuracy.
When deciding on whether to use an instrument, you should scrutinize the reliability and validity of the instrument. Both of these measures indicate the extent to which error is present in the instrument.
Instrument Reliability
When examining for reliability, your focus is on the consistency, stability, and repeatability of a data-collection instrument. If the instrument is reliable, it will not vary with chance factors (random error) or environmental conditions; it will have consistent or stable results if repeated over time or if used by two different investigators. Knowing that an instrument is reliable allows you to make interpretations with confidence. With a highly reliable measure, you can predict with confidence that given repeated administrations with the same or similar groups of people, the results should be consistent, assuming the assessed conditions have not changed.
Table 1 summarizes the three different types of reliability to consider when critiquing a measure for use in research as well as in clinical practice. Measurement of stability over time is used when you assume that the concept being measured has or should remain constant. DeWaters' would not be interested in this type of reliability because pain levels are expected to vary over time with changes in condition and treatment. Many of you reading this article have taken the Myers-Briggs Inventory as a measure of your personality or communication style. In fact for me, I have taken it five times during the past 20 years. Each time I take the test, I am surprised to find that my scores remain consistent, despite much variability in my work and life experience. It is measuring a concept that is not expected to vary over time. Therefore, on repeated measures, the instrument is stable-it measures the same way (within a reasonable range) each time the test is given.
Reliability tests for equivalence of the measure in examining consistency between two versions of an instrument or across different people using the instrument. Some psychological and educational measures have different versions of instruments that measure the same construct. It is critical that the different versions be tested for equivalence to be assured that they are, in fact, testing the same thing. To do this, the same subjects would complete both instruments (parallel forms) during the same time period. The correlation between the two parallel forms is the estimate of reliability.
Another form of equivalence is inter-rater reliability, which is occasionally referred to as scorer agreement. This type of reliability is important in a measure where different individuals will be asked to score observations. To be reliable, you would expect consistent scores across different raters. This form of reliability was used by DeWaters. Two research assistants scored 10 videotaped vignettes using the PAINAD tool and then the ratings were correlated (intraclass correlation) as a test of reliability. Generally, you would anticipate an inter-rater reliability greater than or equal to .90. In DeWaters' study, the intraclass correlation was .98. DeWaters' was able to conclude that there was high inter-rater reliability-that when observing the same phenomenon or situation, the different scorers had consistent results.
The third category of reliability measurement is internal consistency-a measure of how consistent the results are for different items within the measure (consistency among questions). With internal consistency reliability estimation, a single measurement instrument is administered to a group of people on one occasion to estimate reliability. Each question provides a measure of the construct, so if the measure is reliable, one would expect consistency among the questions. There are various internal consistency measures that can be used. Split-half method uses a single questionnaire and divides it in half by some random method, and the two halves are correlated. If they consistently measure the same concept, a high correlation will be obtained. Cronbach's alpha, a measure of the average correlation among the items on a scale, is another common measure of internal consistency. Cronbach's alpha is expressed as a correlation coefficient, ranging in value from 0 to +1. An estimate of .70 or higher is desired for judging a scale reliable. In DeWaters' study, the Cronbach alpha was determined for both the cognitively intact and cognitively impaired group. Similar alphas were obtained (.846 and .847, respectively), allowing DeWaters to conclude that the PAINAD scale had high internal consistency.
Instrument Validity
A test is reliable to the extent that whatever it measures, it measures it consistently. From a research point of view, the scale is reliable because whatever it is measuring, it is measuring it consistently but it does not examine whether the instrument is valid-or whether it reflects the construct being examined. Reporting of reliability and validity is generally done together because an instrument cannot be valid if it is not reliable.
Validity is a complex phenomenon, and there are many approaches to measuring validity, all which fall under the term of construct validity. A common definition of validity is the extent to which the measurement instrument measures what it says it measures. Thus, the PAINAD scale measures pain. But is it possible that some of the items do not measure pain alone but capture another phenomenon, such as confusion? The extent that the instrument measures confusion and not pain is an example of systematic error. The more valid the instrument, the less the systematic error (Burns & Grove, 2007).
Content validity addresses the representativeness of the measurement tool in capturing the major elements or different facets relevant to the construct being measured. Content validity asks, "Is the content of this measurement tool representative of the content of the property being measured?" (Kerlinger, 2000) Most instruments are developed from a thorough review of the literature on the concept or from qualitative research findings where representatives of the relevant population (i.e., people in pain or people with depression) provided data on the experience/ construct. In this way, beginning content validity is established. However, to establish content validity of an instrument, it is necessary to go beyond this. Other "competent" judges should review the content of the items to determine whether, in their judgment, the instrument captures or measures the phenomenon of interest. These judges could be experts in the field or could be representatives of the relevant population. The judges are given specific definitions of what the concept of interest is, as well as directions for making their judgment. The independent judgments can be pooled to determine content validity. Frequently, changes to the instrument are made based on review by the judges. These changes enhance the overall content validity of an instrument. DeWaters' article does not address content validity of the PAINAD tool, but review of the items captures the content of pain. This appearance of validity, or what is referred to as face validity, is not considered a sufficient measure of content validity.
Another approach to validating a new test is concurrent validity. Concurrent validity compares the scores on the new measure with the scores on a known and accepted measure of the concept. DeWaters' established concurrent validity by comparing the PAINAD scores with the scores from a self-report numeric rating scale (NRS) (1 to 10) of pain. The NRS is valid and clinically useful, and if the new scale, the PAINAD scale, is valid, one would expect the scores on both measures to correlate (generally expect a correlation >= .60). To determine concurrent validity, both measures are taken at the same time. DeWaters reports on 50 pain observations in which both the PAINAD and NRS scale were used to measure pain. The results showed significant correlations, with all observations having a correlation of .83.
Another form of validity is discriminant validity. Generally, to establish discriminant validity you must show that measures that should not be related are in reality not related-that they can discriminate between two related concepts. For example, you would expect scores from a pain measure and a well-being measure to have low correlations, validating that they were able to discriminate between the related but different concepts.
DeWaters also examined discriminant validity; however, her approach was not to examine two different measures but to determine if the measure of interest (PAINAD) changed logically when conditions changed-a measure of the sensitivity of the instrument. This was done by comparing the pain scores at a time when pain was not to be expected (comfortably at rest) and comparing them to a time when pain was expected (with movement). The statistic used to do this was the Wilcoxon signed-rank test, which is used to compare the difference in mean scores of paired samples (same person at two periods-unlikely and likely pain). The results showed the ability to discriminate, with pain scores higher during periods of likely pain compared to periods of unlikely pain.
A final important point about instrument validity is that an instrument is valid for a given group of people and must be established when used in different groups. Validity of the PAINAD instrument had been determined in the long-term care setting. In this population, it is likely that people would experience more chronic rather than acute pain. DeWaters extended the validity of the instrument by testing it on older adult hip fracture surgery patients who are either cognitively intact or cognitively impaired-a group that is likely to experience acute pain. DeWaters' study determined that the PAINAD instrument was both reliable and valid in this population.
REFERENCES