Have you heard this story about reliability and validity? Joe walks into an ice cream shop every night at 5 pm and orders a milkshake. Every night at about 5:05 pm, the soda clerk gives him a milkshake that tastes just like it did the night before. Joe loves this because he knows he can walk into this ice cream shop and always get a drink he likes that tastes the same way each time. One day Joe invites a friend from work to join him. They both go into the shop and order what Joe thinks is the reliable milkshake. When it is delivered, however, his friend from work tastes the drink and declares, "This isn't a milkshake-it's an ice cream soda."
This story illustrates the concepts of reliability and validity. Although the clerk was reliably delivering the same drink night after night, he was not delivering a drink that actually fit the definition of a milkshake; therefore, the statement that the drink actually was a milkshake was not valid.
For Joe in the ice cream shop, it may not make much difference that he was receiving an ice cream soda and not a milkshake, but for neuroscience nurses measuring physical concepts such as weight or temperature or behavioral concepts such as brain impairment or disability, the instrument measuring the concept clearly needs to be both reliable and valid. Consequently, two universal challenges of any measurement tool are reliability and validity. Neuroscience nurses using tools in practice and researchers must ask themselves two important questions: What is the reliability of the measurement instrument? What is the validity of the measurement instrument? This article will review types of reliability and validity-sometimes referred to collectively as a psychometric testing of an instrument. Relevant examples are used to illustrate the importance of reliability and validity to neuroscience nurses.
A measurement instrument that is reliable is one that is stable or consistent across time (Kerlinger, 1986). In statistical terms, reliability is the ability of an instrument to measure something consistently and repeatedly. It is easiest to picture reliability when thinking about physical measures such as weight. When measuring weight, given that all other variables are the same (e.g., the amount of food consumed), if a scale weighs a person at 120 pounds today, that same scale should weigh that person at 120 pounds the next day. However, understanding reliability in behavioral measures normally used by neuroscience nurses is a little more confusing.
The reliability of a behavioral measure really is the stability of that measure to produce the same results when measuring a construct (idea). The most common types of reliability are test-retest reliability, split-half reliability, and internal consistency reliability. Test-retest reliability means that each time a test is administered, the results will be the same. When measuring both brain impairment behaviors and disability, for example, if the scale used to measure each concept is administered to a group of people today, their answers should look similar 2 weeks from now given that all other variables are the same.
The statistical comparison measure used for test-retest reliability is the Pearson's r correlation coefficient; it can range from +1.00 to -1.00. A Pearson's r correlation coefficient of +1.00 indicates a perfect positive relationship, 0.00 indicates no relationship, and -1.00 indicates a perfect negative relationship (Munro, 2005).
Cameron and colleagues (2008) used test-retest reliability to develop the Brain Impairment Behavior Scale (BIBS). Clinical team members tested the scale with 37 participants on two occasions 2 weeks apart. The correlation coefficients of .75, .88, .82, and .81, respectively, were reported for each of the four subscales, indicating strong positive relationships between the two administrations of the scale (Cameron et al.).
Two-week test-retest was used in the psychometric testing of the Americanized version of the Guy's Neurological Disability Scale (GNDS). A Pearson's r correlation of .91 was reported; this indicates a strong relationship between the amount of disability measured at different times (Fraser & McGurl, 2007).
Split-half reliability compares one half of a test to the other half based on the assumption that all items should be comparable in measuring one construct and the results should be similar. If there were 20 items on a measure, the first 10 items would be compared to the second 10 items. The Spearman Brown correlation formula is used to determine split-half reliability.
Every time research is used, reliability and validity are some of the criteria upon which neuroscience nurses should base their evaluation of research.
Internal consistency reliability is more complicated, because in this measure of reliability we are establishing how well each item in a scale measures the same construct. Internal consistency reliability often is measured with a statistical test called a Cronbach's alpha coefficient (Munro, 2005). This is a way of looking at the extent to which items on an instrument fit together. Cronbach's alpha reliability coefficient normally ranges between 0 and 1.0. The closer the resulting number is to 1.0, the greater the internal consistency of the items on the scale. In behavioral measures, a 100% correlation would not be expected. As a rule of thumb, some professionals require a reliability of .70 (or 70%) or higher (obtained on a substantial sample) before they will use an instrument.
Cameron and colleagues (2008) reported Cronbach's alpha coefficients ranging from .78 to .91 for the four domains of their 18-item BIBS. Because the values of the Cronbach's alpha coefficients all were greater than .70, each of the items included were assumed to be measuring the same thing. For example, with a Cronbach's alpha of .89, each of the items in the apathy subscale appears to be measuring apathy.
Another example of the measurement of internal consistency occurs in the work of Fraser and McGurl (2007), who reported Cronbach's alpha values for the entire GNDS for each administration of the scale. The Cronbach's alpha was .79 at Time 1, .78 at Time 2, and .80 at Time 3, indicating good internal consistency (Fraser & McGurl).
Validity in behavioral measures refers to how well the instrument measures the construct it says it is measuring (Kerlinger, 1986). For example, if an instrument is measuring disability, is it really measuring disability or is it measuring impairment? Establishing validity can be accomplished in many ways.
Content validity is established by having a panel of experts familiar with the construct being measured judge the content of the instrument to establish how well they believe the items actually measure the content. Multiple judges usually are used and their answers are compared to establish their level of agreement.
If Joe (in the opening example) had invited his colleague to the ice cream shop sooner, he would have had the benefit of his expert opinion and would have learned that his milkshake was in fact an ice cream soda. Another example of content validity is the research of Fraser and McGurl (2007), who stated the content validity of the GNDS was determined by 49 neurologists who represented many countries.
Construct validity refers to how well the instrument establishes the theoretical soundness of the instrument. This is established in multiple ways. When developing a behavioral instrument, authors usually hypothesize relationships between the new instrument and other established measures. For example, in disability, one might hypothesize that there would be a relationship between disability and activities of daily living (ADLs). So a person who is more disabled would, in theory, have more difficulty managing his or her ADLs. This process of establishing construct validity is involved and generally requires multiple studies.
Factor analysis is a statistical process that is used to establish how individual items cluster around a given dimension. Subscales can be developed in this manner. Exploratory factor analysis often is used in the early stages of instrument development (Munro, 2005). A study by Cameron and colleagues (2008) reports a factor analysis of the BIBS. In the beginning, the instrument had 37 items and four factors were identified when factor analysis was completed: apathy, comprehension/memory problems, depression/emotional distress, and irritability (Cameron et al.).
The technique of factor analysis was used in a slightly different manner in the testing of the GNDS. Based on previous research on the scale, a four-factor solution was tried, revealing a different configuration of items loading (clustering) on the factors. The authors concluded that the items on the Americanized version of the GNDS should not be conceptualized as falling together to form consistent subscales and recommended a 15-item version to be subject to further testing (Fraser & McGurl, 2007).
Neuroscience nurses should base interventions on evidence. To do so, it is important to become good consumers of research. When reading research or scanning articles, it is important to determine if the findings are sound. If the measures that researchers use are not reliable and valid, their findings are not reliable or valid. Every time research is used, reliability and validity are some of the criteria upon which neuroscience nurses should base their evaluation of research.
Cameron, J. I., Cheung, A. M., Streiner, D. L., Coyte, P. C., Singh, M. D., & Stewart, D. E. (2008). Factor structure and reliability of the brain impairment behavior scale. Journal of Neuroscience Nursing, 40(1), 40-47. [Context Link]
Fraser, C., & McGurl, J. (2007). Psychometric testing of the Americanized version of the Guy's Neurological Disability Scale. Journal of Neuroscience Nursing, 39(1), 13-19. [Context Link]
Kerlinger, F. (1986). Foundations of behavioral research (3rd ed.). Orlando, FL: Harcourt Brace Jovanovich. [Context Link]
Munro, B. (2005). Statistical methods for health care research (5th ed.). Philadelphia: Lippincott Williams and Wilkins. [Context Link]