Recently, there has been a renewed push by the statistical community away from the use of the P value as the single determinant whether a given study finding was important. First, the American Statistics Association (ASA), the largest association of statisticians worldwide, published a statement on statistical significance and P values.1 Several shortcomings were highlighted in this statement including erroneous interpretation of the P value as the probability that a null hypothesis about a study question is true, along with basing conclusions and often health policy decisions on whether a P value falls above or below a certain threshold, such as .05. This statement was followed up by an entire issue of the ASA's journal The American Statistician with over 40 individual contributions related to how the scientific community can move beyond the use of P < .05.2
To get a better understanding of the issues addressed in these 2 publications, let us look first at the interpretation of P values and statistical significance.
A P value represents the probability of calculating a test statistic from sample data (eg, a mean difference between 2 groups) that is equal to or more extreme than the one that was observed in the sample data assuming that the null hypothesis is actually true. The P value measures how compatible the sample data are with the null hypothesis (eg, there is no difference between the groups). High P values obtained from these data would indicate that the observed data are likely when assuming the null hypothesis is true, while small P values would indicate that the data are unlikely when the null hypothesis is assumed to be true. In other words, a low P value indicates that the data provide sufficient evidence to reject the null hypothesis (and conclude that a difference or effect has been observed).
In 1925, Ronald Fisher started using a threshold P = .05 as a measure of statistical significance; however, originally in 1885, Francis Edgeworth used the phrase "statistical significance" as a tool to judge whether a result merited further study rather than to assign a label of scientific importance to a finding, which is how it has been used since.3,4 The relatively arbitrary choice of this threshold originated since a threshold or significance level of .05 translates to a 5% chance of being wrong-in other words there is a 1 in 20 chance of rejecting a null hypothesis as wrong when it is actually true. This chance of being wrong is known as a type 1 error.
Using P values in a dichotomous way (ie, only looking at whether a given P value is below the typically used significance level [[alpha]] of .05) often leads to the decision that a certain observed outcome is not relevant based solely on a P value above .05. This decision in turn often leads to reporting only of findings deemed deserving due to having reached statistical significance (ie, falling below the .05 threshold), resulting in biased reporting and missed opportunities along with wasted resources. A good example is described in a review of use of P values in research studies by McShane and Gal.5 The authors report on 2 studies investigating the same outcome that came to opposite conclusions, despite obtaining essentially the same effect size. These 2 studies investigated the effect of subcutaneous heparin compared to intravenous heparin as initial treatment for deep vein thrombosis, and even though their resulting odds ratio (OR) was almost identical (OR = 0.62 vs OR = 0.61), one study team concluded that subcutaneous heparin was more effective than intravenous heparin while the other study group concluded the opposite when basing their conclusion solely on the obtained P value. One study reported P > .05, while the other study indicated P as being less than .05. Such a puzzling finding can result from various differences between the 2 studies including sample size. For example, because statistical significance depends largely on sample size, if one study has a much larger sample size than the other, the conclusions may differ when using a threshold such as .05 to dichotomize results into statistically significant versus, not even if the study findings are identical. This occurrence highlights the common problem that research findings are often judged by reaching a certain level of statistical significance. However, statistical significance is not related to clinical importance in any way. This means that a study result that has no clinical relevance can be judged statistically significant based on a given significance level due to a large sample size, while a result from a small study could have clear clinical implications but is deemed not significant because the P value fell above the threshold. Clearly, there is need for change.
In an effort to move nursing beyond P < .05, a group of 25 statisticians working in schools of nursing across the country recently published an editorial in several nursing journals, including the International Journal of Nursing Studies6 to promote the following changes to journal guidelines: when a P value is reported, the actual value should be provided; determination of the relevance of a finding should not be based on a threshold such as .05; when P values are reported, appropriate measures of the effect size should be included with a corresponding interval estimate such as a confidence interval.
What does this mean for the WOC nurse?
These changes will involve modifications to publication practices so that P values are no longer used in a dichotomous fashion, including reporting P values as only <.05 or >.05, as well as the use of "ns" for values that are above the traditional threshold, or using asterisks to indicate different levels of significance like one asterisk for .05, 2 asterisks for .01, and 3 asterisks for .001. Instead, any P value should be reported as the actual value that was obtained. At the same time, P values should not be the only findings that are reported for the study results. Each P value should be accompanied by an appropriate effect size. This can include raw effect sizes such as means, proportions, and correlation or regression coefficients and computed effect sizes like the most well-known Cohen's d to report sufficient information to allow the reader to draw the appropriate conclusions. Along with effect sizes appropriate measures of uncertainty, like standard errors and interval estimates (ie, confidence intervals) should be reported. Finally, these findings should be interpreted in terms of their clinical relevance in the context of the research questions. Confidence intervals should be interpreted not with a focus on whether or not the null value is contained in the interval but instead explaining what the upper and lower limits of the interval mean clinically and in the context of the study. For example, using a hypothesized difference in body weight reduction of 12 lb between 2 (hypothesized) study groups of adults, with a lower confidence level of 2 lb and an upper confidence level of 22 lb, this confidence interval would be considered compatible with a clinically important weight reduction (22 lb) and a weight loss not considered clinically important (2 lb).
Going back to the example mentioned earlier, if both studies interpreted their finding of an OR of 0.6 clinically, then we would expect to see the same conclusion, in contrast to the published reports with the focus on an arbitrary threshold to determine whether a finding is important or not. Focusing on the observed effect size in concert with the uncertainty that is associated with the observed effect size, in other words how precisely findings obtained from the sample data estimate the population effect, along with the P value, will lead to better reporting and thus better science.
REFERENCES