Background
Critical appraisal tools (CATs) help readers to rate research papers and are used in systematic reviews, evidence-based practice and journal clubs.1,2 There are many well-known CATs available such as the Jadad scale,3 Maastricht scale,4 Critical Appraisal Skills Programme tools,5 Assessment of Multiple Systematic Reviews6 and Single-Case Experimental Design scale.7 However, these and other CATs suffer from similar problems. First, most CATs were designed to appraise either one or a small number of research designs.1,8 When a reader wants to appraise many papers that use a diverse range of qualitative and quantitative research designs, or that use multiple or mixed methods, then they must use multiple CATs. The scores from multiple CATs cannot be compared because they may use different scoring systems, design features or assumptions that are incompatible. Second, the majority of CATs lack the depth to fully appraise research8,9 or have scoring systems that are insufficient to accurately reflect the content of research papers.10-12 In either of these cases, the resultant score from the CAT can be compromised and, as a result, defects in the research may be hidden or not fully considered by a reader. Third, very few CATs have any validity and reliability data available.1,13,14 This means that there may be no evidence that a particular CAT is effective or consistent in appraising research.
The Crowe Critical Appraisal Tool (CCAT)1,15,16 was designed to overcome the problems outlined above. First, the CCAT was built based on a review of the design of 44 CATs across all research designs.1 These CATs were analysed using a combination of the constant comparative method,17,18 standards for the reporting of research19-24 and research methods theory.25-27 This analysis led to the development of a tool that consisted of eight categories (preliminaries, introduction, design, sampling, data collection, ethical matters, results and discussion) divided into 22 items, which were further divided into 98 item descriptors.1 The combination of categories, items and item descriptors allows for a wide range of qualitative and quantitative health research to be appraised using one tool.1,15,16 Second, a comprehensive user guide was produced that is considered vital to obtaining valid scores from the CCAT. Scoring is described in the user guide as a combination of subjective and objective assessment where each category is scored from 0 (the lowest score) to 5 (the highest score). Third, an evaluation of score validity15 and reliability16 were completed for the CCAT. These preliminary assessments showed that the scores obtained had a reasonable degree of validity, and the CCAT could be considered a reliable means of appraising health research in a wide range of research designs. The CCAT and user guide, as used in this study, are available as additional material online (Appendix S1).
However, while undertaking research into the CCAT,15,16 two questions arose with regards to CATs in general. First, a search of the literature revealed only one article that tested whether using a CAT is an improvement over not using a CAT to appraise research.28 Therefore, although it has been assumed that using a CAT is a better option, there is little evidence to substantiate this assertion. The second question was whether a reader's subject matter knowledge or research design knowledge influence the scores awarded to a research paper. In other words, when a reader looks for evidence as a basis for their practice, does their subject matter or research design knowledge affect how they rate research papers? If subject matter knowledge or research design knowledge does affect appraisal, then this may lead to situations where only evidence that reinforces current knowledge is incorporated into practice while evidence that is new to or contradicts with a reader's knowledge may be discarded, no matter how worthy.
Teaching and implementation of evidence-based practice may be improved by exploring the relationship between using a CAT versus not using a CAT and the influence of subject matter knowledge and research design knowledge on the appraisal of research papers. Therefore, the aims of this study were:
1. to investigate whether using a CAT versus not using a CAT (i.e. informal appraisal) affected how readers appraise a sample of health research papers and
2. to examine whether subject matter knowledge or research design knowledge affected how readers appraise a sample of health research papers.
Methods
The CAT used in the study was the CCAT. The CCAT was used because it was known to the authors; score validity and reliability data were available; and the CCAT could be used across all health research designs, removing a potential confounder where a different CAT could be required for each research design. The alternative to using a CAT was an informal appraisal of research papers where no CAT was supplied to participants. The outcome measure used was rating (total score as a percent) of health research papers using either the CCAT or informal appraisal.
Design
Potential participants were asked to take part in the study through a series of invitations emailed to academic/research staff and postgraduate research students in: the School of Public Health, Tropical Medicine and Rehabilitation Science; the School of Nursing, Midwifery and Nutrition; and the School of Medicine and Dentistry, James Cook University, Australia.
Participants were match paired by the principle investigator (MC) based on their level of research experience so that participants with similar experience were allocated to each research group. Research experience was determined by a pre-enrolment questionnaire that asked the participants to indicated: how many years they had been involved in research; on how many research projects they had worked; on how many projects they had been lead or principal researcher and a subjective assessment of their level of research experience on a scale from 1 (novice) to 5 (expert). This measure of researcher experience was not validated because it was only used to match participants rather than as a conclusive measure of researcher experience. No additional inclusion or exclusion criteria were used.
When all participants had been match paired, they were randomly assigned by the principal investigator to either the informal appraisal (IA) group (control) or the CCAT group (intervention), using the random sequence generator available from http://www.random.org.29 The principal investigator was not blinded to the groups participants were allocated. Blinding was not considered necessary because participants individually scored papers without input from the principal investigator. Participants were informed that they could contact the principal investigator if they had any general questions regarding the study. However, questions concerning how to score a research paper, whether using the CCAT or not, would not be answered because this could affect the scores awarded and bias the results obtained. Furthermore, participants were requested not to discuss the study with other participants, if they became aware of them, until data collection was completed.
Sampling
A sample size calculation showed that six raters reading five papers each were required to achieve an intraclass correlation coefficient (ICC) of 0.90 ([alpha] = 95%, 1-[beta] = 0.79, rmin = 0.55).30 Two separate groups were required, which meant a minimum of 12 participants in total.
Health research papers to be scored were randomly selected using the random sequence generator available from http://www.random.org.29 The research papers were selected from a larger pool of papers that was used in two other studies.15,16 In brief, the larger pool of research papers was randomly selected from OvidSP's (Ovid, New York, NY, USA) full text articles subscribed to by James Cook University, Australia. Research papers in the larger pool were chosen based on the research design used in each paper, with possible categories of research designs being: true experimental, quasi-experimental, single system, descriptive, exploratory or observational, qualitative and systematic review. The five randomly selected papers were:
1. true experimental: Arts MP, Brand R, van den Akker EM, Koes BW, Bartels RH, Peul WC. Tubular diskectomy vs. conventional microdiskectomy for sciatica: a randomised controlled trial. JAMA 2009; 302: 149-58;
2. quasi-experimental: Polanczyk G, Zeni C, Genro JP et al. Association of the adrenergic [alpha]2A receptor gene with methylphenidate improvement of inattentive symptoms in children and adolescents with attention-deficit/hyperactivity disorder. Arch Gen Psychiatry 2007; 64: 218-24;
3. single system: Jais P, Haissaguerre M, Shah DC et al. A focal source of atrial fibrillation treated by discrete radiofrequency ablation. Circulation 1997; 95: 572-76;
4. qualitative: Beck C. Postpartum depressed mothers' experiences interacting with their children. Nurs Res 1996; 45: 98-104 and
5. systematic review: Singh S, Kumar A. Wernicke encephalopathy after obesity surgery: a systematic review. Neurology 2007; 68: 807-11.
Data collection
All data were collected in August and September 2010. Each participant was supplied with a copy of the research papers to be appraised, instructions on what was required and forms to write their scores (see Appendix S1 online for copies of the forms). For the IA group, the participants were asked to read each research paper thoroughly and to rate each paper on a scale from 0 (the lowest score) to 10 (the highest score). No further instructions were given to the participants on how to determine the score for a paper other than to use their best judgement. For the CCAT group, the participants were asked to read each paper thoroughly and to fill out a CCAT form for each paper. The CCAT form was supplied with an extensive user guide to help participants use the tool as effectively as possible.
Participants in both groups were also asked to indicate their subject matter knowledge and their research design knowledge for each research paper. The scale used for both subject matter knowledge and research design knowledge was from 0 (no knowledge) to 5 (extensive knowledge).
Data analysis
When the appraisal forms were returned, the total scores for the CCAT group were checked by adding the individual category scores. Total scores for the research papers in the IA and CCAT groups were then converted to percentage scores so that the rating of papers could be compared. The reliability of scores was calculated using the ICC and generalisability theory (G theory). An analysis of covariance (ANCOVA) between the dependent variable (total score) and covariates (subject matter knowledge and research design knowledge) was also completed.
Ethics
Ethical approval for this study was obtained from James Cook University Human Ethics Committee (H3415) and the study conformed to the Declaration of Helsinki.31 Written informed consent was obtained from each participant before they took part in the study. Participants could withdraw at any stage without explanation or prejudice. The authors have no potential conflicts of interest or sources of funding to declare.
Results
A total of 19 people responded to the invitation to participate in the study, and 10 participants (53%) completed the study. Despite repeated emails to attract further participants to the study, no other participants were forthcoming. Eight participants were academic/research staff and two were postgraduate students. Eight participants (not all of them staff) were from the School of Public Health, Tropical Medicine and Rehabilitation Science; one participant was from the School of Nursing, Midwifery and Nutrition; and one participant was from the School of Medicine and Dentistry. The flow of participants through the study is indicated in Figure 1.
Participants were match paired based on their responses to the pre-enrolment questionnaire. Four participants (two pairs) had a low level of research experience, four participants (two pairs) had a medium level of research experience and two participants (one pair) had a high level of research experience. There was no difference between the IA group and the CCAT group based on research experience. Subject matter knowledge for the IA group and the CCAT group were positively skewed (i.e. more participants stated they had low levels of knowledge rather than high levels of knowledge). For research design knowledge, both groups had normal distributions of data. There was no statistical difference (Mann-Whitney U-test) between the IA group and CCAT group for subject matter knowledge (U = 298.5, z = -0.29, P = 0.78 two-tailed) or research design knowledge (U = 270.0, z = -0.85, P = 0.40 two-tailed).
The maximum score in the IA group was 90%, the minimum score was 30% (range 60%) and the average score was 67% with a standard deviation of 16%. The maximum score in the CCAT group was 98%, the minimum was 25% (range 73%) and the average score was 67% with a standard deviation of 22%. The total score for both groups was found to be normally distributed. In the IA group, Kendall's tau correlation coefficient showed a significant weak positive relationship ([tau] = 0.38, P = 0.03) between total score and subject matter knowledge. There was no significant relationship between total score and subject matter knowledge for the CCAT group or between total score and research design knowledge for either group.
Reliability, based on total score, was calculated in SPSS version 18.02 (SPSS, Chicago, IL, USA) using the ICC for multiple raters. Reliability for the IA group showed an ICC for consistency of 0.84 and for absolute agreement of 0.76 (Table 1a). The CCAT group had an ICC for consistency of 0.89 and for absolute agreement of 0.88.
A G study (Table 1b), using G_String_III,32 demonstrated where error occurred in the total scores. The IA group had 76% of variance attributable to the paper, 10% attributable to the rater and 14% attributable to paper x rater interaction. The CCAT group had 88% of variance attributable to the paper, 1% attributable to the rater and 11% attributable to paper x rater interaction. Taking an a priori minimum acceptable G coefficient of 0.75, a D (decision) study (Table 1c) showed that in the IA group, three raters would be required to achieve the relative G coefficient and five raters would be required for the absolute G coefficient. In the CCAT group, two raters would be required to achieve both the relative and absolute G coefficients.
Analysis of covariance (Table 2) was used to determine whether raters (considered a random factor) were influenced by their subject matter knowledge or research design knowledge in appraising each paper. Assumptions of independence, normality, linearity, homogeneity and independence of covariates were met before analysis of covariance was undertaken. There were significant results in the IA group for subject matter knowledge (F(1,18) = 7.03, P < 0.05 one-tailed, partial [eta]2 = 0.28) and rater (F(4,18) = 4.57, P < 0.05 one-tailed, partial [eta]2 = 0.50). There were no significant results for the CCAT group.
Discussion
Even though both groups had the same average score, the range for the IA group was narrower than that for the CCAT group; the CCAT group had a lower minimum and higher maximum scores. Therefore, it could be concluded that the CCAT had better discriminatory power than informal appraisal. In other words, finer distinctions could be made between papers using the CCAT.
With regards to reliability, it was expected that the scores from CCAT group would be more reliable than the IA group because there was a more structured approach to appraising the papers. This expectation was borne out with the CCAT group having an ICC for consistency 0.05 higher than the IA group and an ICC for absolute agreement which was 0.12 higher than the IA group. Furthermore, the CCAT almost eliminated the rater effect (variance in total scores because of variability in how a rater scored a paper), with the CCAT group having a rater effect of 1% and the IA group's was 10%. Also, the D study showed that fewer raters would be required to achieve similar reliability using the CCAT than using informal appraisal especially where absolute agreement was sought (two vs. five raters).
In the IA group, there was a significant subject matter knowledge effect (f = 0.63) and a weak positive Kendall's tau correlation between total score and subject matter knowledge ([tau] = 0.38, P = 0.03). This meant that taking rater variance and research design knowledge variance into account, knowledge of subject matter had a significant effect on total scores for the IA group and the greater the rater's subject matter knowledge, the higher the score they will give a paper. The ANCOVA also reinforced the significant rater effect (f = 1.00) for the IA group, as was apparent in the G study, and also that the rater effect was larger than the subject matter knowledge effect. This was as expected considering that subject matter knowledge is a characteristic of a rater.
The G study, ANCOVA and D study results show that using the CCAT appeared to neutralise any effects the raters or their subject matter knowledge had on the appraisal of the research papers. In other words, using the CCAT instead of an informal appraisal of research papers should help raters with different subject matter knowledge reach similar conclusions about a paper. This, in turn, has the potential to reduce poor conclusions being drawn from research papers and may even improve the implementation of evidence into practice.
The results did not show what other characteristics of the raters, besides subject matter knowledge (a significant effect) or research design knowledge (no effect), influenced the IA group's appraisal of the research papers. The level of research experience, which was used to match pair participants, could not be used because fewer participants were recruited than initially hoped for and the method used to determine researcher experience was not validated. Another limitation of this study was the small number of papers appraised. The same result may not be found if a large number of papers were appraised. Future research should address these two issues.
Conclusion
For the researcher, the decision on whether to use a CAT or an informal appraisal of research papers is clear: a structured approach was better. The CCAT was developed from theory and empirical evidence to work across multiple research designs, has a substantial user guide and has a published body of score validity and reliability data. The CCAT was shown to reduce the influence raters and subject matter knowledge had on the research papers being appraised. Finally, by being a consistent and structured tool, using the CCAT may in turn lead to improved understanding of findings and application of the evidence.
Acknowledgement
The authors wish to thank Anne Jones (James Cook University) for her contribution to this paper.
References