Abstract
This study developed and validated a rule-based classification algorithm for prediabetes risk detection using natural language processing from home care nursing notes. First, we developed prediabetes-related symptomatic terms in English and Korean. Second, we used natural language processing to preprocess the notes. Third, we created a rule-based classification algorithm with 31 484 notes, excluding 315 instances of missing data. The final algorithm was validated by measuring accuracy, precision, recall, and the F1 score against a gold standard testing set (400 notes). The developed terms comprised 11 categories and 1639 words in Korean and 1181 words in English. Using the rule-based classification algorithm, 42.2% of the notes comprised one or more prediabetic symptoms. The algorithm achieved high performance when applied to the gold standard testing set. We proposed a rule-based natural language processing algorithm to optimize the classification of the prediabetes risk group, depending on whether the home care nursing notes contain prediabetes-related symptomatic terms. Tokenization based on white space and the rule-based algorithm were brought into effect to detect the prediabetes symptomatic terms. Applying this algorithm to electronic health records systems will increase the possibility of preventing diabetes onset through early detection of risk groups and provision of tailored intervention.