Abstract
Background: Preterm births in the United States increased from 11.0% to 11.4% between 1996 and 1997; they continue to be a complex healthcare problem in the United States.
Objective: The objective of this research was to compare traditional statistical methods with emerging new methods called data mining or knowledge discovery in databases in identifying accurate predictors of preterm births.
Method: An ethnically diverse sample (N = 19,970) of pregnant women provided data (1,622 variables) for new methods of analysis. Preterm birth predictors were evaluated using traditional statistical and newer data mining analyses.
Results: Seven demographic variables (maternal age and binary coding for county of residence, education, marital status, payer source, race, and religion) yielded a .72 area under the curve using Receiving Operating Characteristic curves to test predictive accuracy. The addition of hundreds of other variables added only a .03 to the area under the curve.
Conclusion: Similar results across data mining methods suggest that results are data-driven and not method-dependent, and that demographic variables offer a small set of parsimonious variables with reasonable accuracy in predicting preterm birth outcomes in a racially diverse population.