6th Annual Public Health Information Network Conference: Statistical Vs Keyword-Based Natural Language Processing: What May Work for Public Health Domain

Statistical Vs Keyword-Based Natural Language Processing: What May Work for Public Health Domain

Wednesday, August 27, 2008: 10:40 AM
Atlanta H
Ninad Mishra, MD, MS , NCPHI, Centers for Disease Control and Prevention, Atlanta, GA
David Cummo, BS , NCPHI, Centers for Disease Control and Prevention, Atlanta, GA
Jason Bonander, MA , NCPHI, Centers for Disease Control and Prevention, Atlanta, GA

 Statistical Vs Keyword-Based Natural Language Processing: What May Work for Public Health Domain

 

 Ninad Mishra MD, David Cummo BS, Jim Arnzen BA, Jason Bonander MA

 

 

We analyzed hospital discharge summaries which were released by i2b2 (Informatics for Integrating Biology and the Bedside) for a natural language processing data classification challenge (2008 Obesity Challenge). i2b2 is an NIH-funded National Center for Biomedical Computing based at Partners HealthCare System.

We compared the effectiveness of two approaches: a statistical approach and a keyword approach. The goal of this experiment is to classify hospital discharge summaries based on evidence of specific morbidities related to obesity. We further plan to extend these methodologies by applying them to classify publicly available online profiles on the basis of ‘risky’ (as defined by HIV/STD researchers) behavior information exhibited on them.

 

Method: For the statistical approach, we employed a custom-developed naïve Bayesian classifier implementing a Bayes decision rule algorithm. For the keyword approach, we used a custom-developed classifier that identified the existence of relevant keywords in each discharge summary. The words identified consisted of the following: co-morbidities’ names, synonyms and abbreviations of the co-morbidities, and medications used to treat the diseases.

Results: The results of both approaches were compared based on precision, recall, and F-measure. The keyword approach showed significantly better performance than the statistical approach. The ineffectiveness of the statistical approach may likely be due to the limited amount of training data in the discharge summary corpus. Another factor could be a lack of distinguishing tokens appearing in sufficient numbers for the classifier to distinguish one morbidity category from another. We believe that best results could be achieved in the public health/ medical domain by judiciously applying both statistical and keyword approach in a synergistic fashion.

 

 

Statistical
Keyword
Precision

.4357

.7874

Recall

.6531

.9885

F-Measure

.5227

.8766

 

<< Previous Abstract | Next Abstract