Statistical Vs Keyword-Based Natural Language Processing: What May Work for Public Health Domain
Ninad Mishra MD, David Cummo BS, Jim Arnzen BA, Jason Bonander MA
We analyzed hospital discharge summaries which were released by i2b2 (Informatics for Integrating Biology and the Bedside) for a natural language processing data classification challenge (2008 Obesity Challenge). i2b2 is an
We compared the effectiveness of two approaches: a statistical approach and a keyword approach. The goal of this experiment is to classify hospital discharge summaries based on evidence of specific morbidities related to obesity. We further plan to extend these methodologies by applying them to classify publicly available online profiles on the basis of ‘risky’ (as defined by HIV/STD researchers) behavior information exhibited on them.
Method: For the statistical approach, we employed a custom-developed naïve Bayesian classifier implementing a Bayes decision rule algorithm. For the keyword approach, we used a custom-developed classifier that identified the existence of relevant keywords in each discharge summary. The words identified consisted of the following: co-morbidities’ names, synonyms and abbreviations of the co-morbidities, and medications used to treat the diseases.
Results: The results of both approaches were compared based on precision, recall, and F-measure. The keyword approach showed significantly better performance than the statistical approach. The ineffectiveness of the statistical approach may likely be due to the limited amount of training data in the discharge summary corpus. Another factor could be a lack of distinguishing tokens appearing in sufficient numbers for the classifier to distinguish one morbidity category from another. We believe that best results could be achieved in the public health/ medical domain by judiciously applying both statistical and keyword approach in a synergistic fashion. .4357 .7874 .6531 .9885 .5227 .8766
Statistical Keyword Precision Recall F-Measure