20901 Free, Fast, Flexible and Fit for First-Class GIS: Open-Source USPS Address Validation/Standardization QED

Wednesday, September 2, 2009: 10:20 AM
Baker
James L. Tobias, BS, GISP , Northrop Grumman, Atlanta, GA
Robert Borchers , Cancer Reporting System, Wisconsin Division of Public Health, Madison, WI
Introduction: CDC provides comprehensive support for central cancer registries. These and other disease registries provide an unparalleled data resource for population-based analysis of disease incidence, etiology, social burden, and related medical, social, and political policy. Since the mid-1990s, CDC has been encouraging improvement and expansion of the use of GIS technology pursuant to increasing understanding of those dimensions of cancer measurably manifest in space and place. Although desktop GIS availability exploded during the 1990s, users soon noted that conventional automated geocoding failed with sometimes spectacular significance. This commonly involves 1) too-loose matching of addresses in raw medical records with geo-referenced place inventories and 2) too-limited input requirements.

Methods:  

A new approach to address validation and standardization emerged. It followed several years of 1) review of data sources, 2) examination of the nature of common errors in raw address information (e.g., in medical records,) 3) investigation of types of geocoding false positive and false negatives, and 4) exploration of search algorithms. Once USPS address data licensing was found to be a relatively insignificant expense, a program was developed, initially for use by hospitals and clinics in Wisconsin. Design was guided by desire for search flexibility, speed, and standard USPS address data retrieval. Coincidentally, the programming (open source by virtue of its institutional origin) was done in Visual Studio.Net (a free version of the IDE from Microsoft suffices) utilizing basic SQL syntax code readily adaptable to many common programming languages and databases. 

Results: The program typically enables users at hospitals, clinics, and public health centers to quickly validate and USPS-standardize raw case delivery addresses into the full Zip+4 format and associated county.      

Conclusion:    

The completeness and reliability of GIS output can be substantially improved through distributed free and open-source applications and up-to-date USPS (small license cost) data.