Skip Navigation Links
Centers for Disease Control and Prevention
CDC
CDC CDC Home Search Health Topics A-Z
Contact Help Travelers Health n i p Home NIP header
Family

Wednesday, October 20, 2004 - 8:35 AM
1

Probabilistic Record Matching and Deduplication Using Open Source Software

Michael C. Berry, HLN Consulting, LLC, 7072 Santa Fe Canyon Place, San Diego, CA, USA and Andres E. Blanco, Family Health, Rhode Island Department of Health, 3 Capitol Hill Rm 302, Providence, RI, USA.


BACKGROUND:
Rhode Island’s KIDSNET integrates data from multiple public health programs, including Immunization, Lead, WIC, Newborn Screening, Hearing Screening, Early Intervention, Home Visiting and Risk Response, and Vital Records. Developed in the mid-1990’s, KIDSNET employed a simple deterministic algorithm for matching incoming data to existing KIDSNET demographic records. By 2004, KIDSNET had accumulated a queue of over 47,000 unmatched records. With a limited budget, RI embarked on a project to improve the matching process and to ultimately reduce the number of unmatched records.

OBJECTIVE:
Demonstrate how probabilistic matching and deduplication can be implemented with the help of open source software.

METHOD:
KIDSNET’s unmatched record queue was analyzed, and surveys of matching methods and software options were conducted. Requirements were documented, and a probabilistic matching, adding, and deduplication architecture for KIDSNET was designed. Febrl (Freely Extensible Biomedical Record Linkage), an open source package, was modified for use within the new framework. Probabilistic parameters were developed, and an extensive six-month testing process ensued. The process was placed into production in May, 2004.

RESULT:
The new process, combined with “human review” activity, reduced the number of unmatched records by over 93% in its first three weeks. Probabilistic deduplication combined with an interactive merging interface ensures that the number of duplicates in KIDSNET will remain low even as unmatched children are added to KIDSNET.

CONCLUSION:
There is a middle ground between less expensive, “home grown” deterministic matching and more expensive commercial products. KIDSNET has achieved success in this area with the help of open source software.

LEARNING OBJECTIVES:
To understand the software options for probabilistic matching, and to identify the potential benefits and limitations of implementing probabilistic matching with the help of open source software.

[ Recorded presentation ]   Recorded presentation

See more of Deduplication: Challenges and Solutions
See more of The 2004 Immunization Registry Conference