David Shields, QS Technologies, Inc, 37 Villa Road, Suite 508, Greenville, SC, USA
Learning Objectives for this Presentation:
By the end of the presentation participants will be able to use two simple methods which use probabilities to evaluate the effectiveness of their Immunization Registry deduplication strategy.
Background:
During a deduplication project for CDC, researchers encountered difficulty determining when they had found all of the duplicates. Two different probabilistic approaches were devised to check the effectiveness of the deduplication. Both of these techniques can be used by Data Quality Managers for Immunization Registries.
Objectives:
Our objective is to evaluate software tools that locate sets of records in a database which pertain to the same person.
Methods:
Two methods are presented:
1. The first method examines “Bridge Matches” and extrapolates an estimated count of matches that were not found. This approach requires the use of deduplication software which identifies pairs of records as duplicates.
2. The second method examines date of birth distribution for uncommon names using SQL. It can be used to produce an estimated count of duplicates before deduplication. It can also estimate the duplicates remaining in a “fully deduplicated” database.
Results:
The first method will give a low estimate of the missed duplicates if the software producing the pairs is weak. The second method is independent of software, but may give skewed results for other reasons. Both methods will give you estimates of duplicates, but neither method will tell you which records they are.
Conclusions:
Most large public health databases continue to have multiple records which represent the same person despite our best efforts and the use of sophisticated software. The methods described here can be used to assist Data Quality Managers in evaluating software packages and other techniques which remove or link these duplicate records.
See more of Posters
See more of The 40th National Immunization Conference (NIC)