Wednesday, October 20, 2004 - 8:55 AM

Applying CDC Deduplication Toolkit to Test Deduplication

Brandy Altstadter, Immunization Registries, Immunization Registries, Scientific Technologies Corporation, 67 E. Weldon Avenue, Phoenix, AZ, USA

The CDC Deduplication Toolkit was used as an evaluation tool to help achieve two goals:
1) Improve the quality of the existing registry deduplication which uses deterministic logic
2) Develop new deduplication algorithm using probabilistic logic and verify that the deduplication specificity and sensitivity is in fact improved with the probabilistic logic.
In order to ensure adequate testing of the registry's algorithm, extensive analysis on several large datasets revealed numerous advantages and disadvantages to the CDC Deduplication Toolkit.

The purpose of the analysis was to improve the deduplication algorithms and determine the best methods for verifying the accuracy of deduplication.

Ran test data from the CDC Deduplication Toolkit through two different deduplication algorithms and analyzed the results. In addition, ran a large dataset through the same two different algorithms and analyzed the differences, then compared to the CDC results to determine how commonly each deduplication scenario occurs and if any scenarios are not covered by the CDC toolkit.

The data analysis showed that while the CDC Deduplication Toolkit provides an excellent starting point for validating the quality of deduplication algorithms, it is limited by number of test cases available and the breadth of the test case scenarios.

The CDC Deduplication Toolkit is a good tool for evaluating deduplication but you need to be cognizant of its limitations and supplement your evaluation with additional testing that is specific to the dataset being evaluated.

Workshop attendees will learn how to better utilize the CDC Deduplication Toolkit to verify their deduplication results. They will also learn additional scenarios to consider when validating deduplication data.

