Monday, 28 October 2002 - 2:20 PM
47

This presentation is part of B8: Protect: Data Quality — Part I

Record Matching Challenges

Martin Buechi and Andrew Borthwick. ChoiceMaker Technologies, 41 East 11th Street, 11th Floor, New York, NY, USA


KEYWORDS:
approximate record matching, de-duplication, technical

BACKGROUND:
Immunization data entries must be linked to the correct child in an immunization registry database even if the entry contains spelling errors, variations in formatting, or a new address. In the absence of unique identifiers, such as an SSN, this requires approximate record matching. Highest quality record matching is critical for an immunization registry, because mistaken matches may cause underimmunization and missed matches overimmunizations. Both types of errors lead to erroneous statistics. Yet building a good matching system is very difficult.

OBJECTIVE(S):
The objective of the talk is to describe some challenges of record matching and a particularly successful solution.

METHOD(S):
We describe a number of typical problems that one faces when implementing approximate record matching. These problems include the contradictory requirements for speed and accuracy, spelling errors, and contradicting information. For each of these problems we sketch the solution that we found for the New York City Immunization Registry and the Master Child Index. For example, to meet the speed and accuracy requirements we built a 2-phase matching system that takes value frequencies into account in both phases. Spelling errors are compensated by a combination of different phoneticization and approximate string comparison functions. Contradicting information is weighed with a combination of a probabilistic model and rules.

RESULT(S):
Statistics on NYC DOH measurements of ChoiceMaker’s accuracy will be presented.

CONCLUSIONS(S):
ChoiceMaker successfully solves the record matching challenges of de-duplication, linkage, and processing of new entries for the NYC DOH.

LEARNING OBJECTIVES:
The audience will get a better understanding of the typical problems and possible solutions for approximate record matching. This may be useful for evaluating possible solutions to a record matching problem in an immunization registry.


Web Page: www.choicemaker.com

Back to Protect: Data Quality — Part I
Back to Contributed Papers
Back to The 2002 Immunization Registry Conference of CDC