In his excellent post “New matching engines go beyond apples and oranges”, Winfried van Holland states that traditional matching engines are based on atomic string comparison functions, like match-codes, phonetic comparison, Levenshtein string distance and n-gram comparisons. He further argues that the drawback of these functions is that it’s not always clear for what purpose one needs to utilize a particular function, and that these low-level DQ functions cannot distinguish between apples and oranges – you end up comparing family names with street names.
Good point! In essence, this is the basis of the discussion on the matching approach within customer data management: As intelligent automated matching of records distributed over various heterogeneous data sources is an essential pre-requisite for correct and adequate customer data integration, there are many opinions on how to achieve this.
In theories on data matching, there are in general two methods that prevail when customer data management is concerned: deterministic and probabilistic matching.
- Deterministic matching uses, among others, country- and subject-specific knowledge, linguistic rules, such as phonetic conversion and comparison, business rules and algorithms, such as letter transposition or contextual acronym resolving to determine the degree of similarity between database records.
- Probabilistic matching uses statistical and mathematical algorithms, fuzzy logic and contextual frequency rules to assign the degree of similarity between database records. In this, patterns with regard to fault-tolerance play an important role (the matching method is able to take into account that humans make specific errors). Probabilistic matching methods usually assign the probability of a match in a percentage.
Both methods have advantages and disadvantages, but I believe (following the train of thought in “Matching engines go beyond apples and oranges”) that the two methods should always be combined. The reason for this is actually quite simple: the better the matching engine is able to determine what is what in a particular context, the better the probability calculation of a certain match or a certain non-match. This is, in essence, the same as humans do. We determine what we know and consequently use contextual probability and pattern recognition to assign significations to the words we come across.
Combining deterministic and probabilistic matching will yield in more precise matching, with less mismatches and less missed matches. Probabilistic matching often uses weighting schemes that consider the frequency of information to calculate a score and/or ranking. The more common a particular data element is, the lighter the weight that should be used in a comparison. That is a sound and robust approach. However, assigning weighting factors on data that have been interpreted and enhanced with statistical information, will increase the matching results to a high precision level.