Professional data matching engines are becoming more and more intelligent. Within Human Inference, we also see that our matching techniques are capable of using more and more intelligence, and needless to say that we incorporate and use this intelligence in our engines in order to adopt to the way that humans do their matching.
Traditional data quality or matching engines were based on atomic string comparison functions like match-codes, phonetic comparison, Levenshtein string distance, n-gram comparisons or similar functions. These kinds of functions are relatively easy to implement and to use although a significant amount of plumbing is needed to get reasonable results. Open source projects like the Lucene search engine, and variants, provide a solid and proven set of these functions. The drawback of these functions is that it’s not always clear for what purpose one needs to utilize a particular function. An even larger issue is the fact that these low-level DQ functions cannot distinguish between apples and oranges – you end up comparing family names with street names. We still see that, for example BI vendors, claim to provide data quality functionality, while they only provide these atomic comparisons.
Within Human Inference we have been developing matching engines that look beyond these primary functions for years. Engines capable of identifying given names, surnames, family names, postal codes, titles, initials, etc. The true benefit of this approach is that matching results are significantly higher, because you are comparing apples with apples and oranges with oranges. The glueing or plumbing in this approach to validate street or family names is completely under the hood for the data stewards. With a correct set of reference data, the right mix of atomic functions and – not the least – vivid domain knowledge, these matching engines are capable of quickly and adequately finding duplicates – beyond the ones that have simple typos.
The complexity in matching apples starts if you take into account the variants in apples, or to speak in Data Quality terminology, in case you take into account that per country or region people have more or less subtle differences in using names, streets, measurements and writing sets.
The moment you value these differences you also recognize new opportunities. You will notice that by looking at an apple, you get information on oranges. By looking at the name Белоусовa (Beloussowa), you might recognize a family name and that you’re dealing with a female. By looking to the number 681012-2355, you might recognize that this is a valid Swedish personnummer, and that the birth date of this male is October 12, 1968. By looking to an email like Winfried.vanHolland@humaninference.com you might recognize a given name “Winfried”, that you’re dealing with a male, that he has surname “van Holland” and that he is working for a company called Human Inference, and I leave it up to you from which country he originates…. By retrieving additional information out of obvious information, the matching moves beyond the apples and oranges, and becomes easier, faster and more accurate.