What to do when basic string comparison (fuzzy search) techniques won’t give the right results? Fuzzy search helps to find matches in situations where people make typo’s (e.g. compare Human Inference with Human Inverence) or make up abbreviations (King str. with King street) or ignore diacritics (Sørensen and Soerensen). In case the ‘wrong word’ is not a real used word it becomes obvious that after correcting the typo we have a match.
More challenges appear if the typo has caused another existing word; now we need to make a decision on how equal the two entries are. In case you have some knowledge on the frequency of usage of words you can use that in the equation. How to get the frequency of usage for words is another ballgame – at least you can assume that a ‘wrong word’ is never used (bit of a paradox).
A large group of possible matches that are not found (i.e. missed matches) by fuzzy search methods are the ones that sound the same but are written rather differently. Often a callcenter agent types the name exactly like he hears it. An example would be the family name ‘Farren’ and ‘Pharan’. They have already so many differences that it becomes rather hard for a string comparison to treat both entries as equal. Phonetic search would definitely help here. Drawback on only phonetics is that you can now combine entries that are for sure no matches (i.e. mismatches), e.g.:
- René Meierhofer and
- Renée Mayrhofer
Two valid family names, but the given names show both a male and a female entry.
In a real life example, we would expect a complete name with titles and we’d still need to match in a correct way. Take, for example,
- Dr. John J. Farren jr.
- John J. Pharan jr. PhD
Pure string comparisons based searches won’t work in this case. The complete entry could be matched in combination with some smart academic synonyms and some n-gram or matrix comparison on the individual elements.
Introducing synonyms immediately generates new types of challenges. In address matching you will go a long way when you take into account the abbreviations for street types (Avenue for Av., Street for Str. etc). For company names it definitely helps to have a synonym table on legal forms (Limited for Ltd, Incorporated for Inc., etc). With the actual company name itself it becomes more challenging. A German example might look like:
- Fahrrad-Handel Anna Cintula and
- Zweirad-Shop Anna Cintula,
Two synonyms for bike shop. Quite often people think in such situations that by adding a synonym table the challenge is gone. They are absolutely right for part of the problem but still there is a large set of words that get their specific meaning based on the context of that word – and by that they refer to a particular synonym. If we take for example the following three entries, it seems evident that we cannot replace the word ‘art’ with one single synonym here
- Art Gallery Garfunkel
- ART Auto Rendition Technology
- Paul Simon & Art Garfunkel
String comparison is fine as a start in data matching problems. To really avoid a serious amount of mismatches or missed matches – preventing a serious amount of manual work – you need to know what you’re dealing with. You need to compare apples with apples, oranges with oranges. What would really help here, is a bit of natural language processing ;-)