What is equal? – challenges with sound and synonyms

What to do when basic string comparison (fuzzy search) techniques won’t give the right results? Fuzzy search helps to find matches in situations where people make typo’s (e.g. compare Human Inference with Human Inverence) or make up abbreviations (King str. with King street) or ignore diacritics (Sørensen and Soerensen). In case the ‘wrong word’ is not a real used word it becomes obvious that after correcting the typo we have a match.

More challenges appear if the typo has caused another existing word; now we need to make a decision on how equal the two entries are. In case you have some knowledge on the frequency of usage of words you can use that in the equation. How to get the frequency of usage for words is another ballgame – at least you can assume that a ‘wrong word’ is never used (bit of a paradox). Continue reading ‘What is equal? – challenges with sound and synonyms’

The obfuscated address contest

Programmers sometimes organize contests in writing code that is perfectly understandable for a compiler, but very difficult to understand for people.

When working on products for address standardisation, one can discover an interesting variant: people sometimes write – unintentionally, I suppose – addresses in such a way that they are rather understandable for people, but very difficult to process for computers.

Consider for example this street name:

Kerkchoosteeg hoogl

The official version is:

Hooglandsekerk-choorsteeg (‘high land church – choir alley’)

This street contains a couple of errors:

  • A hyphen is missing.
  • One ‘r’ is missing.
  • One word (‘Hooglandsekerk’) has been split up into two words.
  • The first word (‘Hooglandse’) is written at the end.
  • One word is abbreviated (‘hoogl’).

The first two errors are not very special, but the last three can only be discovered in common: it can only be discovered that the word ‘hooglandsekerk’ has been split up into two words, if at the same time it is understood that the left part has been abbreviated and moved to the end.
Continue reading ‘The obfuscated address contest’