What is equal? – challenges with sound and synonyms

What to do when basic string comparison (fuzzy search) techniques won’t give the right results? Fuzzy search helps to find matches in situations where people make typo’s (e.g. compare Human Inference with Human Inverence) or make up abbreviations (King str. with King street) or ignore diacritics (Sørensen and Soerensen). In case the ‘wrong word’ is not a real used word it becomes obvious that after correcting the typo we have a match.

More challenges appear if the typo has caused another existing word; now we need to make a decision on how equal the two entries are. In case you have some knowledge on the frequency of usage of words you can use that in the equation. How to get the frequency of usage for words is another ballgame – at least you can assume that a ‘wrong word’ is never used (bit of a paradox). Continue reading ‘What is equal? – challenges with sound and synonyms’

Standardizing crime

fingerprint_large

A recent article in a Dutch newspaper describes the success the Dutch police force is realizing with data mining products. Policemen are using data mining software to predict time and place of potential criminal activities, such as burglary and robbery, and direct extra police attention to these hotspots at those hours.
As with any data mining project, the quality of the analyses depends heavily on the quality of the data entered in the data warehouse.
Every statement entered in the system, every location, description of people, every relevant object needs to be comparable.
Address standardization products can help when entering locations precise and first time right in the system. Other data quality solutions are available for entering names and other data of people – suspects, victims, and witnesses.
But what about the other aspects of a statement? Was the crime the theft of a car, a vehicle, a van, a pick-up, etc? Did the villain pick a purse or a wallet? A bicycle or a bike? The list of synonyms for objects of crime is endless.
I think the criminal community should come to an agreement and decide on standards to make analyses of these data mining projects even more successful. Now that Christmas is nearing,we all want a better world, isn’t it?