Standardizing crime


A recent article in a Dutch newspaper describes the success the Dutch police force is realizing with data mining products. Policemen are using data mining software to predict time and place of potential criminal activities, such as burglary and robbery, and direct extra police attention to these hotspots at those hours.
As with any data mining project, the quality of the analyses depends heavily on the quality of the data entered in the data warehouse.
Every statement entered in the system, every location, description of people, every relevant object needs to be comparable.
Address standardization products can help when entering locations precise and first time right in the system. Other data quality solutions are available for entering names and other data of people – suspects, victims, and witnesses.
But what about the other aspects of a statement? Was the crime the theft of a car, a vehicle, a van, a pick-up, etc? Did the villain pick a purse or a wallet? A bicycle or a bike? The list of synonyms for objects of crime is endless.
I think the criminal community should come to an agreement and decide on standards to make analyses of these data mining projects even more successful. Now that Christmas is nearing,we all want a better world, isn’t it?

High precision matching – apples, oranges or fruit salad?

apples-oranges In his excellent post “New matching engines go beyond apples and oranges”, Winfried van Holland states that traditional matching engines are based on atomic string comparison functions, like match-codes, phonetic comparison, Levenshtein string distance and n-gram comparisons. He further argues that the drawback of these functions is that it’s not always clear for what purpose one needs to utilize a particular function, and that these low-level DQ functions cannot distinguish between apples and oranges – you end up comparing family names with street names.

Good point! In essence, this is the basis of the discussion on the matching approach within customer data management: As intelligent automated matching of records distributed over various heterogeneous data sources is an essential pre-requisite for correct and adequate customer data integration, there are many opinions on how to achieve this.

In theories on data matching, there are in general two methods that prevail when customer data management is concerned: deterministic and probabilistic matching. Continue reading ‘High precision matching – apples, oranges or fruit salad?’

Chinglish – the most delightful side-effect of internationalization

little grass has life

An increasing number of companies have to deal with data from the world’s fastest emerging economy: China. And the big question in this issue is of course: How can we compare these “strange” Chinese characters with our own writing set?

Grammar and character set of our Western alphabet-languages (such as English, French, Dutch or German) differ tremendously from Mandarin Chinese (which is the language spoken by most in the People’s Republic of China and abroad. Mandarin is a tonal language with an ideographic character set. Almost all characters have a semantic and a phonetic component. The different pithch in the pronunciation eventually determines the signification

Complicated? Definitely. But what about the other way around? Have you ever thought about the difficulties the Chinese have to face when trying to convert their language into meaningful English?

This phenomenon is sometimes hilariously being illustrated by the many public signs in China used to inform foreign visitors or to help them finding their way around.

This is truly a delightful side-effect of internationalization. …. Continue reading ‘Chinglish – the most delightful side-effect of internationalization’

Matching persons with different official names

Dealing with matching of persons or contact data in general, we are all aware that individuals can make use of abbreviations or nicknames as kind of synonyms for their name. Classic examples are the usage of the name Bill for the actual name William, or like my own father is using the name Mans while officially his name is Hermanus. Most data matching engines make use of a kind of synonym table to take care of this. That can be done because within a culture or region the nicknames are quite often linked to the same names and people do not tend to use completely different official registered names.

It becomes more challenging if there is no longer a link between nickname and official name. That may happen, for example, if people move from one cultural region to another where also other writing sets are used. Take for example my chinese friend 高为民, whose Latin name would be Gao Weimin (family name first), but the moment he works in Europe or the US he is using the Latin variant William Gao. There is no common relation to the name William and Weimin both in Latin or Chinese and it they are no phonetic variants of each other. Continue reading ‘Matching persons with different official names’

International domain names – there goes the ASCIIhood….


The internet is on the verge of one of the most fundamental changes in its history. The Internet Corporation for Assigned Names and Numbers (ICANN) is expected to agree on the use of internet addresses in non-Latin characters during this week’s ICANN convention in Seoul. If all goes according to plan, it will be possible to use Greek, Cyrllic, Arabic, Chinese, Korean and many other characters in the internet browser’s address bar. More than half of the 1.6 billion internet users in the world are using a character set which is not Latin. Therefore, ICANN expects that the number of non-Latin domain names, and thus the number of new internet usersm, will increase rapidly.

This far-reaching change in the use of he internet is based on a system that can “translate” or “convert” different writing systems (with sometimes different writing directions, i.a Arabic and Hebrew). On a high level, it would look a little like this, I would imagine:























Naturally, this phenomenon raises questions concerning the matching of internet addresses. Is ووو.هُمَنِنفِرِرِنسِ.كُم the same as It appears that generic multilingual data matching issues also apply in this particular case.

Deduplication, first time wrong?


One of my current projects has been to take an intelligent approach to the removal of duplicates already on an existing system (SAP).

The client has already successfully used our software in their IT environment to effectively stop all new duplicates being entered into SAP. They now want to use the same technology to remove all existing duplicates. Their idea is so simple I am amazed that I have not heard of it being done elsewhere before.

Every evening the whole clients SAP database will be searched for duplicates in their Companies and Contacts (> 3 million records deduplicated in less than an hour!) The results are stored in a master result table that SAP has been given access to. Now depending on the likelihood of the match, the duplicates can fall into one of three categories: automatic merging, manual merging or no merge. If the score for the whole duplicate group is above the threshold for automatic merging then the automatic merging process is started. Continue reading ‘Deduplication, first time wrong?’