Deduplication, first time wrong?

twins

One of my current projects has been to take an intelligent approach to the removal of duplicates already on an existing system (SAP).

The client has already successfully used our software in their IT environment to effectively stop all new duplicates being entered into SAP. They now want to use the same technology to remove all existing duplicates. Their idea is so simple I am amazed that I have not heard of it being done elsewhere before.

Every evening the whole clients SAP database will be searched for duplicates in their Companies and Contacts (> 3 million records deduplicated in less than an hour!) The results are stored in a master result table that SAP has been given access to. Now depending on the likelihood of the match, the duplicates can fall into one of three categories: automatic merging, manual merging or no merge. If the score for the whole duplicate group is above the threshold for automatic merging then the automatic merging process is started. Continue reading ‘Deduplication, first time wrong?’

Data Quality – who needs it!

escher_gezichtsbedrog2Okay, so the theme Data Quality (DQ) has been around for more than a couple of years now. If you are reading this, chances are that you are obviously already informed on what’s available.

I came from a large logistics company, where DQ was preached heavily and seen as a way of reducing costs. The further though we went into what DQ could actually mean – the more vague and indirect the costs and effects seemed to be. The one thing we knew we really suffered from it was that we had a whole lot of duplicates in the system. This was always visible and the effects from it very tangible. They effectively helped screw up a perfectly good CRM tool. The solution was simple. Buy a deduplication tool and identify the duplicates!

Continue reading ‘Data Quality – who needs it!’

A question of quality?

Yesterday I gave a lecture for management information system (MIS) students. We were looking into definitions of data quality linked to the natural language processing approach of the Human Inference software. As discussions developed, the students could not easily agree on criteria for quality in general. In an exercxise, we talked about “good” and “bad” service. It appeared that, besides differences in taste, good service had a lot to do with expectation and fulfillment of that expectation. Of course, there were also a lot of other “requirements” for good service, but the discussion made me think of a Youtube movie I had recently seen. Seeing this movie made the jump to a solid and generic data quality definition easy: data has quality if it satifies the requirements of its intended use… Enjoy the movie!

Dutch comedians making fun of names

ankeiler_201_programma_info_tcm8-72662The dutch radio program Andersmans Veren (Radio 2, AVRO) broadcasted a special on names on February 15th. Although not related to any burning data quality issues it is almost two hours of fun to listen to. From Theo Maassen to Toon Hermans, and of course Hans Teeuwen with his famous sketch. Point your browser to the radio stream or download it as a podcast. Note: only suited for people who understand Dutch.