Deduplication, first time wrong?


One of my current projects has been to take an intelligent approach to the removal of duplicates already on an existing system (SAP).

The client has already successfully used our software in their IT environment to effectively stop all new duplicates being entered into SAP. They now want to use the same technology to remove all existing duplicates. Their idea is so simple I am amazed that I have not heard of it being done elsewhere before.

Every evening the whole clients SAP database will be searched for duplicates in their Companies and Contacts (> 3 million records deduplicated in less than an hour!) The results are stored in a master result table that SAP has been given access to. Now depending on the likelihood of the match, the duplicates can fall into one of three categories: automatic merging, manual merging or no merge. If the score for the whole duplicate group is above the threshold for automatic merging then the automatic merging process is started.

This merge process has been created by an external SAP consultancy group that does a lot of clever stuff in giving each record a score depending on its’ financial relevance. E.g. open payments, current order status, payment reminders etc. (Hey, it’s SAP and in the world according to SAP only financial dealings have a value!) In the end the one record with the highest score is set to be the lead duplicate. All information from the other records in the duplicate group is placed onto the leading record to create a unique (‘Golden’) record. All duplicate records with the exception of the lead duplicate are then removed from the system, in the case of SAP, these records are given a ‘set for deletion’ flag and subsequently archived.

The ‘Non merges’, i.e. where the match score is below the accepted threshold level, are discarded and all remaining records are sent to a separate SAP mask for manual inspection for the following day. All that is required is to identify if the records shown belong in a duplicate group or not. After this decision has been made each duplicate group goes to the ‘merging’ process. Just the same as the automatic merge process.

At the end of the day the whole process starts again. Wash, rinse, repeat! Simple! The first thing to happen is that over a short period of time all the secure duplicates disappear as they are merged automatically. This is highly visible, no more multiple identical records that pop up whenever a new record has been entered. The impact on the quality on the surrounding systems is just as direct. No sending out bills or marketing mails x times to the same person (having worked in Marketing before, I know the problem and it always leaves such a professional impression with the customer!) So it’s already something easy to sell to your managers and so far you have not had to lift a finger. Great!

The brilliance of the SAP data quality solution though lies elsewhere. The simple fact is that it really does not matter whether the rest of the results are worked through in 1 day, 1 month or a year – as they are always captured, every day anew. The net result is that the total level of duplicates is constantly decreasing. Where the merge process has taken place, the duplicates will disappear. Only a change on the record will force it to be rechecked in the next round of deduplication. This means that apart for the costs of enhancement of the current system the client has an effective DQ firewall that now not only protects them from duplicate data being entered onto their IT systems, but will now over time cleanse the system from within. Even if it means putting an employee to sporadically make a decision on the manual matches. It is something that the company/department can concentrate on where they have time/resources available. (That should be easy after showing what success you have had with it already!)

How about if it the process could be easily and readily monitored? Say by using Excel or a similar product. Bar graphs and pie charts always tell way more than actual figures! Then the impact on what is happening is all the more visible and easy to sell (a good budget retainer!)

Good luck in dealing with your duplicates.

0 Responses to “Deduplication, first time wrong?”

Comments are currently closed.