Data Quality – who needs it!

escher_gezichtsbedrog2Okay, so the theme Data Quality (DQ) has been around for more than a couple of years now. If you are reading this, chances are that you are obviously already informed on what’s available.

I came from a large logistics company, where DQ was preached heavily and seen as a way of reducing costs. The further though we went into what DQ could actually mean – the more vague and indirect the costs and effects seemed to be. The one thing we knew we really suffered from it was that we had a whole lot of duplicates in the system. This was always visible and the effects from it very tangible. They effectively helped screw up a perfectly good CRM tool. The solution was simple. Buy a deduplication tool and identify the duplicates!

Now there’s a good few deduplication tools out on the market. All of them will tell you how good they are at finding duplicates using mathematical-, probabilistic- and fuzzy matching. The list can go on and on. Where all providers though seem to stop short, is what they do with the duplicates found. Now all vendors have ways of identifying the duplicate pairs or even duplicate groups and most vendors will offer clever and fancy ways of bringing the duplicates together, but in almost all of these cases this happens OUTSIDE the clients’ current IT systems. Of course bulk loading old/new data back into the IT system is always so simple! So much fun! IT/ICS departments are always so understanding and helpful!!

This is what we had to find out for ourselves in a painful fashion. We quickly came to realise that finding the duplicates is barely even the tip of the iceberg. Actually trying to group the information together in such a way as to create a unique (golden) record was going to cost the company a lot of money. It would need an involvement of a Systems Integrator – because the problems are never just related to only one system, right? Multiple man years spent in project time (God! Consultants just prey for those types of projects.) Naturally, project costs rapidly rising into million of euros.

Where’s the saving then? I mean, the company has lived with the same problem for years right? So it can’t be that bad, can it?!? (Ever wondered exactly how the banks got into the current credit crisis?)

Unfortunately for the logistics company – the project never really got off the ground. Although the benefits of non duplicates were visible: reduced overhead – a more streamlined sales force, increased effectiveness, the risks about project timelines and overall cost killed it.

All is not lost. Good DQ providers will also offer software that effectively stops duplicates being allowed on to the clients system. Of course the safest way of ensuring a high level of DQ is to capture all the mistakes by their entry. Sounds wonderful, but like the Irish saying goes “If I was you – I wouldn’t start from here.” The DQ providers are only asked for a solution because the need has already been identified by the prospective client. The client has no other choice but to start from where they are. Just stopping the problem coming in through the front doors does not make the waste that’s already on the system magically disappear.

So data quality tools will of course offer solutions – but far too often they won’t go far enough! DQ can never really be about just installing a CD to solve the problem. Good expertise and guidance is necessary. It will always require a deeper understanding of the origin of the problem, a sharp focus on the companies’ pain points and a tight integration into the IT landscape. Suddenly a simple CD solution doesn’t fit anywhere; it quickly becomes an exponential cost, sucking up time and resources simply to make it work. At a time where companies are looking to become more efficient, cut costs and unfortunately overhead too, is that really the best way?

3 Responses to “Data Quality – who needs it!”

  • Very apt Paul.
    Along our data quality project we found out, that stopping new duplicates pouring into the database isn’t enough – it’s just a good start as you say.
    Usually there is a history of data which is already there (unchecked and terribly filthy) and of course there are business processes permanently running on this data.
    So we did a three step approach:
    1. stop new dublicates coming in
    2. kill duplicates which have been already there without stopping the running business
    3. Improve the data content and so improve finding dublicate results
    But as you said: it’s along way to Tipperary (which is a pub in London as I’ve heard – can you confirm?)

  • Nice to read :-) and interested in the second part *g*

    Schöne Grüße aus Köln

  • Balancing the cost/impact of fixing the information quality problems against the cost/impact of living with them is a constant theme I’ve come across working internally in a large company.

    The cost of poor quality is often seen as the “cost of doing business”, and often the thinking is tightly stuck in that rut. The only approach that I’ve ever seen come close to addressing the holy grail of good quality data has been to:

    1) Tackle people and processes to change behaviour that causes common errors
    2) Put measures in to track the changes in behaviour (profile data on the way into your systems for example)
    3) Then make the changes necessary to the systems to address the causes that are inherent in the system, or are required to address other process weaknesses that cause poor quality data.
    4) And then (either at the end or in parallel to 1) and 2), start working to clear out the crummy data in the systems.

    Unfortunately, executives are too used to the “zip bang flashing lights” of IT systems delivery (for example sexy new web applications or call centre tools) to appreciate the hidden value in the unglamorous work of mucking out the database and making the processes hum along to stop crud getting in.

    It’s a bit like people being enamoured with “slim-quick” diet pills and ‘magic potions’ while not recognising the real long term value of going to the gym to improve fitness. The former gives unsustainable results quickly, but the latter gives longer lasting and more far-reaching benefits.

    And on that… just popping out for a run….

Comments are currently closed.