Doing a Data Quality Analysis (DQA) is a challenging task. You need to get your head into the domain of the business to understand what the data is all about. You need to talk to the users of the organization to understand how they work with data. And within hours you’ll most probably have a dozen of different data sources that you need to dig into.
Tools, tools, tools
The DQA is not trivial, actually the opposite, so often you’ll see that tool support is lacking. The analyst himself will have to use a toolset that is just as diverse and uncontrollable as the data he is trying to manage. The problem with such an approach is that it will eventually get in your way because you’re trying to get 2-3 independent tools to work nicely together, instead of just having these functions available where you need them.
I don’t mind combining tools at all, but we have to do so with care and acknowledge that combining tools also adds a lot of constraints to our working process. Let us for example say that you’re doing an analysis of string patterns in a set of Company names. You’re noticing a piece of the pattern that shows out to be legal forms like GmbH, Ltd, A/S and so on. You want to separate the legal form from the company name, but switching between tools that do the pattern finding (a profiling tool) and the separation (a transformation, perhaps even ETL, tool) means that you have to go back to step 1 in your workflow and re-do all the steps in your flow in different tools. If your chain of analysis steps is more than just a few steps long, then you’re out for a lot of waste. Continue reading ‘Data Quality Analysis – It requires a bit of all worlds’