Doing a Data Quality Analysis (DQA) is a challenging task. You need to get your head into the domain of the business to understand what the data is all about. You need to talk to the users of the organization to understand how they work with data. And within hours you’ll most probably have a dozen of different data sources that you need to dig into.
Tools, tools, tools
The DQA is not trivial, actually the opposite, so often you’ll see that tool support is lacking. The analyst himself will have to use a toolset that is just as diverse and uncontrollable as the data he is trying to manage. The problem with such an approach is that it will eventually get in your way because you’re trying to get 2-3 independent tools to work nicely together, instead of just having these functions available where you need them.
I don’t mind combining tools at all, but we have to do so with care and acknowledge that combining tools also adds a lot of constraints to our working process. Let us for example say that you’re doing an analysis of string patterns in a set of Company names. You’re noticing a piece of the pattern that shows out to be legal forms like GmbH, Ltd, A/S and so on. You want to separate the legal form from the company name, but switching between tools that do the pattern finding (a profiling tool) and the separation (a transformation, perhaps even ETL, tool) means that you have to go back to step 1 in your workflow and re-do all the steps in your flow in different tools. If your chain of analysis steps is more than just a few steps long, then you’re out for a lot of waste.
Now let’s have a look of some of the characteristics of doing a data quality analysis:
- Explorative – you need to go into the analysis with an open mind and explore the problem domain as well as the data’s current state.
- Interactive – you will need to go repeatedly from overview to detail view and adjust your presumptions along the way.
- Temporary and non-persistent – an analysis is an activity that provides a snapshot of the current state as well as a set of recommended actions. It’s not something that immediately feeds back into the data source from which it originates. And it does not require permanent deployment of any applications (that would be monitoring, which is only meaningful after a thorough analysis).
- Multi-sourced – At the analysis stage there is very rarely a data source containing the “single version of the truth”. This is why as an analyst you have to prepare to deal with multiple data sources, such as databases, Excel spreadsheets, MS Access files and so on.
So what happens then?
Way too often what happens is that the analyst turns to a full-fledged ETL tool because it’s the most generic tool that can configured to support most operations. So what’s wrong with that?
Transformations for the DQ domain
The transformations that you want to employ in a Data Quality Analysis are typically quite different than those that come out-of-the-box in database scripts and ETL tools. Such tools are typically quite generic and will provide general purpose tokenizers etc., but will typically lack transformations pertaining to the DQ-domain, such as date mask matchers, dictionary lookups and synonym replacements, standardization of email adresses, names and URL’s. If we really want to have “everything in one place”, then we should also be carefull that “everything” is not just the lowest common denominator.
Transformations are persistent
When you do pre-processing in separate tools, you also need to persist your transformed datasets. In a Data Quality Analysis this is just a waste of resources and provides poor performance. If you need to perform transformations, apply filtering etc. for the purpose of analysis, profiling and exploring your data it is much more feasible to just perform these transformations when needed in stead of storing them up front. This also allows for a much more free user experience where you can actually experiment with your data and you analysis in stead of having to overthink it.
In conclusion this adds up to a tough case for data analysis tools. It’s very hard to compress all your DQA needs into a single tool and it seems that a lot of vendors are trying to separate transformation activities from analyzing activities because it’s the most convenient solution from a product development perspective.
These have been some of the thoughts that we’ve had in mind for the last half year while developing the next version of the open source Data Quality application, DataCleaner. Last monday we released DataCleaner 2.0 and it resembles our attempt to bridge the gap between the time-consuming ETL tools, complicated data mining tools and the constrained “pure profiling” applications. Let this be an invitation to join the debate around data profiling tools – do you agree with the DQA toolset requirement perspective that we’re presenting or are other features more important? How do you see the future for DQA tools?