At first glance, migrating data from one database to another looks trivial: you can easily discover the schema or datamodel of each database, then simply map the elements of one to the other. Any ETL tool worth its acronym will let you do that.
In practice, data migration is a nightmare…
A database belongs to an application and it is the application that gives meaning to the data. So to map from one to another, you need to understand each application. And you’ll soon find that one application has concepts that do not exist, in quite the same form, in the other.
For example, when migrating grants-management data for a charitable foundation, I found that the source package had broad indicators of the status of each grant application: Pending, Approved, Rejected, Closed. The target package had a similar top-level division, but allowed a much more fine‑grained specification of the lifecycle below that. My client was keen to exploit this, as this finer grain enabled better coordination of the workflow, more automation of letter generation, and better management and reporting of workload and progress. To populate this fine grained status, we had to scour the source database for additional information, and apply some 20 tests to a dozen additional fields.
OK, so this was a bit elaborate, but still entirely tractable. The data was there and could be interpreted unambiguously.
Things become much harder where it isn’t the application that gives meaning to the data, but the users. Users often find that their application does not quite allow them to express the distinctions they want to make, or to record all the facts they want to record. This is especially so in older applications that lack easy customisation – but it is still true of modern applications, as customisation involves time, effort and cost. Luckily, any application tends to contain a couple of spare fields, included by the designers because someone somewhere asked for them, but that are not actually any use in this particular organisation. These spare fields then become the receptacle for the information that won’t fit elsewhere. Experience and lore tell users the true meaning of these fields, ignoring whatever label they may have in the user interface.
And it gets worse. From time to time, users change their minds as to what is important. Fields fall into disuse and, after some time, are repurposed. In the grants-management migration, I found fields that had changed meaning three times, within the life of the application, plus a few more times in an older application from which the data had been converted, previously.
At the time, all I had available as a tool were my eyeballs. Now, of course, I would instantly reach for HIquality Inspect.
The ready-made Frequency info check in Inspect is a very handy tool for quickly listing the different values that occur in a given field. On the basis of this, you can then build regular expressions to retrieve individual records based on particular strings or substrings. Regular expressions are very flexible and allow a degree of fuzzy matching. Assuming the records also contain audit attributes, like Date Modified, you can then see the date ranges when these various strings were most popular. All you need to do then is to find a user who remembers what these strings actually meant, between August 1999 and April 2003. Which may be easier than finding a manager who is prepared to say this data can be safely discarded…
Things get even hairier when users don’t have spare fields and resort to placing various remarks alongside – or embedded in – the expected content of well-used fields. But that is one for another day.