In my previous blog post we learned that the Danish government has initiated a progressive strategy to make basic public data freely available. Now let’s see how we can use this data to cleanse our data. I will be using DataCleaner to pre- and post-process data needed for the cleansing operation.
We’ll assume you have some customer database with (among other information) a CVR number for (some of) your customers. Here’s a small example MySQL database table that I will use in this blog:
Alternatively you can also use a “P number”, identifying the physical location of the company. As a preparation step we need to build a list of these numbers only, which will be sent to the CVR “Masseudtræk” (mass extraction) service.
Download DataCleaner, start it up and register your customer database as a datastore. We will connect to the MySQL database, containing customer information, but you should connect to whatever datastore your customer information resides in:
Once the database is registered as a datastore, select it with the checkbox to the left , and click the big “Analyze!” button. This will open up the job editing mode of DataCleaner. Expand the tree on the left side to expose your table structure. Here’s what this looks like on my example database:
To create the list of CVR numbers we only need the column containing this information. In my case, this is the “cvr” column. So I double click this column, which adds it to the “Source” tab in the right side of the screen. Before we export the list, we want to make sure we only get the Danish customers in our result. So I also add the country column:
To perform the exclusion of non-Danish countries, we need to use an Equals filter. Click “Transform -> Filter -> Equals”. This will add a new tab, in which we can select the column to do filtering on, as well as the number of values that we’re looking for. In my database it seems that the country column contains both “Denmark” and “DK” values, so I add them both:
With this filter applied, we’re actually ready to write the export file. We’re interested in all records that evaluate “VALID” in this filter (e.g. equal to either “DK” or “Denmark”), so click this button and select “Set as default requirement”. Now click the “Write data” item in the top menu, and select “Create CSV file”:
This will bring up a new tab for configuring the exported CSV file. Fill in a name of the CSV file, and select a whitespace as the “Quote character” (since the CVR system does not currently support quote characters in the input):
Now click the big “Execute” button. The job will run and produce a text file similar to:
Notice the initial “cvr” header. This is added by DataCleaner to keep track of column names, but in this context it is a little inconvenient, since the CVR system requires no headers. So delete that line.
Now you can go to the CVR website and click “Masseudtræk”. Upload your file using the web form, and wait for it to be processed:
Note: Lately I’ve experienced quite some downtime on this service. So if you get an error message at this point, it is not uncommon at the moment. I imagine the CVR registry will become more stable as soon as they get used to the new load on their servers ;-)
When the file has been processed, the resulting company information will be available for you in your email inbox. Click the link in the email and download the file on your local disk. It will look something like this, if you open it in a text editor:
Now you have an up-to-date, high-quality file with company information for your customers. This includes both information that you already might have (like address information), but potentially also new data that you did not already have (like phone number, industry code, number of employees etc.).
In my next blog post I will go through the process of writing this information to your customer database – updating it with the most recent information, and enriching it with information that was never there before!