Cleaning Danish customer data with the CVR registry and DataCleaner, part 2 of 3

In my previous blog post we learned that the Danish government has initiated a progressive strategy to make basic public data freely available. Now let’s see how we can use this data to cleanse our data. I will be using DataCleaner to pre- and post-process data needed for the cleansing operation.

We’ll assume you have some customer database with (among other information) a CVR number for (some of) your customers. Here’s a small example MySQL database table that I will use in this blog:

Alternatively you can also use a “P number”, identifying the physical location of the company. As a preparation step we need to build a list of these numbers only, which will be sent to the CVR “Masseudtræk” (mass extraction) service.

Download DataCleaner, start it up and register your customer database as a datastore. We will connect to the MySQL database, containing customer information, but you should connect to whatever datastore your customer information resides in:

Once the database is registered as a datastore, select it with the checkbox to the left , and click the big “Analyze!” button. This will open up the job editing mode of DataCleaner. Expand the tree on the left side to expose your table structure. Here’s what this looks like on my example database:


To create the list of CVR numbers we only need the column containing this information. In my case, this is the “cvr” column. So I double click this column, which adds it to the “Source” tab in the right side of the screen. Before we export the list, we want to make sure we only get the Danish customers in our result. So I also add the country column:

To perform the exclusion of non-Danish countries, we need to use an Equals filter. Click “Transform -> Filter -> Equals”. This will add a new tab, in which we can select the column to do filtering on, as well as the number of values that we’re looking for. In my database it seems that the country column contains both “Denmark” and “DK” values, so I add them both:

With this filter applied, we’re actually ready to write the export file. We’re interested in all records that evaluate “VALID” in this filter (e.g. equal to either “DK” or “Denmark”), so click this button and select “Set as default requirement”. Now click the “Write data” item in the top menu, and select “Create CSV file”:

This will bring up a new tab for configuring the exported CSV file. Fill in a name of the CSV file, and select a whitespace as the “Quote character” (since the CVR system does not currently support quote characters in the input):

Now click the big “Execute” button. The job will run and produce a text file similar to:

Notice the initial “cvr” header. This is added by DataCleaner to keep track of column names, but in this context it is a little inconvenient, since the CVR system requires no headers. So delete that line.

Now you can go to the CVR website and click “Masseudtræk”. Upload your file using the web form, and wait for it to be processed:

Note: Lately I’ve experienced quite some downtime on this service. So if you get an error message at this point, it is not uncommon at the moment. I imagine the CVR registry will become more stable as soon as  they get used to the new load on their servers ;-)

When the file has been processed, the resulting company information will be available for you in your email inbox. Click the link in the email and download the file on your local disk. It will look something like this, if you open it in a text editor:

Now you have an up-to-date, high-quality file with company information for your customers. This includes both information that you already might have (like address information), but potentially also new data that you did not already have (like phone number, industry code, number of employees etc.).

In my next blog post I will go through the process of writing this information to your customer database – updating it with the most recent information, and enriching it with information that was never there before!

Cleaning Danish customer data with the CVR registry and DataCleaner, part 1 of 3

 

Recently the Danish government has initiated a strategy to open up and make freely available basic public master data on a wide range of areas such as company registrations (CVR), social security (CPR), real estate, employment, environment and more. The initiative’s name is “Good basic data for everyone – a driver for growth and effeciency” – read the full report here.

Over the coming years this means that public registries will be made available to the public. Of course privacy laws are still active and the data is exposed in a manner where access is only granted in relationship with specific search criteria. Still, you could say that this is a huge step toward liberalizing and making available public data for everyone.

At Human Inference we recognize the value of good data quality, and we believe that this initiative by the Danish government will do a lot of good for both the public sector and the private sector in Denmark. In the government report, it is estimated that more than 100 million EUR per year in costs will be saved for the private and public sector in Denmark.

The timeline of the initiative spans many years of work, and it will be interesting to see how the end result is shaped. A notion of a central Data Distributor service is included in the plans, which will make data access streamlined and uniform to further improve availability. Personally I think this is a major step forward, and I cannot wait for this plan to be realized.

 

The first steps have already been taken though! As per the first of January 2013, the price of accessing the CVR (company identification) registry was set to 0 DKR – free for everyone who accepts terms and agreements. CVR, in my opinion, is one of two very central registries when it comes to common data quality issues in Denmark. The other one is the CPR registry for person identification and social security, which according to plans will be opened in 2014 using the new Data Distributor platform.

 

So how do you start using the CVR data? In the my next blog post (which will appear net Monday) I will give you a tutorial on how to cleanse your company customer data with CVR, using DataCleaner for all the pre- and post-processing needed. So stay tuned for part 2, which will be a real hands-on walkthrough!

 

DataCleaner adds expert cleansing functions- added value in Open Source

Late 2009 in their report on Who’s Who in Open-Source Data Quality, Andreas Bitterer and Ted Friedman from Gartner, pointed already to DataCleaner as a promising tool. A tool that, in their opinion, could certainly improve by offering more high end Cleansing functions and improve the rather basic User Experience.

Since then, a lot has happened in the DataCleaner space and in the profiling market. Before the launch of version 2 we notified everybody on the acquisition of eobjects.org or DataCleaner by Human Inference. It might be that some of you were curious on what would happen with the functionality, and as stated at that time we would continue with the community and further participate and expand in it. Under the flag of Human Inference we launched the renewed DataCleaner 2.0, where we definitely increased the customer experience with an enhanced user interface together with possibilities to provide filters or filter flows. The filter flows show their benefit if you analyze your data source and want to create new (temporary) data sources based on matching criteria. You can do that either manually, or in a completely automated way to monitor your data.

With Open Source in general, and with DataCleaner in particular we want the community to participate in the functionality of the product. Since long DataCleaner contains the RegexSwap: the community where you can share regular expressions. Why would everybody reinvent the same wheel to build a regular expression on creditcard checks, emails, etc?

Next to regular expressions that can be used to profile data, there is the need on data cleansing functions that contain much more business logic that can hardly be covered in a regular expression. For example, to validate of the syntax of an email is correct is something else than validating if there is also a running mail server attached to the domain. Cleansing functions are already part of DataCleaner but there is always a need for other or more advanced functional extensions. To prevent that you need to create them in the ‘DataCleaner’ way we have created an easy extension sharing mechanism. Continue reading ‘DataCleaner adds expert cleansing functions- added value in Open Source’

Centenarians in Greece, Zimbabwe and the quality of birth dates

Population distribution in The Netherlands

This week started with a remarkable news item on the number of dead Greeks still drawing pension. Especially the 9000 centenarians (people with age above 100) give the feeling that there might be something wrong. At second pace, looking at the statistics of Europe, I hold my horses – France, Spain, Italy, Germany and the UK have also a significant amount of older people.

Anyway, as my mind was still boggling about these centenarians in Greece, a new news item was popping in. A news article on the statistics on the voters lists of Zimbabwe. 41.100 potential voters in Zimbabwe are centenarians, 4 times more than currently in the UK. Where the population of the UK is approximately 5 times Zimbabwe! And this is possible when the average life expectancy in Zimbabwe has fallen to 44.8 years. Even more extreme, the number of 16.800 potential voters aged 110 years old and all born on January 1st 1901.

I cannot prevent that pointing to both these news items might raise your eyebrows. Everything in me wants to prevent that I want to make some a sort of a political statement, I leave that to you.

For us, people living in the data quality world, these items trigger us, how can we identify these weird data manipulation on dates. When we do profiling of our customers data sources we were familiar with checks on certain date related things – no rocket science – for example: Is the date written in US or European style (mm/dd/yyyy or dd/mm/yyyy), are we dealing with two or four digits for the year, is the birth date before the current date, is the marriage date after the birth date, etc

We are also used to peaks at certain dates. A notorious one is January first of any year, on the one hand because it’s the default in many entry screens, on the other hand in some cultures the birth date itself is not that important – people from these cultures put more emphasis on name dates and won’t remember their day of birth. And all of a sudden they are forced to give one, with the effect that they or someone else is choosing a default one. Continue reading ‘Centenarians in Greece, Zimbabwe and the quality of birth dates’

We have 180 million names! Which one is right?

The internet is an ocean of wealthy content, but unfortunately, as in the real world, it’s heavily polluted.

As a company in business for 25 years, Human Inference absolutely sees the benefits of the internet. For our reasoning processes, based on natural language processing, we gather content and we classify this content on type, such as given names, family names, prefix, suffix, etc. (See also my blog post on the comparison of apples and oranges ….)

In the past this was done manually by, for example, investigating telephone books or manual research of census lists. But these were the ‘pioneer years’. What we see now is an enormous amount of content that can be gathered on the internet. It’s quite easy to find an internet page with 180 million records of person names. Great, so knowledge gathering is passé now? Continue reading ‘We have 180 million names! Which one is right?’