Cleaning Danish customer data with the CVR registry and DataCleaner, part 3 of 3

In my previous two blogs I’ve dealt with the background of the Danish CVR registry being made available to everyone (and more public registries to come) as well as how to do the initial pre-processing steps with DataCleaner to fetch CVR data about your customers. We ended up with a CSV file from the CVR service that looks like this:

In this blog entry I will go through the process of integrating this data in your customer database. Let’s first visit a few examples of why that would be relevant:

  • You can make sure that your address is up-to-date and that letters and invoices to your customer reach their intended destination.
  • You can add more communication channels. The CVR registry contains email, phone and other basic communication information. Having this information available will improve the capabilities and reach of your support or customer service teams.
  • You can make it really easy for your customer. Instead of asking the customer for all kinds of information again and again, you can enter the customer’s name and let the data quality infrastructure handle the rest.
  • You can gain valuable insights with Business Intelligence tools, integrating the customer industries, company size and other metadata about the companies – provided through CVR.

The first step is to register our new file as a datastore in DataCleaner. That part is easily done using the icon that says CSV on the front screen of the application. Once that is done, like previously, we will connect to our customer database with DataCleaner using the ‘Analyze’ button. What we want to do is technically to join the database table with the CSV file. For this we can use the ‘Table lookup’ component, found in the “Transformation” menu. Here’s how I’ve set up the Table lookup:

In the screenshot above you can see that I’ve set the condition to be that “cvr” from the database needs to be the same as the “cvrnr” field of the file. In the output I’m interested in getting the following fields from the CSV file: Name (“navn”), address (“adresse”), zip code (“postnr”), phone number (“telefon”) and email address.

Now let’s see a preview of this Table lookup transformation:

There’s a few interesting things to notice here:

  • Not all the CVR numbers yielded a result. The last record in the example above shows only  <null> values. This means that the CVR number was not found in the file. 
  • If we compare with the input, the names are not totally alike, but pretty close. The same holds for addresses, zip codes and so on. 
  • Potentially we could face a situation where the number itself had a typo in it. Then we would get another company’s information. That situation could be quite severe, so we would want to do some checks on e.g. the similarity of the name before we automatically update our customer database!

Let’s look at these issues step by step. To check the similarity, we want to get a score. A simple and free mechanism for similarity scoring / matching is available for DataCleaner through the “EasyDQ additionals” extension.

  • Go to the EasyDQ additionals page on the DataCleaner website.
  • Click the “Install” button on that page. DataCleaner will now flash and ask for permission to install. Give the permission.
  • Once installed, click “Transform -> EasyDQ additionals -> Similarity evaluator”.
  • Now set up the similarity check so that you compare e.g. names and addresses:

If we preview this transformation’s data, here’s what we see:

Notice the Similarity score to the right. This is a score of the textual differences between the different values (0 = no similarity, 100 = textual equality). Obviously, we can set up different business rules based on this score. As you can see even in basic company names, there might be quite a difference between what you already know and what the “official” truth is. Therefore it is normal to only automate up to a certain point – many of these updates need to be validated by a human being first (or richer matching capabilities, for instance provided by Human Inference’scommercial add-ons).

 Let’s suppose we want a business rule like this:

When similarity score is less than 40, there’s probably a typo in the registered CVR number or the related name and address. In such cases we want to raise an alert to the customer’s account managers or a data steward team.

  • When the similarity score is above 85, we want to automatically update the customer database.
  • Anything in between a similarity score of 40 and 85, we want to put it into a staging table where it can be reviewed and propagated by a data inspection team.

For the sake of this blog entry’s scope and length, I am only going to demonstrate the automatic update (similarity score above 85). The approach for handling the remaining subroutines should become visible through this example.

First, I will add a filter to check the value of the similarity score. Using this filter we will direct the further action to be taken:

Click “Transform -> Filter -> Number range”. 

  • On the configuration panel of the Number range filter, select the Similarity score column.
  • Set the “Lowest value” to 40, and the “Highest value” to 85.

  • Now click the “HIGHER” button and select “Write data… -> Update table”. 
  • On the configuration panel of the Update table component, select the customer database as the target datastore, and the schema and table of your customer data.
  • For the “Update condition” panel, select the id of both source and target to be mapped together. This will ensure that the update targets the same record as the input record.
  • For the “Values to update” panel, select the lookup values and map them back to the field names in the database.
  • After configuration, your “Update table” configuration panel will look like this:
Now if you click execute, you will automatically update the customer database with updated information from CVR! In many real-life scenarios, I would add additional information to the table as well. Obviously that requires a bit more, since you need to add new columns to the table definitions. Additional information would typically include a timestamp for the latest CVR check, a log of changes and/or overridden values and newly found information from the CVR export – such as industry type, phone number, email, number of employees and more. These choices are always important to think of, but depend a lot on the level of data management you wish to apply to your data.

I hope these three blog entries have been both educational and interesting to read. As more and more open data becomes available to everyone, I see a natural fit for open source software to provide the processing power to handle it. In the DataCleaner team we will follow this trend closely and provide you with the best tooling to utilize the universe of data that is rapidly becoming available in nations, industries, communities and cyberspace.

 

 

Cleaning Danish customer data with the CVR registry and DataCleaner, part 2 of 3

In my previous blog post we learned that the Danish government has initiated a progressive strategy to make basic public data freely available. Now let’s see how we can use this data to cleanse our data. I will be using DataCleaner to pre- and post-process data needed for the cleansing operation.

We’ll assume you have some customer database with (among other information) a CVR number for (some of) your customers. Here’s a small example MySQL database table that I will use in this blog:

Alternatively you can also use a “P number”, identifying the physical location of the company. As a preparation step we need to build a list of these numbers only, which will be sent to the CVR “Masseudtræk” (mass extraction) service.

Download DataCleaner, start it up and register your customer database as a datastore. We will connect to the MySQL database, containing customer information, but you should connect to whatever datastore your customer information resides in:

Once the database is registered as a datastore, select it with the checkbox to the left , and click the big “Analyze!” button. This will open up the job editing mode of DataCleaner. Expand the tree on the left side to expose your table structure. Here’s what this looks like on my example database:


To create the list of CVR numbers we only need the column containing this information. In my case, this is the “cvr” column. So I double click this column, which adds it to the “Source” tab in the right side of the screen. Before we export the list, we want to make sure we only get the Danish customers in our result. So I also add the country column:

To perform the exclusion of non-Danish countries, we need to use an Equals filter. Click “Transform -> Filter -> Equals”. This will add a new tab, in which we can select the column to do filtering on, as well as the number of values that we’re looking for. In my database it seems that the country column contains both “Denmark” and “DK” values, so I add them both:

With this filter applied, we’re actually ready to write the export file. We’re interested in all records that evaluate “VALID” in this filter (e.g. equal to either “DK” or “Denmark”), so click this button and select “Set as default requirement”. Now click the “Write data” item in the top menu, and select “Create CSV file”:

This will bring up a new tab for configuring the exported CSV file. Fill in a name of the CSV file, and select a whitespace as the “Quote character” (since the CVR system does not currently support quote characters in the input):

Now click the big “Execute” button. The job will run and produce a text file similar to:

Notice the initial “cvr” header. This is added by DataCleaner to keep track of column names, but in this context it is a little inconvenient, since the CVR system requires no headers. So delete that line.

Now you can go to the CVR website and click “Masseudtræk”. Upload your file using the web form, and wait for it to be processed:

Note: Lately I’ve experienced quite some downtime on this service. So if you get an error message at this point, it is not uncommon at the moment. I imagine the CVR registry will become more stable as soon as  they get used to the new load on their servers ;-)

When the file has been processed, the resulting company information will be available for you in your email inbox. Click the link in the email and download the file on your local disk. It will look something like this, if you open it in a text editor:

Now you have an up-to-date, high-quality file with company information for your customers. This includes both information that you already might have (like address information), but potentially also new data that you did not already have (like phone number, industry code, number of employees etc.).

In my next blog post I will go through the process of writing this information to your customer database – updating it with the most recent information, and enriching it with information that was never there before!

Cleaning Danish customer data with the CVR registry and DataCleaner, part 1 of 3

 

Recently the Danish government has initiated a strategy to open up and make freely available basic public master data on a wide range of areas such as company registrations (CVR), social security (CPR), real estate, employment, environment and more. The initiative’s name is “Good basic data for everyone – a driver for growth and effeciency” – read the full report here.

Over the coming years this means that public registries will be made available to the public. Of course privacy laws are still active and the data is exposed in a manner where access is only granted in relationship with specific search criteria. Still, you could say that this is a huge step toward liberalizing and making available public data for everyone.

At Human Inference we recognize the value of good data quality, and we believe that this initiative by the Danish government will do a lot of good for both the public sector and the private sector in Denmark. In the government report, it is estimated that more than 100 million EUR per year in costs will be saved for the private and public sector in Denmark.

The timeline of the initiative spans many years of work, and it will be interesting to see how the end result is shaped. A notion of a central Data Distributor service is included in the plans, which will make data access streamlined and uniform to further improve availability. Personally I think this is a major step forward, and I cannot wait for this plan to be realized.

 

The first steps have already been taken though! As per the first of January 2013, the price of accessing the CVR (company identification) registry was set to 0 DKR – free for everyone who accepts terms and agreements. CVR, in my opinion, is one of two very central registries when it comes to common data quality issues in Denmark. The other one is the CPR registry for person identification and social security, which according to plans will be opened in 2014 using the new Data Distributor platform.

 

So how do you start using the CVR data? In the my next blog post (which will appear net Monday) I will give you a tutorial on how to cleanse your company customer data with CVR, using DataCleaner for all the pre- and post-processing needed. So stay tuned for part 2, which will be a real hands-on walkthrough!

 

Data Quality Analysis – It requires a bit of all worlds

Doing a Data Quality Analysis (DQA) is a challenging task. You need to get your head into the domain of the business to understand what the data is all about. You need to talk to the users of the organization to understand how they work with data. And within hours you’ll most probably have a dozen of different data sources that you need to dig into.

Tools, tools, tools
The DQA is not trivial, actually the opposite, so often you’ll see that tool support is lacking. The analyst himself will have to use a toolset that is just as diverse and uncontrollable as the data he is trying to manage. The problem with such an approach is that it will eventually get in your way because you’re trying to get 2-3 independent tools to work nicely together, instead of just having these functions available where you need them.

I don’t mind combining tools at all, but we have to do so with care and acknowledge that combining tools also adds a lot of constraints to our working process. Let us for example say that you’re doing an analysis of string patterns in a set of Company names. You’re noticing a piece of the pattern that shows out to be legal forms like GmbH, Ltd, A/S and so on. You want to separate the legal form from the company name, but switching between tools that do the pattern finding (a profiling tool) and the separation (a transformation, perhaps even ETL, tool) means that you have to go back to step 1 in your workflow and re-do all the steps in your flow in different tools. If your chain of analysis steps is more than just a few steps long, then you’re out for a lot of waste. Continue reading ‘Data Quality Analysis – It requires a bit of all worlds’