Cleaning Danish customer data with the CVR registry and DataCleaner, part 3 of 3

In my previous two blogs I’ve dealt with the background of the Danish CVR registry being made available to everyone (and more public registries to come) as well as how to do the initial pre-processing steps with DataCleaner to fetch CVR data about your customers. We ended up with a CSV file from the CVR service that looks like this:

In this blog entry I will go through the process of integrating this data in your customer database. Let’s first visit a few examples of why that would be relevant:

  • You can make sure that your address is up-to-date and that letters and invoices to your customer reach their intended destination.
  • You can add more communication channels. The CVR registry contains email, phone and other basic communication information. Having this information available will improve the capabilities and reach of your support or customer service teams.
  • You can make it really easy for your customer. Instead of asking the customer for all kinds of information again and again, you can enter the customer’s name and let the data quality infrastructure handle the rest.
  • You can gain valuable insights with Business Intelligence tools, integrating the customer industries, company size and other metadata about the companies – provided through CVR.

The first step is to register our new file as a datastore in DataCleaner. That part is easily done using the icon that says CSV on the front screen of the application. Once that is done, like previously, we will connect to our customer database with DataCleaner using the ‘Analyze’ button. What we want to do is technically to join the database table with the CSV file. For this we can use the ‘Table lookup’ component, found in the “Transformation” menu. Here’s how I’ve set up the Table lookup:

In the screenshot above you can see that I’ve set the condition to be that “cvr” from the database needs to be the same as the “cvrnr” field of the file. In the output I’m interested in getting the following fields from the CSV file: Name (“navn”), address (“adresse”), zip code (“postnr”), phone number (“telefon”) and email address.

Now let’s see a preview of this Table lookup transformation:

There’s a few interesting things to notice here:

  • Not all the CVR numbers yielded a result. The last record in the example above shows only  <null> values. This means that the CVR number was not found in the file. 
  • If we compare with the input, the names are not totally alike, but pretty close. The same holds for addresses, zip codes and so on. 
  • Potentially we could face a situation where the number itself had a typo in it. Then we would get another company’s information. That situation could be quite severe, so we would want to do some checks on e.g. the similarity of the name before we automatically update our customer database!

Let’s look at these issues step by step. To check the similarity, we want to get a score. A simple and free mechanism for similarity scoring / matching is available for DataCleaner through the “EasyDQ additionals” extension.

  • Go to the EasyDQ additionals page on the DataCleaner website.
  • Click the “Install” button on that page. DataCleaner will now flash and ask for permission to install. Give the permission.
  • Once installed, click “Transform -> EasyDQ additionals -> Similarity evaluator”.
  • Now set up the similarity check so that you compare e.g. names and addresses:

If we preview this transformation’s data, here’s what we see:

Notice the Similarity score to the right. This is a score of the textual differences between the different values (0 = no similarity, 100 = textual equality). Obviously, we can set up different business rules based on this score. As you can see even in basic company names, there might be quite a difference between what you already know and what the “official” truth is. Therefore it is normal to only automate up to a certain point – many of these updates need to be validated by a human being first (or richer matching capabilities, for instance provided by Human Inference’scommercial add-ons).

 Let’s suppose we want a business rule like this:

When similarity score is less than 40, there’s probably a typo in the registered CVR number or the related name and address. In such cases we want to raise an alert to the customer’s account managers or a data steward team.

  • When the similarity score is above 85, we want to automatically update the customer database.
  • Anything in between a similarity score of 40 and 85, we want to put it into a staging table where it can be reviewed and propagated by a data inspection team.

For the sake of this blog entry’s scope and length, I am only going to demonstrate the automatic update (similarity score above 85). The approach for handling the remaining subroutines should become visible through this example.

First, I will add a filter to check the value of the similarity score. Using this filter we will direct the further action to be taken:

Click “Transform -> Filter -> Number range”. 

  • On the configuration panel of the Number range filter, select the Similarity score column.
  • Set the “Lowest value” to 40, and the “Highest value” to 85.

  • Now click the “HIGHER” button and select “Write data… -> Update table”. 
  • On the configuration panel of the Update table component, select the customer database as the target datastore, and the schema and table of your customer data.
  • For the “Update condition” panel, select the id of both source and target to be mapped together. This will ensure that the update targets the same record as the input record.
  • For the “Values to update” panel, select the lookup values and map them back to the field names in the database.
  • After configuration, your “Update table” configuration panel will look like this:
Now if you click execute, you will automatically update the customer database with updated information from CVR! In many real-life scenarios, I would add additional information to the table as well. Obviously that requires a bit more, since you need to add new columns to the table definitions. Additional information would typically include a timestamp for the latest CVR check, a log of changes and/or overridden values and newly found information from the CVR export – such as industry type, phone number, email, number of employees and more. These choices are always important to think of, but depend a lot on the level of data management you wish to apply to your data.

I hope these three blog entries have been both educational and interesting to read. As more and more open data becomes available to everyone, I see a natural fit for open source software to provide the processing power to handle it. In the DataCleaner team we will follow this trend closely and provide you with the best tooling to utilize the universe of data that is rapidly becoming available in nations, industries, communities and cyberspace.



0 Responses to “Cleaning Danish customer data with the CVR registry and DataCleaner, part 3 of 3”

Comments are currently closed.