Today is a memorable day for data quality in the Netherlands. Exactly two hundred years ago, on August 18, 1811, the French emperor (and occupier) Napoleon Bonaparte issued the decree that all citizens of the northern provinces of the Netherlands were to choose a surname. This name was very useful in the municipal registers of the Dutch inhabitants: how else could the French army know which lad to draw for military service, or which peasant to pursue for taxes? Continue reading ‘200 years of family names’
What is equal? – challenges with sound and synonyms
What to do when basic string comparison (fuzzy search) techniques won’t give the right results? Fuzzy search helps to find matches in situations where people make typo’s (e.g. compare Human Inference with Human Inverence) or make up abbreviations (King str. with King street) or ignore diacritics (Sørensen and Soerensen). In case the ‘wrong word’ is not a real used word it becomes obvious that after correcting the typo we have a match.
More challenges appear if the typo has caused another existing word; now we need to make a decision on how equal the two entries are. In case you have some knowledge on the frequency of usage of words you can use that in the equation. How to get the frequency of usage for words is another ballgame – at least you can assume that a ‘wrong word’ is never used (bit of a paradox). Continue reading ‘What is equal? – challenges with sound and synonyms’
Has your name ever hurt you? – when nomen becomes omen
Addressing clients with the right data often means the difference between making a profit and not making a profit. Working with data quality experts has made me ever more consious of the value personal data represents for people. In this respect names are especially intriguing to me, as owners appear to identify with their name a lot. So I decided to do a little research and determine if people really are what their name tells you. Can nomen indeed become omen?
Your parents probably gave a lot of thought to the name they once gave you, and as it turns out they were right to do so! Research tells us a name can do wonders for its owner, as well as a lot of damage for that matter. Let’s have a look at some remarkable results.
Peter for President!
Recent studies show that in the US a student called Fred is more likely to fail his exam than a student who just happened to be named Andrew: people tend to indentify with their name and, in general, have a positive feeling about letters that correspond with their initials. Consequently Fred is far more likely to settle for a meager F, while Andrew will have an extra motive to strive for an A. Continue reading ‘Has your name ever hurt you? – when nomen becomes omen’
Marketing? – Let your ingredients interact!
Throughout the years Human Inference has carried out and supported research with regard to the importance, the impact and the perception of customer data quality in business environments. This reasearch shows that the phenomenon customer data quality is subject to a perception shift. In general, one could argue that, in the early years, data quality used to be perceived as “something that is being carried out by the IT-department”, whereas nowadays more and more companies and organizations are recognizing the importance of customer data and information quality. Issues and initiatives such as the value of a single customer view, data integration, fraud prevention, customer relationship management, operational risk management, compliance and anti-terrorism have become boardroom themes. Continue reading ‘Marketing? – Let your ingredients interact!’
DataCleaner adds expert cleansing functions- added value in Open Source
Late 2009 in their report on Who’s Who in Open-Source Data Quality, Andreas Bitterer and Ted Friedman from Gartner, pointed already to DataCleaner as a promising tool. A tool that, in their opinion, could certainly improve by offering more high end Cleansing functions and improve the rather basic User Experience.
Since then, a lot has happened in the DataCleaner space and in the profiling market. Before the launch of version 2 we notified everybody on the acquisition of eobjects.org or DataCleaner by Human Inference. It might be that some of you were curious on what would happen with the functionality, and as stated at that time we would continue with the community and further participate and expand in it. Under the flag of Human Inference we launched the renewed DataCleaner 2.0, where we definitely increased the customer experience with an enhanced user interface together with possibilities to provide filters or filter flows. The filter flows show their benefit if you analyze your data source and want to create new (temporary) data sources based on matching criteria. You can do that either manually, or in a completely automated way to monitor your data.
With Open Source in general, and with DataCleaner in particular we want the community to participate in the functionality of the product. Since long DataCleaner contains the RegexSwap: the community where you can share regular expressions. Why would everybody reinvent the same wheel to build a regular expression on creditcard checks, emails, etc?
Next to regular expressions that can be used to profile data, there is the need on data cleansing functions that contain much more business logic that can hardly be covered in a regular expression. For example, to validate of the syntax of an email is correct is something else than validating if there is also a running mail server attached to the domain. Cleansing functions are already part of DataCleaner but there is always a need for other or more advanced functional extensions. To prevent that you need to create them in the ‘DataCleaner’ way we have created an easy extension sharing mechanism. Continue reading ‘DataCleaner adds expert cleansing functions- added value in Open Source’
Centenarians in Greece, Zimbabwe and the quality of birth dates

Population distribution in The Netherlands
This week started with a remarkable news item on the number of dead Greeks still drawing pension. Especially the 9000 centenarians (people with age above 100) give the feeling that there might be something wrong. At second pace, looking at the statistics of Europe, I hold my horses – France, Spain, Italy, Germany and the UK have also a significant amount of older people.
Anyway, as my mind was still boggling about these centenarians in Greece, a new news item was popping in. A news article on the statistics on the voters lists of Zimbabwe. 41.100 potential voters in Zimbabwe are centenarians, 4 times more than currently in the UK. Where the population of the UK is approximately 5 times Zimbabwe! And this is possible when the average life expectancy in Zimbabwe has fallen to 44.8 years. Even more extreme, the number of 16.800 potential voters aged 110 years old and all born on January 1st 1901.
I cannot prevent that pointing to both these news items might raise your eyebrows. Everything in me wants to prevent that I want to make some a sort of a political statement, I leave that to you.
For us, people living in the data quality world, these items trigger us, how can we identify these weird data manipulation on dates. When we do profiling of our customers data sources we were familiar with checks on certain date related things – no rocket science – for example: Is the date written in US or European style (mm/dd/yyyy or dd/mm/yyyy), are we dealing with two or four digits for the year, is the birth date before the current date, is the marriage date after the birth date, etc
We are also used to peaks at certain dates. A notorious one is January first of any year, on the one hand because it’s the default in many entry screens, on the other hand in some cultures the birth date itself is not that important – people from these cultures put more emphasis on name dates and won’t remember their day of birth. And all of a sudden they are forced to give one, with the effect that they or someone else is choosing a default one. Continue reading ‘Centenarians in Greece, Zimbabwe and the quality of birth dates’
