What is equal? – challenges with sound and synonyms

What to do when basic string comparison (fuzzy search) techniques won’t give the right results? Fuzzy search helps to find matches in situations where people make typo’s (e.g. compare Human Inference with Human Inverence) or make up abbreviations (King str. with King street) or ignore diacritics (Sørensen and Soerensen). In case the ‘wrong word’ is not a real used word it becomes obvious that after correcting the typo we have a match.

More challenges appear if the typo has caused another existing word; now we need to make a decision on how equal the two entries are. In case you have some knowledge on the frequency of usage of words you can use that in the equation. How to get the frequency of usage for words is another ballgame – at least you can assume that a ‘wrong word’ is never used (bit of a paradox). Continue reading ‘What is equal? – challenges with sound and synonyms’

Has your name ever hurt you? – when nomen becomes omen

Addressing clients with the right data often means the difference between making a profit and not making a profit. Working with data quality experts has made me ever more consious of the value personal data represents for people. In this respect names are especially intriguing to me, as owners appear to identify with their name a lot. So I decided to do a little research and determine if people really are what their name tells you. Can nomen indeed become omen?

Your parents probably gave a lot of thought to the name they once gave you, and as it turns out they were right to do so! Research tells us a name can do wonders for its owner, as well as a lot of damage for that matter. Let’s have a look at some remarkable results.

Peter for President!
Recent studies show that in the US a student called Fred is more likely to fail his exam than a student who just happened to be named Andrew: people tend to indentify with their name and, in general, have a positive feeling about letters that correspond with their initials. Consequently Fred is far more likely to settle for a meager F, while Andrew will have an extra motive to strive for an A. Continue reading ‘Has your name ever hurt you? – when nomen becomes omen’

Marketing? – Let your ingredients interact!

 

Throughout the years Human Inference has carried out and supported research with regard to the importance, the impact and the perception of customer data quality in business environments. This reasearch shows that the phenomenon customer data quality is subject to a perception shift. In general, one could argue that, in the early years, data quality used to be perceived as “something that is being carried out by the IT-department”, whereas nowadays more and more companies and organizations are recognizing the importance of customer data and information quality. Issues and initiatives such as the value of a single customer view, data integration, fraud prevention, customer relationship management, operational risk management, compliance and anti-terrorism have become boardroom themes. Continue reading ‘Marketing? – Let your ingredients interact!’

DataCleaner adds expert cleansing functions- added value in Open Source

Late 2009 in their report on Who’s Who in Open-Source Data Quality, Andreas Bitterer and Ted Friedman from Gartner, pointed already to DataCleaner as a promising tool. A tool that, in their opinion, could certainly improve by offering more high end Cleansing functions and improve the rather basic User Experience.

Since then, a lot has happened in the DataCleaner space and in the profiling market. Before the launch of version 2 we notified everybody on the acquisition of eobjects.org or DataCleaner by Human Inference. It might be that some of you were curious on what would happen with the functionality, and as stated at that time we would continue with the community and further participate and expand in it. Under the flag of Human Inference we launched the renewed DataCleaner 2.0, where we definitely increased the customer experience with an enhanced user interface together with possibilities to provide filters or filter flows. The filter flows show their benefit if you analyze your data source and want to create new (temporary) data sources based on matching criteria. You can do that either manually, or in a completely automated way to monitor your data.

With Open Source in general, and with DataCleaner in particular we want the community to participate in the functionality of the product. Since long DataCleaner contains the RegexSwap: the community where you can share regular expressions. Why would everybody reinvent the same wheel to build a regular expression on creditcard checks, emails, etc?

Next to regular expressions that can be used to profile data, there is the need on data cleansing functions that contain much more business logic that can hardly be covered in a regular expression. For example, to validate of the syntax of an email is correct is something else than validating if there is also a running mail server attached to the domain. Cleansing functions are already part of DataCleaner but there is always a need for other or more advanced functional extensions. To prevent that you need to create them in the ‘DataCleaner’ way we have created an easy extension sharing mechanism. Continue reading ‘DataCleaner adds expert cleansing functions- added value in Open Source’

Centenarians in Greece, Zimbabwe and the quality of birth dates

Population distribution in The Netherlands

This week started with a remarkable news item on the number of dead Greeks still drawing pension. Especially the 9000 centenarians (people with age above 100) give the feeling that there might be something wrong. At second pace, looking at the statistics of Europe, I hold my horses – France, Spain, Italy, Germany and the UK have also a significant amount of older people.

Anyway, as my mind was still boggling about these centenarians in Greece, a new news item was popping in. A news article on the statistics on the voters lists of Zimbabwe. 41.100 potential voters in Zimbabwe are centenarians, 4 times more than currently in the UK. Where the population of the UK is approximately 5 times Zimbabwe! And this is possible when the average life expectancy in Zimbabwe has fallen to 44.8 years. Even more extreme, the number of 16.800 potential voters aged 110 years old and all born on January 1st 1901.

I cannot prevent that pointing to both these news items might raise your eyebrows. Everything in me wants to prevent that I want to make some a sort of a political statement, I leave that to you.

For us, people living in the data quality world, these items trigger us, how can we identify these weird data manipulation on dates. When we do profiling of our customers data sources we were familiar with checks on certain date related things – no rocket science – for example: Is the date written in US or European style (mm/dd/yyyy or dd/mm/yyyy), are we dealing with two or four digits for the year, is the birth date before the current date, is the marriage date after the birth date, etc

We are also used to peaks at certain dates. A notorious one is January first of any year, on the one hand because it’s the default in many entry screens, on the other hand in some cultures the birth date itself is not that important – people from these cultures put more emphasis on name dates and won’t remember their day of birth. And all of a sudden they are forced to give one, with the effect that they or someone else is choosing a default one. Continue reading ‘Centenarians in Greece, Zimbabwe and the quality of birth dates’

Expected: Continuous rise and fall of Social Networks

Last week LinkedIn has gone public. With the enormous growth of social networks like LinkedIn, Facebook and the likes and the commercial value they — virtually — represent nobody can deny that these networks are booming.

A couple of years ago there were many local / specialised players, and during their first hick-ups some of them lost attention from the public and others quickly moved further. The ones that survived this consolidation battle are now growing for the big money. They have linked communities together that were previously pretty hard to access, and they even linked people together that lost sight of each other. Great phenomonom with completely new dynamics. Two of these new dynamics I want to emphasize:

  1. Access the network, not the individual.
    Via the social networks it becomes much easier to connect to people that might be interrested in your products or services. And the known relations between individuals provides insight in who might also become potential targets or groups. There is less need to get direct personal details from individual before you can contact them. The thing you need is a good advertisement for your target audience, the moment they open a page you need to convince them to come to you.
  2. Use your network id to access everything.
    For the individual there is the benefit of using your social networks id to identify yourself at other internet pages or services. Providing your personal information again and again is a thing of the past. Simply enter, e.g., your Facebook authentication and retrieve your ticket or order your goods. Continue reading ‘Expected: Continuous rise and fall of Social Networks’