Expected: Continuous rise and fall of Social Networks

Last week LinkedIn has gone public. With the enormous growth of social networks like LinkedIn, Facebook and the likes and the commercial value they — virtually — represent nobody can deny that these networks are booming.

A couple of years ago there were many local / specialised players, and during their first hick-ups some of them lost attention from the public and others quickly moved further. The ones that survived this consolidation battle are now growing for the big money. They have linked communities together that were previously pretty hard to access, and they even linked people together that lost sight of each other. Great phenomonom with completely new dynamics. Two of these new dynamics I want to emphasize:

  1. Access the network, not the individual.
    Via the social networks it becomes much easier to connect to people that might be interrested in your products or services. And the known relations between individuals provides insight in who might also become potential targets or groups. There is less need to get direct personal details from individual before you can contact them. The thing you need is a good advertisement for your target audience, the moment they open a page you need to convince them to come to you.
  2. Use your network id to access everything.
    For the individual there is the benefit of using your social networks id to identify yourself at other internet pages or services. Providing your personal information again and again is a thing of the past. Simply enter, e.g., your Facebook authentication and retrieve your ticket or order your goods. Continue reading ‘Expected: Continuous rise and fall of Social Networks’

First Time Right in Action

In previous blogs on the First Time Right (FTR)-principle, we’ve talked about preventing that your data becomes polluted. After reading the white paper on FTR you might want to see some actual examples. Yesterday, I have seen some demo’s and trials from our development group (special thanks to Kasper Sørensen and Ankit Kumar!) that I want to share. Look and play with it and give me feedback how to improve things. The demos are focussed only on guiding the user to provide correct names (so I’m aware that email, telephone, address, etc is not yet incorporated).

The first demo is a mockup for Microsoft CRM. You should go to the name fields (First name and Last name) and see how the entry form is helping you to guide you to correct names. I need to admit that the Microsoft CRM demo works better in Internet Explorer (I wonder why …. ;-).

The second demo shows the key possibilities of HIquality Name Worldwide in a Linkedin mockup.

I am enthusiastic about the ease to integrate the first time right mechanism in a web form (or any application with a web UI). Engineers showed me that it’s quite non-intrusive, they added five lines in the beginning of the page and it is working already.

We have 180 million names! Which one is right?

The internet is an ocean of wealthy content, but unfortunately, as in the real world, it’s heavily polluted.

As a company in business for 25 years, Human Inference absolutely sees the benefits of the internet. For our reasoning processes, based on natural language processing, we gather content and we classify this content on type, such as given names, family names, prefix, suffix, etc. (See also my blog post on the comparison of apples and oranges ….)

In the past this was done manually by, for example, investigating telephone books or manual research of census lists. But these were the ‘pioneer years’. What we see now is an enormous amount of content that can be gathered on the internet. It’s quite easy to find an internet page with 180 million records of person names. Great, so knowledge gathering is passé now? Continue reading ‘We have 180 million names! Which one is right?’

Fødselsnummer – Crossing centuries in Norway

Norwegian Fødselsnummer examples

The Norwegian Fødselsnummer (Birthnumber) is an 11-digit number with 2 control digits. The 10-th digit is a control digit calculated with a weighted modulo 11 variant over the first 9 digits. The 11-th digit is a control digit calculated with another weighted modulo 11 variant over the first 9 digits combined with the 10-th control digit.

As in other countries also this number is based on the date of birth. The first 6 digits represent the birth date as “ddmmyy”. Problem with a 6-digit date is that you cannot identify the century – is a Fødselsnummer starting with 121009 someone born in 1909 or 2009? The Norwegian government has solved this by grouping the following 3 individual digits (individual number) in groups representing a certain era. If you are born between 1854-1899, then your individual number must be between 500 and 749, born between 1900-1999 then your number lies between 000 and 499, and for those born recently between 2000-2039 then your number lies between 500 and 999. With some exceptions for those with an individual number between 900 and 999. Continue reading ‘Fødselsnummer – Crossing centuries in Norway’

New Matching Engines go beyond apples and oranges

Beyond apples and oranges

Professional data matching engines are becoming more and more intelligent. Within Human Inference, we also see that our matching techniques are capable of using more and more intelligence, and needless to say that we incorporate and use this intelligence in our engines in order to adopt to the way that humans do their matching.

Traditional data quality or matching engines were based on atomic string comparison functions like match-codes, phonetic comparison, Levenshtein string distance, n-gram comparisons or similar functions. These kinds of functions are relatively easy to implement and to use although a significant amount of plumbing is needed to get reasonable results. Open source projects like the Lucene search engine, and variants, provide a solid and proven set of these functions. The drawback of these functions is that it’s not always clear for what purpose one needs to utilize a particular function. An even larger issue is the fact that these low-level DQ functions cannot distinguish between apples and oranges – you end up comparing family names with street names. We still see that, for example BI vendors, claim to provide data quality functionality, while they only provide these atomic comparisons. Continue reading ‘New Matching Engines go beyond apples and oranges’

Is 270368A172X a correct Finnish Henkilötunnus?

FinlandHetu270368A172X-150x150

The Finnish national personal identification number is the Henkilötunnus, aka Hetu or Ht, has the following format – ddmmyyc999C. For details how to calculate the control character, I refer to the overview blog on National Identification Numbers.

Validating the Hetu 270368A172X shows that it is indeed a correct number. The number 270368172 generates indeed 29 for the modulo 31 proof, represented by control character “X” in the checksum list. The number shows that this is the 86-th girl born on the 27th of March 2068.

The latter might is exactly the start for the discussion on validity. Althought the number itself is well formed, and passes all the automatic checks, dealing with this number in a data quality assessment will raise your digital eyebrow. In the data quality world we will nowadays say that this Hetu is a wrong Hetu, that it cannot be correct.

So always use a bit of human inference when dealing with finnish national personal identification numbers.