This week started with a remarkable news item on the number of dead Greeks still drawing pension. Especially the 9000 centenarians (people with age above 100) give the feeling that there might be something wrong. At second pace, looking at the statistics of Europe, I hold my horses – France, Spain, Italy, Germany and the UK have also a significant amount of older people.
Anyway, as my mind was still boggling about these centenarians in Greece, a new news item was popping in. A news article on the statistics on the voters lists of Zimbabwe. 41.100 potential voters in Zimbabwe are centenarians, 4 times more than currently in the UK. Where the population of the UK is approximately 5 times Zimbabwe! And this is possible when the average life expectancy in Zimbabwe has fallen to 44.8 years. Even more extreme, the number of 16.800 potential voters aged 110 years old and all born on January 1st 1901.
I cannot prevent that pointing to both these news items might raise your eyebrows. Everything in me wants to prevent that I want to make some a sort of a political statement, I leave that to you.
For us, people living in the data quality world, these items trigger us, how can we identify these weird data manipulation on dates. When we do profiling of our customers data sources we were familiar with checks on certain date related things – no rocket science – for example: Is the date written in US or European style (mm/dd/yyyy or dd/mm/yyyy), are we dealing with two or four digits for the year, is the birth date before the current date, is the marriage date after the birth date, etc
We are also used to peaks at certain dates. A notorious one is January first of any year, on the one hand because it’s the default in many entry screens, on the other hand in some cultures the birth date itself is not that important – people from these cultures put more emphasis on name dates and won’t remember their day of birth. And all of a sudden they are forced to give one, with the effect that they or someone else is choosing a default one.
Good thing is that most of these issues in your data sources can be found by data profiling (e.g. DataCleaner ), perform a date to age transformation and see the value distribution. With wrong data the peeks will immediately pop into your face. Problem however is that, except in Zimbabwe where people put all birth dates of the 110 year old persons on Jan 1st, people that normally committing fraud are a bit smarter. They take random names, take random days of birth and if they add a significant population they map them on reasonable statistics, meaning equally divide the days of birth over the year.
And there is where the human being – the human inference – can still find the real figures from the generated ones. It happens to be that birth days are not equally divided over the year, e.g. in the western oriented countries more people are born in the period August-October. And a remarkable thing in countries where the delivery is done more in hospitals, the days of the week are also not equally divided over a year, there are less people born on Saturday and Sunday than on the weekdays! Happily DataCleaner provides the nice WeekDay distribution validation to see if people have even taken care of that statistic.