An Easy mashup of ETL and DQ

Today I saw how easy it can be to make a mashup from ETL and DataQuality tools. More and more ETL vendors see the need to not only extract, transform and load data, but at the same time also enhance the data by hand with data quality tools. Most of them stick to so-called tick mark data quality – main stream easy to get enhancements. These results are mostly experienced as disappointing or at max average. Building ETL solutions is another ball-game than building data quality solutions. You need to mash these worlds together.
Together with Pentaho we as Human Inference are creating a mashup with their Kettle ETL tool and our HIquality Data Quality solutions. The nice thing is that the data quality solutions can be used both in the cloud as well as on-premise.
It’s almost finished now and as a teaser I just want to show you a hot screenshot of it. Soon available as add-on from our easyDQ website, followed by an inclusion in the coming Pentaho release. If you need it right away, please contact us directly.

Ask Me is linked with Any Body and relates with Walther Von Stolzing

Weird subject, isn’t it? Quite obvious for everybody, the persons ‘Ask Me’ and ‘Any Body’ are artificial names. They will never belong to a real person. How they relate to ‘Walter von Stolzing’ will follow.

For over 25 years Human Inference has collected reference data, for instance on persons. Because of our reference set we immediately recognize that ‘Ask Me’ and ‘Any Body’ are fake names. People are using these either in test situations or to hide their actual names.

In the old days we only needed to test on ‘Test Test’, in more recent years we see great inventiveness on these fake names. A brief example can be seen in the following list.

Alpha Beta Any Body
Ask Me Best Friend
Blue Sky Cool Dude
Dress Code El Comandante
Guess Who In Cognito

In case you cannot rely on reference data and interpretation you need to provide a check list. Providing it is one thing, but since users tend to be really creative, maintaining it is essential. Continue reading ‘Ask Me is linked with Any Body and relates with Walther Von Stolzing’

Know Your Customers – improving your Corporate Social Responsibility

It’s not only what you achieve, it’s also how you behave. Some small organizations can still behave somewhat undetected way to achieve successful results. For medium and large organizations that is not what governments and customers expect from them. Transparency on Corporate Social Responsibility (CSR) are key in this and therefore a significant number of countries agreed on these in, amongst others, the OECD Guidelines for Multinational Enterprises.

This week, the latest results have been presented in The Netherlands on Transparency in the Banking area. And although some institutions score really good, others really need to take it at least one mile further to get a good or even fair score.

We agree with the recommendations of the report that compliance regulations can help/force in being more transparent, e.g., the SEC in the USA is enforcing more detailed information than their Dutch peer, the AFM. And also for Basel II the financial institutions need to know who they are dealing with in the end. The phrase – in the end – makes it even more difficult for the CSR, because not only the ultimate legal entity is now needed, but additional details per region and per sector are required. Continue reading ‘Know Your Customers – improving your Corporate Social Responsibility’

What is equal? – challenges with sound and synonyms

What to do when basic string comparison (fuzzy search) techniques won’t give the right results? Fuzzy search helps to find matches in situations where people make typo’s (e.g. compare Human Inference with Human Inverence) or make up abbreviations (King str. with King street) or ignore diacritics (Sørensen and Soerensen). In case the ‘wrong word’ is not a real used word it becomes obvious that after correcting the typo we have a match.

More challenges appear if the typo has caused another existing word; now we need to make a decision on how equal the two entries are. In case you have some knowledge on the frequency of usage of words you can use that in the equation. How to get the frequency of usage for words is another ballgame – at least you can assume that a ‘wrong word’ is never used (bit of a paradox). Continue reading ‘What is equal? – challenges with sound and synonyms’

DataCleaner adds expert cleansing functions- added value in Open Source

Late 2009 in their report on Who’s Who in Open-Source Data Quality, Andreas Bitterer and Ted Friedman from Gartner, pointed already to DataCleaner as a promising tool. A tool that, in their opinion, could certainly improve by offering more high end Cleansing functions and improve the rather basic User Experience.

Since then, a lot has happened in the DataCleaner space and in the profiling market. Before the launch of version 2 we notified everybody on the acquisition of or DataCleaner by Human Inference. It might be that some of you were curious on what would happen with the functionality, and as stated at that time we would continue with the community and further participate and expand in it. Under the flag of Human Inference we launched the renewed DataCleaner 2.0, where we definitely increased the customer experience with an enhanced user interface together with possibilities to provide filters or filter flows. The filter flows show their benefit if you analyze your data source and want to create new (temporary) data sources based on matching criteria. You can do that either manually, or in a completely automated way to monitor your data.

With Open Source in general, and with DataCleaner in particular we want the community to participate in the functionality of the product. Since long DataCleaner contains the RegexSwap: the community where you can share regular expressions. Why would everybody reinvent the same wheel to build a regular expression on creditcard checks, emails, etc?

Next to regular expressions that can be used to profile data, there is the need on data cleansing functions that contain much more business logic that can hardly be covered in a regular expression. For example, to validate of the syntax of an email is correct is something else than validating if there is also a running mail server attached to the domain. Cleansing functions are already part of DataCleaner but there is always a need for other or more advanced functional extensions. To prevent that you need to create them in the ‘DataCleaner’ way we have created an easy extension sharing mechanism. Continue reading ‘DataCleaner adds expert cleansing functions- added value in Open Source’

Centenarians in Greece, Zimbabwe and the quality of birth dates

Population distribution in The Netherlands

This week started with a remarkable news item on the number of dead Greeks still drawing pension. Especially the 9000 centenarians (people with age above 100) give the feeling that there might be something wrong. At second pace, looking at the statistics of Europe, I hold my horses – France, Spain, Italy, Germany and the UK have also a significant amount of older people.

Anyway, as my mind was still boggling about these centenarians in Greece, a new news item was popping in. A news article on the statistics on the voters lists of Zimbabwe. 41.100 potential voters in Zimbabwe are centenarians, 4 times more than currently in the UK. Where the population of the UK is approximately 5 times Zimbabwe! And this is possible when the average life expectancy in Zimbabwe has fallen to 44.8 years. Even more extreme, the number of 16.800 potential voters aged 110 years old and all born on January 1st 1901.

I cannot prevent that pointing to both these news items might raise your eyebrows. Everything in me wants to prevent that I want to make some a sort of a political statement, I leave that to you.

For us, people living in the data quality world, these items trigger us, how can we identify these weird data manipulation on dates. When we do profiling of our customers data sources we were familiar with checks on certain date related things – no rocket science – for example: Is the date written in US or European style (mm/dd/yyyy or dd/mm/yyyy), are we dealing with two or four digits for the year, is the birth date before the current date, is the marriage date after the birth date, etc

We are also used to peaks at certain dates. A notorious one is January first of any year, on the one hand because it’s the default in many entry screens, on the other hand in some cultures the birth date itself is not that important – people from these cultures put more emphasis on name dates and won’t remember their day of birth. And all of a sudden they are forced to give one, with the effect that they or someone else is choosing a default one. Continue reading ‘Centenarians in Greece, Zimbabwe and the quality of birth dates’