International data quality

When I read Henrik Liliendahl Sørensen’s blog on cross border data quality, I made a mental note to write a follow-up blog, because his theme closely borders on a presentaion I am preparing for the 2012 ECCMA Data Quality Summit in Ocober. As it happened, the organization  committee of the summit asked me to write an article on my upcoming presentation, and so I thought I’d combine my efforts and use the article as input for this blog.

As Henrik pointed out, there are a lot of data and information quality aspects to consider when crossing the border, and they cannot be solved by using domestic tools such as a national change of address service. Organizations are investing substantial amounts of money to deal with issues and initiatives such as the value of a single customer view, data integration, fraud prevention, operational risk management and compliance. But how do these investments equip companies for the inevitable internationalization of our business community?

Apparently, a lot of companies doing business abroad often seem to forget that they are dealing with a large variety of languages, names, address conventions and other culturally embedded business rules and habits. If we take a look at European contact data diversity, here are a couple of examples:

The names Haddad, Hernández, Le Fèvre, Smid, Ferreiro, Schmidt, Kuznetsov en Kovács are illustrative for the variety of names in Europe. These names all mean “Smith” in different countries. Naturally, there is a large variety of names in the US as well, but the rules and habits concerning structure, storage, exchange and representation are far more intricate in the various European countries. Think of the use of patronymics (Sergei Ivanovich Golubev and Olga Ivanovna Golubeva), prefix sorting and different significations for similar name components. An even greater challenge lies in the interpretation and processing of European postal addresses. The variety in address components and the differences in order and formatting of these components are extraordinary.

Naturally, there are many more data and information quality aspects to consider when crossing the border. Think of multiple languages, character sets, privacy issues, and different currency and date notation. Companies working with international data are highly dependent on understanding name specifics, address conventions, languages, code pages, culture, habits, business rules and legislation.

In my presentation during the 2012 ECCMA Data Quality Summit, I will address natural language processing methods and international name and address specifics. Furthermore, I will show some examples of the application of the insights with regard to international data quality. For more information, you can also check out our website.

 

 

International data quality – Is a football always a football?

football 2

football 1High quality customer data have become the prerequisite for successful business decisions. In order to reach the intended data quality level, a lot of money is being invested in solutions for input control, file merging, data enrichment and duplicate identification. But do these investments guarantee high quality data and information? For example, are the data quality tools and processes equipped for the inevitable internationalization of our business community? Is a football always a football?

Natural language processing

Why do we know that William Jones International Logistics Ltd and W. Jones Int. Transport Co. are probably different notations for the same company? How do we determine that Leonard is a given name in Leonard Peters and a surname in Leonard & Peters? Without being all that aware of it, we are using methods such as pattern recognition, context analysis and other linguistic considerations. To answer the question ”what is what in customer data?” people will use their knowledge of language and culture to interpret the data they will encounter in daily life. Continue reading ‘International data quality – Is a football always a football?’

Chinglish – the most delightful side-effect of internationalization

little grass has life

An increasing number of companies have to deal with data from the world’s fastest emerging economy: China. And the big question in this issue is of course: How can we compare these “strange” Chinese characters with our own writing set?

Grammar and character set of our Western alphabet-languages (such as English, French, Dutch or German) differ tremendously from Mandarin Chinese (which is the language spoken by most in the People’s Republic of China and abroad. Mandarin is a tonal language with an ideographic character set. Almost all characters have a semantic and a phonetic component. The different pithch in the pronunciation eventually determines the signification

Complicated? Definitely. But what about the other way around? Have you ever thought about the difficulties the Chinese have to face when trying to convert their language into meaningful English?

This phenomenon is sometimes hilariously being illustrated by the many public signs in China used to inform foreign visitors or to help them finding their way around.

This is truly a delightful side-effect of internationalization. …. Continue reading ‘Chinglish – the most delightful side-effect of internationalization’

Toponymic confusion

via-dolorosa1Did you know that Urshalim, al-Quds, Yerushalayim and Jerusalem are four names for the same city? There is great international confusion over the names of countries, cities, streets and rivers which have been changing so frequently that postal services, health and rescue workers and transportation companies are struggling very hard to cope.

The UN’s expert committee on names is expanding standardisation efforts in order to to make it easier to find your way in an increasingly globalized world. The most prominent examples of these efforts are the change from Bombay to Mumbai and of Peking to Beijing, thus re-installing the correct names from a pre-colonial era. But the toponymic name battle still has some major challenges. Some examples… Continue reading ‘Toponymic confusion’