High quality customer data have become the prerequisite for successful business decisions. In order to reach the intended data quality level, a lot of money is being invested in solutions for input control, file merging, data enrichment and duplicate identification. But do these investments guarantee high quality data and information? For example, are the data quality tools and processes equipped for the inevitable internationalization of our business community? Is a football always a football?
Natural language processing
Why do we know that William Jones International Logistics Ltd and W. Jones Int. Transport Co. are probably different notations for the same company? How do we determine that Leonard is a given name in Leonard Peters and a surname in Leonard & Peters? Without being all that aware of it, we are using methods such as pattern recognition, context analysis and other linguistic considerations. To answer the question ”what is what in customer data?” people will use their knowledge of language and culture to interpret the data they will encounter in daily life.
Correct, automated interpretation of customer data needs to imitate this natural language processing abilities. This requires knowlegde, containing relevant information on the components customer data consist of. Furthermore, a “grammar” is needed to take care of issues such as context rules, ambiguity checks, structure recognition, semantic associations and probability estimates. Using the knowledge and the grammar, the software solution decides what is the most probable signification of a word in a database record. This is the basis for all customer data quality processes. A quick look at the following example illustrates the power of this approach. We, as humans, immediately understand the signification ambiguity of “ART”. In addition, we also understand that automated interpretation represents a high level of complexity:
Art Johnson Sporting Goods
Art Gallery Johnson & Johnson
ART Ltd. Auto Rendition Technology
If we take a look at international busines initiatives, things become even more complex. Apparently, a lot of companies doing business abroad, often seem to forget that they are dealing with a large variety of languages, names, address conventions and other culturally embedded business rules and habits. For this post, I will limit myself to some focus points when dealing with names in international context.
The names Haddad, Hernández, Le Fèvre, Smid, Ferreiro, Schmidt, Kuznetsov en Kovács all mean “Smith” in different countries. That’s a factual observation, which is not necessarily helpful in solving complex data quality problems. There are, however, many aspects of the various national naming conventions, that represent a challenge for every company doing business across the border. Here are some examples:
Signification of name components
Due to divergent naming conventions, there is a great variety in storage, exchange, representation and signification of names. For example, the first name Joan is male in Spain and female in Belgium. Also, the representation will be exactly reverse. The form of address ‘Señor’ is the male equivalent of ‘Mevrouw’. In Spain: Señor Joan Martinez Fonseca Andrade. In Belgium: Mevrouw Vandenwalle, Joan.
A name like Van Buren would be sorted under ‘V’ in the US. In the Netherlands, for example, that name will always be found under the letter ‘B’. Additionally, the spelling (initial capital or not?) of prefixes differs per situation, per country.
The use of patronymics (names derived from the father’s first name) is highly country-specific. Whereas a Russian man whose father’s first name is Ivan, will add the patronymic Ivanovich to his family name, his sister will use Ivanovna: Sergei Ivanovich Golubev and Olga Ivanovna Golubeva. In Iceland, it is impossible to establish direct relation through analysis of family names. Here the patronymic serves as the family name itself: The son and daughter of Björn Thorgeirson will be called Nils Björnson and Anna Björnsdottir respectively.
Naturally, there are many more data and information quality aspects to consider when crossing the border. Think of multiple character sets, privacy issues, multi-lingualism and different currency and date notation. The examples given are meant to illustrate the following: Companies working with international data must invest in understanding customer data specifics. One size does not fit all!