Short question, complex answer: Who is who and what is what in your database?

 

Any organization that deals with customer, prospect, supplier, distributor, product and service information, uses all kinds of data in their day-to-day business processes. Identification of a customer or a product within an automated system, using a specific id-number, the name or any other identifying feature, is a key issue in these processes. Furthermore, it is a task that needs considerable attention, since the collection and management of data is essentially error-prone. People make mistakes, names are understood incorrectly, numbers are typed in the wrong order; there are just too many reasons for defective data and poor information quality.

The collective term ‘business data’ is often used without a precise notion of what business data actually contain. It is not just the customer identification numbers and product codes. Naturally, the sort and the importance of data used in a business process will differ from organization to organization. However, a closer look at the seemingly endless variation will show that names and addresses of persons and organizations are as detailed and complicated as they are identifying. The following classification will show the details of names, addresses and complementary data.

* In personal names we will encounter: given (first) names, middle names, initials, surnames, surname prefixes, surname suffixes, forms of address, titles, functions, qualifications, professions, patronymics and nicknames.

* The name of an organization can consist of virtually everything: legal forms, fantasy words, natural language words, personal names, numbers, Roman numerals, ordinals, letters, acronyms, geographical indications, suffixes, articles, prepositions, conjunctions, indication of year of establishment and non-alphabetical signs.

* Postal Address data combine recipient information with delivery points: countries, regions, towns, districts, proximate towns, delivery service indicators, delivery service qualifiers, postcodes, addressee and mailee indicators, thoroughfare names, thoroughfare types,  house or plot  numbers, house number additions, building names, building types and delivery point access data, such as wing, floor or door.

* Complementary data used in business processes include: phone numbers, fax numbers, e-mail addresses, dates of birth, contract dates, social media account id’s, product and brand names, product codes, product numbers, gender indication, financial data, lifestyle data and transaction data.

Defining the data groups as precisely and as detailed as possible, is the first step towards useful interpretation. People, applying their natural language processing capabilities, structure the information as they interpret it. They will use their frame of reference, which includes their knowledge dictionary, their linguistic repository, statistical information and mathematical information.

Knowledge-based interpretation, incorporated in an automated system to solve data quality issues, must work in exactly the same way. Consider the following examples: Continue reading ‘Short question, complex answer: Who is who and what is what in your database?’

Any close encounters with the FBI terrorist watchlist?

tsc080105aJust before this summer the U.S. Department of Justice filed a report about the FBI Terrorist Watchlist. This watchtlist serves as a critical tool for screening and law enforcement personnel for alerting them when they come across a known or suspected terrorist. It is used by personnel at airports, harbours and the borderline. Also when you apply for a visum you are matched against this watchlist. The Terrorist Screening Center, a subsidiary of the FBI, is responsible for maintaining the watchlist.

This watchlist was created in 2004 from several other lists and at that time it consisted of about 68.000 entries. I use the word entries, because in the years after it became fuzzy if one record is the same as one individual. By the end of 2008 the list had grown to over 1,1 million entries. In 2008 after the American Civil Liberties Union (ACLU) mentioned that the list had passed the 1 million, the government came with an explanation. Although we have recorded over 1 million entries in the database, the net result is that these records correspond to about 400.000 individuals. Terrorist often use different and thus multiple identities, use several (falsified) passports etc. But adding entries with only the first initials and last name, while an entry of the full first names and last name already exists will result in unwanted side-effects. Continue reading ‘Any close encounters with the FBI terrorist watchlist?’

Bi-lingual streetnames in Amsterdam, do we really need it?

StraatnaambordSo once in a while I visit Amsterdam and have a drink or two in the centre. Afterwards I use the tram to get back to the hotel. This weekend I was quite surprised to find out that all the streetnames are announced in English, at each stop. The easy and obvious one is of course Centraal Station, which was translated to Central Station. I also can see how they came up with Rembrandt Square instead of Rembrandtsplein. But translating “Spui” to “Courtyard with a chapel” doesn’t help any tourists to find their destination. Continue reading ‘Bi-lingual streetnames in Amsterdam, do we really need it?’

The importance of persistent identification

A much overlooked issue in Customer Data Integration projects is “Persistent Identification”.

Persons and companies are very often identified using their address data. But, what do you do if a person has moved from address A to address B. One, thing you really don’t want is that the person is added to database as a new person (INSERT). From that moment a duplicate person or company resides in your system. This should be prevented, by creating searching indexes which include the current and the previous address of the persons and companies in your database.

Continue reading ‘The importance of persistent identification’