The internet is an ocean of wealthy content, but unfortunately, as in the real world, it’s heavily polluted.
As a company in business for 25 years, Human Inference absolutely sees the benefits of the internet. For our reasoning processes, based on natural language processing, we gather content and we classify this content on type, such as given names, family names, prefix, suffix, etc. (See also my blog post on the comparison of apples and oranges ….)
In the past this was done manually by, for example, investigating telephone books or manual research of census lists. But these were the ‘pioneer years’. What we see now is an enormous amount of content that can be gathered on the internet. It’s quite easy to find an internet page with 180 million records of person names. Great, so knowledge gathering is passé now?
Sorry to say; it has become far more complex now. The volume is much larger than in the past, but the quality is worse. Of course there are nuances in quality (e.g. the difference in content coming from social media versus the content from census lists), but in all these cases people working in data quality will absolutely agree with me, that even lists coming from the government contain false names and garbage.
Before we can discuss how to gather the right content from a large set, we need to have a common understanding on what is a right name. In discussions with linguists you can have long debates about this. Take my family name for example: according to the Dutch rules the family name “van Holland” is valid and known, and the family name “Van Holland” or “Vanholland” is invalid and unknown. In an international setting however, both names are absolutely valid and known (for example in Belgium or the United States) Similar situations might occur when people move between countries and patronyms (names derived from the father’s name) are used in one country, but not in another. Then you could find a name like Maria Romanov, who, in her country of origin, would be named Maria Romanova.
This phenomenon, which I call the ‘degenerations’ of names, has been going on ever since official administrations wanted to store names of individuals. That’s the reason why there are, for example, so many variants of the name Mohammed. Sometimes this cannot be prevented because there are characters in the original name which are not known at the registration office. An example would be the family name Sørensen (quite common in Scandinavia), which would become Sorensen, because of the unavailability if the letter ‘ø’. The moment a name is officially registered and used as such, the name becomes a valid name. This becomes even more complex once a name is written in a different writing set and has to be transcribed or transliterated.
We see here that right and wrong needs to be seen in context of region. We define a name as right in case the individual related to that name identifies her/himself with that name. So, in case I’m travelling to the US, I might still feel that “Vanholland” is wrong, but I can imagine that my relatives living in the US for many years now call themselves “Vanholland”. They identify with that name and in that case the name becomes “right”.
Back to the 180 million records. In order to filter these to get valid names we first perform so-called purification filters. That’s a huge set of different filters divided in specific sets like salutation filters, company related word filters, extraordinary sequence filters, strange character filters, capitalization filters, etc. Examples on what we filter here is for example the name “be@home” with the strange character filter but we keep the “d’Ancona”, or filter on “John Baker Inc.” with the legal forms filter in the company related words set. Nice examples in the extraordinary sequence set are always the words on our black list (e.g. “asdf” or “qwerty”) or repeating words (e.g., “abcabcabc”). After this purification we end up with 160 million records.
These 160 million are aggregated on family name and the frequency is counted. This causes a major reduction, because now we have a list of 14 million potential family names. If we compare these 14 million with our internal knowledge we immediately see the difficulty of dealing with names, i.e., the long tail. In general, the top 10% of the names in regions are responsible for over 90% of the population. The so called long tail, 90% of the remaining names, is responsible for less than 10% of that population. This means that this long tail of names has a very low frequency, as have most of the wrong names! So on frequency alone you cannot distinguish if a name is valid or not! For example we see that within these 14 million names, roughly 1.1 million are responsible of over 140 million persons in the original population! The remainder of 12.9 million family names still needs a lot of attention to verify if they belong to valid names. An example of such names with their respective frequency is given in the table below and immediately shows that it is not trivial to assume that all these names are valid:
|Potential Family name||Frequency (from 160 million records)|
Roughly 3 million of these names have a frequency of 1! They are very hard to validate automatically, but there are techniques we use which I will describe later in a separate blog.
In order to handle these volumes of records and to place records in the right validation buckets we use the DataCleaner . This is a very powerful and fast open source profiling toolset. With DataCleaner 2.0 we also use the validation steps and the fact that you can now make a complete hierarchy of these steps in a single template. This way, the results can be stored in new data stores, that can be examined later within the same tool!
We are not afraid of tackling large datasets to gather content. A large dataset will not provide you automatically with a large update in your knowledge, there is an enormous amount of pollution in such a set. We use DataCleaner for purification because it’s powerful and extremely flexible. We have defined a complete process to purify a given dataset and to automatically retrieve as many valid names as possible, which will then be added to our knowledge. Manual work will remain, but we can focus now more on linguistically and culturally challenging names, instead of spoiling most of our time on trivial ones. Purifying 180 million records is a hell of a job, but we are good at it!