We have 180 million names! Which one is right?

The internet is an ocean of wealthy content, but unfortunately, as in the real world, it’s heavily polluted.

As a company in business for 25 years, Human Inference absolutely sees the benefits of the internet. For our reasoning processes, based on natural language processing, we gather content and we classify this content on type, such as given names, family names, prefix, suffix, etc. (See also my blog post on the comparison of apples and oranges ….)

In the past this was done manually by, for example, investigating telephone books or manual research of census lists. But these were the ‘pioneer years’. What we see now is an enormous amount of content that can be gathered on the internet. It’s quite easy to find an internet page with 180 million records of person names. Great, so knowledge gathering is passé now? Continue reading ‘We have 180 million names! Which one is right?’

International domain names – there goes the ASCIIhood….

sel-logo-155x82

The internet is on the verge of one of the most fundamental changes in its history. The Internet Corporation for Assigned Names and Numbers (ICANN) is expected to agree on the use of internet addresses in non-Latin characters during this week’s ICANN convention in Seoul. If all goes according to plan, it will be possible to use Greek, Cyrllic, Arabic, Chinese, Korean and many other characters in the internet browser’s address bar. More than half of the 1.6 billion internet users in the world are using a character set which is not Latin. Therefore, ICANN expects that the number of non-Latin domain names, and thus the number of new internet usersm, will increase rapidly.

This far-reaching change in the use of he internet is based on a system that can “translate” or “convert” different writing systems (with sometimes different writing directions, i.a Arabic and Hebrew). On a high level, it would look a little like this, I would imagine:

عربي

中文

English

日本語

Deutsch

Français

Español

Русский

Português

한국어

Italiano

AR

ZH

EN

JA

DE

FR

ES

RU

PT

KO

IT

Naturally, this phenomenon raises questions concerning the matching of internet addresses. Is ووو.هُمَنِنفِرِرِنسِ.كُم the same as www.humaninference.com? It appears that generic multilingual data matching issues also apply in this particular case.