Boundless Search

We live with restrictions every day.

  • A rafter blocks my cellar stairs, so I always bend when I enter.
  • At the end of the street a barking dog runs to the fence whenever I pass, so Ialways cross the street just before I reach the dog.

We learn to live with restrictions and they become a habit. After a while you just stopAngry dog realizing why you do things the way you do. The rafter has been removed, and the dog has died. Then why do I still bend on the cellar stairs, and why do I still cross the street before I reach the end? Recently I was confronted with similar obsolete restrictions at Human Inference customer support.

A rafter is visible and a barking dog can be heard, so it doesn’t take long before my habits change to fit to the new situation. It’s different however for technical restrictions of which you never get to know that they have disappeared. When I build descriptions for more than a million source records in a SQL Server database, I automatically switch from the free SQL Server Express database to an Enterprise edition. A customer decided to build descriptions for 7 million source records in a SQL Server Express edition. I was rather surprised the build was successful at the end. It turned out that in SQL Server Express 2008 the maximum database size is 10 GB as compared to SSE 2005 having 4GB. As It turned out I had been crossing the street for years to hide for a dead dog.

Searching with HIquality Identify

HIquality Identify is used for search, deduplication and data matching. To retrieve fast search results, subsets are used to preselect records for evaluation. Before the actual search, the maximum number of evaluations as set in the configuration is checked. When a subset exceeds this maximum, it is skipped. When all subsets are skipped, a message is returned indicating that not enough search data was entered. Searching for Müller with house number 9 in the whole of Germany for example will preselect too many candidates, and is, even when results are found, not very useful. Besides the number of candidates per subset, the total of evaluations in all subsets is checked with the maximum. This maximum is not allowed to be above the magical limit of 32768.

Continue reading ‘Boundless Search’

We make ‘null’ mistakes

Wherever software is created, mistakes are being made. Software providers often presume their products are bug-free, but software of that kind doesn’t exist. Our departments works hard to prevent it, yet in our HIquality Life Cycle new bugs could still be introduced, even in the oldest modules that have been in use for over 25 years already. 

HIquality bug cycle

Usually our customers are satisfied with our product suite. At customer support I never receive information about the successful implementations. I got to know our software through the problems that occur, and in almost 15 years of acceptance testing and customer support, I’ve seen all kind of bugs passing by.
HIquality bug cycleSoftware crashes and never ending loops are nasty. Worse are those bugs that are not that visible in the beginning, but keep on growing in the course of time.
Recently we caught such a bug in our longest existing product HIquality Identify. Continue reading ‘We make ‘null’ mistakes’

First time right? Let your data decide!

Data quality consultants will tell you that collecting data correctly, getting it right first time is essential, whilst in contrast almost every organisation actually puts most of their budget and labour into attempting to cleanse data after collection.

The proactive versus reactive debate rages, but in fact data quality must be both a proactive and a reactive process. The data will dictate which to use, or whether both are required. Continue reading ‘First time right? Let your data decide!’

An Easy mashup of ETL and DQ

Today I saw how easy it can be to make a mashup from ETL and DataQuality tools. More and more ETL vendors see the need to not only extract, transform and load data, but at the same time also enhance the data by hand with data quality tools. Most of them stick to so-called tick mark data quality – main stream easy to get enhancements. These results are mostly experienced as disappointing or at max average. Building ETL solutions is another ball-game than building data quality solutions. You need to mash these worlds together.
Together with Pentaho we as Human Inference are creating a mashup with their Kettle ETL tool and our HIquality Data Quality solutions. The nice thing is that the data quality solutions can be used both in the cloud as well as on-premise.
It’s almost finished now and as a teaser I just want to show you a hot screenshot of it. Soon available as add-on from our easyDQ website, followed by an inclusion in the coming Pentaho release. If you need it right away, please contact us directly.

Komerc in Croatia

People find many ways to be unique, including in their choice of names and how they are written.  Common names may be written in any number of ways (Zachery, Zaccari, Zachery, Zakarey and so) and in any number of forms (Za’Korey, zaKori). This variation, and the importance that the customer attaches to it, reinforces the importance of first time right when collecting information about a person’s name.

I was reminded recently that this rule applies also to company names when reviewing a directory containing Croatian companies.  The directoryshowed a great variation in words that at first glance would seem ideal candidates for correction and standardization. For example, many companies contained strings like these:

Commerc, Comerce, Comerc, Kommerce, Kommerc, Komerce, Komerc

There are many other examples which had me scratching my head: Compani, Konsulting, Konzalting, Konsalting and so on.

Why the variance is spelling?  Are these companies with the English word commerce in their names where that word has been typed as heard by call centre workers with a limited knowledge of English? Are they typos of a valid Croatian word? Are they accurate representations of a valid Croatian word as rendered in different dialects? Is it a mixture of all these factors?

Continue reading ‘Komerc in Croatia’

Ask Me is linked with Any Body and relates with Walther Von Stolzing

Weird subject, isn’t it? Quite obvious for everybody, the persons ‘Ask Me’ and ‘Any Body’ are artificial names. They will never belong to a real person. How they relate to ‘Walter von Stolzing’ will follow.

For over 25 years Human Inference has collected reference data, for instance on persons. Because of our reference set we immediately recognize that ‘Ask Me’ and ‘Any Body’ are fake names. People are using these either in test situations or to hide their actual names.

In the old days we only needed to test on ‘Test Test’, in more recent years we see great inventiveness on these fake names. A brief example can be seen in the following list.

Alpha Beta Any Body
Ask Me Best Friend
Blue Sky Cool Dude
Dress Code El Comandante
Guess Who In Cognito

In case you cannot rely on reference data and interpretation you need to provide a check list. Providing it is one thing, but since users tend to be really creative, maintaining it is essential. Continue reading ‘Ask Me is linked with Any Body and relates with Walther Von Stolzing’