Boundless Search

We live with restrictions every day.

  • A rafter blocks my cellar stairs, so I always bend when I enter.
  • At the end of the street a barking dog runs to the fence whenever I pass, so Ialways cross the street just before I reach the dog.

We learn to live with restrictions and they become a habit. After a while you just stopAngry dog realizing why you do things the way you do. The rafter has been removed, and the dog has died. Then why do I still bend on the cellar stairs, and why do I still cross the street before I reach the end? Recently I was confronted with similar obsolete restrictions at Human Inference customer support.

A rafter is visible and a barking dog can be heard, so it doesn’t take long before my habits change to fit to the new situation. It’s different however for technical restrictions of which you never get to know that they have disappeared. When I build descriptions for more than a million source records in a SQL Server database, I automatically switch from the free SQL Server Express database to an Enterprise edition. A customer decided to build descriptions for 7 million source records in a SQL Server Express edition. I was rather surprised the build was successful at the end. It turned out that in SQL Server Express 2008 the maximum database size is 10 GB as compared to SSE 2005 having 4GB. As It turned out I had been crossing the street for years to hide for a dead dog.

Searching with HIquality Identify

HIquality Identify is used for search, deduplication and data matching. To retrieve fast search results, subsets are used to preselect records for evaluation. Before the actual search, the maximum number of evaluations as set in the configuration is checked. When a subset exceeds this maximum, it is skipped. When all subsets are skipped, a message is returned indicating that not enough search data was entered. Searching for Müller with house number 9 in the whole of Germany for example will preselect too many candidates, and is, even when results are found, not very useful. Besides the number of candidates per subset, the total of evaluations in all subsets is checked with the maximum. This maximum is not allowed to be above the magical limit of 32768.

RestrictionThe pre-calculation of the number of evaluations was done using the sum of all subsets, but the sum of all subsets is often higher than the union of the subsets. Before improving the algorithm, we investigated why this maximum of 32768 is used. This maximum size was driven by the fact that we used host arrays in the Oracle implementation. In host arrays the maximum number of records that can be fetched is 32768.For host arrays in Oracle 11 this is still the maximum. Yet we found out that several years ago we stopped using host arrays in our code. To prevent internal overflows, we still checked in our code if the maximum was not exceeded. Out of fear for the dead dog we had been crossing the street for years.Of course there are some risks in just removing these obsolete restrictions. By allowing more data to pass then we ever did, further on a code may be hit that has never been hit before. Moreover the reason for using a maximum in the evaluations is to prevent a search action to take too much time. What is the limit for time and number of evaluations? All we know is that it is not indicated by obsolete restrictions any more.

Boundless search?

Tests showed that after the removal of the restriction it is now possible to find over 150.000 records in one single search call. Functionally such an exercise is completely useless, and overall it is fareasier to get these results by using a simple query to the database instead of
an advanced tool like HIquality Identify. It is nice nice to see though, that technically this has become possible. As expected the performance for a query like this is just terrible. Where usually 100 calls are done within a second, a single call takes over 90 seconds now.

So to keep the performance at an acceptable level, the limitation of 32768 is still kept in our configuration tool. That limit can be changed by editing the configuration file. This kind of unsupported changes often caused trouble in the past, but from now on they may work better than ever.The world is changing. Computers become faster and memory grows. Old restrictions have disappeared. We can search without limits. But
let’s not just search because there is so much to find. Always keep the use of searching in mind. After all, the rafter may have been removed, but those who jump high enough may still bump their heads.

0 Responses to “Boundless Search”


Comments are currently closed.