DataCleaner adds expert cleansing functions- added value in Open Source

Late 2009 in their report on Who’s Who in Open-Source Data Quality, Andreas Bitterer and Ted Friedman from Gartner, pointed already to DataCleaner as a promising tool. A tool that, in their opinion, could certainly improve by offering more high end Cleansing functions and improve the rather basic User Experience.

Since then, a lot has happened in the DataCleaner space and in the profiling market. Before the launch of version 2 we notified everybody on the acquisition of or DataCleaner by Human Inference. It might be that some of you were curious on what would happen with the functionality, and as stated at that time we would continue with the community and further participate and expand in it. Under the flag of Human Inference we launched the renewed DataCleaner 2.0, where we definitely increased the customer experience with an enhanced user interface together with possibilities to provide filters or filter flows. The filter flows show their benefit if you analyze your data source and want to create new (temporary) data sources based on matching criteria. You can do that either manually, or in a completely automated way to monitor your data.

With Open Source in general, and with DataCleaner in particular we want the community to participate in the functionality of the product. Since long DataCleaner contains the RegexSwap: the community where you can share regular expressions. Why would everybody reinvent the same wheel to build a regular expression on creditcard checks, emails, etc?

Next to regular expressions that can be used to profile data, there is the need on data cleansing functions that contain much more business logic that can hardly be covered in a regular expression. For example, to validate of the syntax of an email is correct is something else than validating if there is also a running mail server attached to the domain. Cleansing functions are already part of DataCleaner but there is always a need for other or more advanced functional extensions. To prevent that you need to create them in the ‘DataCleaner’ way we have created an easy extension sharing mechanism. Continue reading ‘DataCleaner adds expert cleansing functions- added value in Open Source’

Let’s be honest – Solve your data quality before jumping into Pattern-Based Strategy


In the evolution of information technology Gartner provided a new term as ultimate goal to reach: Pattern-Based Strategy.

As you were reaching for the final destination in your ultimate journey to transform bits and bytes to real information, again you encounter a new optimum. Pattern-Based Strategy, as described by Yvonne Genovese et al. can be identified as the last era in all the eras of IT-value add. Basically, the level of control identifies in which of the era you currently operate – from tight control and pure automation in the ‘old’ days via augmentation, e-commerce/Web 1.0 and web 2.0 to the highest era called – Pattern-Based Strategy. Continue reading ‘Let’s be honest – Solve your data quality before jumping into Pattern-Based Strategy’

Data Quality Summit 2010 – a must-attend-event for every data quality professional


On 28 january 2010 the next Human Inference Data Quality Summit will be held in the Evoluon in Eindhoven (NL). The theme – Value your data, value your future– is inspired by the idea that investments in data quality have become part of standard business and that vision, strategy and solutions are being synchronized with these investments. As data quality has reached a certain level of maturity, it is time to have an in-depth look at the (near) future of Data Quality.

The program is challenging, comprehensive and entertaining. Keynote speakers include Ted Friedman (vice president Gartner Research), Mathias Klier (professor at the University of Innsbruck) and Sabine Palinckx (CEO Human Inference). Additionally, in the break-out-session a wide variety of theme-related topics will be addressed: maximising the buisnes value of information, guiding a dq-project through migration, data quality maturity, marketing effectiveness and many more….. In short, the Data Quality Summit is not to be missed!

Save the date and register by clicking this link!

How green is your data value?

top101Number 4 in the top 10 list of Gartner’s Strategic Technologies is Green IT. David Cearleys take on this is quite straightforward. On the one hand regulations and more efficient equipment will force or help to reduce unwanted emissions. For our discussions – talking about data value – I see several angles:

  • Having the right contact details will reduce waste of natural resources because we bring the deliveries immediately at the right place, and it’s not only the deliveries that can be optimized, we can also avoid that deliveries get lost and natural resources are actually piped for /dev/null !
  • By valueing our data through deduplication we can in general avoid to spoil needless energy – both by humans and other resources – and use the sparse energy only for those who actually need it. Here I feel the same remark as David in his blog. There comes a moment in the near future, with an rising energy prices and increasing emission penalties, that that aspect will win in the equation from the actual spoil of goods and human energy.
  • Saving resources is now also done by concentrating or centralizing services – optimizing the service per energy unit. For data we see this happening in the Virtualization of data and Master Data Management technologies. Strong place in your centralizing strategy will be the role of your data quality – that will bring your real value

I encourage you all to think out-of-the-box how data-value can help to make it a better world for the future. But I’m afraid that in this economic climate the short term is ruling and not the long(er) term.

Virtualization: It’s the data! – not the hardware

The first Strategic Technology to watch according to Gartner is Virtualization. And I do like their twist in the whole virtualization debate – focus on data. While the whole world is linking the word virtualization with optimizing your hardware assets by using a virtual layer on top of your hardware. By optimizing the usage of your assets in this virtual way you can significantly reduce the total cost of ownership (ToC).

David Cearley at Gartner comes with a fascinating other angle. Basically he sees virtualization also as strategic technology to virtualize the data. And by that twist, data quality and data governance appears annoyingly in the middle of your radar screen. In order to use this strategy for your operational excellence, to eliminate the number of redundant data on your real storage devices, and make a virtual layer between your applications and this virtual data storage, you need to be sure that all your applications can work seamlessly with that virtual data.

Continue reading ‘Virtualization: It’s the data! – not the hardware’

Top 10 Technical Strategies for 2009

Recently – close your eyes and imagine the meaning of recently in this climate of economic crisis – David Cearley from Gartner published a blog on the most important technical strategies for 2009. In a couple of blogs I want to pick some of them and emphasize my view on them in relation to data value.

In general I agree with the top 10 of technological strategies, be there some slight personal priority adaptations, but let’s focus on that in later blogs. The missing point is in my opinion the lack of emphasis on risk mitigation, and I do realize that things changed since October 2008. Which technologies can we adopt to avoid that we provide services, products, at the end money to the wrong contacts, or that we are sure to deliver it to the right contacts. The technology strategy of Master Data Management, Know your customer, Single View of X, or how we call it, will need our attention in 2009!