The term Golden Record is closely related to Customer Data Integration or MDM for Customer data. It refers to the “single truth” which has been created or calculated from all those duplicate customer records from different systems. This post is not about finding or tagging all those duplicate records. There all kinds of ways to find them using advanced statistical methods, fuzzy matching etc.
But what do you once you have found the duplicates. How do you create the best possible customer data out of all gathered elements?
First of all we have to define what is meant by the Golden Record. We at Human Inference use the acronym ACCU, short for Actual, Correct, Complete and Unique. Ofbviously, we want one unique record. That’s why we use matching or identity resolution software. But Actual, Correct and Complete are less absolute, they can be interpreted in a subjective manner. You can have never-ending discussions about it, build the most complex business-rules ever etc. But I prefer to start with simply determining the superlative of Actual, Correct and Complete. In other words the most actual, the most correct and the most complete data-element or attribute “wins” and makes it to the Golden Record. Let’s take the following example, two almost identical records are gathered from two different systems (A & B).
Record 1 from System A
Record 2 from System B
|J. (John) Miller||J.F. Miller|
|26 Spring Gdns||26 Spring Gardens|
|Manchester, Lancashire, M2 1BB||Manchester, Lancashire, M2 1BA|
|United Kingdom||United Kingdom|
The basic rule is that only Correct data will make it into the Golden Record. So, if you can validate data please do so. For instance you can check social security, bank account and credit card numbers using algorithms. You can validate email addresses. Using postal reference data, it is also possible to verify the correctness of addresses. The most difficult is to validate names. Extensive knowledge is needed to check whether names of persons and organizations are valid.
In my own experience and opinion you should always discard incorrect data, or let it be corrected by a data steward. In the end nobody should be in doubt whether a Golden Record has been established using doubtful data.
The next step is to examine attribute (field) by attribute. So using the example from above.
|Initials||J.F. “wins” from “J.”, because it consists of more characters (simply use the LEN function).|
|First Name||John wins from the non-existent first name in Record 2. You can also deduct this person is a male.|
|Street||“26 Spring Gardens” wins from “26 Spring Gdns”. Full length is preferred above abbreviated.|
|Housenumber||26/II wins, once again it consists of more characters (more complete).|
|Postcode||M2 1BB wins. This is the correct postal code for the even housenumbers.|
|City & Country||It doesn’t matter, both records contain the same data.|
So using validation techniques to distinguish the correct data from incorrect data and determining the length of each attribute in the provided records will result in the following Golden Record:
Mister J.F. (John) Miller
26 Spring Gardens
Manchester, Lancashire, M2 1BB
Even if you have a lot more of attributes in your Golden Record, this method still works. Determine the correct data and use only correct data. And using the function Length (LEN) to determine the “most complete” data. Most complete simply refers to consisting of the most characters. If the source systems also provide dates for “date entered” and “date last changed” you can use this to determine what the most recent data is. The most recent data is determined by formulas like MIN (“CurrentDate” minus “”Last Changed Date”).
I believe this method will lead to a very usable Golden Record in 90 to 95% of all cases. Only when you have to deal with complicated data, for instance father and son living on the same address and having the same initials it becomes much more complex. I am curious which rules-of-thumb and methods you use when calculating the Golden Record. Please put your ideas in the comments.