<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Data Value Talk &#187; knowledge</title>
	<atom:link href="http://datavaluetalk.com/tag/knowledge/feed/" rel="self" type="application/rss+xml" />
	<link>http://datavaluetalk.com</link>
	<description>Customer data is a valuable asset. Why not treat it that way?</description>
	<lastBuildDate>Thu, 10 May 2012 14:49:53 +0000</lastBuildDate>
	<language>nl</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Ask Me is linked with Any Body and relates with Walther Von Stolzing</title>
		<link>http://datavaluetalk.com/data-quality/ask-me-is-linked-with-any-body-and-relates-with-walther-von-stolzing/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=ask-me-is-linked-with-any-body-and-relates-with-walther-von-stolzing</link>
		<comments>http://datavaluetalk.com/data-quality/ask-me-is-linked-with-any-body-and-relates-with-walther-von-stolzing/#comments</comments>
		<pubDate>Wed, 12 Oct 2011 08:51:26 +0000</pubDate>
		<dc:creator>Winfried van Holland</dc:creator>
				<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Names]]></category>
		<category><![CDATA[cleansing]]></category>
		<category><![CDATA[identity]]></category>
		<category><![CDATA[interpretation]]></category>
		<category><![CDATA[knowledge]]></category>
		<category><![CDATA[name]]></category>
		<category><![CDATA[names]]></category>

		<guid isPermaLink="false">http://datavaluetalk.com/?p=1991</guid>
		<description><![CDATA[Weird subject, isn&#8217;t it? Quite obvious for everybody, the persons &#8216;Ask Me&#8217; and &#8216;Any Body&#8217; are artificial names. They will never belong to a real person. How they relate to &#8216;Walter von Stolzing&#8217; will follow. For over 25 years Human Inference has collected reference data, for instance on persons. Because of our reference set we immediately recognize [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://datavaluetalk.com/cms/wp-content/uploads/2011/10/Obama.png"><img class="alignleft size-thumbnail wp-image-2022" title="I'm Obama" src="http://datavaluetalk.com/cms/wp-content/uploads/2011/10/Obama-150x150.png" alt="" width="150" height="150" /></a>Weird subject, isn&#8217;t it? Quite obvious for everybody, the persons &#8216;Ask Me&#8217; and &#8216;Any Body&#8217; are artificial names. They will never belong to a real person. How they relate to &#8216;Walter von Stolzing&#8217; will follow.</p>
<p>For over 25 years Human Inference has collected reference data, for instance on persons. Because of our reference set we immediately recognize that &#8216;Ask Me&#8217; and &#8216;Any Body&#8217; are fake names. People are using these either in test situations or to hide their actual names.</p>
<p>In the old days we only needed to test on &#8216;Test Test&#8217;, in more recent years we see great inventiveness on these fake names. A brief example can be seen in the following list.</p>
<div align="center">
<table border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td valign="top" width="137">Alpha Beta</td>
<td valign="top" width="137">Any Body</td>
</tr>
<tr>
<td valign="top" width="137">Ask Me</td>
<td valign="top" width="137">Best Friend</td>
</tr>
<tr>
<td valign="top" width="137">Blue Sky</td>
<td valign="top" width="137">Cool Dude</td>
</tr>
<tr>
<td valign="top" width="137">Dress Code</td>
<td valign="top" width="137">El Comandante</td>
</tr>
<tr>
<td valign="top" width="137">Guess Who</td>
<td valign="top" width="137">In Cognito</td>
</tr>
</tbody>
</table>
</div>
<p>In case you cannot rely on reference data and interpretation you need to provide a check list. Providing it is one thing, but since users tend to be really creative, maintaining it is essential.<span id="more-1991"></span></p>
<p>In these 25 years we identified a move from &#8216;real fake names&#8217; towards &#8216;real names used in a fake way&#8217;. In the USA, for example, we identified popular Hollywood names and names of politicians being used as fake names. Currently the usage of the name &#8216;George Bush&#8217; is decreasing, whereas &#8216;Barack Obama&#8217; is increasingly used. We recognize the false usage of these names because of the change in frequency figures of the given name and family name as well as the usage of the combination itself. Remarkable is that &#8216;Abraham Lincoln&#8217; and &#8216;George Washington&#8217; are quite steady.</p>
<p>Back to &#8216;Walter von Stolzing&#8217;. By now you might have guessed what is happening here. We recognized that in German speaking areas this name is also passing our threshold on validity. By <a href="http://en.wikipedia.org/wiki/Die_Meistersinger_von_N%C3%BCrnberg" rel="nofollow">googling</a> the name you can see that Walter is actually a character in Wagner’s opera &#8216;Die Meistersinger von Nürnberg&#8217; back from 1868!</p>
<p>Let’s see if in 100 years time people are still using &#8216;Darth Vader&#8217;, &#8216;Lord Rings&#8217; or &#8216;Snoop Dogg&#8217;!</p>
<p>All the names used in this blog are ‘real’ names coming from a popular social media site. Please check our <a href="http://www.humaninference.nl/producten/data-cleansing">data cleansing</a> products in case you need cleansing solutions.</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://datavaluetalk.com/data-quality/ask-me-is-linked-with-any-body-and-relates-with-walther-von-stolzing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Has your name ever hurt you? &#8211; when nomen becomes omen</title>
		<link>http://datavaluetalk.com/data-quality/when-nomen-becomes-omen/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=when-nomen-becomes-omen</link>
		<comments>http://datavaluetalk.com/data-quality/when-nomen-becomes-omen/#comments</comments>
		<pubDate>Mon, 08 Aug 2011 12:46:30 +0000</pubDate>
		<dc:creator>Esther Labrie</dc:creator>
				<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Names]]></category>
		<category><![CDATA[customer data]]></category>
		<category><![CDATA[customer view]]></category>
		<category><![CDATA[first name]]></category>
		<category><![CDATA[identity]]></category>
		<category><![CDATA[knowledge]]></category>
		<category><![CDATA[names]]></category>

		<guid isPermaLink="false">http://datavaluetalk.com/?p=1887</guid>
		<description><![CDATA[Addressing clients with the right data often means the difference between making a profit and not making a profit. Working with data quality experts has made me ever more consious of the value personal data represents for people. In this respect names are especially intriguing to me, as owners appear to identify with their name [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://datavaluetalk.com/data-quality/when-nomen-becomes-omen/attachment/baby-baby-names-3/" rel="attachment wp-att-1899"><img class="alignleft size-thumbnail wp-image-1899" title="bad baby names" src="http://datavaluetalk.com/cms/wp-content/uploads/2011/08/baby-baby-names2-150x150.jpg" alt="" width="150" height="150" /></a>Addressing clients with the right data often means the difference between making a profit and not making a profit. Working with <a title="Data Quality" href="http://www.humaninference.com" target="_blank">data quality</a> experts has made me ever more consious of the value personal data represents for people. In this respect names are especially intriguing to me, as owners appear to identify with their name <em>a lot</em>. So I decided to do a little research and determine if people really are what their name tells you. Can <em>nomen</em> indeed become <em>omen</em>?</p>
<p>Your parents probably gave a lot of thought to the name they once gave you, and as it turns out they were right to do so! Research tells us a name can do wonders for its owner, as well as a lot of damage for that matter. Let’s have a look at some remarkable results.</p>
<p><strong>Peter for President!<br />
</strong>Recent studies show that in the US a student called Fred is more likely to fail his exam than a student who just happened to be named Andrew: people tend to indentify with their name and, in general, have a positive feeling about letters that correspond with their initials. Consequently Fred is far more likely to settle for a meager F, while Andrew will have an extra motive to strive for an A. <span id="more-1887"></span>It also explains how in choosing a partner we show a slight preference for someone whose name resembles our own, or why Mary will prefer to live in Maryland, while Monica is more inclined to settle in Santa Monica. Most of these preferences only show themselves through our subliminal selves, so we are not actually aware of the motivation for some of our choises. Another US study endorses these findings: inspired by the results mentioned above, researchers decided they’d investigate on another letter. They came up with the letter K, which in baseball stands for strikeout. The study showed once again that there is a connection between a letter and its causer: batters whose names began with a K struck out more often than other batters.</p>
<p><strong>Ominous names<br />
</strong>A UK research tells us that as much as one in 5 parents regret how they named their child. The novelty might have worn off after a few years, but can there be any real objections to a certain name? Apparently, there are plenty! Ironically it’s not the parents who’ll have to carry this burden for the rest of their lives…</p>
<p><strong>“Hi, I’m Antwan, but you can call me Antoine…”<br />
</strong>It seems that even children’s language skills are influenced by their name. This has to do with the effect negative emotions can have on a child’s performance. If for example you decided to name your son ‘Gene’ but spell it ‘Jene’, he is very likely to get confronted with disbelief from his teachers. “Are you sure your name isn’t spelled with a ‘G’?” This can severely undermine Jene’s sense of confidence. That explains why children with an unusual name or a name that is unusually spelled generally are less adequate spellers and readers.</p>
<p><strong>“But Sissi is a Royal name, dear!”<br />
</strong>When a girl is called Frankie we think it’s a fun name, a cool and robust statement to fit a strong personality. Yet when a boy is called Mckenzie, (yes, some parents think it’s cute to give their boy a name that has a feminine touch to it ) we see a similar effect, but with a different outcome. This is something his parents obviously had not foreseen: their son will constantly be shaking off his girly image. The effect is striking: boys with a androgynous name misbehave more often than their unambiguously named peers, especially when they reach puberty. A boy called Mckenzie or Aubrey is even more likely to display bad behaviour when there is a girl with the same name among his peers. One more reason for parents to stick to conventions when choosing a name for their newborn.</p>
<p><strong>Want to produce the new Einstein? Call her Kate!<a href="http://datavaluetalk.com/data-quality/when-nomen-becomes-omen/attachment/einstein/" rel="attachment wp-att-1911"><img class="alignright size-thumbnail wp-image-1911" title="The new Einstein? Kate!" src="http://datavaluetalk.com/cms/wp-content/uploads/2011/08/einstein-150x150.jpg" alt="The new Einstein? Kate!" width="150" height="150" /></a><br />
</strong>A name can be a burden, but if you use this knowledge wisely, you might just turn it into an advantage. What happens to a girl when she has finished school and needs to choose what subject to study? Well, according to a US study, her choice depends on her name. As it turns out girls with a very feminine name like Julietta or Isabella are more likely to study humanities, while those whose name is less obviously feminine are more partial towards science. The question is: who’s aspiring to whom? Could it be that parents would treat Kate in a different way than Barbara? Or did the parents subconciously decide they wanted to raise a scientist when they decided to call their daughter Kate?</p>
<p><strong>Would you rather hire Vanity or Grace?<br />
</strong>Of course it’s not just letters or gender that determines how we feel about a name. In fact, how other people perceive us very much depends on the meaning of our name. For example: when looking for a new member on your marketing team, would you rather hire Vanity or Grace? In spite of what her name tells us, Grace might be a job jumper who doesn’t know how to work in unison with her colleagues. Vanity on the other hand could just be a daughter of a well-read mother who had just finished her latest Thackeray when she gave birth. Still, both women will either meet a lot of prejudice or feel the need to live up to a very high standard because of their name.</p>
<p>It all goes to show that a name defenitely posesses some self-fulfilling qualities. Given the fact that so many parents regret their choice of names afterwards makes me think that the owners of that name might share these sentiments. So what does that mean when looking at it from a data quality point of view? Unisex names for example are responsible for a lot of data quality issues. As the borders between male and female names are fading we’ll need to update our knowledge continually. The human in Human Inference will definitely take care of that. After all, we wouldn’t want to you to put off Mrs Clinton when sending her a petition to take pity on the Syrian citizens starting: &#8220;<em>Dear Mr. Clinton</em>…”.</p>
<p>Source: Livescience.com &amp; Babynames.com</p>
]]></content:encoded>
			<wfw:commentRss>http://datavaluetalk.com/data-quality/when-nomen-becomes-omen/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>We have 180 million names! Which one is right?</title>
		<link>http://datavaluetalk.com/data-quality/we-have-180-million-names-which-one-is-right/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=we-have-180-million-names-which-one-is-right</link>
		<comments>http://datavaluetalk.com/data-quality/we-have-180-million-names-which-one-is-right/#comments</comments>
		<pubDate>Mon, 14 Feb 2011 10:46:24 +0000</pubDate>
		<dc:creator>Winfried van Holland</dc:creator>
				<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[apples and oranges]]></category>
		<category><![CDATA[DataCleaner]]></category>
		<category><![CDATA[family names]]></category>
		<category><![CDATA[filters]]></category>
		<category><![CDATA[knowledge]]></category>
		<category><![CDATA[Name research]]></category>
		<category><![CDATA[naming conventions]]></category>
		<category><![CDATA[patronyms]]></category>
		<category><![CDATA[purification]]></category>
		<category><![CDATA[transliteration]]></category>

		<guid isPermaLink="false">http://datavaluetalk.com/?p=1692</guid>
		<description><![CDATA[The internet is an ocean of wealthy content, but unfortunately, as in the real world, it’s heavily polluted. As a company in business for 25 years, Human Inference absolutely sees the benefits of the internet. For our reasoning processes, based on natural language processing, we gather content and we classify this content on type, such [...]]]></description>
			<content:encoded><![CDATA[<p><a rel="attachment wp-att-1694" href="http://datavaluetalk.com/data-quality/we-have-180-million-names-which-one-is-right/attachment/name-cloud/"><img class="alignleft size-thumbnail wp-image-1694" title="Name cloud" src="http://datavaluetalk.com/cms/wp-content/uploads/2011/02/Name-cloud-150x150.png" alt="" width="150" height="150" /></a></p>
<p>The internet is an ocean of wealthy content, but unfortunately, as in the real world, it’s heavily polluted.</p>
<p> As a company in business for 25 years, Human Inference absolutely sees the benefits of the internet. For our reasoning processes, based on natural language processing, we gather content and we classify this content on type, such as given names, family names, prefix, suffix, etc. (See also my <a href="http://datavaluetalk.com/2010/02/11/new-matching-engines-go-beyond-apples-and-oranges/" target="_blank">blog post on the comparison of apples and oranges ….)</a></p>
<p>In the past this was done manually by, for example, investigating telephone books or manual research of census lists. But these were the ‘pioneer years’. What we see now is an enormous amount of content that can be gathered on the internet. It’s quite easy to find an internet page with 180 million records of person names. Great, so knowledge gathering is passé now?<span id="more-1692"></span></p>
<p>Sorry to say; it has become far more complex now. The volume is much larger than in the past, but the quality is worse. Of course there are nuances in quality (e.g. the difference in content coming from social media versus the content from census lists), but in all these cases people working in <a href="http://www.humaninference.com">data quality</a> will absolutely agree with me, that even lists coming from the government contain false names and garbage. </p>
<p>Before we can discuss how to gather the right content from a large set, we need to have a common understanding on what is a right name. In discussions with linguists you can have long debates about this. Take my family name for example: according to the Dutch rules the family name “van Holland” is valid and known, and the family name “Van Holland” or “Vanholland” is invalid and unknown.  In an international setting however, both names are absolutely valid and known (for example in Belgium or the United States) Similar situations might occur when people move between countries and patronyms (names derived from the father’s name) are used in one country, but not in another. Then you could find a name like Maria <em>Romanov</em>, who, in her country of origin, would be named Maria <em>Romanova.</em> </p>
<p>This phenomenon, which I call the ‘degenerations’ of names, has been going on ever since official administrations wanted to store names of individuals. That’s the reason why there are, for example, so many variants of the name Mohammed. Sometimes this cannot be prevented because there are characters in the original name which are not known at the registration office. An example would be the family name Sørensen (quite common in Scandinavia), which would become Sorensen, because of the unavailability if the letter ‘ø’. The moment a name is officially registered and used as such, the name becomes a valid name. This becomes even more complex once a name is written in a different writing set and has to be transcribed or transliterated. </p>
<p>We see here that right and wrong needs to be seen in context of region. We define a name as right in case the individual related to that name identifies her/himself with that name. So, in case I’m travelling to the US, I might still feel that “Vanholland” is wrong, but I can imagine that my relatives living in the US for many years now call themselves “Vanholland”. They identify with that name and in that case the name becomes “right”.</p>
<p>Back to the 180 million records. In order to filter these to get valid names we first perform so-called purification filters. That’s a huge set of different filters divided in specific sets like salutation filters, company related word filters, extraordinary sequence filters, strange character filters, capitalization filters, etc. Examples on what we filter here is for example the name “be@home” with the strange character filter but we keep the “d’Ancona”, or filter on “John Baker Inc.” with the legal forms filter in the company related words set. Nice examples in the extraordinary sequence set are always the words on our black list (e.g. “asdf” or “qwerty”) or repeating words (e.g., “abcabcabc”). After this purification we end up with 160 million records.</p>
<p> These 160 million are aggregated on family name and the frequency is counted. This causes a major reduction, because now we have a list of 14 million potential family names. If we compare these 14 million with our internal knowledge we immediately see the difficulty of dealing with names, i.e., the long tail. In general, the top 10% of the names in regions are responsible for over 90% of the population. The so called long tail, 90% of the remaining names, is responsible for less than 10% of that population. This means that this long tail of names has a very low frequency, as have most of the wrong names! So on frequency alone you cannot distinguish if a name is valid or not! For example we see that within these 14 million names, roughly 1.1 million are responsible of over 140 million persons in the original population! The remainder of 12.9 million family names still needs a lot of attention to verify if they belong to valid names. An example of such names with their respective frequency is given in the table below and immediately shows that it is not trivial to assume that all these names are valid:</p>
<table border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td>Potential Family name</td>
<td>Frequency (from 160 million records)</td>
</tr>
<tr>
<td>Rofriguez</td>
<td>117</td>
</tr>
<tr>
<td>Rodriiguez</td>
<td>91</td>
</tr>
<tr>
<td>Rodrigguez</td>
<td>78</td>
</tr>
<tr>
<td>Rpdriguez</td>
<td>74</td>
</tr>
<tr>
<td>Rodrgiguez</td>
<td>51</td>
</tr>
<tr>
<td>Rodrigfuez</td>
<td>20</td>
</tr>
</tbody>
</table>
<p> Roughly 3 million of these names have a frequency of 1! They are very hard to validate automatically, but there are techniques we use which I will describe later in a separate blog. </p>
<p>In order to handle these volumes of records and to place records in the right validation buckets we use the <a href="http://datacleaner.eobjects.org/" target="_blank">DataCleaner </a>. This is a very powerful and fast open source profiling toolset. With DataCleaner 2.0 we also use the validation steps and the fact that you can now make a complete hierarchy of these steps in a single template. This way, the results can be stored in new data stores,  that can be examined later within the same tool! </p>
<p>We are not afraid of tackling large datasets to gather content. A large dataset will not provide you automatically with a large update in your knowledge, there is an enormous amount of pollution in such a set. We use DataCleaner for purification because it’s powerful and extremely flexible. We have defined a complete process to purify a given dataset and to automatically retrieve as many valid names as possible, which will then be added to our knowledge. Manual work will remain, but we can focus now more on linguistically and culturally challenging names, instead of spoiling most of our time on trivial ones. Purifying 180 million records is a hell of a job, but we are good at it!</p>
]]></content:encoded>
			<wfw:commentRss>http://datavaluetalk.com/data-quality/we-have-180-million-names-which-one-is-right/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Your name is too &#8220;common&#8221;&#8230;.</title>
		<link>http://datavaluetalk.com/data-governance/your-name-is-too-common/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=your-name-is-too-common</link>
		<comments>http://datavaluetalk.com/data-governance/your-name-is-too-common/#comments</comments>
		<pubDate>Mon, 07 Sep 2009 13:14:24 +0000</pubDate>
		<dc:creator>Holger Wandt</dc:creator>
				<category><![CDATA[Data Governance]]></category>
		<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Banks]]></category>
		<category><![CDATA[Chinese characters]]></category>
		<category><![CDATA[customer view]]></category>
		<category><![CDATA[deduplication]]></category>
		<category><![CDATA[interpretation]]></category>
		<category><![CDATA[knowledge]]></category>
		<category><![CDATA[single customer view]]></category>

		<guid isPermaLink="false">http://datavaluetalk.com/?p=1207</guid>
		<description><![CDATA[A major bank in Dongguan (China) refused a potential customer because his name is Li Jun. Apparently, there were already over 300 bank accounts assigned to the name Li Jun. Not that this particular Li Jun was responsible for opening all these accounts, there were just too many men with exactly the same name. The [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-thumbnail wp-image-1209" title="chinese-characters" src="http://datavaluetalk.com/cms/wp-content/uploads/2009/09/chinese-characters-150x150.jpg" alt="chinese-characters" width="150" height="150" /></p>
<p>A major bank in Dongguan (China) refused a potential customer because his name is Li Jun. Apparently, there were already over 300 bank accounts assigned to the name Li Jun. Not that this particular Li Jun was responsible for opening all these accounts, there were just too many men with exactly the same name. The bank states that the refusal is nothing personal, since nobody with the name Li Jun will be accepted as customer in the near future&#8230;.. In the meanttime, Li Jun is taking legal action against the bank.<span id="more-1207"></span></p>
<p>When I read this news article this morning, my first thoughts were that it was perhaps a hoax. It turns out , however, that the news fact is true. From a data quality point of view this strikes me as really strange. How does this particular bank manage its customer data? Are there no additional identifiers (address, date of birth, etc.) to determine that you are actually dealing with the customer you think you are dealing with? Imagine that every John Smith would have a hard time to open a bank account, to apply for a job or to buy a product via the web. Or Jenny Jones? Bob Johnson? When is a name too &#8220;common&#8221;? It is common misbelief that the complexity of ideographic characacters such as Mandarin Chinese makes it harder to identify. At Human Inference we carried out some pretty serious dedups of Chinese files and-taking into account that Mandarin Chinese is a tonal language and other priciples of fault-tolearnce apply- the duplicate identification was rather accurate.</p>
<p>It is all a matter of using an intelligent <a title="data matching" href="http://www.humaninference.com/products/data-matching" target="_blank">data matching</a> method and knowing what kind of data one is working on. Every name can be identified; even &#8220;common&#8221; names.</p>
]]></content:encoded>
			<wfw:commentRss>http://datavaluetalk.com/data-governance/your-name-is-too-common/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Any close encounters with the FBI terrorist watchlist?</title>
		<link>http://datavaluetalk.com/data-governance/any-close-encounters-with-the-fbi-terrorist-watchlist/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=any-close-encounters-with-the-fbi-terrorist-watchlist</link>
		<comments>http://datavaluetalk.com/data-governance/any-close-encounters-with-the-fbi-terrorist-watchlist/#comments</comments>
		<pubDate>Mon, 17 Aug 2009 09:14:34 +0000</pubDate>
		<dc:creator>Ramon de Noronha</dc:creator>
				<category><![CDATA[Data Governance]]></category>
		<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[compliance]]></category>
		<category><![CDATA[identification]]></category>
		<category><![CDATA[identity]]></category>
		<category><![CDATA[interpretation]]></category>
		<category><![CDATA[knowledge]]></category>
		<category><![CDATA[persistent identification]]></category>
		<category><![CDATA[processes]]></category>
		<category><![CDATA[suspect list matching]]></category>

		<guid isPermaLink="false">http://datavaluetalk.com/?p=1125</guid>
		<description><![CDATA[Just before this summer the U.S. Department of Justice filed a report about the FBI Terrorist Watchlist. This watchtlist serves as a critical tool for screening and law enforcement personnel for alerting them when they come across a known or suspected terrorist. It is used by personnel at airports, harbours and the borderline. Also when [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-1127" src="http://datavaluetalk.com/cms/wp-content/uploads/2009/08/tsc080105a.jpg" alt="tsc080105a" width="160" height="152" />Just before this summer the U.S. Department of Justice filed a report about the FBI Terrorist Watchlist. This watchtlist serves as a critical tool for screening and  law enforcement personnel for alerting them when they come across a known or suspected terrorist. It is used by personnel at airports, harbours and the borderline. Also when you apply for a visum you are matched against this watchlist. The Terrorist Screening Center, a subsidiary of the FBI, is responsible for maintaining the watchlist.</p>
<p>This watchlist was created in 2004 from several other lists and at that time it consisted of about 68.000 entries. I use the word entries, because in the years after it became fuzzy if one record is the same as one individual. By the end of 2008 the list had grown to over 1,1 million entries. In 2008 after the American Civil Liberties Union (ACLU) mentioned that the list had <a title="Numbers don't add up" href="http://www.aclu.org/privacy/gen/36064res20080721.html" target="_blank">passed the 1 million</a>, the government came with an explanation. <em>Although we have recorded over 1 million entries in the database, the net result is that these records correspond to about 400.000 individuals. </em>Terrorist often use different and thus multiple identities, use several (falsified) passports etc. But adding entries with only the first initials and last name, while an entry of the full first names and last name already exists will result in unwanted side-effects.<span id="more-1125"></span></p>
<p>We all know, as being interested in data quality and identity resolution, that J. Robinson will result into much more matches (hits) than James Robinson. Indeed the number of found matches will sky-rocket and have to be evaluated manually. Might this be the reason, that we see more and more security personnel on airports?</p>
<p>In the<a href="http://www.usdoj.gov/oig/reports/FBI/a0925/final.pdf" target="_blank"> latest audit report</a> of the U.S. Department of Justice about this watchlist one other problem was analyzed. While extensive procedures were made for nominating and adding suspects to the watchlist, there is no procedure for removing people from the list. Based on a sample of almost 70.000 entries and investigation of the individuals an astounding number of 35% omissions was found. People who had died were still on the list, people who were no longer investigated upon, cases which had been closed etc. So this watchlist is <a href="http://www.aclu.org/privacy/spying/watchlistcounter.html" target="_blank">growing and growing</a>. Resulting in screening personnel who ensnare many innocent travelers as suspected terrorists. And wasting their time and divert their energies from looking for true terrorists. It seems to me that FBI and TSC can benefit from better Data Governance, what do you think?</p>
]]></content:encoded>
			<wfw:commentRss>http://datavaluetalk.com/data-governance/any-close-encounters-with-the-fbi-terrorist-watchlist/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Budget for Data Quality seems no problem</title>
		<link>http://datavaluetalk.com/data-quality/budget-for-data-quality-seems-no-problem/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=budget-for-data-quality-seems-no-problem</link>
		<comments>http://datavaluetalk.com/data-quality/budget-for-data-quality-seems-no-problem/#comments</comments>
		<pubDate>Thu, 16 Oct 2008 12:00:36 +0000</pubDate>
		<dc:creator>Emile van de Klok</dc:creator>
				<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Budget]]></category>
		<category><![CDATA[interpretation]]></category>
		<category><![CDATA[knowledge]]></category>
		<category><![CDATA[processes]]></category>

		<guid isPermaLink="false">http://datavaluetalk.wordpress.com/?p=95</guid>
		<description><![CDATA[A survey of Human Inference in 2008 indicates that processes are the biggest experienced challenge in relation to Data Quality. However the subject that seems to be no problem is the budget. Human inference differentiates itself by interpretation of knowledge. So from this perspective I wonder how the respondents interpreted the word &#8220;processes&#8221;. Do they [...]]]></description>
			<content:encoded><![CDATA[<div class="mceTemp">
<p class="MsoNormal" style="margin: 0;"><span style="font-size: small; font-family: Myriad Pro;">A survey of Human Inference in 2008 indicates that processes are the biggest experienced challenge in relation to <a title="data quality" href="http://www.humaninference.com" target="_blank">Data Quality</a>. However the subject that seems to be no problem is the budget. Human inference differentiates itself by interpretation of knowledge. So from this perspective I wonder how the respondents interpreted the word &#8220;processes&#8221;. Do they mean the processes within the value chain of their companies or do they actually mean the process of obtaining a budget for Data Quality? The latter would actually explain a lot.</span></p>
</div>
<div id="attachment_75" class="wp-caption alignnone" style="width: 310px"><a href="http://datavaluetalk.files.wordpress.com/2008/10/survey-challenge-dq.jpg"><img class="size-medium wp-image-75" title="survey-challenge-dq" src="http://datavaluetalk.files.wordpress.com/2008/10/survey-challenge-dq.jpg?w=300" alt="HI Survey Results" width="300" height="188" /></a><p class="wp-caption-text">HI Survey Results</p></div>
]]></content:encoded>
			<wfw:commentRss>http://datavaluetalk.com/data-quality/budget-for-data-quality-seems-no-problem/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

