Blame It on Ted
First Published Thursday, 26th April 2012 02:30 pm from TIBCO Software : Dave Chamberlain
The opinions expressed by this blogger and those providing comments are theirs alone, this does not reflect the opinion of Automated Trader or any employee thereof. Automated Trader is not responsible for the accuracy of any of the information supplied by this article.
Imperfect data - A historic
Our world of
computing in 1969 was very different from today. In 1969, Dr.
E.F. (Ted) Codd published his first internal IBM paper,
"Derivability, Redundancy and Consistency of Relations stored in
Large Data Banks", followed in 1970 by the ACM publication, "A
Relational Model of Data for Large Data Banks" - the birth of
relational databases as we know them today.
Organizations used to have complete control of their
data. With just a few systems (usually to automate back office
functions) there was no concept of customer self-service, or
integrated supply chains, or third party data feeds, or just
about anything we take for granted today.
alt="" width="300" height="239" />
generated by professional data entry staff; they took pride in
getting the data entry right, with very low error rates. Data was
processed sequentially, tapes spinning round and lights flashing
brightly; often you could tell what job was being processed by
the noises in the computer room.
What's changed over 40 years? Today the typical
organization runs hundreds, if not thousands, of systems spread
across large data centers - many of these applications sharing
data with external sources, their supply chain, external data
feeds and, of course, we are constantly trying to get our
customers to do as much as we can get them to do. When you add up
40+ years-worth of growth and change, we can see
how organizations have come to have such volumes of "imperfect"
data to deal with - data that is full of errors,
SQL has little ability to deal with
In 1969, there
was no concept of anything other than data that was perfect. This
was a major contributor to the fact that as RDBMS and SQL were
being defined, very little allowance was made to deal with errors
in data. "Like" or "contains" clauses and "wildcard" characters
enable data with known errors to be found and very little else.
If SQL can't find the data people and systems need, then it needs
to be searched by hand, so you often find significant human
effort being spent - often trawling through databases to find the
right data. Some organizations have tried to deal with the
problem by building monolithic dynamic SQL search systems, which
they typically find are very resource
consumptive. These systems take a lot of effort to
design, build and maintain, and still end up not being able to
find the data.
If only we could
leverage all that we now know about data and go back in time to
build RDBMS and SQL with the built-in ability to deal with all
sorts of data effectively and efficiently. More realistically, of
course, we need a different way to find the data
people and systems require without needing to know
the multitude of ways data can be "imperfect." We also need to
bear in mind that people are very good at finding data using
their built-in ability to see through errors and differences -
the only problem is that they work at their own, much slower
pace. Providing systems with the ability to work as
accurately as humans, yet at the speed of systems, is long
Stay tuned for five
things to consider when evaluating href="http://www.tibco.com/products/business-optimization/pattern-matching/default.jsp">solutions
to deal with imperfect data!