Blame It on Ted

First Published Thursday, 26th April 2012 02:30 pm from TIBCO Software : Dave Chamberlain

The opinions expressed by this blogger and those providing comments are theirs alone, this does not reflect the opinion of Automated Trader or any employee thereof. Automated Trader is not responsible for the accuracy of any of the information supplied by this article.


Imperfect data - A historic

perspective

Our world of

computing in 1969 was very different from today. In 1969, Dr.

E.F. (Ted) Codd published his first internal IBM paper,

"Derivability, Redundancy and Consistency of Relations stored in

Large Data Banks", followed in 1970 by the ACM publication, "A

Relational Model of Data for Large Data Banks" - the birth of

relational databases as we know them today.

Organizations used to have complete control of their

data. With just a few systems (usually to automate back office

functions) there was no concept of customer self-service, or

integrated supply chains, or third party data feeds, or just

about anything we take for granted today.

title="Imperfect Data"

src="http://www.thetibcoblog.com/wp-content/uploads/2012/04/Imperfect-Data-300x239.jpg"

alt="" width="300" height="239" />

Data was

generated by professional data entry staff; they took pride in

getting the data entry right, with very low error rates. Data was

processed sequentially, tapes spinning round and lights flashing

brightly; often you could tell what job was being processed by

the noises in the computer room.

What's changed?

What's changed over 40 years? Today the typical

organization runs hundreds, if not thousands, of systems spread

across large data centers - many of these applications sharing

data with external sources, their supply chain, external data

feeds and, of course, we are constantly trying to get our

customers to do as much as we can get them to do. When you add up

40+ years-worth of growth and change, we can see

how organizations have come to have such volumes of "imperfect"

data to deal with - data that is full of errors,

inaccuracies and

inconsistencies.

SQL has little ability to deal with

imperfect data

In 1969, there

was no concept of anything other than data that was perfect. This

was a major contributor to the fact that as RDBMS and SQL were

being defined, very little allowance was made to deal with errors

in data. "Like" or "contains" clauses and "wildcard" characters

enable data with known errors to be found and very little else.

If SQL can't find the data people and systems need, then it needs

to be searched by hand, so you often find significant human

effort being spent - often trawling through databases to find the

right data. Some organizations have tried to deal with the

problem by building monolithic dynamic SQL search systems, which

they typically find are very resource

consumptive. These systems take a lot of effort to

design, build and maintain, and still end up not being able to

find the data.

The route

forward

If only we could

leverage all that we now know about data and go back in time to

build RDBMS and SQL with the built-in ability to deal with all

sorts of data effectively and efficiently. More realistically, of

course, we need a different way to find the data

people and systems require without needing to know

the multitude of ways data can be "imperfect." We also need to

bear in mind that people are very good at finding data using

their built-in ability to see through errors and differences -

the only problem is that they work at their own, much slower

pace. Providing systems with the ability to work as

accurately as humans, yet at the speed of systems, is long

overdue.

Stay tuned for five

things to consider when evaluating href="http://www.tibco.com/products/business-optimization/pattern-matching/default.jsp">solutions

to deal with imperfect data!

No

related posts.

  • Copyright © Automated Trader Ltd 2013 - The Gateway to Algorithmic and Automated Trading

click here to return to the top of the page