The Gateway to Algorithmic and Automated Trading

The Year of Big Data: The challenges of putting trillions of odd-shaped pegs into neat little holes

Published in Automated Trader Magazine Issue 27 Q4 2012

The volume of digital content is forecast to surge to a staggering 2.7 zettabytes in 2012. How big is that? One of those is 10 to the 21st power. We're talking billions and billions of terabytes. What's more, at least 90 percent of that will be unstructured, made up of enormous numbers of files based on social media or web-enabled processes. How do companies turn this information soup into something palatable? The short answer is, in lots of ways. Adam Cox reports.

The Year of Big Data. The challenges of putting trillions of odd-shaped pegs into neat little holes

This is the year when "Big Data" becomes a must-have competency, according to International Data Corporation. Its forecast volume of 2.7 zettabytes for the world's digital content represents a 48 percent jump from the 2011 total. Most of that - social media files, text, images, videos, audio - is unstructured and will offer formidable challenges for business information (BI) specialists. But don't expect those firms to be daunted.

"As businesses seek to squeeze high-value insights from this data, IDC expects to see offerings that more closely integrate data and analytics technologies, such as in-memory databases and BI tools, move into the mainstream," IDC said in a report on this year's data outlook.

How do information specialists make sense of all this unstructured data and what are the key hurdles they face? In the realm of financial markets, much comes down to the techniques they use to process textual data.

BT Radianz, which focuses on telecoms for the financial markets, is seeing an increasing number of information providers coming onto the network focused on delivering unstructured data that is reshaped into structured data.

Michael Cooper

Michael Cooper

"I have a feel for it, and that feeling is that it's broadening and increasing," said Michael Cooper, BT Radianz's chief technology officer, referring to the use of unstructured data. "The base of what people are using is definitely broadening and I think that the frequency or volume of that data is increasing as well."

That presents one of the most immediate challenges to this trend: data capacity. "So one of the things you have to do with this type of data, you really have to use it intelligently," Cooper said. That means not only dealing with the scale of the volume, which he suggests is actually manageable, but also addressing its fluid nature, which is less so.

"I actually think there's a reasonable roadmap towards more capacity," Cooper said. "I think the complication and the difficulty is the volatility of it. Whilst you can install 100 gigabytes everywhere if you want to, the commercials don't make sense but the volatility may actually require it at some point."

"More intelligent software, virtualisation in terms of underlying capacity efficiencies and other technological advances such as distributed computing will be key, " Cooper said. "To that end, BT Radianz has been experimenting in its labs with Apache Hadoop, the open-source software that features prominently in the unstructured data world because it allows for distributed processing of extremely large data sets."

"If I'm an algorithmic trader, I've probably got a footprint in a number of places," he said. "I might like to be able to avail myself of some sort of common standard capability, which might give me on-tap compute, the ability to process very large data sets at location," Cooper said.

But moving and storing the data is just one of many hurdles. The really funky problems begin when companies start to try to analyse and understand the data itself. At this stage, all manner of technological dilemmas appear, from how to identify entities to how to extract specific information from the clutter of data in a document. Natural Language Processing, the burgeoning field that combines computer science with linguistics, acts as a kind of lifeblood in many systems.

First of all, a system needs to be able to work out what has happened. RavenPack's director of quantitative research, Peter Hafez, said his company is able to detect about 1,200 different events, about one third of which are related to companies and two-thirds of which are more macro-type events. These include natural and man-made disasters and all manner of political and social events.

Tim Haas

Tim Haas

Brian Rooney, global business manager for Bloomberg core news products, said one of the "next big things" in unstructured data R&D is in handling more and more types of events beyond the easier-to-define ones such as earnings, dividends or buy-backs.

Next there is the question of entity detection. Who are involved with those events? Who are the instigators, victims, beneficiaries, bystanders et cetera? RavenPack says it can detect 140,000 entities, such as companies, people or organisations.

Then it starts to get even thornier. Not only does a system need to ensure that the right entities are linked to the right events, but also it needs to perform semantic analysis that takes into account some of the idiosyncrasies of human language.

"If we look at our news analytics system, it's basically a Natural Language Processing system that uses a combination of statistical and linguistic methods," said Rich Brown, global head of quantitative and event-driven trading solutions at Thomson Reuters.

"What that basically does is count up from a statistical perspective how many negative words there are and how negative they are - or positive for example. 'Challenging management environment' might score a negative 2, 'exceeds expectations' might score plus 3, and you put these into context for the company within the article by looking at the proximity of words to one another, the modifiers and which words are describing the subject of the article."

That may sound straight-forward enough, but then there is what Brown calls the "harder stuff", namely semantics. "So 'good' is a good word, 'terrible' is a bad word, 'terribly good' is a very good phrase," he said, which makes this kind of analysis that much trickier than the "bag-of-words" technique that others have used in the past.

A bag-of-words analysis looked at whether an article as a whole was positive or negative, rather than which aspects within it were positive or negative concerning specific entities. "Some folks are still doing the bag-of-words technique and some claim that they're doing the entity-level specificity," Brown said.

Having overcome all of that, there are more nuanced challenges for would-be meaning detectors. Ronen Feldman, author of some of the most widely cited academic work in the field of unstructured data, said huge amounts of progress have been made when it comes to solving the question of what happened and who did it. But machines have enormous difficulty in identifying sarcasm, he said. Brown of Thomson Reuters noted that this comes up increasingly as systems process social data, where sarcasm is often one of the chief currencies.

Peter Hafez

Peter Hafez

"What you run into problems with are things like profanity, sarcasm, emoticons, all-caps and multiple exclamation points. Those things can mean very different things to different readers, and even depend on who the author is," Brown said.

Sarcasm is particularly problematic with shorter text excerpts, he added, noting that longer stories or posts tended to be accompanied by factual-looking data and more context. In a larger document, the process will arrive at an overall indicator, so an incidence of sarcasm in the beginning may not affect the results if the rest of the post gives the real context. But in a 140-character tweet, there are rarely enough clues to help a system get the intended meaning.

In addition to working out the meaning of text in any given item, systems are also focused on measuring the volume and frequency of content (the supply) as well as the ways that people are searching for it and consuming it (the demand). A flurry of references to a company immediately tells a machine something. Analysis of the frequency of certain kinds of Google queries tells a machine something else.

Here is an area where Bloomberg believes it has an advantage, as its analytics capture both supply and demand. "We have a great vision into the flow of news and one of the analytics that is important is how that flow of news is changing on any given day," Rooney said.

Underlying what these companies do is an unshakable belief in the value of news. "It's interesting because, on the one hand, so much has changed so quickly, right? The shifting definition of real time, from seconds to sub-seconds to milliseconds to microseconds. The shifting requirement of the format, years ago from the broad tape to reading news over a terminal, multicast data delivery, to derived sentiment signals via a news analytics platform,We have a great vision into the flow of news and one of the analytics that is important is how that flow of news is changing on any given day," said Tim Haas, global product manager for algorithmic trading at Dow Jones.

"But the core of it hasn't changed at all," Haas said, offering a testimonial for the value of trusted editorial judgment. "That's the core of what we do, so if you want to talk about how we win, that's how we do it."

When Charles Dow and Edward Jones co-founded their publishing company in 1882, could they have conceived how their journalism would one day be processed by machines? It's highly doubtful. But the news service that bears their name is today one of the companies that are doing just that.