The Gateway to Algorithmic and Automated Trading

Voracious readers: The changing world of unstructured data

Published in Automated Trader Magazine Issue 27 Q4 2012

Reckon you're a fast reader? There's a reasonable chance your company has machines that have already read every single word in this article, long before you could lay eyes on it. If that's the case, you're currently in the minority among market participants - but you may not be for long. Signs point to a steady climb in the adoption of systems that allow vast amounts of news and social data to be consumed instantly. As the technology has advanced, the market has grown more comfortable embracing it. Adam Cox surveys the unstructured data landscape and asks what's in store.

In 1958, a German-born computer scientist for IBM wrote about his ideas for a tool that would make sense of incoming text based on word patterns. A little more than half a century later, the social media giant Twitter struck a deal to provide its entire archives to the US Library of Congress.

The two events are major milestones for the algorithmic trading community, points on a chart that depicts the market's remarkable progress towards being able to make sense of unthinkable amounts of information in the blink of an eye. The first event represented the starting point in a race to create text-reading machines; the second marked a development that would ultimately allow those machines to make use of billions of online conversations. Along the way, market participants, technologists and information providers have been engaged in a mad scramble to create systems that could consistently outwit the very best that human traders can achieve. The results of the past several years suggest the mad scramble is paying off.

Brian Rooney

Brian Rooney

Until recently, much of the focus has been on making sense of news

"News is the classic big data problem," said Brian Rooney, global business manager for Bloomberg core news products. "You've got this flood of incredibly valuable information that starts at least largely unstructured, and the great art and value is in structuring the content, to make it easier for both machines and for humans to make sense of, and to ultimately act on."

But increasingly, market participants are adding social data to their models and vendors are looking at better ways to handle it. Should quants be worried about Beyoncé and Justin Bieber? Probably not much. But for the next generation of quants, the Twitter deal could mean all the difference when it comes to building robust, cutting-edge models.

A company in the middle of the Twitter library deal - a niche group called Gnip - entered into arrangements to pipe Twitter data wherever it was wanted. That meant that billions of tweets about virtually anything happening around the world could be archived, normalised and used for backtesting.

"They asked us to partner with them to help the Library of Congress manage that data," said Seth McGuire, director of business development at Gnip. "So we have the full Twitter historical corpus. We've spent a lot of time over the past year working with it to normalise it, clean it, manage it and create an infrastructure that allows use of it."

Social media firms, it turned out, had been focusing on immediate communication but had not given so much thought to capturing, categorising and packaging all of that content they were generating.

Who is using all this data and what are they doing with it? The answers to those questions cast a spotlight on one of the most dynamic and fast-growing areas in the financial markets.

Market reaction from Bloomberg’s Event-Driven Feed

Market reaction from Bloomberg's Event-Driven Feed

The Users

"We certainly see a range of clients," said Tim Haas, global product manager for algorithmic trading at Dow Jones. "Some are still quite sceptical when you're talking about leveraging a sentiment signal in a model. One recently said to me, 'That sounds like a spectacular way to lose money quickly.' But that is quickly becoming the minority customer. We're seeing more and more who are, at a minimum, willing to evaluate and dip their toe in the water."

At the other end of the spectrum are what Haas called the "believers", the data ravenous firms that are always on the lookout for non-traditional data sets no one else is using. "There are firms like that which are always looking for the next thing: 'Where can I find that edge? Where is the data set where I can build pull signals and include those factors in my models?'. Those are the more interesting conversations."

One of the companies that works with Dow Jones is RavenPack, which provides news analytics technology. RavenPack's director of quantitative research, Peter Hafez, described a broad array of users whose trading strategies involving machine-readable news can range from sub-second to multi-year.

"We're seeing a good mix, actually. We have the ultra-high-frequency guys, we have intraday. A lot of people believe that the sweet spot for this type of data is a week to a month. But we have found signals that last for a year out. And we even have had clients invested with a three-year horizon," Hafez said.

Hafez estimated that perhaps a quarter of his firm's clients were high-frequency users, while most of the customers fell into the one-week to one-month bucket. "But obviously we have clients that are very secretive about what they do, and some of them trade all horizons."

In Automated Trader's latest survey, which polled more than 500 people at a range of organisations, 8 percent of buy side firms said they were using news algorithms successfully and another 13 percent said they were doing so with mixed results. On the sell side, 11 percent were using them with success and another 6 percent had mixed results. So all told, about one-fifth of the market was already trading with news-reading algorithms. Many more firms were testing the systems or planning on using them in the next two years.

News is obviously just one part of the unstructured data universe, with social data increasingly being incorporated into models as well. A report by Aite Group, in which they interviewed people at more than a dozen quantitative trading firms, found that more than a quarter of them were already running unstructured data in their strategies and another 36 percent were considering doing so.

Predictably, the early interest for using social data has come from larger, quantitative hedge funds. "They tended to be funds with over $1 billion AUM, the larger funds that had a definite quantitative slant. I'd say most of them algorithmic, some of them AI even," McGuire of Gnip said.

"They wanted to understand the volume of conversation relating to specific things. And that could be simply looking at the mentions of Apple or looking at the mentions of iPhone. Or it could be understanding volume in relation to, say, the Sudan and oil," he said.

"The conversation" is a term that McGuire and others in this industry use frequently. For Gnip - whose name was inspired by spelling the word "ping" backwards - that means everything being said anywhere online. It is the world's conversation with itself, and this technology allows anyone in the market to eavesdrop in real time.

McGuire said his clients typically won't go "open-curtain" when it comes to details about who they are and what they do. But he was able to paint a broad geographical picture. Most of Gnip's customers are in the United States, with several in the UK and with burgeoning interest coming from Asia, notably from Hong Kong and Tokyo.

"The US is certainly the primary client base for finance, and now we're moving from the buy side algorithmic traders to incorporation from the trade desks and sell sides at the larger banks. They're starting to say, 'Hey, this is a valuable news source, I need to incorporate it.'"

Seth Mcguire

Seth Mcguire

The Providers

The initial vision came from a man born in the 19th century. Hans Peter Luhn, a computer scientist who had fled war-torn Europe and ended up working for IBM in the United States, wrote an article called "A Business Intelligence System" about tools which could automatically abstract incoming text based on word patterns.

Over the decades, as computers became more powerful and versatile, the idea became more feasible, and by the 1990s academics and companies were engineering the early versions of the systems now being deployed. The idea also led to a cottage industry made up of firms ranging from smaller specialist technology vendors such as Gnip and RavenPack to massive information providers such as Bloomberg, Thomson Reuters and Dow Jones.

Those large news companies have gone down different routes to tap into this market. All of them argue they provide an edge and all of them have their fans. Bloomberg and Thomson Reuters both use their own technology and apply that to both their own content and thousands of other news streams. Dow Jones meanwhile offers a suite of products that can process the same content differently, each based on technology from different companies.

Early instances of machine-readable news from the big information vendors focused on providing real-time, field-based economic data that could be fed into algorithms to generate sub-second trading.

But economic data is merely a question of treating what is already structured data in a structured format. The business of instantly reading millions of textual items and then turning that into meaningful, structured data was still a way off. In 2008, Thomson Reuters started providing news sentiment indicators that aimed to do just that.

In the broadest of terms, there are two major types of news and social data analysis: sentiment analysis and fact extraction.

Sentiment analysis processes perform linguistic and statistical analysis on the words and phrases in one or more items of text. The analysis considers all the words that concern entities which have been identified and then assigns positive or negative scores. This can be aggregated with other analysis on the same entity to come up with an overall indicator. That in turn can then be viewed at a single point in time, or in a time series, to show how sentiment has changed.

A separate type of analysis involves fact extraction. This too may require some semantic analysis, but generally it relies on an easier method of simply identifying certain word patterns. In the case of fact extraction, machines read text looking for words and phrases and then, based on other text in the item, can determine what factual developments have occurred or have been announced. That allows funds to build algorithms that will automatically trade based on fact-based triggers.

Rich Brown, global head of quantitative and event-driven trading solutions at Thomson Reuters, said his company works with Moreover Technologies, which has some 4.5 million social media feeds and about 54,000 internet news sites. All of those feeds then flow into the news sentiment products that Thomson Reuters began selling in 2008.

"So when you combine these various content sets and analyse them with a consistent methodology, it allows you to understand which signals are generated from internet news, from premium financial news such as Reuters for example, or even social media," Brown said.

Bloomberg is similarly active in the battle to bring order to the chaos of unstructured data. Rooney said Bloomberg aggregates somewhere in the neighbourhood of 100,000 different sources and has built up an archive of about 200 million articles. A key differentiator, he said, is in the proprietary technology the company developed.

"I don't think you can get to where you need to be by simply cobbling together a series of different vendors. Vendors can be very important, and they can be very useful for a specific tool that you want to have, but ultimately you need someone with the domain expertise in a particular field to really get the most out of structuring data," Rooney said.

As an example, he cited tools that summarise long pieces of content. Such a tool, faced with digesting a 10k filing, might get very different results from that which Bloomberg has developed because it has focused on going after the information that matters to the markets.

Dow Jones went down a different tack and focused on providing clients with technology choices.

"We realised, as we talked to more and more firms, that there was a broader universe of companies out there - hedge funds, and even banks - that didn't have the technical resources in-house to take our vast universe of unstructured content and turn it into structured data. So we reached out and started to work with a range of technology vendors who we feel have game, who are really strong and can do that piece of it," Haas of Dow Jones said.

In addition to working with RavenPack, Dow Jones also partners with Alexandria, Digital Trowel and SemLab. "We can talk to somebody and same day give them a cut of sample data so they can quickly evaluate how that might work in their model, in their methodology. That's been a huge change for us, to be able to give customers that choice and for them to be able to quickly evaluate."

Another player is Need to Know News, a specialist news provider that was acquired by Deutsche Börse in 2009. It has added a new twist to the machine-readable news game by offering to provide custom service on specific macroeconomic news events via Deutsche Börse's algo news feed. Instead of trying to cover the universe, Deutsche Börse and Need to Know News have honed their focus on a select type of news event and on providing ultra-fast technology.

The company does not perform semantic analysis, opting to focus on entering data or decisions in field-based systems. But it has open communication with its customers and polls them regularly on specific new fields to introduce, so that clients can build bespoke algorithms.

"One of the things that we do differently, with some of these macro events, is we'll often change some of the particular data points that we enter," said Clint Rhea, chief operating officer at Need to Know News. "So for example, if there's a specific portion on unemployment or on retail sales that a customer might be interested in, we can change that information."

The company can alter or augment the data streams it sends right up to the night before. The only criterion is that the information can't be open-ended. Deutsche Börse delivers the data straight to multiple co-location facilities around the world. Companies believe they can make money if they get the news just millionths of a second faster than others. "It is down to a microsecond game now at this point, at least in the United States," Rhea said. In Europe, he said it was still a millisecond game, although in parts of Asia the critical difference was viewed in seconds.

Thomson Reuters’ BUBBLE-OMETER (Market Risk Index)

Thomson Reuters' BUBBLE-OMETER (Market Risk Index)

The Trades

As could be expected, the companies that are already using these feeds employ a variety of tactics. A key factor is the time horizon.

"When you look at longer horizons, you tend to aggregate information more," Hafez of RavenPack said, adding that traders can then aggregate data at a company, sector or market level and look at how the indicators interact.

"In the ultra-high-frequency space, you tend to be more event-driven," he said. In these cases, firms will programme machines to look for specific news events in certain markets - for example, layoffs or a CEO departure - and build algorithms that will trade a certain way in the event that the news occurs.

"These signals can be more or less sophisticated. We tend to talk about two types of events - what we call simple events and what we call complex events. Simple events are when you make decisions based on the face value of the data," the RavenPack executive said. "Here's a layoff event, you make a decision. Here's a negative earnings report, you make a decision."

Richard Brown

Richard Brown

In the case of complex events, models do not look at news analytics in isolation. "They try to put it into context, bring in other types of data, for instance: 'Does it matter how layoffs recently have been interpreted by the market, or how the company has been portrayed in the news going into the layoff event?'" he said.

Sentiment indicators tend to work best over a time horizon of a few days, according to Brown of Thomson Reuters. "The bulk of the value is in a week or less," he said, citing research by JP Morgan and Deutsche Bank, "but certain techniques enable signals to be utilised over several weeks or even months."

JP Morgan's global head of equity quant research, Marco Dion, issued an 80 page report last year on his team's experiments with Thomson Reuters news feed products. The headline result was an eye-catching 95 percent annualised return and a maximum drawdown during the backtested period of 15 percent. But the numbers came with a big catch.

"Whilst this strategy looks extremely appealing we do realise that it is also extremely turnover intensive, and that this may therefore deter most quant managers," Dion and his team found. "We also appreciate that considering the level of turnover, presenting the results pre-commissions, slippage and transaction costs is not realistic, as transaction costs may erode much of the alpha displayed in a research/backtested environment."

Such a strategy could be mainly applied by firms that were market-makers. For others, longer time horizons could make more sense. But the returns, while still positive, can be less spectacular. In a test of a week-based news trading strategy, JP Morgan achieved a 22 percent annualised return for the period backtested, with a maximum drawdown of 13 percent.

Analysis by Deutsche Bank conducted a year earlier reached similar conclusions about the value of news sentiment indicators. "Overall, we find that news sentiment, in conjunction with non-linear models, can generate alpha. Even better, we find this alpha is relatively uncorrelated with the more traditional quant factors. Of course, there is also a downside. The predictive ability of news sentiment is short-lived," Deutsche wrote. "The best results are obtained when forecasting only the next five days."

Finding alpha is just one of the ways people are using text-reading machines. They can also be used for more defensive purposes in automated systems.

"Some folks use it as a circuit breaker," Brown said. "So if you're buying IBM and news comes out, it just stops until a human can make a judgment as to its importance."

RavenPack Workflow

RavenPack Workflow

Another use, he said, was to avoid trading on something that appears to be news but may already have been discounted by the market. This is where aggregating many news sources as well as social data becomes valuable.

"The importance from understanding that is to figure out where you are in the information cycle," Brown said. "Do you really want to generate an automated trade on something that happens in one particular source? Possibly, but you may not want to do that if it showed up on the internet first from an internet news site or social media."

In other words, text-reading machines can act as a safety-valve. Or, conversely, they could allow traders to trade against headlines that their systems have already determined to be old news, Brown said. "You know, the old investor adage, 'By the time the idea makes it to the retail schmuck like me it's time to get out of it'? Really understanding where you are in that information cycle is extremely important. It's kind of the holy grail as you look across all of these sources. Is the short you have an informed one, and does the rest of the market know about it already?"

But one problem with widening the pool of information sources is that it can increase the amount of low-quality information being processed and make finding the genuine news that much more difficult, according to Hafez.

"There have been a lot of stories in the media where you say, well, Twitter broke it first. But one of the things these stories don't talk about is that it is kind of a needle in the haystack. How many times did it get it wrong? Were you actually able to filter on the data in a way that this could actually have been detecting it - with such high confidence that you would be willing to trade on it?" he said.

Another issue he noted was that the universe of companies discussed on Twitter is limited because they tend to be household names and brands - the Apples and Microsofts of the world.

That's where StockTwits comes in. Named by TIME as one of the 50 best websites of 2010, StockTwits is a social network for traders and investors. It works the same way Twitter does, just with a $ symbol in front of an equity ticker.

Gnip said the emergence of StockTwits was an important development for its own business.

"It became a very interesting second layer of data. It also was a good proof point for us because our clients would come back and say, 'Hey, Twitter has been valuable so far, what other data source can I use?' and that's where StockTwits has been really taking off," McGuire said.

Critical to all of these strategies is the ability to weight different sources of information in an unstructured data processing system, to slice and dice the streams of data and then to perform sophisticated analysis. That can mean looking only at news feeds from the big news vendors such as Bloomberg, Dow Jones and Thomson Reuters or from specialist firms such as Need to Know News, or it could mean training a system to look at only certain types of blogs or Twitter and StockTwit sources.

"Those who tend to automate tend to look at some subset of volume, some subset of sentiment, some subset of audience," McGuire said. "So again, getting to that point where you're using the metadata to narrow down the scope of what you're looking at."

Clint Rhea

Clint Rhea

The future

What nearly everyone in the industry agrees on is that demand for these systems is on the rise.

"Right now it is at the very early front of its alpha period," McGuire at Gnip said. "So it is a very unique data source used by specific funds. We're starting to see that translate to a wider understanding and adoption."

Brown of Thomson Reuters sees new sources of text as an interesting area for development: "I think there is a lot of opportunity, a host of different data points to inject," he said. "If you look at earnings call transcripts and close captioning from video feeds, there are tons of ways that you can really exploit this."

He also said that certain aspects of processing unstructured data could become standard, but there was little chance of full commoditisation in this area. He noted that the Thomson Reuters engine had some 90 fields. Add to that the number of other sources across the internet and then combine that with other factors in any given quant model: the result is infinite combinations and strategies. RavenPack highlighted several trends for the future. Hafez expects demand for more specialised sources in multiple languages, more derived analytics products that cater for people outside the hard-core quant community, and constantly improving technology.

"It's a space where everything is improving all the time, so event detection will become better, entity detection will become better," Hafez said. He added that entity detection - an area where RavenPack believes it is particularly strong - is already at a pretty high level. "What you will be able to extract from the news will obviously continue to improve."

At the moment, the bulk of the systems are analysing English language content. But other languages are on the horizon. Alexandria, one of the companies that partners with Dow Jones, is already able to process Japanese and Chinese content. Gnip is language-agnostic since it is focused on holding and piping the data rather than analysing it.

Rooney of Bloomberg foresees a widening of the pool of users. Increasing demand is coming from the event-driven trading community, whether that is people looking for arbitrage opportunities based on new information or those putting the systems to more defensive uses to manage risk.

"Then also we're seeing rising demand from organisations that really are not specialised in quantitative solutions but who are realising that the human trading techniques they're using today are missing out on opportunities," he said. "As the machines are getting ahead of the humans, they need to be in this space as well. So we're seeing demand from organisations seeking to replicate strategies that previously worked with their human traders."

Perhaps the last word here belongs to one of the early pioneers of processing unstructured data. Ronen Feldman is credited with coming up with the term "text mining" in 1995. An academic who has published widely cited papers and books on the subject, Feldman is also a co-founder of Digital Trowel, one of the companies that works with Dow Jones.

His take on the potential for exploiting text-reading machines is simple: growth will be explosive. "I think it's just starting," he said.

Disclosure: The author worked as a journalist for both Bloomberg, from 1990-1992, and then Reuters (later Thomson Reuters), from 1992-2012.