The Gateway to Algorithmic and Automated Trading

Is ABC really as easy as 123? The world of NLP, according to StreamBase

Published in Automated Trader Magazine Issue 27 Q4 2012

Since this edition of Automated Trader is exploring the fast-evolving world of unstructured data, we decided to spend some time with complex event processing group StreamBase. Adam Cox asks their chief technology officer, Richard Tibbetts, how far the industry has come in adopting Natural Language Processing techniques in algorithms, and what the future holds.

Richard Tibbetts

Richard Tibbetts

Adam: Let's start by talking about some of the more interesting developments you've seen in the last few years in terms of tackling unstructured data as they relate to financial players.

Richard: First is the naïve or obvious exercise in using sentiment indicators to drive intraday and real-time trading. That was the initial noise in the market. But people have backed off from that, having realised that off-the-shelf sentiment technology is often not enough to drive real trading strategies, at least those simply based on news sentiment. My favourite example is that you have a news story about General Electric layoffs. Whether the tone of that story is positive or negative, it actually doesn't impact how the market's going to move. What you're actually looking for is something a bit more factual or a bit more specialised.

The first thing we saw is that people realised - and people have been realising over time - that you need to be more sophisticated in how you process unstructured data than just looking at straight sentiment. One of the things we've seen is people adopting Natural Language Processing-type technologies, categorising their own annotation on the news data. That has been an ongoing trend, but people start to realise that you need to create something with your own secret sauce to succeed.

Adam: In other words, so much has to do not just with how many hot and cold words it has and how it triggers an algorithm, but also what the market expectations are and the context you might not get from the data itself. Is that what you're talking about?

Richard: What we found is there's a whole spectrum of different trading strategies matching different goals. The simplest thing to understand and the most straightforward method is actually a lot like what a traditional human trader would do - acting on news. You see this most strikingly with news data coming from key financial indicators. If you're going to write a trading strategy based on a key financial indicator, you don't do it based on the sentiment of an article in the New York Times. You do it based on having set an expectation for the jobs numbers and then seeing what the actual numbers are, and then driving your trading strategy based on whether it met or exceeded your expectations.

Adam: With economic indicators, that's been done for years and that's not actually unstructured data. The main information vendors provide it even faster in structured form than in unstructured form.

Richard: Exactly, but it's a good example to understand. What's one of those trading strategies people are accomplishing with unstructured data? Well, actually, they're structuring it in that same way. They have an expectation of a CEO departure, or an expectation of layoffs, or expectations of a particular earnings number associated with a new product launch. Whatever the expectation is, you build a recogniser that is customised to look at news articles and pick out the specific quantitative information that you're expecting to see.

Adam: So there's a risk that Company X will fire its CEO. You believe it will have an impact on the share price and you set up your system to capture that. Are people building in comfort factors in terms of who the source is?

Richard: One of the things is, if you're going to build a recogniser for this sort of thing, you wouldn't want to build a completely generic recogniser for CEO departures and then run it on the whole web. What you'd actually do is build up a corpus of articles about executive departures from major news outlets that you happen to have in your data feed, and train your recogniser based on those. So you'd know how Bloomberg writes this and how Reuters writes this.

Adam: Are people having much success?

Richard: We have seen firms that are having success. The tricky thing with this is, it's not a fire-and-forget strategy. So everybody's always looking for the algo trading strategy that you just set up and let run and never have to babysit. Whereas this kind of strategy requires a human to actually constantly work on new hypotheses about executive departures or layoffs, or whatever the characteristic, is and then set up a real-time strategy based on those.

Adam: Is there an appetite for people who want these off-the-cuff ones, who every week do something different and are programming them almost on the fly?

Richard: Obviously, it tends to be a somewhat smaller fund profile. It tends to be an individual trading group that happens to want to work this way, that believes they have this kind of insight and enjoys working the markets this way.

Adam: Does this tend to interact with the HFT world?

Richard: You can actually bump this kind of data into the HFT world, but what I'm describing here is not. The speed of execution matters but it's not like a rebate-recovery strategy. The reason you're automating it is to achieve speed of execution, but not benefiting from market-making activities.

Adam: In terms of the technology, what has changed in the last couple of years to improve performance, either from you or from the industry as a whole?

Richard: With the industry as a whole, it's been the emergence of platforms like StreamBase, that make it easy to build applications, and more economical to build one-off, or very simple strategies. Also, the increased availability of Natural Language Processing tools and expertise, and the fact that computers are getting not only better at this but generically faster.

Richard Tibbetts

Adam: Those sorts of tools, do you have to be a programmer to use them?

Richard: You certainly have to be a programmer. But there's a spectrum of programmers, from someone who's most comfortable in Excel all the way down to someone who's writing tight C++ code to run a smart order router.

Adam: You were talking about how some people will be doing it based on a Bloomberg news feed or a Reuters news feed or what have you - it can be any piped-in news feed. What about other sources of information? Does that affect the way these tools are built and the way people design their programmes?

Richard: There's a range of things you see people doing. You start out with recognising something that's relatively quantitative out of traditional news feeds. You then actually can segue into something quantitative out of web-based data. There are a few fairly publicised situations where people are doing web-scraping of Amazon or other sites. For example, a few years ago people were web-scraping Amazon to predict Magellan's earnings numbers - the GPS manufacturer. What the sales rank in Amazon is relative to their competitors' GPS units - there's a bunch of math there. Amazon exposes a certain amount of information about sales, and the conclusion on the behalf of this trading group was that it was enough information to get pretty predictive about how bad a holiday season Magellan was having.

Adam: Did it work?

Richard: Oh yeah, absolutely. This was reported to me as a successful exercise. There's another example, which is even less structured and has nothing to do with high frequency. People were looking at iPhone serial numbers at the Apple Store. That has nothing to do with computers at all. That's just a case of walking down to the 5th Avenue Apple store and seeing how much bigger the number was than it was last week.

That's where you start to blend into potentially tradable information that you can get from the broader mass of Internet users or social media. For example, if you identify public companies where consumer sentiment is particularly relevant, that's where being able to pull a generic sentiment number out of data you can get from Gnip becomes much more compelling. It's tradable not because people are saying mean things about the stock ticker, but because people are having a positive or negative experience with the actual product. On the other hand that's not a high frequency scenario either. That's just another form of market research.

The place where we've seen most effective use of high frequency data has really been around a negative signal. I call these news-based circuit breakers. Let's say I am doing a high frequency trading, rebate-oriented strategy where I'm market-making in a particular security. The example event I use for this is when the news broke on Twitter that the Micron CEO was killed in an airplane crash. This is a very unfortunate case, of course. If Micron is a security that you're market-making in, just the fact that there's a flurry of news information, without having to decide if it's positive or negative, that flurry of information could be enough for you to decide to pull out or widen your spread or expect higher volatility. It's not necessarily being able to predict price movement, so much as being able to predict volatility, or to just predict the lack of certainty.

You may want to pull liquidity, or you may just want to charge more for liquidity and widen your spread. If you pull out of the market completely you may miss an opportunity, so there's a risk-reward exercise there. But the general idea is, you don't need to know if it's good news or bad news, just that there is more than normal amounts of news out there.

Adam: Once you widen your information pool to include so much non-verifiable information, where there is no control or formal ability to establish a correction to incorrect reports, how does the industry deal with that?

Richard: Most of the opportunities I outlined have gone around that and avoided that scenario because it is quite challenging to deal with a major false positive. It really depends on what your time horizon is on trading. For example, if you're planning to short the South Korean stock exchange whenever North Korea gets shaky you may not care whether it's valid or invalid excitement in the Twitter sphere about North Korea. The strategy is seldom: 'People are going to make a lot of noise on Twitter and then there will be a fundamental move in the stock price'. It's much more likely to be: 'People are going to make a lot of noise on Twitter and that will increase the volatility. Or, people are going to make a lot of noise on Twitter and that's a leading indicator of some more traditional number that I can then be predicting and trading on'.

Google is sitting on a fabulous data set. They can tell you four weeks ahead of time what the unemployment numbers are going to look like. The way everybody goes to file for unemployment is they search on Google "How do I file for unemployment?" The number of people who come on line and search for "How do I file for unemployment?" is pretty demonstrative of how many people are becoming unemployed. They've actually been pretty public about how this exists and works.

That's an example of where, if you had this dataset, you would have the number. Then you have to look at more oblique ways for getting the number, which you can certainly do, whether it's through sponsored links or through looking for people complaining about the unemployed status on Twitter - that sort of thing. Obviously, those who have the most predictive and tradable numbers for that sort of thing are very much not forthcoming.

"Our core business for the StreamBase CEP platform is about making it easy for people to build real-time applications - specifically in capital markets real-time trading strategies for anyone who needs to analyse data and make an automated decision based on timely information, whether that's structured or unstructured data."

Adam: So where is StreamBase putting its focus?

Richard: Our core business for the StreamBase CEP platform is about making it easy for people to build real-time applications - specifically in capital markets real-time trading strategies for anyone who needs to analyse data and make an automated decision based on timely information, whether that's structured or unstructured data. That includes everything from trading strategies to risk management and fraud detection.

Where we are today is getting people access to more data and getting people access to more analytics. Our platform is really just the substrate, the plumbing for moving from raw data to analysed data to tradable data to actually making the trade. We also integrate with statistical computing packages like R and Matlab, and for Natural Language Processing analytics we use Lingpipe. Our goal is to integrate with all the data sources our customers find valuable.

Adam: Let's talk about what the future holds in terms of challenges. Presumably, one big issue is simply the raw amount of data compared with structured data. What do you see changing in that area?

Richard: First of all, the volume of data is going up, but the availability of computation is also going up, and platforms like StreamBase make it easy to scale the computation across multiple machines.

Adam: Which one is outpacing the other?

Richard: Neither is going to run away from the other. I think you'll see your costs of processing go up. But it's not like you won't be able to get the processing power. You're going to have to write out a slightly bigger cheque, but not prohibitively so. Twitter still peaks at under 25,000 messages per second. It is a lot of data, but it's not a soul-crushing amount of data. From the perspective of people who've worked with traditional data feeds, processing options data is certainly a lot more updates per second.

Now, as for the weight of chewing through each of those pieces of unstructured data, they are higher processing costs than processing a piece of structured data. But that's something you can certainly deal with. So computational complexity is not a major impediment, it's just a fact of life.

What you're seeing is continuous improvement in the ability to analyse this sort of data. One thing I think you will see is the 'mainstream-ification' of different kinds of Natural Language Processing. Today, the average quantitative statistics expert really has a relatively superficial understanding of Natural Language Processing. People who are in school today see this as a key area that they need to focus on. There's a lot more interest, a lot more work being done, and a lot more people getting exposed to this technology in school or after school. It's becoming more mainstream.

At the same time, if I were going to make one prediction, I think you'll see more derived data products built around these sources. Even though, as I said, in order to really make profits you need to tailor the feeds, tailor your analysis to a particular algo, I think there is an appetite for getting this data, even if it's only a check on the positions you've already made elsewhere.

I think you'll see people producing more data products around this, that pre-analyse it and pre-can it.

Adam: For re-selling it? Because wouldn't the value of this data decrease based on the more people who have it and the less unique it becomes?

Richard: You have firms that are only ever going to want their own custom analysis, their own custom data. That's just true, and they're certainly here to stay and you'll see more of that over time. But if you're a traditional asset manager, a mid-sized asset manager, you're unlikely to have the capacity to do a lot of custom analysis in this way - not the computational capacity but the quantitative, intellectual capacity. So they're going to look for off-the-shelf, even if the value is notionally eroded. If you're doing that to check on a trading strategy, then maybe the off-the-shelf isn't such a problem.

Adam: I guess it's a bit like a buyside firm using a broker algorithm. Yes, it's already out there, yes they're not getting an advantage over a sizeable portion of the market - but they're keeping pace with another sizeable portion of the market.

Richard: Exactly. Or, frankly, using broker research. Other people are reading this research but it would be ridiculous of me to not read it because of that fact.

Adam: In terms of Natural Language Processing, is the goal here to read text almost as the human mind can?


Richard: That's the goal but that's far from the reality. An important thing to realise is NLP systems today almost never look for understanding. It would be false to think of them as understanding the data. For example, if you're going to use the NLP technology to build a recogniser for layoffs or for CEO departures, what you'd actually be building is a machine that knows how to look at news articles, break them up by word or letter - you actually do some funky tricks there - and then notice that the pattern of words in this one is very similar to the pattern of words in other articles that I have been told represent CEO departures. So that's a categoriser. Then you might do some sort of entity extraction, which is to say: 'Now that I've decided this is probably about a CEO departure, I should go and find a sentence that looks much like the sentences I've seen before, and then know that the second phrase in that sentence - the capitalised noun in that sentence - is going to be the CEO's name.

Adam: The mind obviously would never do anything like that. Are there quantum leaps that we're hoping to make in terms of doing that much more efficiently, or in a much better way?

Richard: There are a lot of people getting PhDs in this and working on those problems. They're very hard problems. Understanding English without having wandered around as a human for several years is quite challenging! People make all sorts of implicit assumptions, there's all sorts of scattered information. Things are completely unclear.

If you try to understand the entirety of a news article, they're going to say something like: 'It was a Sputnik moment', and you're screwed. Or they're going to say something less subtle. For a place where people are doing some work on this, you can look at IBM's Deep Blue playing Jeopardy. The interesting thing with watching it play Jeopardy is, it gets a lot of questions right but it also gets a lot of questions wrong that are really easy questions - and it gets them wrong in completely ridiculous ways.

Adam: Is that something you keep an eye on, in terms of what this next generation of PhDs is doing? Or is that getting into another branch that is not really that important at this stage for the markets?

Richard: At this point, the market still hasn't digested what is relatively well-established technology when it comes to Natural Language Processing. So I certainly pay attention, but not because I think the next discovery will be the one that makes it for the market. It's much more about watching them slowly adopt things that are pretty straightforward and really figure out how to apply those to their own problems, because the people who are working on their PhDs at Carnegie Mellon are not focused on tradability of the data.