"Before I got into this, I didn't realise how long I would spend preparing my data versus how long I would spend doing the cool algorithm part."
That's what one client told Tanya Morton, an application engineering manager at MathWorks. Morton said that it's for that very reason -- because data cleansing can be so time-consuming and require so much fiddling -- that many companies should probably be thinking about investing more to improve their processes. In other words, spending a lot of time and energy on data cleaning quickly gets costly, so an investment upfront can save money later.
The causes of faulty data are multiple, from having a hodgepodge of different systems to relying on the wrong tools for data warehousing to business culture factors. The methods for ensuring you do have clean data are similarly numerous. Technology has made a difference in allowing you to spot rogue or missing data, but the view from data specialists is that if you want clean data there ultimately is no way to avoid wading in and getting your hands dirty.
"People think that with all the high technology available now, how difficult can it be to clean the data? And the answer is extremely," said Simon Garland, chief strategist at Kx Systems. "And it's not what people want to hear. It's laborious and of course it's expensive. They keep hoping there's some shortcut."
Whatever the causes and whatever the solutions, consultants and data experts agree that the impact from not putting in sufficient effort to clean your data can be substantial. Whether it's for back-testing a new model, transaction cost analysis or satisfying compliance and regulatory requirements, data cleanliness is paramount and insufficient attention to that need can cost you money.
Jeffrey Alexander, co-head of equity execution consulting and analytics at Barclays.
"Data cleansing is largely necessary because of the systems that we use,"
Where the problems start
A common cause of problems is how a firm uses its order management system (OMS).
Many firms -- when doing transaction cost analysis (TCA), reporting or other analysis -- will pull the data from the OMS. But such systems were not designed for data warehousing and data extraction. They were designed for managing orders, in other words for trading, not for reporting or analysis.
"Data cleansing is largely necessary because of the systems that we use," said Jeffrey Alexander, co-head of equity execution consulting and analytics at Barclays.
Alexander offered a simple example, a buy-side shop that uses only an order management system. "That system is used to execute orders. Very little thought has been put into when they designed this as to what happens after the fact, the retrieval and the reporting."
Once there is a cancel of an order, a correction, or more commonly, a merge or a split, it can wreak havoc, Alexander said. "Your trail gets messy very, very quickly. If you go through some of the various order management systems and you talk to some of the buy-side people and you say, 'What's your original order size?' a lot of them can't answer the question."
He added: "There is not a single buy-side order management system that is well-equipped to facilitate accurate reporting if you just do a standard plain vanilla data pull."
Alexander and the other co-head of his group, Linda Giordano, work with the British bank's trading clients to help them improve their systems and operations. He looks at data cleansing as an issue that can only be discussed in a larger framework. "It's not even the data cleansing, it's the data architecture," he said.
"If you want to get the right answers, you've got to do a lot more work than to sit there and say I'm going to do a generic pull and I'm going to plot out my orders table and I'm going to pull out my execution table and pull out my allocation table, because that's just going to give you a final state on everything and it's not going to answer the questions that you want."
He described one US client that was using a vendor to help it understand how its portfolio managers were performing. The client asked Alexander to look at a feed of this data to get a feel for how the firm operated and how it was performing. It turned out that nearly a third of the firm's allocations were unaccounted for in the feed, and that this had been going on for years.
His colleague Giordano said the problem in this case had to do with a unique feature in the feed that the client was using and a bespoke database the customer had built. "The cookie-cutter feed didn't pick up the uniqueness of the database and because of that, 30% of the allocations were dropped," she said. In this case, the impact would be management would have incorrect information on how well the portfolio managers were performing, a key factor for deciding salaries and bonuses.
"People don't realise that there are more reasons why data can be dirty than just somebody mistyping," Garland of Kx said.
Sometimes, there may be no glitch or mistake whatsoever and a firm can end up with large amounts of dirty data without realising it. For instance, Garland noted that the German market can switch to auction mode if prices become too volatile, but if your system doesn't make this distinction you could be recording indicative prices as actual prices. "So you'd better know that you've switched to auction mode. You're getting dirty data but for a weird reason. Everything's dirty but it looks okay."
Erik DiGiacomo, head of Virtusa's financial services consulting, said the sheer amount of data firms need to process, and the multitude of sources, made dirty data even more of an issue. "It's pervasive, it's everywhere, it's being tracked and stored in a variety of different formats and a variety of different points in time," he said.
The abundance of data sources was also an issue raised by Steve Wilcockson, industry manager for the financial services division at MathWorks, who works with Morton.
"We're pulling financial data from multiple sources, market data, reference data, in some cases text sentiment-type data, maybe. And we want to import and massage consistently into a central data environment, which becomes a reliable multi-purpose warehouse hub. For instance, ensure holiday extrapolation onto, say, returns data to ensure bank holidays are accounted for consistently."
The multitude of systems came up with specialists time and again.
"On the sell side you have issues with a lot of legacy systems," Giordano said. "I worked at a bunch of sell side firms, and it's always been a challenge, just because you have so many different areas and each area tends to have their own databases and there's always a project in combining all these databases and getting the data put together."
There may be some systems that work extremely well but there could also be legacy systems that are always in the process of being re-built and re-vamped. She added: "It gets very complex monitoring the trail of data, following a trade from when a client sends it in to where it goes all throughout the firm because it's making so many stops along the way."
Another scenario that Garland cited was when exchanges made changes to their feeds.
"Often people aren't following those changes immediately. If you've got, let's say, 20 years of tick-by-tick history and then an exchange adds an extra indicator, are you going to go back and re-write 20 years of history with a null entry in those fields? It's a big deal when an exchange adds another field. For them, in the feed itself, it's comparatively trivial because the feed handler needs to be updated," he said.
Linda Giordano, Barclays
"The cookie-cutter feed didn't pick up the uniqueness of the database and because of that, 30% of the allocations were dropped,"
"The easiest thing to do is just ignore that new field, which is fine -- until it isn't," he added.
To make matters worse, even if a firm has made the necessary investments and has spotted incorrect data as it comes in, there's no guarantee that an exchange will do likewise. That throws up a conceptual issue.
"Let's say somebody mistyped something. If you're going to clean that, you're assuming that the exchange will also pick that up and clean it up in the reference version," Garland said.
"But if something's wrong and yet the exchange doesn't correct it, what does that mean about that data? Is the fact that the exchange says this is the gospel according to the XYZ Exchange, does that make it true? For trading it sort of does," he added. "You have to trade off what is, not what you think it should be."
Arkady Maydanchik, a data specialist and director at eLearningCurve, is the author of a book on data quality which includes a chapter devoted to the various causes of bad data. Maydanchik came up with 13 broad causes of bad data, from initial data conversions to system consolidations. Ironically, one of the causes is data cleansing itself.
"In the old days, cleansing was done manually and was rather safe. The new methodologies have arrived that use automated data cleansing rules to make corrections en masse," Maydanchik writes, adding that he is an ardent promoter of automated rule-driven data cleansing. "Unfortunately, the risks and complexities of automated data cleansing are rarely well understood."
Maydanchik says data cleansing is dangerous mainly because data quality problems are usually complex and interrelated.
Wilcockson noted another important issue that can create issues: derived data. As an example, he noted that different firms have different conventions for calculating yield curves and this needs to be accounted for, particularly when you're taking data from multiple feeds.
Steve Wilcockson, Industry Manager, Financial Services Division, MathWorks
"We're pulling financial data from multiple sources, market data, reference data, in some cases text sentiment-type data, maybe. And we want to import and massage consistently into a central data environment, which becomes a reliable multi-purpose warehouse hub."
Given all these issues, from hardware to data volume to all manner of externalities, how can a firm ever hope to get its data in reasonable shape?
Morton of MathWorks said one consideration at the data cleansing stage was to quickly identify outliers.
"You're not necessarily trying to understand the complexity of the underlying trend. You want to start off with a fairly simple model that you want to fit to it, because the outliers should really stick out like a sore thumb," she said.
For instance, one way is to apply a simple clustering algorithm to start with such as K-means. "That basically picks out key clusters, and then if you get any data points that are a long way from those clusters, then that data point you might want to go and have a look at," Morton said.
"Similarly from a regression point of view, where you've got an output you're trying to predict as a function of its inputs, I would there start out with a pretty straightforward regression, like a linear regression to get started and then spot the points that are very far away from that model, which would then identify the clear outliers as a starter. And once you've got rid of those, you can move on to your more complex methods."
But a word of warning: people need to be careful not to throw away too much data.
First of all, the data itself in some cases can be costly. For instance, outliers identified from regression analysis can be contentious. "If you're in a case where each data point is expensive to calculate or collect in some way, which we sometimes have in some circumstances, then you really do want to keep as much data as possible, she said.
Even if that means holding onto bad data?
"Is it bad or not? That's where you bring in the statistical knowledge and it becomes a bit more of an art to debate. So was there a problem that caused that data point or is it actually a random variation that's within the reasonable balance you might expect? If you throw out all the unusual data points, there's the danger that you're just throwing away the data that's not convenient, so that can be a source of some good debates."
Morton added: "The danger is if you throw away too much of the bad data you're throwing away some of the signal as well," Morton said. "So getting rid of outliers is something to do carefully and probably with multiple models."
In fact, using multiple models to spot outliers is one of the techniques Morton advises using, particularly for ensuring that unusual data isn't in fact strange but true. "One thing I've found where it's useful to fit multiple methods is where you have some genuine outliers, some genuine spurious results. then you tend to find that multiple models will point to that as being a spurious result, so that's one way you get a confirmation factor," she said.
Another issue is missing data. The problem here is that one doesn't know what's missing. For instance, there could be a data file and suddenly there is a glitch. On a spreadsheet the resulting gap may be visible but as one deals with larger and larger datasets, it is not necessarily practical to try to visually scan everything to spot holes.
Also, if the gap is very large, extrapolating between points can be, as Morton said, "scary".
Morton and Wilcockson both suggested that ultimately data cleansing starts to become an art rather than a science. For instance, Wilcockson said it was important to consider the context of the data. "There might be different use cases. So for instance, you might have a stock traded occasionally, how do you extrapolate to be consistent at every time point, perhaps a multi-asset portfolio combining sporadic derivative cash flows with equity tick updates, perhaps also incorporating IPOs where your stock does not conform to the complete time horizon of your universe. You have to deal with that in your own way, according to your preferred time-points for analysis, intraday, daily, weekly, monthly or whenever. This approach to "missing data" differs from a technical issue, say your feed stopping due to a power cut."
Garland said that clever algorithms that make it easier to identify dodgy data will only take you so far. Sometimes a person's gut feel needs to come into to play.
"There is still a need for common sense. There have to be knowledgeable users. That's why banks have whole teams of market data experts who will be looking at the data and saying there's something strange here," he said.
"It's not that these people are just sitting there quietly receiving good data from the exchanges. It's a two-way thing. It's people who really know the data and the markets inside out pointing out things that can't be or that look very odd."
Tanya Morton, Application Engineering Manager, MathWorks
"You're not necessarily trying to understand the complexity of the underlying trend. You want to start off with a fairly simple model that you want to fit to it, because the outliers should really stick out like a sore thumb."
The Barclays consultants agreed that having someone on the team be the data guru makes a big difference.
"We do think that buy side firms that have invested in having a guy on the desk who is either a tech or a project manager, that they're worth their weight in gold," Alexander said.
Giordano added: " The ones who have even a part time trading/operations person, whose job is to be the data geek, the data shark, they tend to fare much better than firms who have no one and rely on their vendor to be the expert. The vendor's going to spend their one day every couple of months there and go away. They're not going to know the ins and outs and protect you."
She said that at some firms this is usually the disgruntled guy who gets angry about the vendors. "He's the guy who's going to know every field of the data and can rattle off in his head and he's the expert. And whenever they bring in a new vendor, he the one who's going to get to know them very well. He's going to be their best friend and their worst enemy. But that's the guy that you need to have. If you're serious about doing any kind of analysis and analytics, you need someone who is master of the data."
Still, even if a firm has invested in good analytics tools and has staff who know all the idiosyncrasies of the data they handle, that may not be enough. For DiGiacomo of Virtusa, it all comes down to discipline.
"Folks always talk about the strategy," said DiGiacomo of Virtusa. "Strategy is the easy part. It's the discipline and sticking with it that's the hard part."
Companies need to not just formulate a methodology for cleaning the data but also go back and review that methodology on a regular basis, he said.
"You have to have a discipline to go back and test your decisions or what you'll find is over time, one (price) source has deteriorated or the market has moved or changed, and you don't pick up that idiosyncrasy in the marketplace as well. That's what we see in the cleansing and scrubbing space, that folks set the rules on day one, and never come back to it."
DiGiacomo used the idea of trying to lose weight to illustrate the issue.
"The best way to lose weight is to eat healthy and exercise regularly. That's not a complicated strategy. Doing it every day? Impossible for me," he said. "The same thing happens on the data-side. You build this methodology, and then what happens? A market opportunity pops up."
Wilcockson identified another key workplace culture factor to achieving good data hygiene: teamwork.
"Where small teams think holistically - about the meaning of missing, extrapolated, and consistent data - it can work well. In some instances, Market Data can appear to be resistant to other people's opinions and likewise 'experts' can be intimidating or overly dogmatic to market data teams. Where barriers of communication and personalities are overcome, cross-functional benefits can emerge."
Also key is picking good data providers, particularly when dealing with real-time data.
Alexei Chekhlov, head of research at Systematic Alpha Management, a fund based in New York City (see separate interview on page 44), draws a distinction between dealing with historical data and real-time data.
"Historical is surely easier to do because some of the data providers do it for you and you are able to compare data from various sources, so that is relatively easy to implement. The real time data is probably the major source of technological risk, infrastructure risk and there you have to build something that will shield a robot from taking erroneous trades," he said.
He gave one example where a primary data provider sent a British pound quote that had an erroneous timestamp which was ahead of Systematic Alpha and could have triggered the wrong trade.
"This happens almost never, but in the past we had some crazy things happen. And they happen less with good providers and more frequently with not-so-good providers. And as you well know, the cost of infrastructure with these data providers has been just going up and up and up consistently," Chekhlov said.
He brought up the same issue that Morton of MathWorks did: the cost of data can be so high that it's worth the extra investment to ensure it's clean.
"The usage of this is a very key cost, so it has to be made worthwhile. You have to use it to your advantage," Chekhlov said.
He added that redundancy was very important as were statistical filters on the quotes based on the properties of a particular market. For instance, attention to night versus day sessions in some markets can be critical due to significant drops in volumes in orders at night in some markets .
"So all of these issues have to be thought of as a practical matter," Chekhlov said. "Both the algorithms need to be adapted to this and the traders who are ultimately watching over these robots' algorithms need to be fully aware of these things and need to be proactive, because despite all these filters, you cannot foresee everything obviously."
He added: "Having been trading these markets and we've been trading almost all the most liquid futures markets for many years, we stumbled into many awkward situations that we can tell stories about and definitely having this machinery is very worthwhile."
Getting your hands dirty
In the end, if you want clean data, there seems to be no substitute for getting in among the weeds.
"The answer is you really need to go in. You need to look at the audit files. You need to invest the time to see what's going on. Because there are things that are going to come out," Alexander said. "You're going to see time stamps that don't add up. You'll see orders that end before they start. That'll happen a lot. You'll see orders with executed size greater than order size. You'll see weirdness in terms of prices."
But the good news is that in doing all that, not only will a firm have better data, but also there can be some fringe benefits.
His colleague Giordano said that on one project, she and her team found compliance issues as a result of investigating their data systems. "The client was actually in violation of certain compliance rules on their side because their OMS was not behaving in the right way. So because this is such a complex thing, it's something that the client is not even aware of."
Arkady Maydanchik, in his book Data Quality Assessment, identified 13 causes of enterprise data problems.
1. Initial data conversion
As Maydanchik writes, databases rarely begin their life empty. More often
than not, data is converted from another source and that process often introduces errors.
2. System consolidations
These are common in the IT world. The challenges are the same as with data conversion, only magnified.
3. Manual data entry
A lot of data is inevitably typed into databases. People make mistakes.
4. Batch feeds
The source system that originates a batch feed may be subject to structural changes, updates, and upgrades. Testing the impact of these changes on the data feeds to multiple independent downstream databases can be difficult and impractical.
5. Real-time interfaces
Data may be propagated too fast, with little time to verify its accuracy. And if data gets rejected, it may be lost forever.
6. Data processing
If one could document everything going on in a database and how various processes are interrelated, the problem of errors from everyday data processing could be mitigated. But this, Maydanchik says, is an insurmountable task.So regular data processing inside the database will always be a cause of problems.
7. Data cleansing
Not only can data cleansing actually introduce new problems, but also it can create a sense of complacency once a cleansing project is completed.
8. Data purging
When data is purged, there is always a risk of some relevant data also being lost. Or conversely less data than intended might be purged.
9. Changes not captured
In an age of numerous interfaces across systems, people rely largely on a change being made in one place migrating to all other places. But this does not always happen, and data may decay as a result.
10. System upgrades
Upgrades are often designed for and tested against what data is expected to be, not what it really is.
11. New data uses
Data may be good enough for one purpose but inadequate for another.
12. Loss of expertise
Firms will often have one or two people who know the database better than anyone else. When those people move on, problems can start.
13. Process automation
Humans automatically validate data before using it, but computer programs take data literally and can't make proper judgments about the likelihood of it being correct. Validation systems may fail to see all data peculiarities.