The Gateway to Algorithmic and Automated Trading

Tick Data: Crunchable Numbers

Published in Automated Trader Magazine Issue 23 Q4 2011

A regular conversation piece with our readers is data. Not just the usual "faster, faster" topic, but also how to minimise the management overhead of historical data used for analysis and model building. Which led Automated Trader's Founder, Andy Webb, and our friends in the Wrecking Crew to take a look at Tick Data's data service and its TickWrite application.

Sherlock Holmes got it partially right:
"It is a capital mistake to theorise before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts."
(The Sign of Four, Sir Arthur Conan Doyle)

Inconsiderately, Holmes didn't come up with any convenient additional quotes about how to efficiently warehouse and manage an excess of murder clues, but we'll just have to make do on that front. Our readers could oblige, but regrettably most of their quotes about quotes are, um, unquotable. The commonest causes of their profanity about historical data are:

• There's too much of it

• Managing it doesn't generate alpha

That means that anything that can streamline its collection and the hassle of manipulating the stuff (such as re-parsing time frames) is worth looking at. So we are.

Historical history

Even though it was founded in 1984, TickData is one of those companies that feels like it's been around the market ever since the first butter and egg futures traded. Certainly, one member of the review team who is curiously reticent about his precise age recalls buying CBOT futures data from the company "a while back" and spending a very long weekend swapping floppy disks (the really floppy prehistoric five and a quarter inch ones) to load it all.

Things have moved on and the company now offers historical data as end of day downloads or (in the case of larger history blocks) via couriered hard drive. The range of data available has also increased: for example, over the past decade or so the company has been steadily adding equity markets to its service. In addition to the US, it now covers most equity markets in Western Europe, plus Japan and Brazil. In addition to price and volume data, subscribers also receive files for each instrument for splits, mergers and dividends.

On the derivatives front, nearly 100 futures markets are covered (with a broadly similar geographic spread to equities), plus all US options reported by OPRA. Cash indices are available from more than 30 countries, as are 16 US market indicators (such as the CBOE VIX). Where applicable, both historical trades, quotes and volume are available. (Historical quotes consist of best bid/offer, not entire depth of book.)

How clean is my data?

Square One in the data world is cleanliness. One truism in trading is that since exchanges went electronic, data has become cleaner. No longer is some hard-pressed clerk in the trading pit desperately hammering away on a keypad trying to keep pace with the screaming mayhem around them. So bad ticks don't happen anymore - right?

Sadly, they do. Another result of the migration away from open outcry is that individual trade sizes have shrunk while trade volumes have shot up (and as for quote volumes...). Bursts of activity can overwhelm data recording mechanisms, network issues can create out of sequence errors, and so on.

A further issue is that some exchanges and trading venues have simply under-invested in their historical data capture technology. This not only increases the risk of spurious trade ticks, but can also result in sequence errors due to the coarse granularity of time stamping. In some cases, you are lucky even to get per second time stamping - never mind millisecond or microsecond.

Survivorship bias

This is one of those areas that matters not at all to some people and a great deal to others. Unfortunately, there are probably quite a few traders out there scratching their heads over trading models that massively underperform in real time, who assumed themselves to be in the first group when they should be in the second. One reason is that they may have unwittingly fallen foul of survivorship bias.

If a trading model is only built and tested using symbols that are currently traded, there is a bias in favour of survivors. Model performance may be radically different if the symbols tested (typically stocks) also include those that have subsequently been acquired, gone bust, merged or otherwise delisted. For this reason, if you buy either all US equities or options symbols, TickData automatically includes both active and inactive companies in your order. Nice touch.

This situation raises the question of data scrubbing. How much of this should a historical data vendor be doing? TickData's general philosophy is that under scrubbing is better than over scrubbing, as it leaves the end user with greater discretion over what to include or exclude. A data point that has been scrubbed away by the vendor is invisible to the user, so they no longer have the option.

To avoid this situation, TickData gives customers a number of choices. Equity trade tick data is provided in its original raw form but is enriched with a range of condition code fields that flag up problems such as out of sequence trades that the company's cleaning algorithm regards as errors. A further field includes a suggested replacement price for any trades flagged in this fashion. However, the user still has the discretion whether to use this suggested price, the original price, or another price generated by their own proprietary filtering/cleaning algorithm. (Note that this only applies to data supplied as ticks - TickData also offers data prebuilt in one minute price bars, which are automatically cleaned.)

For historical futures tick data, TickData uses a similar principle, but adds further condition codes. These flag data points attributable to off-exchange trades, such as exchange for physical or block trades, which are often completely omitted from many historical data services. Again, the user has the option of deciding whether or not to include these trades in any data output file.

One of the many advantages of the Wrecking Crew is that they bring not just a range of expertise, but also a range of tools and data as well. We took advantage of this opportunity to do some random comparisons of TickData's data and a couple of alternative vendors' products supplied by Wrecking Crew members. This certainly wasn't an exhaustive process, but it did throw up some interesting discrepancies.

The general consensus was that where TickData suggested alternative prices to bad ticks, those alternatives seemed reasonable. By contrast, one Crew member gleefully found a couple of examples in a fellow member's data from another vendor where a known bad tick had been automatically replaced - with an even worse one.

Figure 1

Figure 1

TickWrite

Although it is possible to execute commands from a console or via batch file, the majority of users will probably elect to use the GUI of TickData's TickWrite application (available for Windows and Linux) for collecting and manipulating data. The software can be installed in one of two ways: Primary Installation or Workstation. Workstation is intended for situations where multiple users on the same local area network wish to run the application independently and access a single primary data repository. However, Workstations still require at least one machine on the network to have a Primary Installation that they can connect to in order to function properly.

Figure 1 shows the main TickWrite window with one of the data category tabs (Futures) selected. Tabs only appear in TickWrite for which the user has purchased data and each tab contains commands specific to that category of data. To apply any of the functions at the bottom of the tab to a particular security, the relevant checkbox needs to be ticked (or there is an option to automatically apply the functions to all symbols on a tab).

One extremely handy feature of TickWrite is its ability to create, save and run a particular combination of these functions that are run regularly or frequently as predefined jobs. So if a common requirement is to process individual ticks to create tick bars each containing 10 trade ticks, the settings can be saved and simply run with a single click (or via the Scheduler - see the Scheduling section below).

The range of output options when processing tick data using TickWrite are undoubtedly impressive. Figures 2 and 3, which relate to the Equities tab, give an inkling of this. This particular configuration has been set up to generate a comma separated value (CSV) file of all quotes (in this case for Philips) for the last 100 days. Other interval options include trades, quotes, tick based bars, time bars, plus daily/weekly/monthly bars. Further options relate to handling of missing intervals and the option not to use prices that have not been filtered by TickData's cleaning algorithm.

A handy feature on the Output tab is the ability to choose to output data to screen (provided not too much data is involved) rather than file. We found this useful as a means of sanity checking other settings before running a long job on a large number of symbols over an extended data history. We simply ran the job first on a single symbol for a short period, eyeballed the screen output to check that it was as anticipated, before running the job in full.

Figure 2

Figure 2

Figure 3

Figure 3

Another common sense feature is the ability to adjust output handling with the 'If Output Exists' radio buttons, particularly 'Append Updates'. This last will attempt to tack new output onto existing files that were generated by the same job. For users taking daily updates, this allows a significant saving on processing time by only processing new data on each occasion.

Space limitations preclude covering all the numerous other TickWrite job options when processing data.

Suffice it to say that they appear to cover pretty much any base you care to think of in terms of content, timeline, time zone, trading session, file format/compression, naming convention and fields/ticks to include/exclude. In addition, there is the ability to filter data in/out on a range of parameters that vary according to security type. For example, in the case of equities these include price, volume, trade location and cross trade - while for options there is a further tab allowing you to filter by option expiry date range, puts/calls and strike price range.

Performance

Updating data files via TickData's download service manually is painless. Selecting Get Updates from the Actions menu in TickWrite brings up a dialog box where you elect to download all available items, only data for a specific category of security (such as Futures) and also whether to download trades, quotes or both. TickData's file format of choice is CSV and it uses a combination of zip and gzip file formats for data compression.

This seems to work well and TickData's servers also appear to have a decent amount of bandwidth. We tested this by timing the download of five weeks of tick quotes on 185 Euronext stocks, which took 22 minutes 41 seconds over a 10Mbit connection. Considering that some more active stocks would easily have racked up several million bid/offer quotes in five weeks, we reckoned that was pretty respectable.

However, the performance that really impressed the review team was the speed with which TickWrite processed ticks into price bars. For example, converting two months of trade ticks for the five most active stocks on Euronext (approaching ten million ticks) into seven tick price bars took 16 seconds. Performance on options was if anything even more impressive: processing two months of Citigroup option trades into six tick price bars took less than four seconds. This included writing 714 separate files to disk to cover all possible permutations of call/put, strike prices and expiries available during the two months.

Scheduling

The Scheduler is almost certainly the TickWrite component that those sick to death of managing data will love most of all. Using this to automatically run price updates immediately followed by a predefined job (or jobs) enables users to simply define a schedule, go home, and come back in the morning to find the latest data already downloaded and processed to their requirements.

One caveat is that certain markets make their data available rather late to TickData. For example, Euronext trade data are not available until 4am EST (at the earliest) for the previous day's trading, so traders in Europe wishing to use these data as 'live' trading inputs for the open of the current day will be out of luck. However, other European markets such as Italy, London and Germany are not a problem in this respect as their data is typically available before midnight EST on the same day.

The Schedule Editor (see Figure 4) allows users to build considerable resilience into the execution of TickWrite updates and jobs. The ability to schedule updates individually for quotes or trades for each data type is useful, as it prevents a single problematic data item causing all updates to fail. Multiple update retries can be made at defined intervals for each scheduled update and automatic email notifications can be sent after a specified number of failed attempts. Further email notifications can be sent about the failure, success or both of the entire scheduled update.

Figure 4

Figure 4

Crew Views

WILLIAM R (option prop trader, US bank): Managing historical options data is a horrible job. Multiple strikes and expiries make just the trade data a pain, but then when you add in historical quotes where thousands are often generated per actual trade, it becomes a complete nightmare. On that point, while the TickData default naming convention when writing multiple option data files is very sensible, it's handy that you can also add custom suffixes and prefixes. The Option Instrument Filters are also worthwhile. Being able to filter by expiry date and strike price range is handy. My only thought here is that it would be great to be able to filter by delta as well. Then you'd be able to quickly put together a time series of (say) just 50 delta calls and puts, which is actually the sort of data that I'm more likely to be using day to day.

PIERRE F (quant, European hedge fund):

While scheduling as a general concept is obviously nothing new, the Scheduler is a really practical implementation of it. Built in resilience, alerting before/after task failure and the ability to batch multiple updates and data management jobs together is a great combination. Even though I've already built in-house quite a few of the sort of data management tools TickWrite provides, I'm still probably spending an average of 20 minutes a day tinkering with data. I suspect for those quants and traders who haven't already built their own management tools, this number would be considerably greater - as would the advantages of using TickWrite.

Pricing

So, how much is it? Well, a year of historical tick data (trades and quotes) for a single futures contract (a 'symbol year') is USD 175, as is a year of the update service for a symbol. This includes all listed contract months available for that instrument during the year. How valuable this actually is in practice depends a lot on the instrument concerned. For some instruments where all the action is in the front month, it's not unreasonable but it isn't exactly a huge bargain. However, for other instruments, such as the Eurodollar short term interest rate, which has more than 40 contract months live at any one time, it's excellent value. Tick data for stocks cost considerably less. A year of tick data (trades and quotes) for a single US stock is USD 30, while for a single Euronext, Italian, German, Brazilian or London stock it is USD 20, with daily updates for a year costing the same in both cases.

In addition there are significant volume discounts available. For example, buying a 'symbol year' of data for 500 or more European stocks cuts the unit price by 25 per cent. For those who don't need tick quote history, a symbol year of just trade ticks costs USD 125 for futures, USD 20 for US stocks and USD 15 for Euronext, Italian, German, Brazilian or London stocks. For those who don't need even this level of granularity, stock data only is also available in prebuilt one minute bars for USD 15 for US stocks per symbol year and USD 10 for other stocks.

As a comparison, we took a look at the pricing on CQG's Data Factory, where a symbol year for a futures contract (trades and quotes) works out at USD 216, while for a stock (any country) it is USD 108, with one minute data being USD 120 and USD 60 respectively. As a further comparison, we checked out what Euronext itself was offering. A year of trade ticks for all Euronext cash instruments purchased direct from the exchange is currently EUR9,000 (approx USD 12,000 at the time of writing), for which TickData also charges USD 12,000 - although it does offer discounts for multiple years (e.g. two years is USD 20,000).

Conclusion

While TickData aren't exactly giving the stuff away, the review team felt the data was good value -especially in view of the quality and options around data cleansing.

However, the key point to bear in mind here is the TickWrite software the company also includes with data orders (by contrast, the CQG and Euronext prices above are just for the data). The Wrecking Crew's unanimous opinion (see box 'Crew Views' for details) was that TickWrite turned a good value data package into an outstanding one. The consensus was that if one considered the time usually wasted managing and manipulating historical data by other means, TickWrite could easily offset much (if not all) of the cost of a data subscription in time savings alone.

Which makes it a thumbs up situation.