High frequency data analysis
As the requirements for storing, manipulating and deriving intelligence from ever larger data sets continue to expand, techniques and technology have to keep pace. Brian Sentance, CEO of Xenomorph, outlines some of the prerequisites.
It is hardly a secret that data management in financial markets is undergoing a period of fundamental change. This has been driven by a variety of diverse factors across the capital markets spectrum. When combined, these have precipitated a colossal increase in message and transaction volumes, as well as both profit and cost incentives to move away from single asset class data silos. Among many other factors, this list includes:
The use of algorithmic trading as a means of reducing market impact and improving execution quality
High-frequency statistical and order book arbitrage
Developments in areas such as credit theory that establish market relationships, which in turn motivate more complex cross-asset trading strategies.
Regulations such as MiFID and RegNMS
End of day to high frequency
Just a decade ago, many traders were typically analysing end of day historic data for strategy backtesting and instrument pricing purposes. In at least some cases it must be said that they possibly had little choice from a technological perspective; the capture, storage and analysis of intraday data volumes even then was challenging, especially at a time when the relational database was still a relatively new technology. However, with wider derivative pricing margins and profitable statistical arbitrage being possible using only end of day data, there was also little incentive to store and analyse intraday tick and high frequency data. Since then, far tighter trading margins, cross-asset trading and improved technology have shifted traders' perceptions of what is required and what kind of analysis is possible with high frequency intraday data.
At the same time, intraday data has not created great excitement among risk managers. Hardly surprising perhaps given that many of them find that obtaining clean data for end of day risk measurement challenging enough. Risk measurement techniques such as Monte Carlo or historical simulation VaR require large amounts of historical data and are calculation intensive. Large data universes or poor implementation may mean that it is challenging to attempt to run these techniques as an overnight batch, let alone perform the calculations in real or near real-time. However, growing intraday trading exposure, better understanding of intraday market behaviour and recent regulatory requirements regarding data transparency and data quality are driving risk managers towards analysis of tick by tick and intraday data.
One of the challenges when building a tick database for multiple instruments is defining precisely what data items to capture without having to learn and use some form of database scripting language. Ideally, business users will be looking for some form of GUI that allows them to select multiple data attributes for real-time capture and monitoring, and that also provides an easy means of configuring how the download is to be monitored, filtered and logged.
A considerable degree of flexibility is also desirable around the real-time modification and monitoring of the tick database instrument universe. For example, rather than having to continually specify instruments by individual name or symbol, it is more productive if some form of query option can be applied. Using this, a tick capture process could be established that only captured data for stocks with a market capitalisation between certain levels, or that were part of an index and that were of a certain level of volatility. If this query could also be configured to refresh itself automatically, then no new instruments would ever be excluded from the capture process.
Tick Storage - Real-Time, Intraday and History All in
Storage of intraday data is becoming increasingly challenging due to the data volumes involved. A highly traded stock may have tens of thousands of price events per day, quickly resulting in a storage requirement of Gigabytes of data per day and Terabytes of data per year for any reasonable sized instrument universe.
Bear in mind that these figures are just today's requirements. If one extrapolates the current growth rate in data densities just a year or two into the future, one will be looking at demands that are an order (or perhaps several orders) of magnitude greater. Therefore any database engine that is to be remotely future proof will ideally already be capable of storing Petabytes of data today.
However, storage is only part of the picture. A viable database engine will also need to be capable of high performance data retrieval at speeds well beyond that currently possible with mainstream relational database technology. If it is, this will mean that large calculations with intraday and time series data that ordinarily run as a batch job can be run in real-time and near real-time. This thereby delivers a significant competitive advantage in implementing trading strategies, back-testing of trading strategies, risk management and transaction cost analysis.
Data Validation - Why Waste Half Your Time?
One of the key issues in analysing market data, and in particular intraday data, is that many traders, analysts and risk managers spend far too much of their time checking, validating and interpolating data and not enough time on making decisions with it. For many kinds of analysis, it is essential that the raw database data stored from the market be adjusted in some way prior to applying a calculation. For example, the data may need rebasing in frequency, aligning across multiple series or filling in some way.
In standard SQL databases this kind of manipulation is very difficult to do, requiring much postprocessing of queried data. This has the net result of reducing the usefulness of SQL, increasing the complexity of the queries written and limiting the analysis that can be done by non-technologists.
One effective way of addressing this issue is to have "data rules" that set a data context for the data to be loaded in a query or calculation. Some examples of data rules are shown below, the key point being that by separating out the data rules, the queries themselves can be kept simple and hence more productive for both technologists and business users alike.
- Data Frequency and Time Snapping - this involves converting from tick-by-tick (irregular time frequency) to some other more regular basis such as price samples taken every five minutes
- Data Filling, Aligning and Loading Rules - in the data frequency conversion above, it cannot be guaranteed (and indeed it is unlikely) that there will be a traded price observed at every five minute frequency point. Therefore some form of data rule should be available that allows the user to substitute the latest, next or an interpolated price point
- Data Source Selection - another data management issue that can arise is where multiple data sources are available, and the user has some kind of preference for which sources should be used first. This can be resolved by a data rule that specifies the priority of the various data sources. Then if the primary source does not have any data for a particular capture point, data from the secondary source can be substituted.
Real or Interpolated Data?
The data rules briefly outlined above are powerful, but a key requirement for anyone using them is to quickly understand the effect of applying them to the data loaded in a query. An example query using Xenomorph TimeScape is shown in Figure 1 below:
Figure 1 - Example Query Explaining Effects of Data Rules Applied
This example query converts tick data to an hourly frequency for Barclays, retrieving date/time, value, the data status and a data status explanation. It can be seen that none of the data has been officially validated as shown in the data status column.
The data point explanation column tells the user how the data value was arrived at i.e. whether the data value is "real" or interpolated. This transparency of understanding around data rules and their effects is vital to the validity of any calculation or report.
Data Analysis - From Backtesting to Best Execution
Data only becomes valuable when data analysis translates it into information. Why store vast amounts of market data if you don't have the ability to analyse it? The current competitive challenge in the market is to apply ever more complex analysis to ever-increasing volumes of data.
Backtesting for automated and statistical trading strategies is growing in importance as a pre-trade decision support function. Within risk management, intraday and cross-market trading is driving the need to analyse intraday data more fully and faster. Regulation is now also driving data analysis in areas such as historic verification of best execution policies.
In order to deal with these expanding data analysis requirements, both traders and technologists need the tools to analyse more data, more quickly. It is no longer sufficient solely to rely on technologists to translate business requirement into business analytics. Such a process makes technology staff the business bottleneck in responding to trading requirements. It also means that the technology staff are hit with daily tactical requirements from trading that inhibit the timely delivery of strategic systems and processes. What is needed is a strategic way of alleviating this tactical pressure point, providing a transparent, easy to use and powerful framework for data analysis across all asset classes. Examples of the type of functions that can be used by end users to facilitate this analysis include:
- Chaining Analytical Functions - if calculations can be
chained together to perform more complex analysis, greater
productivity results. For example, a query could take all the
available trade price data for HBOS, calculate a rolling twenty
point volatility (adjusting automatically for irregular data
frequency) and then calculate the average of these volatilities.
When one combines the analysis capabilities shown above with some of the data rules and frequency examples previously explained, it becomes possible to achieve some relatively complex pieces of analysis - but without having to write complex queries.
- An example of these analysis capabilities is shown in Figure
2 where an historic tick data series is converted into a daily
time series at different times of day, just as the market is
closing. It can be seen how the volatility calculated at market
close is much lower than that observed just a few minutes
This kind of intraday measurement of volatility and other measures such as correlation is proving increasingly important in the pricing and risk management of derivatives, and in the formulation of new trading ideas.
- Intraday Time Period Analysis - the example in Figure 2 shows
how the behaviour of a market can vary substantially depending
upon the time during the day that it is observed. Looking at this
issue in more granular way, another chained query could be used
to split a trading day into any number of time buckets and return
values and calculations during each bucket.
Figure 2 - Volatility Measured as a Market Closes
For example, it could return values for the open, high, low and close prices plus the number of points and average price for each ten minute period bucket applied to the intraday price history of a stock.
So what does high frequency data analysis require? Fundamentally, the current needs of the market translate into capturing, cleansing, storing and analysing ever more data, ever more quickly. Building on the above fundamental market requirement, it is evident that more analytical power needs to be put in the hands of the people who make trading and risk management decisions.
The productivity gains from such an approach are manifold, particular when they are based upon a foundation that also delivers automated data cleansing, centralisation and transparency to IT and compliance departments.
Data Accessibility - An Illustration
Changes in recent years in the
density and variety of market data require technology that
can adapt to the pace of change and deliver high
performance analysis even when the quantity of intraday
data being analysed is massive. However, such technology
must make the data it is intended to analyse readily
accessible to both technologists and business users.
Traders in particular will be looking for something that
can give them a quick and intuitive grasp of the financial
objects they are studying, be they instruments, curves,
indices etc. They typically do not take kindly to data
applications that require them to become instantly expert
in database technology or table-based data models.
Fortunately this is unnecessary, as the following figures
hopefully illustrate. Figure 3 shows the stock Imperial
Chemical Industries (ICI) being viewed in Xenomorph's
TimeScape WorkBench. The view shows intraday data captured
from the London Stock Exchange, and in particular the
number of ticks per day for intraday trade prices captured
for the stock. Once a particular date has been selected,
then ticks for that day can be observed and charted as
shown in Figure 4. The left hand side of Figure 4 shows how
firstly the stock ICI is selected and how the properties
available for ICI are expanded for further browsing by the
user. Upon selecting the trade price ("TradePrice")
property for ICI, all of the available data sources for
this property are displayed. Whilst there could be multiple
data sources, in this case there is only one, that of
"LSE". Underneath the "LSE" data source each day's history
is displayed. The righthand side of the application is
split into two, one listing the tick times and values for
the day selected on the left (in this case the 12th of
January 2006), the other charting the ticks for this same