The Gateway to Algorithmic and Automated Trading

Do Your Testing Methods Deliver?

Published in Automated Trader Magazine Issue 08 Q1 2008

Much of the underperformance of messaging systems can be attributed to inadequate measuring tools and testing conditions, according to Barry Thompson, Founder and CTO, and Dave Lauer, Senior Systems Engineer, Tervela.

The financial services industry is deploying so-called next generation network devices and software systems to combat the exponential increase in market data volumes and internally-generated messaging traffic. While messaging technologies, such as message-oriented middleware, have evolved, the means used to appraise them have not. This disconnect has led to 'game-changing' technologies failing spectacularly when moved from testing into production. Buy-side firms, in particular, are feeling the pain. For example, the ability to process and distribute market data relies heavily upon messaging systems, such as feed handlers and complex event processing (CEP) engines. For order execution, similar challenges exist for both order management systems and FIX engines. The sell side is not immune to this either and often leverages the same technology to service their buy-side clients. Though there are many components to a financial services trading infrastructure, the emphasis of this article is on the performance of messaging systems because they are so critical to other parts of the system. We also examine how replayed and live data yield completely different results when trading firms are testing competitive technology offerings.

Expectations and reality diverge

Recently, a large investment bank in New York walked us through their testing approach for new market data infrastructure products, which was created after a messaging system for market data distribution vastly underperformed against published test results. They lamented that even with their own lab they lacked the requisite methodology and framework to achieve a level of accuracy that would ensure seamless live deployment. So why does production reality not meet real world expectations given how well these technologies performed in the lab?

Low latency and high throughput are the mantras of this generation of technology, yet very little attention has been paid to properly measuring the performance of message systems, or indeed other technologies that support automated trading. In general, many products are evaluated in a development laboratory by replaying recorded data streams and examining the software or the device being tested. Firms use replayed data because they can use it as a basis to evaluate different products and observe how they perform under identical conditions. Frequently, however, systems put into a production environment show materially different performance characteristics compared to what happens in this 'clean room' lab environment. These real-time production performance problems cost firms profitability, but could have been uncovered with proper pre-production performance measurement. It has become increasingly apparent that replayed data does not have the same characteristics as live data and cannot accurately demonstrate or measure system performance. Ultimately, trading firms - along with their end users and customers - suffer because they can't get the market data they need to execute a trade in a timely manner because their systems aren't performing as they did in lab tests.

The problems inherent in using replayed data to evaluate how a system under test will perform in a live setting can be broken into two areas - measuring tools and test conditions.

Reproducing extreme volatility

Existing test tools for evaluating trading infrastructure technologies sit on commodity hardware rather than specialised equipment and typically employ non-real-time operating systems that can result in clock skew or difficulties in synchronisation. This is acceptable for evaluating legacy technologies in which latency is measured in milliseconds, but not for 'next generation' technologies which operate in the microsecond realm. This presents an impediment to coordinated replay of market data from multiple systems, which is critical to recreating the 'microbursting' characteristics of a live feed. This vast difference in granularity means that performance cannot be accurately measured for technologies operating in the sub-millisecond space.

Unrealistic test conditions

Even in cases where clock synchronisation can be guaranteed with hardware or the new Precision Time Protocol1, tests of technologies such as feed handlers, CEP engines and FIX engines are often performed under unrealistic conditions or with non-relevant use cases. Clock synchronisation alone is not enough to accurately simulate the extreme 'burstiness' of a live data feed. Commodity hardware and operating systems are difficult to circumvent when trying to put packets on the wire. As a result, the huge amount of queuing in a standard network stack can distort and smooth the output of replayed data. The network infrastructure and the architecture of the data centre are also factors that need to be considered. One hedge fund we work with over-provisions its network by a factor of five, but finds the message servers are overrun consistently during peak conditions. The fund discovered that adding more servers to compensate only yielded diminishing returns while increasing both footprint and operational complexity.

"These real-time production performance problems cost firms profitability, but could have been uncovered with proper pre-production performance measurement."

Replayed data feeds do not come close to accurately mimicking the true dynamic nature of live feeds. If, for example, the OPRA (Options Price Reporting Authority) market data feed produces volume spikes of over 450,000 messages per second (MPS), it would seem reasonable to use 450,000 MPS as the number required from the replay feed as test data. However, viewing the OPRA feed by the millisecond reveals that some periods contain as many as 1,000 messages per millisecond. This traffic burstiness extrapolates to a significant 1,000,000 MPS, which is more than twice as much as would have been used in the replay feed. Increasing the feed to 1,000,000 MPS will demonstrate performance of the tested system with large sustained volumes, but will not indicate how well the system can handle extreme traffic bursts in very small time periods.

Emulating volatile live feed behaviour

As observed above, recreating the volatile behaviour of live feeds is very difficult in a traditional laboratory environment. However, it is critical to use live data or recreate live data conditions in order to accurately evaluate new technologies, such as feed handlers, middleware, CEP engines, etc. Even with available replay tools, multiple synchronised systems working together would be necessary to mimic the level of microbursting. Given the accuracy and granularity of timed services in today's commodity hardware and operating system combinations, this is not possible.

To compound this problem, there is also the issue of reporting and comparison. True performance testing requires a live data stream, but live data is inherently chaotic. Therefore, only a scientific method for describing relative qualities across periods of live data would allow the latency and performance characteristics of devices processing those periods to be reported and compared. Though we discuss the challenges around messaging, we could easily expand our scope to address the performance measurement of middleware systems, switches, routers, firewalls and other network devices. To evaluate any of these technologies effectively, simulating live traffic would require an entire alternative enterprise network infrastructure in order to truly recreate data patterns and volume spikes. One way to overcome this difficulty is with a 'live data coefficient', a single numeric value that can describe a period of live data and allow for comparison of different live periods. Such a comparison would thus enable previously impossible cross-vendor analysis with live data tests.

"… it is critical to use live data or recreate live data conditions in order to accurately evaluate new technologies …"

Live data coefficient methodology

So what's the methodology behind the live data coefficient? The coefficient is a measurement of the amount of bandwidth utilisation within a millisecond time window. By analysing live data on a millisecond basis, the true extent of data burstiness can be evaluated, exposing microbursts that are normally lost when data is described on a per second basis (i.e. megabits per second or messages per second). Using the live data coefficient to describe a millisecond time window allows for analysis of these time windows, and a description of the overall time period. This eliminates the aforementioned problems of measuring replayed data as opposed to live data.

To delve a little deeper, let's assume that we measure the performance of an infrastructure product such as a feed handler or middleware product over a 60-second period of live data. We can then analyse this dataset of 60,000 coefficients by looking at the percentile distribution of the coefficients and the standard deviation of them to derive a measurement of bandwidth and burstiness respectively. These values allow performance results from technology infrastructure products to be compared while under variable load conditions using live data.

The lab versus the real world

One must remember to connect these ideas about live data patterns to real world performance measurement. Using live data to compare products will enable firms to move away from the use of canned data and the problems explained above. The aforementioned live data coefficient can support this transition because it allows firms evaluating new technology infrastructure to compare performance results from tests run over disparate data periods and thus enable them to provision their technological infrastructures for optimal and, most importantly, predictable market performance. This is important because periods of market volatility represent critical automated trading windows, yet any technology volatility will shut that window and erase any financial upside.

As automated and algorithmic trading continues to evolve, it is imperative for firms to align their infrastructure with their trading strategy. Money can be made or lost based on predictable system responsiveness during volatile periods. Many buy-side firms say that performance predictability can be more important than low latency; their algorithms can make money when they know how much latency to expect on a given data stream, but they are left in the dark when results are non-deterministic.

When performance of critical technology infrastructure products such as feed handlers, middleware, CEP engines and FIX engines differs dramatically between the lab testing environment and the volatile real world environment, algorithms that appear to be successful when backtested can fail miserably. If acting on old data, or on data with unpredictable latency, some algorithms can actually accelerate their activity, resulting in disastrous positions that can be costly to unwind.

Absolute and relative performance

When examining the performance of these technologies, we are interested in both absolute performance as well as how correlated the results are to changes in the metrics established by the live data. While absolute performance is important to measure, the deviation in performance of the unit under test as the data pattern changes (from increases in bandwidth or burstiness) is also a critical measurement, and a quality that varies significantly between software and hardware products.

One might expect a high correlation between software system load and the live data coefficient metrics because these systems have little or no control over system interrupts, process scheduling or context switching. However, it is notable that there is also a high correlation between the coefficient and the load of a hardware-based system. Generally speaking, the main differences between software- and hardware-based systems are the magnitude of the standard deviation of latency and the distribution of latency results when viewed as percentiles.

Even with hardware-based devices, we can see a very high correlation between the coefficient values and latency results observed. Simple queuing theory dictates that as a feed increases in magnitude or burstiness, there will be some change in performance. The magnitude of the change in latency is what emerges as the most important feature. If a hardware-based solution demonstrates an orders-of-magnitude improvement over a similar software-based system (as has been observed in empirical testing), then the correlation of the latency to the feed characteristics becomes less important. Software-based feed handlers, for example, have demonstrated a standard deviation of over 300 microseconds on OPRA data distributed over Ethernet. However, a standard deviation of less than 25 microseconds has been observed for certain hardware-based messaging systems. With this in mind, the correlation between latency and feed characteristics (which is roughly equivalent) seems less important.

"If acting on old data, or on data with unpredictable latency, some algorithms can actually accelerate their activity, resulting in disastrous positions …"

Performance and scalability

One has to consider whether a software-oriented approach to messaging systems will satisfy the demands of the automated trading market. With the increase in data volumes exceeding the increase in computing capacity, performance analysis is more critical than ever, especially as new hardware-oriented solutions enter the market. A primitive lab test between two computers just won't be accurate.

As trading systems scale to hundreds of applications running across hundreds of nodes, simple benchmarking becomes more difficult. One of the greatest difficulties in the world of software-based middleware and data fabrics is scalability, i.e. realising the same performance between 500 end-points as observed between two end-points. It is this increase in routing complexity that drove networking equipment from software running on commodity processors to specialised hardware, and it is currently having the same impact in the middleware industry. This impacts not only the testing methodologies that buy- and sell-side firms apply to their trading systems but also the purchase decisions they make when updating their infrastructure.

"… trading firms need to stop blindly absorbing lab test results and examine the methodologies they use to evaluate systems …"

Bridging the gap

While 'game-changing' and 'next generation' technologies appear very promising, they fail to deliver on their promises simply because of the way their performance is lab tested before the products are released. Products that are getting the green light from lab testing are failing in the field under live traffic conditions. Lab performance testing using playback data is flawed because it does not truly reflect real world data traffic characteristics. Measuring performance on live data can now be accomplished with methods of comparing performance across disparate periods of live data such as the live data coefficient. Live feed data modelling allows us to analyse performance under different load conditions and provide a much more objective or 'apples to apples' approach when firms compare test results between different vendors.

In addition to average latency, one must also consider measuring other characteristics that can quantify and contrast true product performance.

The measurement of standard deviation could prove to be even more important than the average, along with maximum values and the performance in the 99th, 99.9th and 99.99th percentiles. Products with the lowest standard deviation when presented with high magnitude and 'bursty' traffic will have superior ability to absorb massive volume spikes with minimal latency outliers. It is the performance at these extreme conditions that can ultimately dictate whether millions are made or lost when an automated trading system is tweaked to maintain pace with this massive data rate explosion.

Ultimately, trading firms need to stop blindly absorbing lab test results and examine the methodologies they use to evaluate the systems upon which they are building their automated trading infrastructures. They also need to pay close attention to industry benchmarks and make sure that the results they extrapolate have applicability to real world trading models. Aligning strategy with infrastructure is the only way to mitigate risk. Failure to do so can result in big surprises when firms find that their carefully tested trading strategies don't perform well in a live environment and don't match the results they were able to generate in their labs. Ultimately, the disparity between expectations and reality becomes the unanticipated foundation for both operational and financial risk to both buy-side and sell-side organisations.