The Gateway to Algorithmic and Automated Trading

When failure is not an option

Published in Automated Trader Magazine Issue 13 Q2 2009

How do you prevent systemic failures in high-performance trading infrastructures? We called in Rob Ciampa, Vice President, Product Management at Tervela.

Today's competitive environment leaves little room for operational mishaps, yet the foundation of the trading business - technology - is more susceptible than ever to failure. Mission-critical, high-performance trading infrastructures remain extremely vulnerable to constituent failures, while burgeoning complexity and increasing system co-dependencies exacerbate risk and time-to-resolution.

My objective with this article is to examine key technology hazards and then to explain how to mitigate danger proactively without negatively impacting the performance equation.

The Sky is Falling

Recently, a major financial services institution specializing in mutual funds suffered a 48-hour outage in a critical business unit resulting in over $20 million in losses and numerous litigation issues. The cause was the inability of algorithmic trading servers to process market data effectively, forcing the data producers to rebroadcast information continually and thus to bring the underlying network down.

Another leading international financial firm recently saw its large New York City trading floor experience numerous outages that resulted in a suspended order flow of $90 billion and losses of $3 million per hour. Data volumes wreaking havoc in the middle office and back office triggered network and system outages that brought down the front office.

The trading world continues to evolve as firms adjust their trading strategies to exploit market opportunities before their competitors. But neither of these variables - trading strategies nor market opportunities - is discrete, autonomous or effectively measurable in real time. Instead, they are part of an ecosystem of complex systems with sophisticated inter- and intra-dependencies.

In addition to diverse algorithmic trading strategies, firms must also manage trade structure, order routing and execution costs across diverse asset classes, markets and liquidity venues. All of this will run on diverse IT infrastructures with diverse Service Level Agreements (if any) and have disparate controls and authority.
It's critical in high-performance trading environments that all the components in the value chain are understood and their inherent interdependencies and risks are quantified.

The path to interdependent systems

Trading infrastructures will continue to evolve as long as algorithmic and technological advances keep yielding positive financial benefits. Challenging economic and market conditions inherently present numerous and diverse execution opportunities. Alternative execution venues, direct market access (DMA), algorithmic containers, electronic communication networks (ECNs), smart order routers, exchange co-location, multi-core processing, distributed caches, low-latency networks and messaging systems are just some of the ingredients that must be combined into the optimal trading recipe.
Building a high-performance trading infrastructure from scratch is just not a practical option for several reasons. Specific expertise is rare. There are too many niche components. Integration schedules won't match the market opportunity window. And, finally, the operational risk of a monolithic system can knock a firm out of the market, which is far too common with many legacy systems today. Hedge funds, market markers, new liquidity venues, algorithmic trading houses and high-frequency trading firms are but a sampling of organizations either leading the charge into high-performance trading or being dragged into it. They understand that the status quo won't do.

The most overriding factor and the foundational driver for competitive advantage with interdependent systems is effective and efficient automation of all critical components of the trade lifecycle which includes market data processing and distribution, risk management, order routing, et cetera. Historically, a good deal of emphasis has been on straight-through processing (STP) to automate and integrate information flow within a firm, but contemporary demands emphasize other critical-path trade routes. For example, market data distribution has been a critical challenge for firms, especially as market data volumes continue to grow.

Figure 1

Figure 1

In its latest year-end report on market data capacity, the Financial Information Forum (FIF) reported significant volume growth across all feeds and market centers. The implications for firms are that any processing challenges in this area will only be exacerbated down the line. Similar parts of the trading ecosystem have challenges as well, requiring a look at the overall risk. Our goal is to keep our overall risk profile within an acceptable range (Figure 1). So let's move from interdependency to complexity, and then we'll examine some of the more effective techniques used to keep the trading ecosystem risk curve in line with the target risk curve.

The risk of complexity

With so many disparate systems in the trade lifecycle, the risk equation changes greatly. If there is one system, it can be readily modeled for risk. If there are two systems, symbiotic models can be produced to sufficiently define the risk profile. If there are greater than two systems, it involves a level of complexity that is increasingly difficult to model.

If there are three systems, for example, - A, B and C - then there are 10 risk factors: A, B, C, AB, BC, AC, ABC, AB on C, BC on A, AC on B. Beyond that the model becomes significantly more complex, as does the risk profile.

What elements are part of the modern trading equation? Networks (switches, routers, WAN links), messaging systems (producers, consumers, middleware, queues, servers), OMS, and EMS are some of the numerous elements of this equation. The assertion is that merely adding a single system to the trading ecosystem increases the risk factor not by the autonomous risk profile of that system, but potentially by magnitudes as that system interacts with other systems.

A simple example is warranted - we'll call it "The Slow Consumer Problem." Assume market data is being distributed to 30 systems using the conventional message-oriented middleware of an Ethernet network. Efficiencies have been built by using multicast technology so that a single message can be delivered to all systems simultaneously.

When one system falls behind in processing the market data (the slow consumer) requests that the producer retransmit information. This, in turn, slows the producer down and forces all the other consumers to process the retransmitted information (which they will likely discard). As a result, all systems - producers and consumers - lose valuable processing cycles and the network bandwidth will begin to erode.

We have research that shows the slow consumer issue is not an isolated, one-time event. Excessive requests may ultimately and dangerously impact the producer of information. This cascades down to all the other components in the trading ecosystem, ultimately resulting in such conditions as price slippage and the inability to trade profitably and effectively. Many trading environments are that fragile.

Given the less-than-desirable operational profile, there is an effective way to model the risk but there are also distinct challenges. Chaos theory is the most appropriate model. IT media company TechTarget provides an excellent definition:

In a scientific context, the word chaos has a slightly different meaning than it does in its general usage as a state of confusion, lacking any order. Chaos, with reference to chaos theory, refers to an apparent lack of order in a system that nevertheless obeys particular laws or rules; this understanding of chaos is synonymous with dynamical instability, a condition discovered by the physicist Henri Poincare in the early 20th century that refers to an inherent lack of predictability in some physical systems…The two main components of chaos theory are the ideas that systems - no matter how complex they may be - rely upon an underlying order, and that very simple or small systems and events can cause very complex behaviors or events.

Rob Ciampa, Vice President, Product Management at Tervela.In his book, Chaos Theory in the Financial Markets, Dimitris M. Chorafas analyzes in great depth the role of nonlinear systems, volatility, risk and cumulative exposure, as well as cognitive models for financial operations. Dr. Chorafas states:

"An overriding need in any business is the ability to represent problem information in such a way that the full complexity and dynamic nature of the underlying structures is captured. Financial systems are no exception … As the information and the tools needed to solve prediction problems becomes more complex, it is increasingly more challenging to foresee and represent the evolving real-world situation. From risk management to generation of profits, flexibility is the cornerstone of a successful prediction process."

Dr. Chorafas' assertions address the external dynamics of financial markets. The premise of this article is that organizations must use the same scrutiny on their internal trading systems, understanding fully that a symbiotic relationship exists between these two environments.

Several challenges exist. First, seldom do firms assign their quants to develop models for internal systems. Next, the operational methods of most of these systems are not quantifiable. Even the application of chaos theory to weather systems has greater measurable components. Last, even if these two challenges were overcome, it would be extremely difficult to bridge the operational - and political - compartments involved in the trading infrastructure.

Mitigation strategies

Due to the complexity of tackling risk at the systemic level, addressing component risk first is both a pragmatic, tactical move and an effective, long-term strategy. Purists may argue that systems should be engineered from the ground up for effective operational risk mitigation, but this is nearly impossible in practice, especially given the continual changes in technology and infrastructure. Furthermore, the risk probability curve rises precipitously as components are added, so factoring out hazards in individual components will have the inverse, beneficial effect.

Is there an optimum number of systems for a trading infrastructure?

If they were all built the same way, then the answer would be an emphatic "yes." However, reality quickly sets in. One international broker-dealer we work with was able to reduce the number of FIX engines on legacy middleware from 24 down to two (with redundancy) on a modern messaging platform. Outages dropped by 87 percent while performance increased by over 300 percent.

In another example, one of our hedge fund clients detected network (and latency) deterioration after just five direct mesh connections on options processing servers. They moved to a centralized communication system yielding predictable and consistent sub-100 microsecond latency. In both of these examples, many factors were at play including data volumes, intersystem communication, excess messaging traffic, et cetera.

Before going through the steps, what types of risk are critical to remove from the internal trading infrastructure? There are several. The most dangerous is operational risk, the failure of a critical component in the information flow. Next is performance risk, the executional degradation of a critical component in the order flow which may include processing slowdowns and system malfunctions that produce errant behavior. Finally, there is flexibility risk, the inability of systems to dynamically adapt to diverse market conditions. All of these ultimately aggregate risk implications in the overall systemlike being behind the market or entirely out of the market.

Also exacerbating these risk conditions are fault detection and isolation. The more complex the ecosystem, the more difficult the root cause analysis and return to execution. Understand that a system that comes back up will likely be hit, for example, by the market data deluge that sunk it in the first place, much like the mutual fund company discussed at the beginning of this article.

Architecting for continuity

The goal of architectures is to remove systemic risk while ensuring predictable, market-leading performance. Fortunately - or realistically - this work does not require a rip and replace; it can be applied to existing trading infrastructures. The following are key items to consider:

1.Build continuity into systems with the greatest number of adjacencies
Redundancy and high-availability are critical requirements for all components in a mission critical system, but more so for those that interconnect other components. This would include liquidity connectivity, network infrastructure and messaging middleware.

It's important to note that building a mesh infrastructure as a means to resiliency actually has the opposite effect because risk increases when five nodes are interconnected in this manner. This frequently occurs at both the network and messaging levels.
Redundancy does not mean plugging two systems into a risk-prone infrastructure - it doesn't make the risk go away. The underlying paths must mitigate the "adjacency effect."

2.Eliminate unnecessary integration
Much like a great team, each player in the trading infrastructure must play his part and play it well. Too often, technology is adopted that allows systems to do many things. The challenge is that individual component behavior is obscured and surprisingly at risk along with the other items with which they are comingled. Only do what you need to do. This often generates conflicts because business units prefer to build out a separate infrastructure rather than embrace the economies of scale proposed by a centralized technology group. If the latter could guarantee a specific risk profile, that would be an ideal scenario. Otherwise, the savings may not outweigh potential operational issues.

3.Optimize performance characteristics
Major technology evolution occurs every five to seven years. When it does, the results are dramatic. Given that the benefits are often a magnitude higher and the cost, a fraction of the previous incarnation, adoption at the right time is important. This is most often when the technology has been in the market (not a lab) for about nine months. Competitive pressures also influence this number. We saw this with networks evolving to hardware; processors adding more cores; and now messaging moving to silicon. It's a continual evolution requiring ongoing education and evaluation for even the savviest of firms.

4. Effectively provision management of multiple information streams
A great deal of market data and order flow is moving through the organization. It's critical that these streams do not all come together at under-provisioned rendezvous points, because volatility and micro-bursting can turn these into information dams, slowing data flow and causing risk-inducing latency.
This is a key consideration in messaging systems for market data distribution and order routing: software-based systems will need distributed streams while hardware-based systems can handle aggregated flows. In this case, the risk profile is matched to the infrastructure capabilities.

Achieving operational excellence

Architecture is one side of the coin; operations is the other. Operational excellence is not a one-time activity; it requires regular tuning and modification. Quantified data and the ability to measure it are critical success factors. Figure 2 highlights the reconciliation of risk with some of the requisite operational elements.

Figure 2

Figure 2

1.Establish operational checkpoints
Gone are the days when high-performance systems were compromised by monitoring capabilities. Establish checkpoints at the system level to mitigate the risk of component failure. Establish checkpoints at major ingress and egress points to mitigate the risk of systemic failure. Make sure that the checkpoints can report independently of being polled, especially when baseline conditions are breached.
Checkpoints need to be managed as well. Too many checkpoints - if improperly set up - can adversely affect the trading cycle by slowing systems down or creating excess network traffic.

2. Measure, measure, measure
Having the checkpoint in place is one thing, but the performance expectations both individually and in the aggregate must also be considered. Not knowing is not an excuse. Set the criteria, establish benchmarks, validate regularly, and warn when operational thresholds are compromised. Measure data volume - not just averages but peaks as well. Measure latency across the entire trading cycle in addition to specific execution points. Measure end-to-end system performance and compare with benchmarks and trending curves. Measure server utilization rates. Measure your service providers and your trading partners. Aggregate what you measure and perform regular statistical analysis.

3.Isolate problems dynamically while maintaining performance
Even with the proper planning, problems are inevitable. If components can be dynamically isolated while maintaining overall system performance, that's a major step forward. The slow consumer problem described earlier is an excellent example from the messaging arena. This problem is being solved by today's contemporary messaging platforms because they have the intelligence to be self-isolating.
High-availability and resiliency are important, but have historically impacted performance, at least in the short-term. Far better to add the requisite intelligence to prune systems (with notification of course) while ensuring consistent high-performance.

Rob Ciampa, Vice President, Product Management at Tervela.Continuity is key

The cost of a lapse in operational continuity in today's high-performance trading infrastructures is too high to leave things to chance. Numerous and diverse systems create complex interdependencies with complicated risk profiles. The most effective means to model this risk is with chaos theory, but it becomes impractical in real-world environments. Capriciously adding new technology may not mitigate the perils either.

Fortunately, risk can be driven down substantially by addressing continuity factors in key, individual components, especially those that have a large number of adjacencies. Architectural improvement can be done on both new and, more likely, existing trading platforms. Major technology innovations play a key role here. Architecture progression in isolation is not enough; operational controls and metrics must be established as well. Though we'll never entirely remove all risk, we can certainly reduce the chance of systemic failure by orders of magnitude. That will keep savvy firms both in and ahead of the market.