The Gateway to Algorithmic and Automated Trading

Bridge over troubled WANs

Published in Automated Trader Magazine Issue 15 Q4 2009

If you’re running data over a world-sized wide-area network, latency is a function of geography. And in the absence of radical downsizing, geography happens to be an irreducible constant. And that’s not good. Shawn McAllister, Chief Architect, Solace Systems, discusses the latest in wide-area bridge-building techniques.

Many low-latency financial-services architects prefer using in-memory databases instead of traditional disc-based databases to accelerate data manipulation and lookups. Sophisticated products have emerged that can cache data in data grids (also known as "spaces") and efficiently flush updates out to databases as needed. This combines the performance of a cache with the persistence of a database.

Frequently, though, the data in one space needs to be shared with data in another space. This is relatively easy over a high-speed LAN, but challenging over a WAN. For example, suppose your risk policies require you to consolidate your active positions between algorithmic trading engines in New York, London and Tokyo. Your ultra-low latency sub-millisecond space-based applications are suddenly separated by a 50 or 100 millisecond WAN link. What does low latency mean in such a configuration? A space can do thousands of updates per second on a LAN, but how many can be achieved over a WAN? How can you avoid traffic between spaces falling behind the rate of updates within each space?

Space bridging is a new technique that enables application traffic to flow across the WAN between clustered computing environments and in-memory data grids (IMDG). As I will describe, space bridging can avoid or overcome issues such as constrained bandwidth, latency from geographic distance, speed mismatches, inconsistent availability, and the need for security. I will also describe the architectural considerations, along with the requisite software and hardware assets, that go into linking clustered computing environments with distributed space bridging.

Challenges with bridging applications over the WAN

I shall start with the challenges, taking them one at a time.

Round-trip latency. There are two ways WAN links add latency. The first is sheer distance, as even at the speed of light it takes time for data to traverse long distances. The second is the overhead of sending packets through intermediate routers. Between these two factors it's not uncommon for packets to have a round trip time (RTT) of 10 milliseconds between disaster recovery sites, and latency for a transcontinental connection across North America is around 50 milliseconds.

This means that RTT latency can hinder the transfer rate even if there is sufficient bandwidth available to support higher transfer rates. As a result, any synchronous exchange of information over a WAN has to wait at least as long as the round trip time. For example, two sites with 10ms round trip time can only communicate 100 times per second even though the same hardware and software on a LAN can operate hundreds or even thousands of times faster. For asynchronous exchanges of information, TCP/IP's sliding window protocol adds acknowledgement delay which can limit the maximum throughput of a single connection.

Chart A

Chart A

Limited bandwidth and compute capacity. Bandwidth isn't free, and in many cases it's not even abundant. A spike of updates that can easily be handled in a LAN-based cluster can quickly exceed the bandwidth of a WAN link. In cloud computing environments there is a lag time between demand and elastic compute capacity. Space Bridging can buffer sudden bursts of transactions and allow for compute capacity to become available without data loss.

It's also important to remember that other applications frequently share the bandwidth of a WAN link, potentially impacting inter-application communications. If a company shares a WAN connection for email, FTP, Web traffic, VOIP, etc. then care must be taken to avoid negatively impacting these other critical business functions, and vice versa.

Firewalls and undetermined IP addresses. Hosts are often behind firewalls that mask IP addresses, use unregistered IP addresses, or use Network Address Translation (NAT) to modify address information. Multicast traffic typically does not get forwarded across WAN/LAN boundaries complicating dynamic discovery. Increasingly processes are running in virtual machines with virtual network interfaces, or in cloud computing machine instances that can be started on-demand with undeterminable host and network identifiers. These complications make it difficult or impossible for clusters (or nodes within a cluster) to dynamically find one another. When IP address are re-used across organizational boundaries, or created on demand, they cannot be statically embedded in code or configuration either.

Infinite loops and resonance. When multiple sites are actively providing shared updates, care must be taken to ensure these updates do not form a feedback loop or resonance effect that causes duplication of transactions or a never ending reapplication of the same updates between two sites. If two sites simply replicate their own local updates across the WAN to each other, the two sites can end up with different values for the same objects as the result of a race condition whereby updates inappropriately overwrite each other.

These race conditions are made more likely by the long round trip latency and low bandwidth of the WAN environment. When sites expand further to 3 or more distributed clusters each updating the others in a network of sites, it becomes a very difficult problem to avoid getting out of sync and incorrect data. Networks can get partitioned into a "split-brain scenario" where each site believes itself to be active and the others to be down. When the flow of data is restored, a flood of updates fight for the title of last and sole survivor and data can be incorrect.

Multi-site distribution. Mirroring two sites is complex in its own right, but adding a third increases that complexity exponentially. If you want updates from a given cluster to be sent to two or more remote clusters over the WAN and then persisted in a local database for fast reloads and secure storage, this one-to-many distribution must be addressed. Updates can (and should) be delivered to each receiver (remote cluster or local database) at the rate the receiver can handle and at the times where the receiver is connected. These speed and connectivity mismatches to the various receivers, combined with the requirement to deliver every update in order without loss in a structured manner, is one of the main reasons that messaging oriented middleware (MOM) systems were first invented. Recreating the code and best practices of decades of message queuing work directly into application logic is time and effort better spent building better business logic and application code.

Space-bridging components and their capabilities

A typical bridging solution will consist of the following components:

1. An API to receive transactions from a local in-memory data grid and package them for sending to one or more remote in-memory data grids (shown in Chart A) via a network of intermediary hardware devices. It should be capable of distinguishing local transactions (which should be replicated across the WAN) from those that may have originated elsewhere (and should not be redundantly reflected back to the source in-memory data grid).

2. Software to receive a stream of transactions sent from a source IMDG through the message queue(s) of a hardware-based middleware device and execute the appropriate operation (write, take, or update) on the remote IMDG (again, shown in blue).

3.A messaging appliance to act as a guaranteed message oriented middleware (MOM) server for in order (first in, first out) asynchronous message queuing. It should make sure messages are never lost, even in the case of a crashed server or network outage. It should transmit and persist data even when one or both of the software elements described above are not running.

Chart B: typical update rates for synchronous WAN communications with software and with asynchronous transfers with hardware-based middleware.

Chart B: typical update rates for synchronous WAN communications with software and with asynchronous transfers with hardware-based middleware.

Achieving the benefits of space bridging

High levels of performance and low latency are achieved by putting large parts of the data path and processing in hardware rather than software. The benefits of this kind of hardware implementation over traditional software are:

• Multi-Tenant - Elimination of any significant impact of any user on all other users concurrently using the system
• In-Memory - Break through the speed limitations imposed by physically spinning hard disks by using a combination of memory caching and intelligent batching of I/O for persistence (reducing or eliminating round trip time to disk) to be able to absorb spikes of updates from an IMDG with persistent storage.
• Wire Speed - Ability to perform wire speed enhancements such as compression, tagging, mirroring, routing, and filtering of message streams.

When incoming data rates exceed WAN capacity, data is buffered in a failsafe manner to a combination of redundant memory and SAN storage. This allows the system to survive spikes in data volume without loss of data. Chart B shows typical update rates for synchronous WAN communications with software and with asynchronous transfers with hardware-based middleware.

In a WAN environment, two or more messaging appliances can act as a distributed message queue and provide a number of additional performance enhancing functions such as:

Non-Blocking Replication - Asynchronous (non-blocking) spooling of messages using publish/subscribe (compared to synchronous request/reply) to avoid the impact imposed by the high round trip time from site to site over the WAN
TCP Optimizations - optimizations such as large TCP receive windows are used to maximize the throughput of individual TCP connections of high bandwidth links.
Streaming Compression - the ability to perform per-connection, streaming payload compression in hardware frequently provides an 80% compression ratio, thereby saving bandwidth and, interestingly enough, reducing the latency of large message transfers especially over low bandwidth links.
Striped TCP Connections - Striping communications across multiple TCP/IP socket connections to reduce limitations of TCP/IP sliding window over high latency networks.
Edge Cache - Pre-fetching and pre-emptive distribution of persistent data (as in edge computing or content delivery networks) so data is replicated across the WAN prior to system startup.

Collectively, these features can ensure that the limiting factor in performance is bandwidth instead of round trip time, and that network bandwidth is used in an efficient manner to transfer application data, rather than protocol overhead.

Network Efficiency and optimisation

Your WAN is a critical piece of your corporate infrastructure, and it is often shared with other applications. Space bridging can optimize your WAN in the following ways:

Filtering - Serialized Java data objects are tagged with small amounts of metadata, which the content routers inspect and use to filter out unwanted or unnecessary updates from flowing across WAN links, so situations like duplicate update prevention is handled at source and there is no risk of infinite loops.
Compression - Use of hardware-based compression ensures that the compression engine is not a performance bottleneck and can reduce data sizes by 80% of the original payload size.
Duplicate Data Reduction - Data updates are only sent once per link even when there are multiple receivers at the far end of the link. Remote routers keep track of subscription counts and duplicate updates (as well as guarantee in-order delivery) at the far end of the WAN links.
Bandwidth Throttling - Tunnels all WAN traffic through a configurable set of TCP/IP sockets and ports. In conjunction with your existing IP routers and switches, these ports can be set to be rate limited so that they do not use 100% of the WAN bandwidth at the expense of other applications.

High availability

Since the motivation for mirroring clusters across a WAN is often disaster recovery, it becomes critical that any additional components enhance the overall reliability of the system. They should support the following capabilities:

Active/Active Message Store - all mirrored messages should be stored in a fault tolerant pair of active/active message queues. Should a router or any software component fail, all queued messages are recoverable.
Backup Energy - They should contain a built in energy storage system which provides enough power to flush all critical message buffers to removable compact flash media so even in the event of a total loss of power to the system, no messages are lost.
High Speed Replication - increasing the rate at which data can be replicated across the WAN means a correspondingly smaller window of lost transactions even in the event of catastrophic loss of an entire data center.

Monitoring and management

The solution should feature graphical tools that enable real-time monitoring as well as configuration of all components in a running system, specifically of statistics such as queue depths and spool usage, message rates, active/inactive client connections, TCP RTT, lost packets, re-transmits, etc.

All of this information should ideally be available via command line interface (CLI), Java API, and as a web service.

The hardware element of the system should also be able to be monitored as any other network device via SNMP or Syslog so alarms can be set for when there are problems with connectivity or queues reach pre-defined thresholds.

Security

Access to the MOM system should be secured by several configurable access and authorization methods including basic login/password and full integration with carrier class third party AAA systems such as those implementing LDAP or RADIUS. Users should be assigned to closed user groups (CUGs) which partition the network into VPNs. Automatic hardware accelerated data encryption and compression should be enabled easily within the network.