The Gateway to Algorithmic and Automated Trading

Hard and Fast?

Published in Automated Trader Magazine Issue 12 Q1 2009

This is an extended version of the Tech Forum that appeared in the Q1 2008 edition of Automated Trader. It includes an additional interviewee and expanded answers from all interviewees on the latest techniques for hardware and networking infrastructures.

  • With:
  • Pat Aughavin, Senior Business Development Director, Financial Services, AMD
  • Vincent Berkhout, Client Engagement Director, COLT
  • Michael Cooper, Head of Product Technology, BT Global Financial Services
  • Andrew Graham, IT Architect, Financial Markets, IBM UK
  • Shawn McAllister, Vice President, Architecture, Solace Systems
  • Parm Sangha, Business Development Manager - Financial Services Industry Solutions, Cisco Systems
  • Geno Valente, Vice President, Marketing and Sales, XtremeData, Inc
  • Nigel Woodward, Head of Financial Services, Europe, Intel

What developments in processing capabilities should firms adopt to support algo/auto trading?

Aughavin: While many companies are investigating parallel programming, they are proceeding methodically because it can be difficult to maintain and support. However, companies recognise the potential of accelerated computing and how it can reduce power consumption and ease infrastructure complexity. In recent months, select companies have launched accelerated computing initiatives which are specifically designed to help technology partners deliver open, flexible and scalable silicon designs. These solutions can significantly boost performance in compute-intensive applications. A key part of such solutions is a stable platform which will help foster dynamic development, enabling technological differentiation that is not economically disruptive at a time when accelerated computing is moving to the mainstream.

Cooper: Alternatives to traditional horizontal and technology upgrade approaches are beginning to emerge that address complex event processing (CEP), capacity and performance requirements. Network-attached compute appliances seek to address processing capacity and performance by offloading processing from existing systems to an optimised appliance. Additionally, some of these appliances mitigate the overheads frequently incurred with platform and technology upgrades by minimising systems modifications and application development. In addition to meeting existing application performance requirements, these applications can service multiple systems while providing significant scalability and capacity for growth. They will also address other issues like power consumption and cooling requirements, are relatively straightforward to deploy and can prolong the life of the existing systems estate. As a consequence they enable new approaches to be developed and new functionality to be supported that would not have been feasible on existing platforms.

Graham: The need to analyse applications to ensure software is designed to exploit multi-core/multi-threaded technology safely is ever more important. A balanced solution stack must always be considered; the old adage that fixing one bottleneck will only move it to another part of the system still holds true. That said, emerging technologies include:
• Offload engines/accelerators to perform XML transformations, security processing acceleration, FIX/FIXML acceleration, market data feeds optimisation, TCP offload engines (TOEs) and hardware devices such as graphics processing units (GPUs), field programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs) and cell broadband engines;
• Accelerated migration of applications from 32 to 64 bit hardware and operating systems, to exploit in-memory databases and larger histories;
• Streaming/event-based technologies with the option to perform more complex processing is gaining significant traction, often blended with column orientation over row orientation;
• The need for predictive quality of service across the architecture
• driving real-time java solutions, real-time extensions to Linux, dedicated or highly-managed networks; and
• Daemonless low-latency middleware that exploits true multicast networking.

Parm Sangha, Cisco Systems.
Parm Sangha, Cisco Systems.

"The organisation needs to be able to monitor data and message flow in order to determine where bottlenecks may be occurring and what to do about them."

McAllister: Hardware infrastructure solutions combine the best of network and content-processing hardware advances to accelerate data delivery, routing and transformation in support of algorithmic and other applications. These solutions use FPGA, ASIC and network processor-based systems to move content processing into silicon which improves uptime and delivery rates by parallelising processing and eliminating the unpredictability of software on servers. With data volumes increasing exponentially and buy- and sell-side firms struggling with increased complexity, latency and unpredictability in their software infrastructures, hardware solutions can deliver an order of magnitude greater throughput while guaranteeing ultra-low, consistent latency.

Sangha: Improving processing capabilities requires a combination of investing in faster processors and making existing server investments work harder through server and application virtualisation and intelligent routing of workload between servers via a high performance trading (HPT) network. As data traverses the different components of a trading platform - including market data delivery, order routing and execution - the HPT infrastructure should not only provide a lowest latency interconnect at each component but, at the same time, allow server CPUs to dedicate more capacity to the application, by offloading networking traffic processing to the switching fabric. Applications can also take advantage of multi-core CPUs when the underlying operating system is designed to support virtualisation and different processes are applied to different CPU cores.

Valente:
CPUs are not getting faster and twice as many cores do not make the system twice as fast. FPGAs are still getting faster every generation and doubling in size every 18 months, so the performance gap is actually increasing in the future. Unfortunately, there is a very fine line between the time it takes to develop, test and deploy new algorithms, and the latency/performance benefits that one can get with exotic technologies. Tomorrow's successful approaches must be multi-threaded in nature, but targeted to any technology at compile time (i.e. quad-core now, to octal-core or FPGAs later). New API layers are helping developers move from technology to technology faster, regardless of the original or future target platform.

Woodward: Processors are now available in single-, dual- and multi-core versions with multi-sockets. Both the operating system and the application software have to be designed to take advantage of this processor layer. Also, various acceleration technologies embedded in the hardware can increase the performance of I/O-sensitive applications. Generally, the focus should be on newer infrastructure technologies and tuning, e.g. Ethernet networking in ten gigabits and Infiniband.

Geno Valente,XtremeData,Inc
Geno Valente, XtremeData,Inc

"CPUs are not getting faster and twice as many cores do not make the system twice as fast."

How should firms handle both the increased diversity and sheer quantity of market data in algo/auto trading?

Cooper: Basic connectivity is clearly an issue: more sources require more connectivity. Frequent changes and systems test acceptance procedures complicate the situation: for example, how do you validate new data and test systems integration? Seeking to address the problem through point-to-point connectivity will lead to scalability and integration issues, high costs and inefficient use of infrastructure, and is an inflexible approach that precludes rapid adoption, testing and validation. Consumers of multiple data sources need to identify service providers who can provide connectivity (with the appropriate performance attributes) to multiple sources and are able to provide flexible solutions that support rapid connection to new sources, new applications and new data services.

Graham: Consideration of an enterprise metadata model for structured and unstructured data is important in terms of management and exploitation by numerous applications and people. Consolidation of feed adapters and exploitation of multicast technologies will help alleviate some overheads. Edge-of-domain performant technologies that filter the signal from the noise would help too.

Data caching and data grid technologies that handle the replication of data to many nodes are worth considering, feeding in-memory databases and specialist compute engines. Systems are being developed which are an execution platform for user-developed applications that ingest, filter, analyse and correlate potentially massive volumes of continuous data streams. They support the composition of new applications in the form of stream-processing graphs that can be created on the fly, mapped to a variety of hardware configurations and adapted as requests come and go, and as relative priorities shift. Systems are being designed to acquire, analyse, interpret and organise continuous streams on a single processing node, and scale to high-performance clusters of hundreds of processing nodes in order to extract knowledge and information from potentially enormous volumes and varieties of continuous data streams.

McAllister: Hardware infrastructure can improve throughput and deliver low, predictable latency allowing firms to easily scale as data rates scale. Additional hardware can perform tasks such as data inspection, filtering, routing, transformation, compression and security to allow algorithms to include market data, algorithmic news, research data and more at rates 100 or more times faster than software and servers. Today, firms typically have a separate infrastructure for their market data, their reference data and their back office SOA-style systems. Hardware solutions allow all classes of service to be combined under a single API without impacting performance on the highest end systems. Many parallel infrastructures can be consolidated into a single infrastructure with better performance and reliability than the stove-piped systems we have today.

Sangha: This is not just a matter of simply keeping pace but of competitive advantage. While processing power is important, it is vital that the network infrastructure is designed to support these ever-increasing volumes of data. Low-latency, high-throughput and deterministic behaviour are key attributes across the entire trading cycle. Consolidated information feeds provide key insight for algorithmic trading engines and human traders alike, but in the automated trading environment now common in equities and derivatives fast, direct feeds with latency measured in milliseconds and microseconds is key. Applications like market data feed-handlers need to be matched to the appropriate networking technology and built upon the most appropriate processor and operating system. Vendors such as Cisco, Reuters, Sun and Intel have joined forces to ensure an integrated approach is tested and ready to deploy. High performance 'cells' that are optimised for just such a demanding environment can sit alongside the standard infrastructure, where the 'need for speed' is less acute.

Woodward: Firms should look at new market data technologies, e.g. compression using the FIX FAST protocol or CEP engines to handle and analyse various data forms. Data caches such as Gemstone, Gigaspaces and Tangasol can store and feed real-time data to CEP engines. In addition, dashboard and analytical tools exist for analysing market data, such as Xenomorph and Kx Systems. New approaches to storage include offline archives from the major vendors such as EMC, NetApps and niche players such as Copan.

How are hardware/networking firms addressing the lead lag in clients' purchasing cycles?

Berkhout: Early engagement ensures that integral parts of network design, such as proximity requirements, can be incorporated at the outset. For example, rather than connecting a server farm directly to the nearest exchange, it could be connected directly to the data centre hosting both the proximity services and an exchange. This avoids unnecessary reroutes through the metro network.

Graham: Each firm has its own buying cycle and of course there will be a lag between procurement and deployment. However, these lags are reducing and there is pressure on firms to compete against each other using technology deployment as a weapon of competitive edge. By providing a constantly-updating stream of technology options, vendors can address the requirements according to the individual firm's cycle.

McAllister: The best way for hardware-based solution providers to deal with deployment lag is to provide a modular industry or de facto standards-based solution that can evolve. For example, a customer may deploy gigabit Ethernet or Infiniband while planning to migrate towards 10 gigabit Ethernet. The hardware solution must provide customers with a path to deploy on one network infrastructure today, and adapt to changes as they are adopted by each customer. Similarly, allowing applications written to industry standard messaging APIs to run unchanged over hardware instead of software will eliminate the time-consuming step of retooling and retesting applications when increasing capacity. Hardware products use a modular design so that clients can choose whether they want messaging, persistence, transformation or event-processing by simply installing new blades into an expandable hardware chassis.

Sangha: I don't think that financial sector companies are implementing out-of-date technology. Financial services companies invest in IT that meets their needs. If their need is for speed then they will be looking at the latest, fastest processors and networking equipment. If speed is not their primary need then they may well wait for new technology to prove itself and become more affordable before investing. This time lag doesn't make the technology out of date, rather that the investment is fit for purpose.

Woodward: At Intel, we have opened our Low Latency Lab to enable tuning and proof of concepts on the latest technologies. This can fast-track adoption of the latest technologies by enabling the testing of new combinations of infrastructure elements. While this will not necessarily accelerate adoption times, it gives a low-risk path to testing the new technologies more quickly than environments can be provisioned inside the firm.

How can firms optimise hardware deployment to overcome problems with power-to-rack ratios at popular co-location centres?

Aughavin: Virtualisation is one of the emerging trends which firms are using to run multiple systems and applications on the same server, the benefits being simpler deployment, an elegant scalable architecture and a more efficient use of computing resources. One approach is to deploy architectures that circumvent both a front-side bus and bottlenecks, enabling efficient partitioning and memory access to and from the processing cores.

Berkhout: While some co-location centres have been chosen as an immediate solution for low-latency trading, in the medium term financial institutions will need to be more selective over their chosen sites, locations and the power options available to them. One solution is 'long lining', i.e. dedicated connectivity to alternative locations nearby with more power and, equally important, ample cooling.

Cooper: The adoption of new systems with lower power usage and reduced cooling requirements is certainly one approach. In addition, consideration should be given to the use of appliances that provide functionality that can be used to support multiple systems, e.g. compute, I/O and storage appliances.

Graham: Mixing workloads within a rack will help balance the distribution of power input/heat output within a rack and across the data centre - often this requires organisational change since workloads are often arranged by line of business. New techniques including robot-assisted approaches can model and analyse the thermal distribution within a data centre to enable a more informed distribution of workloads. Blade-based solutions with more efficient power supply arrangements over rack-mounted servers are also worth considering; tie this in with active power management software that controls the hardware in real time in response to workload demands. To raise utilisation rates and hence attempt to reduce the overall data centre footprint, virtualisation technologies have a place for some workloads, whether through hardware or software hypervisor implementations. For 'hotspots', rear-door water cooling can help address the cooling dimension, with 50 per cent of rear-door heat being removed through the low pressure system in the door.

McAllister:
A single redundant pair of hardware-based nodes in a content infrastructure can handle the workload of 20-60 equivalent general purpose servers running software middleware. Software solutions have widely variable latency characteristics as volumes increase, leading many firms to deploy infrastructures that can handle five or more times market peaks. This can literally mean hundreds or thousands of servers that are very lightly loaded under normal market conditions, just to assure reasonable performance during trading spikes. Hardware solutions can handle many more connections, greater throughput and perform consistently as loading limits are approached. Additionally, a single hardware infrastructure can combine the requirements of low-latency middleware and persistent messaging for tasks such as order routing without impact on performance.

Sangha: Virtualisation is delivering power consumption as well as performance benefits. Driving existing hardware harder through virtualisation reduces power consumption by reducing the need to turn on or invest in extra servers to meet service level agreements during trading spikes. Firms can benefit from a reduced footprint in their data centres, as well as reduced power and HVAC (heating ventilation & air conditioning).

Woodward:
The energy consumption of processors is reducing dramatically. Firms should pressurise their co-location vendors on the range of technical configurations being offered and data centre design to ensure optimum service. They should also tune their applications at the time of relocation, making sure performance is optimised, rather than simply re-hosting existing software. For example, simple use of processor-based acceleration has been proved to improve performance of FIX throughput, and code optimisation is usually expected to contribute at least five per cent gains, reflected in both speed and scale of business process and energy consumption.

Pat Aughavin, AMD
Pat Aughavin, AMD

"Virtualisation is one of the emerging trends which firms are using to run multiple systems and applications on the same server, …"

Is parallel processing fundamental to facilitating algo/auto trading? If so, what technologies should firms deploy?

Aughavin: Yes, and many of the technologies that companies should be looking to deploy, such as FPGA, GPGPU1, and multi-core processing are being evaluated. From a GPU perspective, the bottlenecks are mainly in data transfer and latency rather than computation. Until recently, GPU products were not suitable for data-intensive tasks. However, with newer hardware the latency (i.e. transfer to GPU and back) can be mostly hidden, meaning that the acceleration from GPU-based searches and other algorithms will become more apparent very soon. The GPU will excel at more computationally-intensive tasks. As software tools and programming standards emerge GPU-based applications will grow in number and quality, and the cost of adoption should shrink further.

Graham: Yes, the runway for relying on single-threaded speed jumps within general purpose CPUs is running out, driving the need to consider parallel models to exploit multi-core (and multi-threaded) technologies and/or specialist hardware devices. All these devices typically offer higher performance for their chip area, consume much less power per computation than general purpose processors and are highly efficient in addressing a narrow range of tasks. However, most are expensive to programme because the skills needed are rare, they lack mature application development tooling and they have extremely limited ISV support. Parallel technologies means consideration must be made of event determinism, race conditions and thread-safe applications too. And with specialist hardware solutions there is always the need to balance the cost of implementation and management of exotic technologies over more general purpose solutions. The gaming industry is also driving innovation, so the technology it is considering should be investigated - such technology also has good economies of scale. Gaming is also driving changes to the Linux kernel that can be exploited in financial applications.

McAllister: As chip-manufacturing technology has reached the point where clock rates within a given execution context can no longer easily be increased, the only way to solve a problem faster is to break it into smaller problems and solve each in parallel. If implemented in the right technology, parallel processing allows increasingly complex algorithms to be performed without latency penalty. With FPGAs, for example, creating new parallel execution contexts to handle additional work is a very natural thing to do. However, with parallel processing comes the need for inter-context communication and synchronisation. This is where software solutions based on general purpose CPUs continue to be challenged in high-end performance requirements, since much more time needs to be spent on scheduling, synchronisation and event management. FPGAs and network processors have integrated, purpose-built hardware assists to deal with such functions which have made them a popular choice in networking layers for years.

Valente: Exotic, market-specific technologies will have a short-term niche, but are usually beaten by next generation x86 CPUs or FPGAs. Both of these technologies are at the forefront of the process curve (45 and 65nm) and can be leveraged in many different markets to keep volumes up and costs down. Trading companies need to make sure that their investment is warranted and has longevity. FPGAs and x86 CPUs offer both.

Woodward:
Not necessarily. FPGAs are specialist proprietary technologies which can be used for specific workloads with high results. Is it possible to support all applications on these? It's unlikely, due to the cost of redevelopment. More likely, niche functions will run on FPGAs, while mainstream functions will run on ever faster core processors. In the future, we will likely see development environments in which code is developed to be deployed on the appropriate infrastructure - removing the dedicated tight coupling that exists today.

How should buy-side firms that use in-house automated trading models and algorithms optimise their hardware infrastructure to achieve low latency?

Cooper: For consistently achieving the best possible end-to-end latency, there are two key themes: (i) achieving the best possible results for data forwarding and processing in terms of absolute latency; and (ii) achieving results consistently with variance only as a consequence of inherent variables, e.g. packet size variation and elements outside of the buy-side firm's control. The process of optimisation is iterative and needs to be founded on fundamental data identifying different domains (network and systems for example). In principle, optimal latency can be obtained through the reduction of introduced delay (in terms of end-point to end-point device and component forwarding), transit component optimisation (ensuring devices have sufficient resources to process data with no delay, e.g. network switching without queuing) and addressing sources of variance. Ultimately, each component must be optimised for the lowest possible latency, but invariably there will be components for which optimisation is uneconomic or not feasible.

McAllister: FPGA and ASIC-based messaging is specifically designed to eliminate common infrastructure latency problems created by operating system garbage collection and context switching, thereby providing consistent ultra-low latency even at the most demanding peaks. Furthermore, hardware can optionally perform rules-based analysis of complex content to expand the range of algorithms beyond simple price activity to include news, market alerts and research notes. Hardware content infrastructure offloads CPU-intensive tasks - such as content filtering, routing and transformation from the algorithmic applications - so it can focus on performing its own unique algorithms faster.

Sangha: The biggest challenge for companies running operations in house is understanding latency across the whole trading cycle - applications, servers, network and multiple venues. The organisation needs to be able to monitor data and message flow in order to determine where bottlenecks may be occurring and what to do about them. Buy-side firms need metrics to distinguish the levels of service that sell-side firms and venues are really providing. By monitoring and measuring latency early in the cycle, firms can make better decisions about which network service and which market, intermediary or counterparty to select for routing trade orders.

Valente: Typically, protocol handlers and streaming databases are large contributors to latency. Solutions that target these spaces should offer minimal disruption, like a full-blown FIX or FIXFAST offload engine running in an in-socket accelerator or PCI-express card. This would allow for acceleration of the trader's existing infrastructure, without having to change everything.

Woodward: Probably by using packaged offerings tailored to their size and business strategies. Firms that run stat arb strategies will have higher tech requirements, but less performance-sensitive firms can get tech-enabled at a lower cost. Traditionally, buy-side vendors are not leading technology exponents and as such buy-side firms might not get, or be able to afford, the best advice. If one wants the best one must scour the showrooms and do the necessary research.

Nigel Woodward, Intel
Nigel Woodward, Intel

"Firms should pressurise colocation vendors on the range of technical configurations being offered …"

What hardware-based strategies should buy-side firms adopt to store and access execution data as effectively as possible?

Graham: Investment in performant and reliable data technologies is imperative. Small area networks (SANs) are more competitively priced nowadays and storage is always getting cheaper, so the associated benefits in performance, reliability, thin provisioning, storage virtualisation, remote backup and disaster recovery options are worth investing in. The largest bottleneck may be the current computational paradigm where data is created, stored and then analysed. A way forward may be to create and analyse data, but then only store a subset. Shifts such as this may well play a key role in shaping sophisticated analytical environments to come, where real-time data mining can play an increasing role in analytics, risk management and trade execution.

McAllister: Storage technology evolution has reduced the costs and improved the availability of massive volumes of data. While SAN-based storage is a popular choice for high availability and rapid data lookup, you will generally find a wide spectrum of architectures in buy-side firms. Customer choices are influenced by which configuration works best with the database or data-caching solution at the layer above. For high-speed assured data movement (for example order routing), persisting to physical disk media creates many latency and throughput challenges that have left transactional data rates far behind the messaging rates of non-persistent applications like market data. Hardware specifically designed to provide assured delivery is finally unlocking these limits with 100 per cent failsafe architectures that are not slowed down to the rate of the fastest disk write. In-transit messages are persisted to dual redundant caches and only the data that requires long-term storage is written to disk. Battery-backed RAM ensures that memory-cached messages cannot be lost, even in the event of a power loss to the redundant pair.

Sangha: No matter what storage technology companies are investing in - and we would recommend making storage an integral part of the network with storage area networks - it is unlikely that they will be able to store all their information in one place, with information spread across trading partners and exchanges. Financial institutions need to be able to bring all this information together in a timely and efficient way which is where storage virtualisation provides a way of mapping and managing storage across the enterprise and third parties, while also maximising usage of existing, internal storage resources.

Valente: Analysing and searching terabytes of data is complicated and slow, especially when table joins, sorts and groups are involved. Storing data is simple - using, analysing and retrieving it is a different story. In addition to just storing it, standard database languages like SQL need to be supported and it needs to work on commodity hardware, so the IT professionals will support it. Specific storage/SQL appliances are starting to show up in the market place. These are hardware accelerated very large database (VLDB) appliances that can process SQL queries at a rate of 1TB of data in about a minute.

Andrew Graham, IBM
Andrew Graham, IBM

"… the runway for relying on single-threaded speed jumps within general purpose CPUs is running out, …"

Are the sell side's biggest hardware challenges on the processing side or on the networking side?

Berkhout: Having initially concentrated on processing power and system optimisation, many firms have recently realised that further latency reductions can be achieved only if they focus on the network as well. To get a good understanding of latency introduced by the network layer, consider three prime contributors to latency:
• Serialisation delay converts information to packets or bit stream - restricted by packet size and available bandwidth - which can lead to buffering. To eradicate excessive buffering, we recommend sufficiently dimensioned end-equipment and bandwidth;
• Switching delay is caused by hops across the network and the processing power of routers and switches. This delay is inherent to packet-switched networks and latency can be improved through labelled path switching. The preferred option would be connection-oriented networks close to the optical or transmission layer, avoiding switching altogether;
• Propagation delay is virtually constant in optical networks. It is determined by the speed of light, the refractive index of glass (i.e. resistance) and a linear function of the distance travelled. One way of ensuring the shortest physical routes is for metro fibres to be spliced directly between end sites and not routed through (multiple) exchanges. A proximity solution would virtually eliminate propagation delay for market-makers focused on a single exchange, but can present challenges for cross-asset or multi-market arbitrage models.

Cooper: Both represent significant challenges and organisations will have different priorities. On the network side (i) market structure and evolving trading models present challenges in terms of increases in the number of destinations, integration and shorter lead-time-to-connect expectations, while (ii) network performance continues to be the focus of attention in the context of scalability and increasingly rigorous latency expectations. Consolidation of connectivity requirements through service providers that offer multiple connections across a single communications infrastructure should address some of the flexibility requirements. Network performance - in terms of low latency criteria, the end-to-end deterministic forwarding in microseconds with little or no variance - requires precise engineering of all components on all forwarding paths.

The problem is compounded by the need to instrument the network at a commensurate level of granularity in order to capture and understand network behaviour - a requirement that introduces new products, requires the development of new skills and supporting practices. Different approaches include over-provisioning of bandwidth, the adoption and utilisation of very high-performance devices and systems, an iterative analysis and optimisation cycle utilising information from a developing set of groups and bodies providing specialised benchmarking and development services.

Vincent Berkhout, COLT
Vincent Berkhout, COLT

"… further latency reductions can be achieved only if they focus on the network as well."

Graham: With data volumes and the number of data sources ever increasing, the full architectural stack will be under pressure. Some implementations are currently using lower network bandwidth (100Mb links) as a throttle to enable processing to occur more reliably elsewhere - this then impacts the competitiveness of the overall platform. The complete architecture has to be considered to determine whether a scale-out design could be used for example to enable consumption of full data volumes. By filtering the signal from the noise earlier in the data lifecycle, it may be possible to consume the full data channel and process it in real time. Newer streaming frameworks allow filtering, aggregation, correlation and enrichment that can scale to thousands of individual physical processing elements, effectively using low-latency multicast technologies to distribute the problem.

New middleware can provide a high-throughput, low-latency transport fabric designed for one-to-many data delivery, many-to-many data, or point-to-point exchange in a publish/subscribe fashion. This technology exploits the physical IP multicast infrastructure to ensure scalable resource conservation and timely information distribution.

Michael Cooper, BT Global Financial Services
Michael Cooper, BT Global Financial Services

"Network performance … requires precise engineering
of all components on all forwarding paths."

McAllister: In high-performance distributed applications, the challenges are a combination of both. Hardware-based middleware solutions use FPGA, ASIC and network processor technology to perform sophisticated processing at extremely high message rates with very low, predictable latency. For example, sophisticated, hardware-based content routing ensures that only the content of interest is sent to a given application. This in turn reduces both bandwidth demands in the network as well as processing demands by the subscriber. Such sophisticated routing also removes the need for publishing applications to perform content routing, or to deploy special content routing add-on software services that need to be managed and scaled independently. Hardware solutions can also transform data into a format convenient for the receiving application, again reducing processing demands. TCP offload engines along with zero-copy APIs further offload communication processing from host CPUs to purpose-built hardware, which both increases networking performance and increases CPU cycles available to the application.

Sangha: It's about having complete visibility of processing capacity usage and networking performance; the success of every trade depends on both factors working in tandem. The continuing huge growth in market data and order volumes is demanding huge investment in sell-side firms' data centres. These data centres are rapidly being filled with ever-increasing numbers of servers to drive the consolidation of market data and handle the increasing demands of algorithm processing and the growing complexity of risk modelling. Low latency is critical in many areas and computational power is key in risk calculations, so good data-centre design is critical. The application, compute and network components must act cohesively for best price/performance. Network bandwidth is increasing within the data centres with Infiniband and 10 gigabit Ethernet now becoming common as firms demand the highest performance from their systems. External links to execution venues become key here, where the proximity of a firm's trading and execution platform can dictate who wins and who loses in the 'need for speed'. With this in mind, the design of the data centre, physical distance and type of network connectivity to exchanges -as well as the option of co-locating some of the firm's systems at the exchange or in a service provider's facility - is all part of the mix.
Woodward:
The issues divide between application code, networking infrastructure and hardware platform. These combinations will differ depending on the functions being performed, e.g. between low-latency trading and high-volume settlement. Often the code is old, and/or is running on a heavy operating system that is not multi-threaded and thus unable to take advantage of multi-processors. Double digit gains in performance and latency are now commonplace from enhancements in the infrastructure, e.g. I/O acceleration, Infiniband, multi-threading, tuning of operating systems, etc. It is all about ROI from effort expended. Trading performance is paramount and any gains are competitive advantage and so can be justified; elsewhere, a judgement call has to be made between cost - capital, resources and disruption - and the return.

What infrastructure changes must sell-side firms undertake to optimise powerful new servers, i.e. with 10-gigabit Ethernet cards?

Berkhout: We're seeing 10-gigabit Ethernet services deployed more widely and price pressure on high-volume 10-gigabit Ethernet services. One of the drivers is that hardware prices for interface cards and switches have come down significantly. There is also an increased demand for fibre channels, with two- and four-gigabit fibre channels gaining greater market share.

Cooper: Failure to consider the whole end-to-end topology and infrastructure so will lead to sub-optimal latency and latency variation. Consideration should certainly be given to increasing throughput and the inherent advantages of higher speed interfaces like 10-gigabit Ethernet, but new technologies like Infiniband should also be considered in the data centre environment. These technologies do offer performance advantages, but full optimisation has implications for application and systems design and development.

Similarly the adoption of very high-speed interfaces and platforms that utilise them has a knock-on effect on the underlying network infrastructure, e.g. switching models and capability along with external data centre communications. The effect of high volumes of data being presented to sub-optimal switching and routing infrastructures needs to be assessed.

Graham: If the software can exploit parallel techniques, consideration must be made of Infiniband, for example, writing software that can directly exploit lower latency protocols. As the ability to accurately measure latency across the architecture becomes more important, consideration of time-syncing technology should also be made. Use of Stratum-x NTP clocks to ensure enterprise-wide time-stamping should be considered.

Through virtualisation of I/O devices, networking and storage can be abstracted to a degree with consideration of any latency implications. Optimisation of packet sizes should also be done to reflect the workload being considered. Finally, as servers become more powerful, consideration of the data storage and distribution must be maintained to avoid potential 'starvation' of the processors and other issues that may result from an unbalanced system.

McAllister: With the popularity of increased bandwidth capacity (via 10-gigabit Ethernet interfaces) and increased application processing capacity (via more and faster multi-core processors), servers need to be fed by a hardware-based content infrastructure at the messaging layer. These applications demand much higher message rates and lower consistent latency for market data and for order execution. Such high processing capacity devices need to be served by an even higher capacity messaging infrastructure - otherwise, processing cycles are wasted either waiting for data or repetitively filtering and transforming data. These issues work against the intended purpose of investing in high-performance servers and leave performance bottlenecks unchanged. Coupling a hardware-based content infrastructure with more powerful application servers uses the right tools for the right jobs and eliminates waste in end-to-end latency and throughput limits.

Sangha: Many firms are now looking beyond the traditional 10/100 and gigabit Ethernet connectivity that forms the basis of most data centres. We've seen Infiniband being used by firms where an extremely low-latency interconnect is required between servers. Market data infrastructure and algo-trading farms are typical applications of this alternative to Ethernet as well as in the construction of high-performance computing clusters or grids. Infiniband uses specific network interface cards and drivers for the servers that wish to connect to this high-speed, low-latency interconnect. In reality, this is not an issue and many firms see the benefit of this technology on existing servers and applications with little if any need for major systems changes. Additionally, if a firm wishes to optimise its trading applications by modifying some of the code, Infiniband can offer even better performance as well as removing up to 70 per cent of the load on the servers that can be caused by more typical Ethernet technology. Ten-gigabit Ethernet can offer some of the throughput advantages of Infiniband, but as of today is less deterministic in providing the consistent low levels of latency that automated trading demands.

For sell-side firms needing to overhaul/replace legacy systems, how best to optimise existing infrastructure and build a more scalable one?

Berkhout: For ultra-low latency requirements, hosting in the same building is the preferred way forward. Scalability is achieved by default as network costs are insignificant for primary feeds within the building. Only a few carriers are well positioned as they provide the combination of both data centre space and network services.

Graham: Abstraction technologies that isolate layers of the system facilitate virtualisation where appropriate. Splitting compute nodes and persistent data nodes enables good separation of concerns too.
Writing software to support 64-bit and safe multi-threading will allow the applications to exploit future hardware. Stress testing of existing systems to profile and identify bottlenecks and hot spots in end-to-end processes will also become increasingly important.

McAllister: The critical factor is to choose an infrastructure that can offload processing from legacy systems, thereby extending their life. Hardware infrastructure provides processing headroom for the future along with an increasing performance curve. For example, many trading floors today distribute market data through Ethernet multicast which requires each application to receive, inspect and accept or discard each message. Typically, the ratio of discarded messages to accepted messages is very high which means a heavy CPU load for non-productive work. As message rates increase, more and more time is spent filtering messages, meaning less and less time for application processing. Hardware-based middleware solutions use network processor and FPGA technology to perform message filtering before messages reach the application. Hardware can also ease transition between legacy and modern systems by reformatting data as it is delivered - from the format of the old application to the format of the new. These approaches allow communication between old and new applications to be evolutionary rather than disruptive, with no performance impact.

Is the server farm the most appropriate model for exchanges looking to expand capacity?

Berkhout: Traditionally, server farms are deployed in a scenario with a primary and secondary location. These have typically been owned locations and we see a trend towards leveraging external server farms to complement these. This spreads risk and deals with fluctuations in capacity demand.

Graham: There are trade-offs to this approach. Overall system architecture becomes more complex due to the need to build high availability and recoverability into the solution. Usually this means additional hardware in the form of primary and secondary servers. Management of a large server farm introduces a host of issues that ultimately threaten system stability and reliability.

One alternative would be a two-tier system, comprised of high-speed server components on the front-end, and a robust database server to provide high availability. This approach applies the appropriate server technology that best suits the opposing needs of low latency and high reliability. The advantage is the creation of an exchange solution with both high performance and proven reliability. Another alternative is the adoption of dynamic resource reallocation. In fast markets, the ability for humans to respond to market spikes and allocate additional capacity is diminished. Server technology exists today that has the capability to dynamically reallocate resources.

McAllister:
The best price performance comes down to the message volume an infrastructure can sustain with the least complexity, for the least amount of expense. Server farms that run software are typically complex to manage, consume large amounts of data centre space and struggle to scale to the high-volume requirements of exchanges today. A hardware-based infrastructure provides an integrated platform with very predicable behaviour and eliminates the performance challenges introduced by the interaction between software and operating systems. This allows multiple orders of magnitude better performance with ultra-low consistent latency, which is essential to exchanges as the source of essential large-scale information feeds.

Woodward: The key choices around use of server farms key are between horizontal or vertical scaling and deployment of proprietary or commodity industry-standard technologies, especially as there are questions over the reliability of the latter. The debate is swinging towards horizontal scaling. This gives access to lower cost, industry standard hardware, and operational risk is managed by designing resilience into the infrastructure.

At the switches and routers level, what should exchanges be doing to improve response times?

Berkhout: We recommend running low and ultra-low latency applications as close to the optical layer as possible, where commercially viable, and with the fewest protocol conversions. In practice, deployments are a compromise between ease of configuration, ease of management and support versus dedicated connections and will vary per application and user.

McAllister: Use of higher capacity links such as 10-gigabit Ethernet and cut-through switches can be used to reduce communication latency and thus distributed information sharing latency. However, this latency is already literally in the one microsecond range. An important infrastructure factor in reducing order acknowledgement latency is the performance of the persistent messaging systems used between the various stages of order execution in some venues. Persistent messaging is typically performed using rotating disks which are orders of magnitude slower than RAM and processor speeds. Hardware-accelerated persistent messaging resolves this challenge to ensure that messages can never be lost, supported message rates are very high and latency is consistently ultra low. This simple infrastructure change dramatically improves response times, especially at peak trading hours, because it provides the excess processing capacity and stability only possible with hardware implementations.