What developments in processing capabilities should firms adopt to support algo/auto trading?
Aughavin: While many companies are investigating
parallel programming, they are proceeding methodically because it
can be difficult to maintain and support. However, companies
recognise the potential of accelerated computing and how it can
reduce power consumption and ease infrastructure complexity. In
recent months, select companies have launched accelerated
computing initiatives which are specifically designed to help
technology partners deliver open, flexible and scalable silicon
designs. These solutions can significantly boost performance in
compute-intensive applications. A key part of such solutions is a
stable platform which will help foster dynamic development,
enabling technological differentiation that is not economically
disruptive at a time when accelerated computing is moving to the
Cooper: Alternatives to traditional horizontal and technology upgrade approaches are beginning to emerge that address complex event processing (CEP), capacity and performance requirements. Network-attached compute appliances seek to address processing capacity and performance by offloading processing from existing systems to an optimised appliance. Additionally, some of these appliances mitigate the overheads frequently incurred with platform and technology upgrades by minimising systems modifications and application development. In addition to meeting existing application performance requirements, these applications can service multiple systems while providing significant scalability and capacity for growth. They will also address other issues like power consumption and cooling requirements, are relatively straightforward to deploy and can prolong the life of the existing systems estate. As a consequence they enable new approaches to be developed and new functionality to be supported that would not have been feasible on existing platforms.
Graham: The need to analyse applications
to ensure software is designed to exploit
multi-core/multi-threaded technology safely is ever more
important. A balanced solution stack must always be considered;
the old adage that fixing one bottleneck will only move it to
another part of the system still holds true. That said, emerging
• Offload engines/accelerators to perform XML transformations, security processing acceleration, FIX/FIXML acceleration, market data feeds optimisation, TCP offload engines (TOEs) and hardware devices such as graphics processing units (GPUs), field programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs) and cell broadband engines;
• Accelerated migration of applications from 32 to 64 bit hardware and operating systems, to exploit in-memory databases and larger histories;
• Streaming/event-based technologies with the option to perform more complex processing is gaining significant traction, often blended with column orientation over row orientation;
• The need for predictive quality of service across the architecture
• driving real-time java solutions, real-time extensions to Linux, dedicated or highly-managed networks; and
• Daemonless low-latency middleware that exploits true multicast networking.
"CPUs are not getting faster and twice as many cores do not make the system twice as fast."
McAllister: Hardware infrastructure solutions
combine the best of network and content-processing hardware
advances to accelerate data delivery, routing and transformation
in support of algorithmic and other applications. These solutions
use FPGA, ASIC and network processor-based systems to move
content processing into silicon which improves uptime and
delivery rates by parallelising processing and eliminating the
unpredictability of software on servers. With data volumes
increasing exponentially and buy- and sell-side firms struggling
with increased complexity, latency and unpredictability in their
software infrastructures, hardware solutions can deliver an order
of magnitude greater throughput while guaranteeing ultra-low,
Valente: CPUs are not getting faster and twice as many cores do not make the system twice as fast. FPGAs are still getting faster every generation and doubling in size every 18 months, so the performance gap is actually increasing in the future. Unfortunately, there is a very fine line between the time it takes to develop, test and deploy new algorithms, and the latency/performance benefits that one can get with exotic technologies. Tomorrow's successful approaches must be multi-threaded in nature, but targeted to any technology at compile time (i.e. quad-core now, to octal-core or FPGAs later). New API layers are helping developers move from technology to technology faster, regardless of the original or future target platform.
Woodward: Processors are now available in single-, dual- and multi-core versions with multi-sockets. Both the operating system and the application software have to be designed to take advantage of this processor layer. Also, various acceleration technologies embedded in the hardware can increase the performance of I/O-sensitive applications. Generally, the focus should be on newer infrastructure technologies and tuning, e.g. Ethernet networking in ten gigabits and Infiniband.
How should firms handle both the increased diversity and sheer quantity of market data in algo/auto trading?
Cooper: Basic connectivity is
clearly an issue: more sources require more connectivity.
Frequent changes and systems test acceptance procedures
complicate the situation: for example, how do you validate new
data and test systems integration? Seeking to address the problem
through point-to-point connectivity will lead to scalability and
integration issues, high costs and inefficient use of
infrastructure, and is an inflexible approach that precludes
rapid adoption, testing and validation. Consumers of multiple
data sources need to identify service providers who can provide
connectivity (with the appropriate performance attributes) to
multiple sources and are able to provide flexible solutions that
support rapid connection to new sources, new applications and new
Graham: Consideration of an enterprise metadata model for structured and unstructured data is important in terms of management and exploitation by numerous applications and people. Consolidation of feed adapters and exploitation of multicast technologies will help alleviate some overheads. Edge-of-domain performant technologies that filter the signal from the noise would help too.
Data caching and data grid technologies that handle the
replication of data to many nodes are worth considering, feeding
in-memory databases and specialist compute engines. Systems are
being developed which are an execution platform for
user-developed applications that ingest, filter, analyse and
correlate potentially massive volumes of continuous data streams.
They support the composition of new applications in the form of
stream-processing graphs that can be created on the fly, mapped
to a variety of hardware configurations and adapted as requests
come and go, and as relative priorities shift. Systems are being
designed to acquire, analyse, interpret and organise continuous
streams on a single processing node, and scale to
high-performance clusters of hundreds of processing nodes in
order to extract knowledge and information from potentially
enormous volumes and varieties of continuous data streams.
McAllister: Hardware infrastructure can improve throughput and deliver low, predictable latency allowing firms to easily scale as data rates scale. Additional hardware can perform tasks such as data inspection, filtering, routing, transformation, compression and security to allow algorithms to include market data, algorithmic news, research data and more at rates 100 or more times faster than software and servers. Today, firms typically have a separate infrastructure for their market data, their reference data and their back office SOA-style systems. Hardware solutions allow all classes of service to be combined under a single API without impacting performance on the highest end systems. Many parallel infrastructures can be consolidated into a single infrastructure with better performance and reliability than the stove-piped systems we have today.
Woodward: Firms should look at new market data
technologies, e.g. compression using the FIX FAST protocol or CEP
engines to handle and analyse various data forms. Data caches
such as Gemstone, Gigaspaces and Tangasol can store and feed
real-time data to CEP engines. In addition, dashboard and
analytical tools exist for analysing market data, such as
Xenomorph and Kx Systems. New approaches to storage include
offline archives from the major vendors such as EMC, NetApps and
niche players such as Copan.
How can firms optimise hardware deployment to overcome problems with power-to-rack ratios at popular co-location centres?
Aughavin: Virtualisation is one of the emerging
trends which firms are using to run multiple systems and
applications on the same server, the benefits being simpler
deployment, an elegant scalable architecture and a more efficient
use of computing resources. One approach is to deploy
architectures that circumvent both a front-side bus and
bottlenecks, enabling efficient partitioning and memory access to
and from the processing cores.
Berkhout: While some co-location centres have been chosen as an immediate solution for low-latency trading, in the medium term financial institutions will need to be more selective over their chosen sites, locations and the power options available to them. One solution is 'long lining', i.e. dedicated connectivity to alternative locations nearby with more power and, equally important, ample cooling.
Cooper: The adoption of new systems with lower power usage and reduced cooling requirements is certainly one approach. In addition, consideration should be given to the use of appliances that provide functionality that can be used to support multiple systems, e.g. compute, I/O and storage appliances.
Graham: Mixing workloads within a rack will help balance the distribution of power input/heat output within a rack and across the data centre - often this requires organisational change since workloads are often arranged by line of business. New techniques including robot-assisted approaches can model and analyse the thermal distribution within a data centre to enable a more informed distribution of workloads. Blade-based solutions with more efficient power supply arrangements over rack-mounted servers are also worth considering; tie this in with active power management software that controls the hardware in real time in response to workload demands. To raise utilisation rates and hence attempt to reduce the overall data centre footprint, virtualisation technologies have a place for some workloads, whether through hardware or software hypervisor implementations. For 'hotspots', rear-door water cooling can help address the cooling dimension, with 50 per cent of rear-door heat being removed through the low pressure system in the door.
McAllister: A single redundant pair of hardware-based nodes in a content infrastructure can handle the workload of 20-60 equivalent general purpose servers running software middleware. Software solutions have widely variable latency characteristics as volumes increase, leading many firms to deploy infrastructures that can handle five or more times market peaks. This can literally mean hundreds or thousands of servers that are very lightly loaded under normal market conditions, just to assure reasonable performance during trading spikes. Hardware solutions can handle many more connections, greater throughput and perform consistently as loading limits are approached. Additionally, a single hardware infrastructure can combine the requirements of low-latency middleware and persistent messaging for tasks such as order routing without impact on performance.
Woodward: The energy consumption of processors is reducing dramatically. Firms should pressurise their co-location vendors on the range of technical configurations being offered and data centre design to ensure optimum service. They should also tune their applications at the time of relocation, making sure performance is optimised, rather than simply re-hosting existing software. For example, simple use of processor-based acceleration has been proved to improve performance of FIX throughput, and code optimisation is usually expected to contribute at least five per cent gains, reflected in both speed and scale of business process and energy consumption.
Pat Aughavin, AMD
"Virtualisation is one of the emerging trends which firms are using to run multiple systems and applications on the same server, …"
Is parallel processing fundamental to facilitating algo/auto trading? If so, what technologies should firms deploy?
Aughavin: Yes, and many of the technologies that
companies should be looking to deploy, such as FPGA, GPGPU1, and
multi-core processing are being evaluated. From a GPU
perspective, the bottlenecks are mainly in data transfer and
latency rather than computation. Until recently, GPU products
were not suitable for data-intensive tasks. However, with newer
hardware the latency (i.e. transfer to GPU and back) can be
mostly hidden, meaning that the acceleration from GPU-based
searches and other algorithms will become more apparent very
soon. The GPU will excel at more computationally-intensive tasks.
As software tools and programming standards emerge GPU-based
applications will grow in number and quality, and the cost of
adoption should shrink further.
Graham: Yes, the runway for relying on single-threaded speed jumps within general purpose CPUs is running out, driving the need to consider parallel models to exploit multi-core (and multi-threaded) technologies and/or specialist hardware devices. All these devices typically offer higher performance for their chip area, consume much less power per computation than general purpose processors and are highly efficient in addressing a narrow range of tasks. However, most are expensive to programme because the skills needed are rare, they lack mature application development tooling and they have extremely limited ISV support. Parallel technologies means consideration must be made of event determinism, race conditions and thread-safe applications too. And with specialist hardware solutions there is always the need to balance the cost of implementation and management of exotic technologies over more general purpose solutions. The gaming industry is also driving innovation, so the technology it is considering should be investigated - such technology also has good economies of scale. Gaming is also driving changes to the Linux kernel that can be exploited in financial applications.
McAllister: As chip-manufacturing technology has reached the point where clock rates within a given execution context can no longer easily be increased, the only way to solve a problem faster is to break it into smaller problems and solve each in parallel. If implemented in the right technology, parallel processing allows increasingly complex algorithms to be performed without latency penalty. With FPGAs, for example, creating new parallel execution contexts to handle additional work is a very natural thing to do. However, with parallel processing comes the need for inter-context communication and synchronisation. This is where software solutions based on general purpose CPUs continue to be challenged in high-end performance requirements, since much more time needs to be spent on scheduling, synchronisation and event management. FPGAs and network processors have integrated, purpose-built hardware assists to deal with such functions which have made them a popular choice in networking layers for years.
Valente: Exotic, market-specific technologies will have a short-term niche, but are usually beaten by next generation x86 CPUs or FPGAs. Both of these technologies are at the forefront of the process curve (45 and 65nm) and can be leveraged in many different markets to keep volumes up and costs down. Trading companies need to make sure that their investment is warranted and has longevity. FPGAs and x86 CPUs offer both.
Woodward: Not necessarily. FPGAs are specialist proprietary technologies which can be used for specific workloads with high results. Is it possible to support all applications on these? It's unlikely, due to the cost of redevelopment. More likely, niche functions will run on FPGAs, while mainstream functions will run on ever faster core processors. In the future, we will likely see development environments in which code is developed to be deployed on the appropriate infrastructure - removing the dedicated tight coupling that exists today.
Nigel Woodward, Intel
"Firms should pressurise colocation vendors on the range of technical configurations being offered …"
What hardware-based strategies should buy-side firms adopt to store and access execution data as effectively as possible?
Graham: Investment in performant and reliable
data technologies is imperative. Small area networks (SANs) are
more competitively priced nowadays and storage is always getting
cheaper, so the associated benefits in performance, reliability,
thin provisioning, storage virtualisation, remote backup and
disaster recovery options are worth investing in. The largest
bottleneck may be the current computational paradigm where data
is created, stored and then analysed. A way forward may be to
create and analyse data, but then only store a subset. Shifts
such as this may well play a key role in shaping sophisticated
analytical environments to come, where real-time data mining can
play an increasing role in analytics, risk management and trade
McAllister: Storage technology evolution has reduced the costs and improved the availability of massive volumes of data. While SAN-based storage is a popular choice for high availability and rapid data lookup, you will generally find a wide spectrum of architectures in buy-side firms. Customer choices are influenced by which configuration works best with the database or data-caching solution at the layer above. For high-speed assured data movement (for example order routing), persisting to physical disk media creates many latency and throughput challenges that have left transactional data rates far behind the messaging rates of non-persistent applications like market data. Hardware specifically designed to provide assured delivery is finally unlocking these limits with 100 per cent failsafe architectures that are not slowed down to the rate of the fastest disk write. In-transit messages are persisted to dual redundant caches and only the data that requires long-term storage is written to disk. Battery-backed RAM ensures that memory-cached messages cannot be lost, even in the event of a power loss to the redundant pair.
Valente: Analysing and searching terabytes of
data is complicated and slow, especially when table joins, sorts
and groups are involved. Storing data is simple - using,
analysing and retrieving it is a different story. In addition to
just storing it, standard database languages like SQL need to be
supported and it needs to work on commodity hardware, so the IT
professionals will support it. Specific storage/SQL appliances
are starting to show up in the market place. These are hardware
accelerated very large database (VLDB) appliances that can
process SQL queries at a rate of 1TB of data in about a minute.
Andrew Graham, IBM
"… the runway for relying on single-threaded speed jumps within general purpose CPUs is running out, …"
Are the sell side's biggest hardware challenges on the processing side or on the networking side?
Berkhout: Having initially concentrated on
processing power and system optimisation, many firms have
recently realised that further latency reductions can be achieved
only if they focus on the network as well. To get a good
understanding of latency introduced by the network layer,
consider three prime contributors to latency:
• Serialisation delay converts information to packets or bit stream - restricted by packet size and available bandwidth - which can lead to buffering. To eradicate excessive buffering, we recommend sufficiently dimensioned end-equipment and bandwidth;
• Switching delay is caused by hops across the network and the processing power of routers and switches. This delay is inherent to packet-switched networks and latency can be improved through labelled path switching. The preferred option would be connection-oriented networks close to the optical or transmission layer, avoiding switching altogether;
• Propagation delay is virtually constant in optical networks. It is determined by the speed of light, the refractive index of glass (i.e. resistance) and a linear function of the distance travelled. One way of ensuring the shortest physical routes is for metro fibres to be spliced directly between end sites and not routed through (multiple) exchanges. A proximity solution would virtually eliminate propagation delay for market-makers focused on a single exchange, but can present challenges for cross-asset or multi-market arbitrage models.
Cooper: Both represent significant challenges and organisations will have different priorities. On the network side (i) market structure and evolving trading models present challenges in terms of increases in the number of destinations, integration and shorter lead-time-to-connect expectations, while (ii) network performance continues to be the focus of attention in the context of scalability and increasingly rigorous latency expectations. Consolidation of connectivity requirements through service providers that offer multiple connections across a single communications infrastructure should address some of the flexibility requirements. Network performance - in terms of low latency criteria, the end-to-end deterministic forwarding in microseconds with little or no variance - requires precise engineering of all components on all forwarding paths.
The problem is compounded by the need to instrument the network at a commensurate level of granularity in order to capture and understand network behaviour - a requirement that introduces new products, requires the development of new skills and supporting practices. Different approaches include over-provisioning of bandwidth, the adoption and utilisation of very high-performance devices and systems, an iterative analysis and optimisation cycle utilising information from a developing set of groups and bodies providing specialised benchmarking and development services.
Vincent Berkhout, COLT
"… further latency reductions can be achieved only if they focus on the network as well."
Graham: With data volumes and the number of data
sources ever increasing, the full architectural stack will be
under pressure. Some implementations are currently using lower
network bandwidth (100Mb links) as a throttle to enable
processing to occur more reliably elsewhere - this then impacts
the competitiveness of the overall platform. The complete
architecture has to be considered to determine whether a
scale-out design could be used for example to enable consumption
of full data volumes. By filtering the signal from the noise
earlier in the data lifecycle, it may be possible to consume the
full data channel and process it in real time. Newer streaming
frameworks allow filtering, aggregation, correlation and
enrichment that can scale to thousands of individual physical
processing elements, effectively using low-latency multicast
technologies to distribute the problem.
New middleware can provide a high-throughput, low-latency transport fabric designed for one-to-many data delivery, many-to-many data, or point-to-point exchange in a publish/subscribe fashion. This technology exploits the physical IP multicast infrastructure to ensure scalable resource conservation and timely information distribution.
Michael Cooper, BT Global Financial Services
"Network performance … requires precise engineering
of all components on all forwarding paths."
McAllister: In high-performance distributed
applications, the challenges are a combination of both.
Hardware-based middleware solutions use FPGA, ASIC and network
processor technology to perform sophisticated processing at
extremely high message rates with very low, predictable latency.
For example, sophisticated, hardware-based content routing
ensures that only the content of interest is sent to a given
application. This in turn reduces both bandwidth demands in the
network as well as processing demands by the subscriber. Such
sophisticated routing also removes the need for publishing
applications to perform content routing, or to deploy special
content routing add-on software services that need to be managed
and scaled independently. Hardware solutions can also transform
data into a format convenient for the receiving application,
again reducing processing demands. TCP offload engines along with
zero-copy APIs further offload communication processing from host
CPUs to purpose-built hardware, which both increases networking
performance and increases CPU cycles available to the
Woodward: The issues divide between application code, networking infrastructure and hardware platform. These combinations will differ depending on the functions being performed, e.g. between low-latency trading and high-volume settlement. Often the code is old, and/or is running on a heavy operating system that is not multi-threaded and thus unable to take advantage of multi-processors. Double digit gains in performance and latency are now commonplace from enhancements in the infrastructure, e.g. I/O acceleration, Infiniband, multi-threading, tuning of operating systems, etc. It is all about ROI from effort expended. Trading performance is paramount and any gains are competitive advantage and so can be justified; elsewhere, a judgement call has to be made between cost - capital, resources and disruption - and the return.