The Gateway to Algorithmic and Automated Trading

FPGA's - Parallel Perfection?

Published in Automated Trader Magazine Issue 02 July 2006

FPGAs (Field Programmable Gate Arrays) may not be new technology, but as the data crunching race in automated/algorithmic trading continues to intensify, could they be an idea whose time has finally come? AT talks to Alistair MacArthur, Senior Research Engineer at Celoxica, who discusses current FPGA technology and outlines its potential for tasks such as parsing algorithmic news feeds.

Just how old are FPGAs as a technology concept?

More than twenty years. The original concept was developed by two of the founders of Xilinx - Ross Freeman and Bernie Vonderschmitt - in about 1984. Ross Freeman's idea was that rather than using a generic processor and writing software to run on it, one could customise electronic chips to perform specific tasks by directly programming them. In effect, the program's calculations would be implemented in hardware rather than software. The generic name for this type of chip is a Programmable Logic Device (PLD), with FPGAs being a subset of those.

So what does an FPGA chip actually consist of?

It is essentially a complex form of Static Random Access Memory (SRAM - a very fast type of memory commonly used for the on chip cache on conventional computer processors). Though modern FPGAs can contain a number of additional components, the three principal components are registers, function generators (also known as look up tables or LUTS) and the FPGA clock.

Each function generator contains of a set of logic gates1. It is typically laid out so it has four inputs, one output and a configuration port via which the logic gates are configured.

The basic concept of operation is that each function generator can be configured to produce a particular output for a given set of inputs. This can be thought of as a truth table. For instance a 4-input function generator could perform a 2 bit binary match.

Registers are essentially single storage "cells" each capable of storing one bit (either a 0 or a 1) of data. However, just as the logic gates in a function generator can be configured by a programmer, the registers can be similarly aggregated into larger sections of memory, such as 8, 32 or 64 bits.

The FPGA clock controls the number of times per second that each function generator and register can receive input and generate output. The fastest production FPGAs currently run between 200 and 400MHz, so each function generator and register can therefore theoretically process input/output between 200m and 400m times per second.

Xilinx Virtex 5 FPGA Chip Family

Xilinx Virtex 5 FPGA Chip Family

Plenty of standard computer processors have far faster clocks than that, so where would be the performance advantage in using an FPGA?

The important difference is that a single conventional CPU can only process a single instruction per clock cycle. By contrast, an FPGA can be configured as multiple virtual processors capable of functioning in parallel. Some large FPGAs may contain millions of function generators and registers, so the configuration for a simple processing task such as matching a short text string could result in an FPGA yielding tens of thousands of virtual processors. A configuration such as this will therefore quickly outweigh a clock speed disadvantage multiple of ten or fifteen.

The other point to bear in mind is that FPGAs can bypass a lot of other system latency. With a conventional processor you might be receiving the news feed you are processing via a TCP socket onto an Ethernet chip, but that then has to go through a MAC layer, then a North Bridge chip, then onto the processor main bus, then an interrupt has be flagged, then all the data has to be transferred into the user space. All these things can of course be done very quickly but there are nonetheless a lot of steps to go through that do not apply to FPGAs.

Why not? Is this a function of how FPGAs are connected to a system?

That has a lot to do with it, yes. In a production environment that encompasses algorithmic/automated trading activity, FPGAs are most likely to be found on PCI or PCI-X cards, with the cards being fitted with FPGAs from Xilinx and Altera. A more recent innovation has been to place the FPGA on a co-processor module that plugs directly into the computer motherboard. (These modules are produced by companies such as DRC Computer and can only be used on multiprocessor AMD motherboards that support HyperTransport technology.)

If an FPGA is mounted on a PCI card then the primary bottleneck will be the speed of the PCI connection. However, the advent of the PCI-X standard - with transfer rates of up to 4.26 GB/sec (PCI-X 533) - is making this much less of an issue. Furthermore, FPGAs that are mounted on a PCI card can also be fed data directly from an Ethernet socket mounted on the backplane of the card, completely bypassing the PCI bus.

Co-processor mounted FPGAs have an even faster access channel - with version 3.0 of the HyperTransport specification making transfer rates of up to 41.6 GB/sec possible. A fairly typical configuration here would be to use a dual processor AMD motherboard, with a conventional CPU processor mounted in one socket and a FPGA coprocessor mounted in the other. An additional advantage of this approach is that the FPGA co-processor has

direct access to the main system memory. Therefore, when testing news based algorithms, large news databases and any search terms can be loaded directly into main memory and accessed extremely quickly from there by the FPGA co-processor.

DRC Processor Socket FPGA

DRC Processor
Socket FPGA

What does this mean in practical terms?

Well, as we have seen with developments such as Dow Jones Newswires' announcement of their News and Archives for Algorithmic Applications, text processing is rapidly gaining in importance in algorithmic trading. While we may not as yet have reached the point where models will trade off the news wires alone, there is nevertheless a lot of work currently being undertaken on incorporating news flow into algorithmic models. The parallelism of FPGAs is perfectly suited to very high speed text parsing required for this approach, where you may be screening multiple news feeds for a huge number of keyword combinations relating to perhaps thousands of securities.

Tackling this scale of computing task using conventional computer processors would be relatively inefficient and would almost certainly require multiple processors if it was to be accomplished in a timely manner. While the availability of dual core processors obviously helps, one would still be looking at sizeable investment in hardware to achieve a similar level of parallel execution - and a larger electricity bill.

Clustering or grid technology could be used in order to access idle capacity, but this then raises questions of bandwidth overhead and communications synchronisation. Furthermore, an additional central server would be required in order to load balance and distribute the jobs across the other processors. Ultimately, if evaluating multiple regular expressions2 (such as in news processing for algorithmic trading), using conventional processors is akin to using a rather inefficient sledge hammer to crack a nut.

Alistair MacArthur, Senior Research Engineer at Celoxica

" FPGA can be configured as multiple
virtual processors capable of functioning
in parallel."

How do the technologies compare as the complexity of the expressions being evaluated increases? For example, does searching a news feed for multiple regular expressions result in a significant performance hit for either FPGAs or general purpose CPUs?

There is a very significant difference. As the number of regular expressions being evaluated increases, conventional CPU performance deteriorates appreciably in relation to that of an FPGA. While the precise figures will obviously depend upon individual circumstances, it is a reasonable approximation to say that by the time you are evaluating just fifty regular expressions a conventional CPU will have an execution time more than 100 times that of a comparable FPGA.

Altera Stratix II GX

Altera Stratix II GX

So why have FPGAs achieved so little penetration in financial markets to date?

It is only recently that FPGAs have become an efficient and affordable alternative to raw CPU power. With the advent of PCI-X and HyperTransport, the communications bottleneck has been overcome and with C-based software programming tools for FPGAs maturing, developers can program hardware in a familiar C-based environment.

Additionally, I think people have tended to stick with conventional computing technology for a number of reasons: one of the main reasons is comfort - people naturally prefer to stay with the technology they already know. It is perceived as less risky both for the organisation and for them personally in career terms. That applies to both business and technology roles, so a business line manager trying to evangelise FPGAs is likely to meet resistance from an IT department that may not understand the technology and in a sense feel threatened by it. (That is one reason FPGAs still tend to be seen as a "disruptive technology".)

The other problem for many people who might be prepared to use FPGAs is that many of the companies that provide services in the space are relatively new and/or small. They are therefore seen as a higher risk as suppliers - especially when compared to long-established industry names selling conventional technology.

Are there any significant market participants using FPGAs?

Yes - some participants are engaging with us to leverage the parallel processing capabilities of FPGAs for iterative processes that can benefit from parallelism. For example, we are dealing with a number of major investment banks that are using FPGAs for Monte Carlo simulations as part of their risk management processes.

The performance gains that are possible when using FPGAs for this type of problem can be substantial. For example, we have just completed a prototype implementation of an option pricing model for one investment bank that runs 400 times faster on an FPGA than it did on conventional processors - as well as consuming less significantly less power.

Figure 1

How would an FPGA perform a text search as part of an algorithmic model?

As I mentioned earlier, each pair of inputs to an FPGA function generator is capable of comparing two bits. ASCII codes for letters of the alphabet consist of eight bits - for example, the capital letter "A" is represented in binary as 01000001. Therefore, as an extremely simple example, if you wanted to check a news stream for the presence of the letter "A" you would need to use four function generators (eight pairs of inputs) connected to a further single function generator. You would also need sixteen one bit registers to buffer (store short term) the incoming data from the news feeds and the desired search value "A" that you had input. (Eight bits for the ASCII code for each incoming letter from the news feed and eight for the ASCII code for the search string "A").

Figure 1 is a schematic illustration of this, with the registers shown in pink, the first phase function generators in blue and the second phase function generator in green. The black arrows and lines represent the input path for the search string "A" and the numbers in boxes to the left of the registers represent (from top to bottom) the binary ASCII code for "A" (01000001). The ASCII binary code for each bit of each character in the news feed would be applied in the same corresponding order to the red arrows/lines.

If, as illustrated in Figure 1, both of these red inputs to a function generator match the corresponding black inputs (the search string bits) then the generator will output a 1 (the output connection is the blue line on the RHS of the generator). This output is fed to the second phase function generator along with the output from all the other first phase generators. If all four inputs to the second phase generator are "1" then it will also output a "1" (green line at RHS) indicating that the letter "A" has been found in the text stream.

This is obviously a very trivial example, but the basic concept can be scaled up to accommodate far more complex searches involving multiple words that must be within a certain distance of each other. If this condition was met, the information could then be passed to the algorithmic model in order to trigger a certain action, such as the suspension of trading or the resizing of position slices.

Alistair MacArthur, Senior Research Engineer at Celoxica

"The parallelism of FPGAs is perfectly suited
to the very high speed text parsing required
in situations where you are screening
multiple news feeds for a huge number of
keyword combinations relating to perhaps
thousands of securities."

You have already outlined some of the reasons for slow FPGA take-up in financial markets, but this still sounds rather too good to be true. Is there a significant cost hurdle to FPGAs?

The hardware, especially when viewed in terms of its processing cost per dollar, isn't particularly expensive. At the entry level, external USB-connected FPGA boards start at around $1000, while a top of the range PCI-X based FPGA board might cost $10,000. Obviously in both cases you would still have to factor in the cost of a relatively lowly specified computer to which the FPGA board would be connected.

However, when you consider that reasonably-specified brand name four processor Opteron servers start at around $9000, the relative costs still favour the FPGA. Even a mid range FPGA, at around $5000, will display substantially higher performance on the sort of text processing we have been discussing than a four processor conventional server. The coprocessor alternative is also competitive - for example, DRC Computer's modules start at about $4500, but this will probably fall over time (as volumes increase) to nearer $3000.

Celoxica HTX FPGA PCI Board

Celoxica HTX FPGA PCI Board

The software design environment can be a cost hurdle. Programming toolkits for FPGAs usually start in the $40- 50,000 range. While that sounds a lot, they are typically bought by electronics companies, such as mobile phone manufacturers, who can amortise that cost across potentially hundreds of thousands of units. Similarly, for a broker dealer developing multiple algorithms for its own or its customers' use this might not be a major barrier, but for entities such as smaller hedge funds it would be. The advantage today is that C-based design tools have slashed hardware design time and these efficiencies equate directly to saved man-months of effort and design-cost savings. To date, nobody appears to be developing inexpensive off the shelf FPGA tools for text string searching and other tasks related to algorithm development and trading. Most efforts in this space are still focused on accelerating financial software relating to risk or option models, though this should change as the opportunities FPGAs offer in automated/algorithmic trading become more widely appreciated.