The Gateway to Algorithmic and Automated Trading

Accelerating Automated Trading

Published in Automated Trader Magazine Issue 19 Q4 2010

We first covered FPGAs in the second issue of AT and it’s a topic we’ve revisited on a number of subsequent occasions2. Here, David Buechner, Vice President, Impulse Accelerated Technologies, Larry Cohen, CEO, Accelerated Computing Solutions and Edward Trexel, Senior Staff/Hardware Engineer, bring us up to the second with a discussion of low-latency automated trading via FPGA-based NICs. Faster order execution via network-connected FPGA co-processing, anybody?

Faster processing results in successful trade execution and generates increased revenue. Getting your orders executed before your competition is essential, and more companies are using hardware acceleration to decrease latency and remain competitive. Here, we look at how your team can quickly design, develop and maintain hardware-based trading systems. And we discuss techniques you can use to facilitate migration to better and faster co-processors and Network Interface Cards (NICs).

What are the first steps in developing an automated trading solution?

  1. Identify compute-intense processes which are slowing your automated trading. Move those specific processes to network connected hardware co-processors.
  2. Partition entire processes such that they occur on one chip and minimize the use of external memory and use of an external processor as a "traffic cop."
  3. Develop much of the hardware based application using a higher level compiler (i.e., ANSI C rather than HDL) so the full team can program it and maintain it.

Some financial institutions are starting to realize the benefits of using hardware in automated trading. Most firms are starting to use leasable appliances from companies with pre-configured proprietary APIs. While these hardware based systems are fast, they are outputting a large amount of un-needed data to standard processing grids. When the data hits the grid it encounters a standard OS (i.e., Jitter). About 90% of what comes out of feed parsers is never used.

So the processors bog down working with un-needed data. And the grid consumes massive amounts of power. When customizable hardware is introduced at the network interface using customer programmed FPGAs (Field Programmable Gate Arrays), data is processed without an OS and >90% of the feed data is parsed away prior to ever hitting the microprocessor grid. The result is a dramatic speed up in processing, faster placement of orders and increased revenue. The standard processing grid is off-loaded and only sorts through actionable data.

David Buechner

David Buechner

Hardware choices

Given unlimited resources and unlimited time, full-custom ASICs (Application Specific Integrated Circuits) may be fastest, as they are completely custom. But they take 6 months to develop and
NRE can run 10 to 100x more than the costs of developing an FPGA implementation. Post-production changes are also painfully expensive with ASICs - while changes to an FPGA are virtually free, as they are reprogrammable.

FPGA technologies enable feed parsing and processing to occur directly in hardware, and they "play well" with microprocessors. Trading firms pick the data they want to watch. Only that data is passed to standard processors. This offers trading firms ultra-low latency platforms for identifying and responding to specific information from incoming feeds. Firms designing low latency, programmable hardware systems are able to apply their proprietary strategies and make trades in the shortest possible times.

FPGA-based systems can be thought of as customizable application-specific hardware used to offload single- or multi-core processors. Financial feeds can be managed in such a dedicated hardware processing environment, freeing the host processors to work on less latency-critical tasks. This approach to processing minimizes overall latency, increases the determinism of the latency, and reduces system power consumption.

In the race to get the quickest trades turned around, automated traders increasingly rely on network-connected FPGA cards. FPGA trading solutions typically cost less than $100K and can be deployed on a desktop. The FPGA based solutions amount to customizable network interface cards where processing is done in-line with the data "streams" at or near line speed.

Originating in military research in the 90s, Stream Computing refers to massive parallelization on optimally configured computers. FPGAs are programmed to process multiple parallel streams. Using higher level tools programmers use APIs that are similar to C based threading libraries. But unlike military arrays of the 90s, this can now be done in reconfigurable logic, on a desktop computer. New Commercial Off-the-Shelf (COTS) hardware and software creates a way for software developers to re-factor algorithms for accelerated feed-handling via FPGA tools. This enables a developer to profile and partition code iteratively between FPGA and microprocessor and judge the outcome - without being a hardware expert.

Larry Cohen

Larry Cohen

System Overview

With higher level tools programmers can create flexible, reconfigurable, scalable FPGA-based feed parsing systems for direct off-the-wire financial analysis. Typical project goals are a magnitude or greater acceleration of algorithms over software-based implementations, offloading a significant portion of the processor work to hardware, and minimizing latency from receipt of data to system memory. A well designed system reduces the entire round trip from incoming message management to outbound order execution.

The UDP feed (on the next page) enters the system via a 1GB/s Ethernet connection. It travels through the MAC interface to the highly parallelized processing engine (the FPGA), which parses the feed protocol and filters message types with defined target characteristics for specific symbols. Traders usually track a subset of symbols, sometimes just a small fraction of the total set (e.g., < 10%). The resulting output of the FPGA is a subset of the data that meets the trader's criteria. Common criteria include target message types, picking off a specific symbol, or trader-defined analytics. The system latency is further improved as the FPGA offloads the Linux host, which now has far fewer data sets to process. Also, FPGA processing is much more deterministic, with less operating system interference.

In this configuration (above) the original software system is accelerated beyond processor-based performance. The trading algorithm is the greatest asset a high frequency trader owns. Typcially that algorithm is in ANSI -C. Using high level tools those C based algorithms are ported to an FPGA. The hardware implementation becomes processor and hardware independent so it can be ported again to new hardware as it becomes available.

Again, the logic and algorithms are the key intellectual property and higher level tools allow for those algorithms to be targeted relatively easily to the best platforms currently available and new ones as they are released.

figure 1

Hardware selection and verification

The hardware in this type of application is selected from a wide range of standard-interface, scalable FPGA-based co-processing boards. Recent hardware platforms have multiple FPGAs communicating with processors over a PCIe bus. The boards plug into the slot in a common industrial chassis. Because the software is ultimately a larger investment than the hardware, the design is kept as device-independent as possible.

Basic architectures and partitions are verified in standard ANSI C debuggers before hardware is considered. Hardware functionality is verified later after a board support wrapper layer is added. The board support wrapper is called a Platform Support Package (PSP). The PSP isolates most of the elements that make that board unique, to a layer downstream of the design. This makes the design "device independent" such that it can be tried on alternative boards late in the process. As new boards are available, the design can be easily ported.

A C-to-hardware compiler produces synthesizable HDL as an intermediate file, which is passed to the FPGA manufacturer's or third-party synthesis tool. The synthesis tool is fully cognizant of the intricacies and resources of each specific FPGA and performs a complex "place and route" to deploy the logic. As an ancillary benefit, the creation of IEEE-compliant VHDL or Verilog as an intermediate file format means that further hardware verification can be done via popular HDL simulators such as ModelSim®.

Here's an example

In a typical project the inputs take streaming market data in, such as an ITCH over MoldUDP market data feed via 1GB Ethernet MAC interface. The overall project target latency goals are to move from sub-millisecond to << 50 microseconds (10X or greater acceleration over current best implementation.)

Processing: UDP packets are input, and parsed using FPGA hardware. The above is UDP-in-UDP-out. Feeds are UDP-into-FPGA. TCP-out (not shown) from host CPU is also common, and can be offloaded and accelerated by TCP out from hardware.

Possible algorithms: String matching and other operations are performed on the input data. Processing algorithms are described using C, and parallelized for acceleration. The C-to-FPGA compiler, Impulse C, is a fully ANSI C compatible tool set that enables developers to optimize and parallelize their existing microprocessor-oriented C algorithms for multi-streams processing on FPGA within this model.

FPGA Hardware Latencies

FPGA Hardware Latencies

The simplest path from Ethernet off the wire, through an Ethernet MAC, to an Impulse C hardware process, and back out to the wire is shown above.

Cost of Ownership

To retain value, hardware configurations should be scalable from one to multiple cards. Impulse and its partners usually present several hardware options for a given feed implementation scheme. Most work begins with a preliminary analysis of hardware options, data path analysis and expected performance/latency of each option. The code is designed so that it can port to next generation hardware as it becomes available. Portability relates to device independence, i.e., avoiding customization of code that locks it to any one architecture. Using C as a design/test language and isolating device specific code to a post process or an interface layer, reduces cost and preserves value.

Portable Code Example

Device-independent C is compiled to HDL using a high level compiler. Using this compiler C code is re-factored for streaming processes. It is still recognizable C code that can be developed and verified on standard C development and debugging tools such as GCC/GDB or Visual Studio. For instance, an ITCH over MoldUDP feed processor, re-factored for hardware may looks like this:

void mold_proc( co_stream mold_stream, co_stream itch_stream )
do {
co_uint64 seqNumIn;
co_uint16 msgLen;
co_uint16 msgCount;

co_stream_open(mold_stream,O_RDONLY, UINT_TYPE(MOLD_STREAM_WIDTH));
co_stream_open(itch_stream,O_WRONLY, UINT_TYPE(ITCH_STREAM_WIDTH));

// read sequence number
co_stream_read(mold_stream, &moldDataIn, sizeof(moldDataIn));
seqNumIn = ((co_uint64)moldDataIn) << 32;
co_stream_read(mold_stream, &moldDataIn, sizeof(moldDataIn));
seqNumIn |= moldDataIn;
// decode MoldUDP64 packet breaking it up into individual ITCH message packets
co_stream_read(mold_stream, &moldDataIn, sizeof(moldDataIn));
msgCount = (moldDataIn >> 16);
if ( msgCount == 0 ) {
// heartbeat
} else if ( msgCount == ((co_uint16)0xffff) ) {
// end of session
} else {
// parse out messages
while(1) {
co_stream_read(mold_stream, &moldDataIn, sizeof(moldDataIn));
co_stream_write(itch_stream, &itchDataOut, sizeof(itchDataOut));
if ( msgCount == 0 ) break;
} while (1); // main loop

High level tools use API functions which compile to a bit map which connects up via a PSP to hardware platform resources like a bus interface, available on chip, on board or on system memory. Using this tool set software engineers generate modules and connect them up to a larger system. C-to-FPGA tools are available from several manufacturers with varying degrees of "ANSI compatibility." Impulse C abstracts hardware specifics into Platform Support Packages allowing software programmers to work in C, with little knowledge of hardware. They can use a C approach to control streams, signals, memory, I/O and other hardware features. Without the PSP, the software developer would need to bring in a hardware engineer to write HDL code.

Basic Steps for C-to-FPGA Implementation:


In a window like the top one in the illustration, C code is manipulated and de-bugged.


Simple graphic tools (like the single process flow diagram block insert) enable developers to optimize each process, identifying bottlenecks that may require more resources. The entire file can be crudely "wrapped" in an FPGA for quick functional verification. It will be inefficient until re-factored for an FPGA.


Complex graphic tools like the Stage Delay inset (the data and operator tree and branches) enable developers to view the interaction between operators and optimize the stages.


Target hardware is then selected from a pull down menu. It may be single FPGAs with embedded hard and soft processors or powerful boards with various host communication methods, combinations of FPGAs, different types of memory resources and microprocessors already configured. The code is then further re-factored for the target device.


At three levels:

a. Run the software in a desktop executable to test functionality prior to ever creating hardware.

b. Verification in Impulse C once a target device is selected

c. ModelSim to test that the performance of the C and HDL implementations match.

figure 3

Getting Started Checklist

For a manager new to hardware accelerated computing, here is a possible project schedule:

Web conference to review potential FPGA-accelerated trading solutions 4 hours

Develop statement of work and data flow diagram with HLS tools vendor. Include processing requirements and latency expectations. 2 weeks

- Develop SOW for corollary reference design (operational or R&D)

- Identify target feeds

- Review required depth of book

- Is there a need for internal feed normalization

- Does the firm want to insert risk management algorithms

- What kind of outbound messaging will be required

Finalize Specification 2 days

- I/O

- Hardware selection

- Target speed

- Acceptance test suite definition

Prepare System Integration, Hardware, Software Quotation 2 days

- After acceptance of SOW and purchase requisition:

Alpha code delivery to test network verification at near-target speeds. 6 weeks

- Functionality development

• Milestones: First streams, preliminary functional verification, speed benchmarking.

- Feature development

• Milestone: Full functional test

- Speed improvement and documentation

• Milestone: Functionality at speed

Beta testing 2 weeks

Iterations, as required 2 - 4 weeks

On-site training on how to modify the reference design to incorporate proprietary trading logic 2 days

The migration from software-only financial trading to hardware accelerated trading has already started with FPGA based appliances. The step from that to customizable parsers is made easier through the use of higher level ANSI-C to FPGA tools.







About the authors

David Buechner is an industry leader in the hardware acceleration market and has helped hundreds of companies integrate hardware-accelerated systems into their operations. His clients have included key financial, military and national security customers. His focus is ensuring that software developers achieve first-project success. Buechner is a VP at Impulse Accelerated Technologies. He can be reached via Mr. Buechner has a BA from Calvin College and a MA from Holy Names University.

Larry Cohen advises banks and hedge funds on how to achieve competitive advantages by significantly increasing the performance and quality of their most time-critical systems. He has designed numerous VLSI ASICs for Bell Labs and has provided technical services to firms such as Citigroup, Merrill Lynch, and JPMorgan Chase. He is CEO of Accelerated Computing Solutions and can be reached via . Mr. Cohen has an MS in Electrical Engineering from Stanford University.

Ed Trexel is a Senior Staff/Hardware Engineer with years of experience in developing audio, VOIP, video and other circuits in FPGAs from Altera and Xilinx. Mr. Trexel has created designs that are in use with groups as disparate as the US Airforce, major US trading firms and leading audio processing companies. Mr. Trexel has authored many articles and application notes on the subject of FGPA enabled coprocessing. Mr. Trexel graduated from Colorado State University with dual degrees in electrical engineering and computer science.