The Gateway to Algorithmic and Automated Trading

What I've learned building low latency systems

Published in Automated Trader Magazine Issue 41 Q4 2016

Software development for low latency trading tends to be shrouded in mystery. Development practices are often wrapped in layers of computational alchemy that tend to be impenetrable to outsiders. The industry rarely gives insights, even though it borrows heavily from other sectors to drive its own progress.

Ariel Silahian

AUTHOR'S BIO

Ariel Silahian is an experienced technologist developing low latency trading platforms for hedge funds and proprietary trading firms. He is particularly interested in finding new ways of leveraging technology to exploit opportunities in the capital markets.

The high frequency trading world is secretive and it's not really a surprise that nobody wants to give anything away. Apart from the obvious commercial implications, there are also legal constraints: Everywhere you work makes you sign non-disclosure agreements, which scares people from talking about even the most rudimentary aspects of technology.

And once you have worked on a few different projects or in a few different companies, then you start building up a veritable collection of NDAs. That is my case anyway. But talking about many of the technical aspects of high frequency trading is not giving away any secrets. Sure, there may not be many people talking about it, but it's not a secret. It's simply best practice as it would be known in any other industry. After spending many years with this problem domain, I have compiled a few points that I wanted to share.

Many of the software developers that work in hedge funds and HFT firms have a gaming or communications background. This allows financial firms to utilize their up-to-date knowledge that actually originates from other industries.

High frequency trading is about low latency processing and ultra-fast communication: everything is geared towards speed. From the design and prototyping stage all the way to the final implementation speed is always the primary concern. If you can manage to gain just a couple of microseconds somewhere along the way then it is considered a big deal.

To give you an idea of what kind of speeds we are talking about, let's go through various types of computer operations and how much time they typically take. This is summarized in Table 01 below.

I've discovered that in order to operate with the kind of speeds shown in the upper half of Table 01, I really do need to forget everything I have learned about modern software engineering. It requires a different mindset and the last 30 years of 'best practice development' basically go out the window: latency reigns supreme, no matter how ugly the code gets.

I still remember how I used to lay out my system designs in layers. There would be a Data Access Layer and a Business Logic Layer; there could also be a User Interface Layer. If you are designing an e-commerce platform, it doesn't matter if your code takes an additional 50 microseconds. Nobody even notices. In the end you might be happy if the checkout operation completes in a few seconds.

Choose the right language

Forget scripting languages (i.e. interpreted languages). They are simply of no use for this field. I actually get a lot of people asking me: "Hey, is Python a good language to use for HFT?". No! Python is slow. Terribly slow in fact! Your application will be the software-equivalent of a turtle. Or more accurately perhaps: it will be some kind of slow-moving snake. Pythons are not quick. And languages like Python are good for exploring and testing new ideas and doing some research. No more than that.

When you are looking to shave those last few microseconds off your overall processing time you simply cannot have the overhead of an interpreted language. Additionally, you really want a strong memory model to enable lock-free programming so you should be looking at languages like:

  • C#
  • Java
  • Scala
  • C/C++11 (my preferred one, but that's just me)
  • OCaml
  • Haskell
  • Smalltalk
Task ns µs ms Notes
L1 cache reference 1
L2 cache reference 3 3x L1 cache
Branch mispredict 4
L3 cache reference 11 11x L1 cache
Mutex lock/unlock 25
Main memory reference 60 20x L2 cache, 60x L1 cache
Compress 1K bytes with Zippy 3,000 3
Send 1K bytes over 1Gbps network 10,000 10
Read 4K randomly from SSD* 150,000 150 ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 250
Round trip within same datacenter 500,000 500
Read 1 MB sequentially from SSD 1,000,000 1,000 1 ~1GB/sec SSD, 4X memory
Disk seek 10,000,000 10,000 10 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 20,000 20 80x memory, 20X SSD
Send packet CA → Netherlands → CA 150,000,000 150,000 150
1 ns = 10 -9 seconds | 1 µs = 10 -6 seconds = 1,000 ns | 1 ms = 10 -3 seconds = 1,000 µs = 1,000,000 ns
Table 01: Latency Numbers in Modern Computers (2016)

Abstraction

Forget encapsulation and making your code nice, clean and reusable. Always remember that your goal is to make your code 'as fast as possible', not nice or easy to use.

Keep it all in memory

Read everything at the beginning of your program. Forget about databases or active persistence, this will add huge latencies, giving you enough time to whack a golf ball or two between two orders.

Anything related to I/O will simply destroy your overall latency, so make sure all of your data is pre-loaded (warming up cycle) and you have it all in memory.

This generally means managing your own in-memory data structures and maintaining a persistent log, so that you can rebuild the state after a machine or process restart.

Alternatively, you might (just might) be able to get away with running a local, persisted in-memory database like Redis or MongoDB (with available memory exceeding the data requirements). Note that you can still lose data should something unexpected happen while it is being persisted to disk in the background, but that is the price you have to pay (also see the section on asynchronous operations at the end of the article).

For best performance I would recommend not having any database/disk access on your critical path.

Keep your hardware underutilized

Low latency requires always having resources freely available to process the request: free CPU cores, free memory etc. When setting up your server, make sure you will have enough capacity for your intended use. If you know that your system will use 24 CPU cores and 20 GB of memory, make sure that you add 30% additional resources to have enough room.

Do not try to run at the limit of what your hardware can handle. Always have lots of head room to allow for bursts. Your system will eventually grow and will slow down at those all-important peak times. Make sure you keep plenty of spare resources available.

Keep context switching to a minimum

In concurrent applications you will end up using multiple threads to handle workloads simultaneously. Context switches are a sign that you are doing more compute work than you have resources for. Usually, the operating system will handle running those threads and will decide for you which CPU core will take which thread. Now, if you are running multiple threads at the same time, as in most HFT systems, the operating system will schedule alternate threads on the CPU in order to give them all a chance to execute. Every time the operating system instructs a CPU to switch from one thread to another, the CPU needs to save the current thread's state (in order to resume it later) and load the state for next scheduled thread. This could kill your processing performance.

My best approach has been pinning critical threads to a specific core by simply using busy spinning. That way I avoid triggering a context switch and incurring excessive cache faults (as explained in the next point).

Keep your reads sequential

Make sure you are using the CPU caches (L1/L2). All forms of storage perform significantly better when used sequentially. When issuing sequential reads to memory you also trigger a pre-fetch from main memory. If done properly, the next piece of data you need will already be in the L1 cache right before you need it. The easiest way to help the CPU out with this process is to make heavy use of arrays of either primitive data types or structs.

Stepping through a sequence of reference pointers, either through the use of linked lists or through arrays of objects, should be avoided at all costs.

Use non-blocking calls where possible

Design your system to use non-blocking and wait-free data structures and algorithms wherever possible. Every time you use a lock you have to reach down to the OS to arbitrate the lock, which again comes with a huge overhead. If you know what you are doing then you can often get around locks by understanding the memory model of the CLR, the JVM or the C++11 runtime.

Avoid exception handling

Yes, avoid it! It's expensive. Exception handling adds at least 10-20% overhead. Compilers need to add additional code and implement additional stack management to handle exceptions and that costs time. And before somebody tells me about how the GCC uses the 'zero-cost model': I suggest you profile your application and measure it. Remember: each microsecond counts.

Async as much as possible

Any processing and particularly any I/O that is not absolutely necessary for building the response message should be done outside of the critical path. For instance, if your system includes logging transactions to disk or sending transactions to a secondary server, then those actions can happen asynchronously. You don't have to wait until this has completed to continue on to the next instruction, as this will just block your execution path.

As you can see there is more to coding in this problem domain than just coding itself. While writing every line of code, you have to keep in mind what is really happening behind the scenes.

One last piece of advice: Don't just take anything for granted. Perform benchmarks on different approaches and look at how they differ and why. Not only will you learn something, but you will eventually find the best solution for your problem.