Throughput Performance: Comparing Apples to Apples
from Real-Time Innovations (RTI) : Rick Warren - 31st December 1969
The opinions expressed by this blogger and those providing comments are theirs alone, this does not reflect the opinion of Automated Trader or any employee thereof. Automated Trader is not responsible for the accuracy of any of the information supplied by this article.
If you're serious about the performance of your distributed system, you probably read with interest the performance claims made by network middleware vendors. And if you're a network middleware vendor, you've probably published your share of performance claims. (RTI has comprehensive performance numbers available for both our DDS and JMS APIs.) But in order to know which claims are meaningful - and more importantly, which are useful to you - it's important to understand what you're reading. In the words on one of my coworkers, "many apples are compared to rhinoceroses."
First of all, there are three primary axes along which people tend to measure network performance:
- Latency: end-to-end, round-trip, and loaded latency; the amount of time it takes to send a certain amount of data from somewhere to somewhere else under various conditions
- Latency jitter: the amount of variation in latency measurements
- Throughput: the number of data quanta (either raw bytes or fixed-size samples/messages) transmitted over a given amount of time
The whitepaper "The Data-Centric Future" on the RTI website has a good overview of general performance considerations, so I won't reiterate that material here. What I'll focus on today is throughput, and specifically some of the subtleties you'll want to keep in mind when you're evaluating a networking middleware product.
The size of the data samples you send in a throughput test has a huge impact on the throughput you measure. At small data sizes (dozens to hundreds of bytes), performance is dominated by the expense of traversing the layers of software in between your application and the network: your middleware, if any; your operating system's network stack, including any system calls; and your network driver. As the data size increases, these fixed costs become less significant relative to the cost of actually copying the data (including the "copy" of the data across the network).
This difference in the throughput profile among different data sizes means that if you're reading a performance report, and you see "bytes" but your application deals with "samples" or "messages," it's important to understand the sample size(s) used to generate that report. It's not sound to generate some data with one sample size, add up the total number of bytes, and then divide by another sample size, because you're not correctly accounting for per-sample constant factors.
One vendor - an enterprise software vendor newly entering the messaging market, who shall remain nameless - made a series of throughput measurements for 256-byte samples. This vendor then declared that what their customers really cared about was 16-byte samples, and so immediately multiplied their measurements by 16 (= 256 / 16) and published those extrapolated results! Depending on how they were planning on sending those 16-byte samples, they were assuming either that (a) the cost of performing 16x the number of network sends is zero or (b) the cost of packing and unpacking 16-byte samples into and out of 256-byte chunks is zero. Of course, both of these costs are emphatically non-zero. But that brings me to my next point:
Understand data batching.
Because of the high fixed cost of a network send, especially relative to the cost of copying a small data sample, it is a common practice to batch multiple data samples together and send them together as a unit.
Suppose you're publishing 64-byte samples. Remember that each time you send a packet on the network, you're also sending a couple dozen bytes of IP header data and whatever meta-data your middleware requires. That adds up to a 30-100% space penalty - added to the time penalty discussed above. But if you can amortize these costs over many samples, they become much less important. In fact, batching data can increase your effective throughput by more than an order of magnitude in some cases.
In our experience, you can send 50-80,000 smallish packets per second using commodity OS, computing, and networking components. When you see samples-per-second-style throughput numbers much higher than that, it means that those samples are being batched under the hood.
Note that data batching is an intrinsic part of the TCP protocol, so any middleware implementation that relies on TCP batches data all the time.
Differentiate between one-to-n and aggregate
There are two ways to look at throughput: an application-centric view and a network-centric view. Which one of these a given community cares about governs which one gets measured and reported by vendors that market into that community. It means that you need to be aware of what you're reading when you see "n samples per second," especially when dealing with new vendor.
- Applications typically care about the number of samples that can be sent/received to/from particular destinations using particular data producing and consuming objects. For example, suppose I'm publishing sensor data. I know that my device has a new value available 10,000 times every second. If I try to send that much data, will it work? This viewpoint is relevant to applications with individual data streams that place a significant burden on the network all by themselves. Streaming media, high-rate sensor data, real-time command and control, and other similar domains are in this category. Throughput data is typically reported from one (the data producer) to n (the number of data consumers).
- Other systems take a network-centric view, measuring the total number of samples in flight across an entire system. This view is relevant when both of the following are true: (a) individual data streams may not be demanding by themselves, but there are many of them, and (b) all of those streams have a common choke point. Enterprise integration and web services often fall into this category, as services are invoked on human time scales, and middleware implementations typically include central message broker components. Network-centric throughput is typically reported in aggregate, from n (producers) to m (consumers), where all (n + m) entities bottleneck through a common broker. The goal of the test, in such a case, is to measure the limits of the broker itself, not of the applications that use it.
When you're evaluating a throughput claim, be sure you know which one of these scenarios you're looking at! I can tell you that there was a flurry of activity at RTI when a competing vendor started touting 6 million samples per second - until we read further into that vendor's testing methodology and discovered that the result was an aggregate across 60 applications. For the record, the throughput numbers you will find on RTI's website - showing over 1 million samples per second - are measured 1-to-n. That's one publisher on one box publishing to one destination.
Understand the architecture.
To me, 1-to-n numbers are the honest numbers when you have a peer-to-peer solution, as RTI does. That's because, assuming your switch can keep up, there's nothing to test other than the so-called "client" applications themselves. We can saturate a gigabit link for data sizes not much over 100 bytes, and come close even for very small sizes. Do you have more data to send than can fit over a single link? Then add another link. At that point, you're no longer testing the performance of the RTI infrastructure; you're testing the performance of your switch.
Knowledge is Power, or, Forewarned is Forearmed.
Now, hopefully, you have a better understanding of how throughput performance numbers data is measured and reported, what to expect from that data, and what to look out for. You're ready to enter the wide and wild world of performance evaluation. (Care to try your own hand? Pick up a copy of RTI Data Distribution Service or RTI Message Service and run the comprehensive performance test you can find in RTI's Knowledge Base - search for "Example Performance Test.")
Of course, there's a lot more to network performance besides throughput. No doubt I (and/or someone else) will be returning to talk about latency, loaded latency, jitter, competitive analysis, and other topics in the future - stay tuned.