How can a Pure Software Messaging Solution Whip the NIC off a Hardware Messaging Appliance?

Fast cars, fast software, and beautiful scenery. Why is it that the latest Bugatti Veyron doesn’t have the fastest time at the Top Gear test track? How could a highly regulated or constrained 2004 Renault R24 Formula One outperform it? The Bugatti outperforms the F1 in many tests; but not on that prestigious Top Gear track due to the design trade-offs their designers made. We also don’t know if the driver of the Bugatti really was in tune with this vehicle. Martin Thompson provides an excellent metaphor from racing to software development in his blog Mechanical Sympathy.

"The name "Mechanical Sympathy" comes from the great racing driver Jackie Stewart, who was a 3 times world Formula 1 champion. He believed the best drivers had enough understanding of how a machine worked so they could work in harmony with it." -Martin Thompson

In the software world, “Mechanical Sympathy” comes down to pursuing the principle of building software with an intimate knowledge of the underlying hardware. i.e. If you know your hardware platform, how that platform works, and where it may be advancing, you can optimize your software to realize higher efficacy today and be able to scale along with the advancements.

In this post, we will investigate this principle further and answer the question about how AMPS running on best of breed commodity hardware can be 2.5 times faster in message processing and delivery throughput than a specialized hardware appliance.

Losing the first Heat but Winning in Record Time in the Final:

Admittedly, a while back, AMPS was in a similar situation and was outperformed on a test that was like a Top Gear track. The use case was quite a niche one that mostly relied on our state of the world cache and not a lot of our other capabilities. A specialized software vendor made us humble as their product outperformed what we thought was an analogy to a Bugatti, AMPS. We started the race by out-lapping our opponent as we were 30X faster on inserts however after a few more laps we noted we were significantly slower in the querying. The AMPS approach is optimized for frequently-updated and changing data. Most AMPS applications use subscriptions as a “continuous query” that delivers results as the row is inserted. For other applications, AMPS employs a sophisticated parallel divide and conquer approach to querying. This use-case was more of a ‘write once, then distribute’ model. The test provided an a-priori knowledge of keys and begged for an index based approach. Divide and conquer wasn’t good enough in this case. We wanted to win!

In a few days, we revamped AMPS’ indexing model to add hash indexes, and we were able to outpace the competitor in terms of queries and still maintain the 30X insert advantage. To elaborate, indexing schemes in most database systems rightfully optimize reads at the expense of maintaining indices during writes (updating and possibly rebalancing B-Trees, etc). The result is slow writes/updates but faster reads – a good tradeoff as most use cases are ‘write once, read many’. By adding hash indexes, AMPS included the ability to perform extremely fast lookup for this scenario while keeping inserts extremely efficient. In the last laps, AMPS was fueled with a combination of statically maintained indexes and on demand indexes which allowed AMPS to zoom past the checkered flag in record time.

In summary, AMPS’ Hash Index model allows one to very rapidly find data for queries that are fully covered by a hash index, as well as to take advantage of the divide and conquer on-demand indexing traditionally used by AMPS. Hash indexes brought us back to a Bugatti-level as it dramatically improve AMPS performance for query-heavy (NoSQL) scenarios.

Hardware Messaging Appliances Have a Place

Once in a while we see an analogy to a Formula 1 in the guise of a hardware appliance that claims superior throughput, predictable latency, and management simplicity. Their literature tells us there are scenarios where offloading or redirecting workloads to a messaging appliance makes sense, and if you are doing everything you need to do on the NIC, then a hardware messaging appliance is something that could be useful. By avoiding disk paging and CPU interrupts alone, they argue that they can predictably provide low latency with minimal interruption.

It’s fairly understandable that some people will still think that hardware vendors will always perform faster - it’s a dedicated box with minimal contention and can have registers aligned to the actual message types in the use case. The software driving the hardware can be finely tuned and optimized to both the use case and the customized hardware. The vendor can optimize their TCP stack, employ zero-copy processing and hand off results without leaving the NIC - nothing could be better at receiving, processing and pushing out messages… right?

The premium costs for dedicated appliances make sense only if they do something better (i.e. faster) than what best of breed commodity hardware/software can do and if they are more simple to manage and you can afford to scale out with more or advanced hardware. That is a lot of “Ifs”… perhaps the Forumula 1 cars should go race on their own special track.

What if a software solution proves these assertions wrong?

Dedicated hardware is fast - until you need to turn the other direction! When your needs change, hardware appliances can make you feel like you’re spinning your wheels.

A View to A Trade

So let’s take a real world example: a View Server system where front office equity traders are constantly changing the data they want to see. With even a dozen traders, let alone 100, this can be taxing on the server technology during peak loads when the server has to process incoming market streams as well as quickly serving data requests from the traders while potentially having to enrich or validate the structured and unstructured data.

The hardware appliance was able to drive throughput to 1.6 million 1KB messages per second and the AMPS software (running on commodity hardware) was able to realize 4.5 million messages per second – only being bound by the memory, network and CPU.

If we upgraded any part of the hardware, we would have hit better numbers – and in fact, when we switched from a 10GB network to 40GB we realized 6.5 million messages per second. To read more about this, look at our ‘Shock Absorber’ white paper or if you want particular fan-out details and CPU/Network saturation #s, please get in touch with us. If you need to see it in action, even our evaluation offering on a VM will demonstrate its inherent high throughput capacities (though AMPS works even better on real hardware).

Defending COTS is Hard Work!

AMPS is implemented in regular programming languages and works best on best of breed Commercial Off The Shelf (COTS). It will even scale down to work on VM which is great for Dev-Ops testing and development. However when embracing COTS, one has to not assume your software is really at the whim of the compiler optimization, tuned VM, OS and hardware capacity. Much of the performance gain on COTS comes from understanding each of the levels of the system and spending much effort in measuring and fine-tuning at every stage of processing.

Measuring Success:

Many of our successful real world customers are also driven by measurements and give us throughput targets as well as latency tolerance goals that are indicated by minimum, median and maximum times. In our deployments and in our proof of concepts, we actually work towards predicting how well we would perform. The challenging part is calculating the processing step but we have been doing this long enough that we understand the dynamics and can intelligently inform our estimates for AMPS’ maximum throughput from incorporating the type and size of the memory scheme, CPU type and networking capacity etc. By ensuring highly concurrent processing, intelligent memory usage strategies, and a lot of common sense, we hit our predicted rates.

When we hear that AMPS provided 2.5-4X more throughput in different cases… we aren’t ecstatic. Instead we try to reassess if the 2.5 X more throughput was a theoretical bound or if we were doing something wrong that prevented us from hitting that bound.

It is due to this philosophy that AMPS has become deployed in some of the largest and successful high volume trading systems in the US, and AMPS scales with best of breed COTS. Given AMPS’ effective use of resources, you can do much with one server and this reduces the need to scale AMPS out. Its so well behaved that we are often hosted in a multi-tenant server. If you upgrade CPUs, memory or network, AMPS scales to take advantage of the new capacity. It can scale down to run on a VM for development testing ops and up to the 400 CPUs as we need to set aggressive goals for ourselves as that has become what is expected of us by both clients and ourselves.

The Secret Sauce

In terms of the secret sauce, its simple, don’t follow all of Google’s C++ standards :), but we do implement amazingly fast XML, FIX and JSON etc parsers, exploit concurrency with lock free data structures and be built for the enterprise. The latter point, ‘the enterprise’ is vital as AMPS has a simple model for deployment, conjugation, and updates. It also has enterprise level high availability, monitoring, and admin tools. Due to the low cost point of AMPS, we have seen deployments in the thousands with minimal dev ops support requirements.

Still not convinced? Our many-trick pony also has a rich set of our multi-language API data selection capabilities that ensures it only sends the data that needs to be sent. This reduces risk of broadcast storms or network contention and minimizes CPU waste due to over-subscriptions. Developers can leverage a state of the world data cache of all recent values, complex aggregations and calculations, conflated messages with data updates at an interval, out of focus notifications, and it comes with the greatest invention of all time - a transaction log which can be leveraged for message replay, failover, and to resume subscriptions at the exact point that a subscriber restarted. All of these capabilities are not implemented with the idea that we are going to be trading off throughput performance – it’s the opposite. They increase the holistic system performance and resilience.

We are fortunate to say that we win a lot of races but we know we have to keep on being better and fully embrace the “Mechanical Sympathy” principle. The hardware messaging appliance did a great job serving 1.6 million messages per second and the simple yet finely tuned software vanquished all doubt by realizing 4.5 million 1KB messages per second in its ‘slowest case’…When we added the 40GB network, we hit over 6.5 million messages per second…A little COTS can go a long way!