Performance of Modern Cloud Workloads on Ampere® Altra®


Naren Nayak – Sr. Director, Application Engineering

The ever-increasing demands of the modern cloud – higher performance, better scalability, energy efficiency – are why, early last year, we announced the first cloud native processor, the Ampere® Altra®. Designed from the ground up to meet the needs of the modern cloud, with features such as compelling integer performance, high core count, and great energy efficiency, Ampere® Altra® is the perfect choice for the cloud computing market – and we are happy and proud to see that a year later, the industry is rewarding it with a resounding A+ on these metrics.

Recently AnandTech, Phoronix, and Serve The Home (STH) reviewed the Ampere® Altra® processor and published detailed analysis across a variety of performance benchmarks. These third-party reviews confirm the value it brings to the cloud when compared to competitive offerings. Below are key highlights from these reviews:

AnandTech – “The Altra overall is an astounding achievement – the company has managed to meet, and maybe even surpass all expectations out of this first-generation design. With one fell swoop Ampere managed to position itself as a top competitor in the server CPU market.”

According to the review, on SPECrate 2017 Integer, Ampere® Altra® delivered up to 2.4x better performance on base metrics compared to Intel Xeon 8280 (that’s one Ampere® Altra® CPU outperforming two high-end Intel Xeon CPUs!) and 7% better performance than the AMD EPYC 7742. The author, Andrei Frumusanu added – “while SMT helps, it’s not enough to counteract the raw 25% core count advantage of the Altra system when comparing 80 vs 64 cores.” This further validates our decision to invest in single-threaded cores for better predictability and performance.

In his article, Michael Larabel from Phoronix states – “The performance exceeded my expectations where the Ampere Altra was able to collect wins in not only the performance-per-Watt but in the raw performance as well.” The latter portion of this statement is important. Arm processors have always excelled in performance/Watt, but raw performance has never been as competitive as it is with Ampere® Altra®.

Patrick Kennedy at Serve The Home noted – “The Wiwynn Mt. Jade platform absolutely was an upside surprise we were not expecting. After 4-5 years of working with Arm servers, this is the first server to provide what we can see as a wide-scale deployable solution.” The Wiwynn Mt. Jade platform, a dual socket platform with Ampere® Altra® processors was given the STH Editor’s Choice Award.

Winning on benchmarks is great, but we don’t expect our customers to buy our products to run SPECint in their datacenters; running real-world applications is what really matters. We designed Ampere® Altra® to exhibit competitive out-of-the-box performance on real-world usages, but really shine when optimized. We recently studied competitive performance on a wide variety of applications commonly used in the cloud – from web hosting and media encoding to storage and database.

 

Memcached – In-memory Caching

Memcached is one of the oldest key-value stores still used in the cloud. At its core, it’s a distributed in-memory hash table that was designed to be a fast, predictable cache sitting in front of a slower, disk-based database like MySQL. Memcached is so lightweight and fast that most benchmarking efforts end up with engineers spending more time tuning the network and client settings than Memcached itself!

The Ampere® Altra® delivered up to 29% higher performance compared to AMD EPYC 7742 at similar latencies.

 

NGINX- Web server

NGINX is a popular web server that can also be used for other functions such as load balancing and reverse proxying. By some counts, close to a third of today’s websites are powered by NGINX.

In our tests, the Ampere® Altra® delivered up to 14% higher throughput compared to AMD EPYC 7742 at a p.99 latency of under 10 ms.

 

 

Media Encoding

Media encoding is one of the most popular workloads in the modern cloud and h.264/AVC is still the most prevalent codec in the industry. Ampere® Altra® delivers up to 9% higher FPS compared to AMD EPYC 7742 encoding a 1080p50 clip in a Video on Demand scenario.

 

We set out to design Ampere® Altra® to specifically address the needs of the modern cloud. The metrics the cloud values are different from those in other segments – great integer performance without consuming lots of power, lots of cores so performance can scale out, and compelling out-of-the-box performance with standard, open-source software.

In this blog, I have compared the raw performance of Ampere® Altra® to the current state-of-the-art in server processors with Ampere® Altra® leading across the board. In a follow-up blog, I will discuss the power consumption and energy efficiency benefits of Ampere® Altra® on these workloads, which make it an even more competitive and compelling product.

Needless to say, we are pleased with the leadership performance of Ampere® Altra® as demonstrated by the technical reviews cited earlier and results on real-world workloads. But this is just the beginning. Ampere® Altra® Max, with up to 128 cores, is just around the corner and will push the boundaries of performance even further. And that will be followed by its successor, codenamed “Siryn,” in 2022.

At Ampere Computing, innovation never stops.  We will continue to deliver the world’s most performant, scalable, and power efficient processors that are uniquely designed for the needs of the modern cloud.

 

Foot Notes

SPEC2017 Rate-N Estimated Performance
Data source: AnandTech: https://www.anandtech.com/show/16315/the-ampere-altra-review/6
Hardware Configuration
Ampere® Altra® Q80-33, 2 sockets, 80 cores, 3.3 GHz, L1/L2/SLC = 64KB/1MB/32MB, DDR4@3200MHz – 32GB x 8 1DPC, CentOS 8.0.1905
AMD EPYC 7742, 2 sockets, 64 cores/128 threads, 2.25 GHz CPU (3.4 GHz boost), L1/L2/L3 = 32KB/512KB/256MB, DDR4@3200MHz – 32GB x 8 1DPC, cTDP=240W, CentOS 8.1.1911
Common
2x Mellanox MT27800 ConnectX-5 NICs, 2x Intel Xeon 2679 v4 (Broadwell) load generators
Software Configuration
NGINX
Two NGINX v1.15.4 instances each serving a 50KB static HTML file over HTTPS/TLS, Brotli for compression,  LuaJIT to pre-process the URL string. 2x Intel Xeon 2697 v4 Wrk load generators. Metric is throughput (requests/second) under an SLA – p.99 latency <= 10ms. Load was gradually increased till the SLA was violated.
Memcached
Memcached v1.6.3, Memtier v1.2.17 to generate the load. Multiple instances of Memcached were run, each with 4 threads. IRQs for each of the two network cards were affinitized to their respective CPU sockets. Each instance of Memcached was targeted with a Memtier process with 8 threads, 32 clients per thread, with a pipeline depth of 20. The requests made followed a 90R/10R ratio. The metric was aggregate throughput with a p.99 latency of <= 10ms. Load was gradually increased till the SLA was violated.
Media Encoding
x264 v0.161.3027, clip used – Ducks Take off 1080p50
./x264 –preset medium –psnr –tune psnr –threads 1 –frames 100 –profile main
Multiple single-threaded x264 instances started up (1 per core/thread). The metric was aggregate of the FPS reported by each of the instances.

 

© 2021 Ampere Computing LLC. All rights reserved. Ampere, Ampere Computing, Altra and the Ampere logo are all trademarks of Ampere Computing LLC. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies; no ownership, affiliation, or endorsement by Ampere or the companies is intended or implied.



Related Post