Naren Nayak – Senior Director of Application Engineering
In February, I blogged about the performance of Ampere® Altra® on four key modern cloud benchmarks. These workloads showed the Ampere Altra Q80-33 beating the AMD EPYC 7742 processor on all of them. At that time, I promised to follow up with data showcasing its energy efficiency capabilities. This blog does just that.
The movement of legacy workloads to the cloud has resulted in fewer, larger, more efficient datacenters. The large power demands for these datacenters have driven cloud service providers (CSPs) closer to the grid. There is a big focus on carbon neutrality and some CSPs are already setting goals to do that. To meet those goals, server/infrastructure power consumption and efficiency, both of which are key components of the total cost of ownership (TCO), will continue to play an important role.
The higher power efficiency of Arm-based CPUs has long been taken for granted. The Arm CPUs used in the phones in our pockets likely have high performance/Watt ratios, but that doesn’t mean their raw performance is appropriate for cloud workloads. CSPs strive for cutting-edge performance that meets their SLAs at the lowest-possible power levels. The Ampere Altra processor is the first generally available server CPU based on the Arm Instruction Set Architecture (ISA) that can deliver levels of raw performance to easily rival those of x86 CPUs while consuming less power.
The power efficiency of Ampere Altra on integer workloads has been discussed publicly before. The graphs in Figure 1 below show the Ampere Altra consuming 16% lower power than the AMD EPYC 7742, while performing 9% better on SPECint 2017. While these numbers can be plugged into TCO models, every data center company we have worked with has their own TCO model and their capEx and opEx costs can vary significantly based on their own hardware configurations, power/thermal solutions and commodity costs. Therefore, I’ve decided not to rely on TCO models but rather share performance/Watt ratios, a key technical input for the TCO models.
Figures 2 and 3 show the CPU frequency and CPU power consumption while running the SPECrate2017_int benchmark. The difference in philosophies is stark – the AMD EPYC 7742 consumes close to the thermal design power (TDP) throughout the benchmark and throttles the CPU frequencies down on the power-hungry sub-components to maintain TDP. The Ampere Altra, on the other hand, runs cooler, comfortably below the TDP while being able to maintain the maximum CPU frequency (3.3GHz in this case) across all cores.
This consistency of CPU frequencies leads to predictable behavior for end users – the Ampere Altra Q80-33 will run all the cores at 3.3GHz for most integer workloads. And of course, the resulting performance/Watt is stellar.
Next, let’s look at the other workloads I discussed in the performance blog.
Memcached – In-memory Caching
The Memcached test measures the number of operations/sec (90/10 Gets/Sets ratio) at a given service level agreement (SLA)- In this case, the SLA is set at a p.99 latency of 20 ms for both platforms. After careful tuning of network parameters, like appropriate network IRQ affinity to cores on both platforms, the CPUs on each test platform were close to being fully utilized. The Ampere® Altra® Q80-33 performed 29% better with Memcached at 10% lower power resulting in a 43% higher performance/Watt ratio, showing clearly the superior power efficiency of real-world workload.
NGINX – Web Server
For the NGINX test, our metric was the number of requests per second the platforms were capable of, while keeping the p.99 latency SLA under 10 ms. The Ampere® Altra® Q80-33 processed 14% higher throughput (requests/sec) than the AMD EPYC 7742 at 24% lower power resulting in a performance/Watt ratio that was better by almost 50%!
For this test, we scaled up as many instances of x264 as the number of cores/threads available on the platform. With the AMD EPYC 7742, as expected, aggregate frames per second started to taper off once the number of physical cores was exceeded. SMT provided a minor improvement to the overall performance after that physical core threshold was passed. On this test, the Ampere Altra Q80-33 was able to encode 9% higher frames/second compared to the AMD EPYC 7742 while consuming 8% lower power. Overall yielding a performance/Watt ratio of 18% better than the EPYC 7742 CPU.
Let’s recap what I’ve discussed so far. The Ampere Altra Q80-33 is groundbreaking for many reasons. With respect to performance, it can compete with the best x86 has to offer and it can hit those performance levels while consuming lower power. The power efficiency of Altra results in higher performance/TCO$ for cloud service providers. Altra’s higher core count combined with lower power consumption/core can be utilized to increase VM/user density at a rack-level, leading to higher revenues. The ability to hit and maintain maximum CPU frequencies consistently translates to higher predictability along with fewer noisy neighbor problems for developers in the public cloud. With all these boxes checked and our focus on scale-out at a CPU level, the Ampere Altra family is increasing density to a point where new levels of performance and efficiency can be achieved. This will lead to new platform form factors, configurations and innovative cloud designs at both the infrastructure and the workload levels.
SPEC2017 Rate-N Estimated Performance
Data source: AnandTech: https://www.anandtech.com/show/16315/the-ampere-altra-review/6
Ampere® Altra® Q80-33, 2 sockets (1 socket used for the SPECrate2017_int data), 80 cores, 3.3 GHz, L1/L2/SLC = 64KB/1MB/32MB, DDR4@3200MHz – 32GB x 8 1DPC, CentOS 8.0.1905
AMD EPYC 7742, 2 sockets, 64 cores/128 threads, 2.25 GHz CPU (3.4 GHz boost), L1/L2/L3 = 32KB/512KB/256MB, DDR4@3200MHz – 32GB x 8 1DPC, cTDP=240W, CentOS 8.1.1911
2x Mellanox MT27800 ConnectX-5 NICs, 2x Intel Xeon 2679 v4 (Broadwell) load generators
Two NGINX v1.15.4 instances each serving a 50KB static HTML file over HTTPS/TLS, Brotli for compression, LuaJIT to pre-process the URL string. 2x Intel Xeon 2697 v4 Wrk load generators. Metric is throughput (requests/second) under an SLA – p.99 latency <= 10 ms. Load was gradually increased till the SLA was violated.
Memcached v1.6.3, Memtier v1.2.17 to generate the load. Multiple instances of Memcached were run, each with 4 threads. IRQs for each of the two network cards were affinitized to their respective CPU sockets. Each instance of Memcached was targeted with a Memtier process with 8 threads, 32 clients per thread, with a pipeline depth of 20. The requests made followed a 90R/10R ratio. The metric was aggregate throughput with a p.99 latency of <= 10 ms. Load was gradually increased till the SLA was violated.
x264 v0.161.3027, clip used – Ducks Take off 1080p50
./x264 –preset medium –psnr –tune psnr –threads 1 –frames 100 –profile main
Multiple single-threaded x264 instances started up (1 per core/thread). The metric was aggregate of the FPS reported by each of the instances.
© 2021 Ampere® Computing LLC. All rights reserved. Ampere®, Ampere® Computing, Ampere® and the Ampere® logo are all trademarks of Ampere® Computing LLC. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies; no ownership, affiliation, or endorsement by Ampere® or the companies is intended or implied.