ARM Native Processors deliver higher performance results

DeathStarBench Social Network Workload Brief

DeathStarBench is an open-source benchmark suite

Ampere - Empowering What’s Next

The Ampere^® Altra^® and Ampere^® Altra^® Max processors are complete system-on-chip (SOC) solutions built for cloud native applications. Ampere Altra supports up to 80 cores and Ampere Altra Max supports up to 128 AArch64 cores. In addition to incorporating a large number of high-performance cores, the innovative architecture delivers predictable high performance, linear scaling, and high energy efficiency.

Web services are commonly deployed services comprised of multiple cloud native applications working together to deliver content over the internet. These are increasingly built using a microservice based architecture which breaks the traditional monolithic software architecture into smaller components or services. The individual micro-services can use different programming languages and frameworks but communicate with each other using a common messaging infrastructure. They can be easily deployed, managed, and scaled using a containerized environment like Kubernetes.

In this solution brief, we use the social network application from the DeathStarBench suite to simulate a web service like Twitter or Facebook. We compare Ampere Altra Max M128- 30 to Intel’s latest generation Icelake processor Intel® Xeon® Platinum 8380 and AMD’s latest generation Milan processor EPYC™ 7763 running the social network application while measuring the throughput and latencies of requests queued by simulated clients (users) on each of these platforms.

DeathStarBench Social Network on Ampere Altra Max

DeathStarBench is an open-source benchmark suite developed at Cornell University. It includes several end-to-end services, one of which is the social network application. The social network application uses an NGINX web server as the frontend; micro-services written in C++ and Python to implement the core application logic; and Redis, Memcached, and MongoDB for backend caching and storage of data. A simulated real-world social media application running at scale is accomplished with thousands of users connecting to the frontend using http, composing posts, tagging other users, adding media or URLs to the posts, and saving to the user and home timelines.

The applications in DeathStarBench are publicly available at: http://microservices.ece.cornell.edu under a GPL license.

Benefits of Running DSB Social Network on Ampere Altra Max

Cloud Native: Designed from the ground up for cloud customers, Ampere Altra and Ampere Altra Max processors are ideal for web services that use common cloud-native applications and are deployed in a containerized environment.
Scalable: The high core count combined with consistent frequency for all cores on Ampere Altra processors results in predictable performance for web services even under high utilization. This solution shows 23% better throughput as compared to legacy x86 platforms.
Power Efficient: Industry-leading energy efficiency allows Ampere processors to hit competitive levels of raw performance while consuming less than half the power compared to the competition.

Ampere Altra Max

128 64-bit cores at 3.0GHz
64KB i-Cache, 64KB d-Cache per core
1MB L2 Cache per core
16MB System Level Cache
Coherent mesh-based interconnect

Memory

8x72 bit DDR4-3200 channels
ECC and DDR4 RAS
Up to 16 DIMMs (2 DPC) and 4TB addressable memory

Connectivity

128 lanes of PCIe Gen4
Coherent multi-socket support
4x16 CCIX lanes

System

Armv8.2+, SBSA Level 4
Advanced Power Management

Performance

SPECrate®2017 Integer Estimated: 350

Benchmarking Configuration

To allow comparative testing between various compute platforms as described above, each system under test is exercised using a load generator in an isolated network environment to eliminate outside influences on the data collected.

The tests use wrk2 which is part of the DeathStarBench suite as the load generator. Wrk2 is based on the open-source benchmarking tool, wrk, and is modified to be an open-loop load generator, which ensures that new requests are sent out according to the schedule even if responses of previous requests have not been received. The wrk2 application is run on a client system and generates multiple simultaneous HTTP requests to the social network application running on the target server. The tests are configured to run with multiple threads and connections.

On the target system, the social network application is deployed on a Kubernetes node using helm-charts. The operating system used is Fedora 35. The Kubernetes cluster is deployed using Kubeadm and uses Kubernetes version 1.23.5. The CPU, memory resources and number of replicas for each service have been configured and tuned to achieve the lowest p99 latency at the highest throughput on the system under test.

We used the compose post workload generator for this test. The load generator, wrk2, was configured to run with 80 threads and 5000 total connections for a duration of 5 minutes. The requested throughput was initially set to 1000 RPS. This runs the test using 80 threads, keeping 5000 HTTP connections open, and a constant throughput of 1000 requests per second (total, across all connections combined). The requested throughput was increased gradually in steps of 500 to observe the impact on throughput and p99 latency. At the end of each test, we measured throughput, Requests Per Second (RPS) and p99 latency. Each test was run for 5 minutes and repeated at least 5 times, the highest RPS and p99 across multiple runs was used for the final comparison. We observed some modest run-to-run variations in RPS or p99 latencies because of the nature of this workload.

Since this test simulates end-to-end client requests, it is realistic to measure throughput under a specified Service Level Agreement (SLA). We observed the best SLA set to a 99th percentile latency (p99) of 2 seconds. This ensured that 99 percent of requests have a response time of 2 seconds or less while the throughput of the overall load was maximized with very high observable utilization for the system processor under test. In this way we set the “limit” of the test to the SLA point where the system under test began to behave poorly as measured by the p99 latency and the overall drop rate of requests starting to rise.

The workload was run on Ampere Altra Max M128-30, AMD EPYC 7763, and Intel Icelake 8380 (refer to the chart below for results). The same client system was used as load generator across all the platforms under test.

Benchmarking Results

Fig 1 and 2. shows the test results for social network web service application relative to Intel Icelake 8380 and AMD Milan7763 single socket servers. Ampere Altra Max M128-30 delivers up to 20% higher throughput in requests per second (RPS) and about 34% lower response times measured as p99 latency when compared to high end Intel Xeon series and AMD EPYC series.

For large-scale cloud deployments, performance/Watt (i.e., energy efficiency) is an important metric in addition to raw performance. Ampere Altra Max processors lead in performance with a significant perf/watt advantage as shown in the graph, delivering the same throughput while consuming half the power compared to the other platforms; resulting in 2.3x better power efficiency as shown in Fig 3.

Fig 1. P99 Latency @RPS=6000 (Lower is Better)

Fig 2. Throughput-Requests/Sec (SLA @ p99 latency<2secs)

Fig 3. Performance/Watt

Benchmarking Conclusions

Cloud-native is a modern approach to building and running software applications that makes use of the flexibility, scalability, and resilience of cloud computing. More and more developers are embracing cloud-native microservices based architecture to develop and deploy applications like web services to the cloud.

The social network application suite used here simulates a real-world web service using many of the popular cloud native applications like NGINX, Redis, Memcached and MongoDB run as micro-services. All these micro-services individually as well as part a web service solution perform exceptionally well using the large-scale compute resources of Ampere Altra Max.

Ampere Altra and Altra Max processors with their high core count, large caches, predictable performance, and power efficiency are the ideal platform for running web services in the cloud. In this set of tests Altra Max delivered up to 23% higher throughput measured in requests per second under maximum load with 34% lower response times when compared to legacy x86 based systems. In addition, the most sustainable processing architecture in the industry delivers more performance at lower latencies while performing this for less than half the power consumption of the legacy x86 systems! This demonstrates that Ampere Altra and Altra Max processors are truly the best and most sustainable choice for cloud computing workloads in the market today.

Footnotes

All data and information contained herein is for informational purposes only and Ampere reserves the right to change it without notice. This document may contain technical inaccuracies, omissions and typographical errors, and Ampere is under no obligation to update or correct this information. Ampere makes no representations or warranties of any kind, including but not limited to express or implied guarantees of noninfringement, merchantability, or fitness for a particular purpose, and assumes no liability of any kind. All information is provided “AS IS.” This document is not an offer or a binding commitment by Ampere. Use of the products contemplated herein requires the subsequent negotiation and execution of a definitive agreement or is subject to Ampere’s Terms and Conditions for the Sale of Goods.

System configurations, components, software versions, and testing environments that differ from those used in Ampere’s tests may result in different measurements than those obtained by Ampere.

©2022 Ampere Computing. All Rights Reserved. Ampere, Ampere Computing, Altra and the ‘A’ logo are all registered trademarks or trademarks of Ampere Computing. Arm is a registered trademark of Arm Limited (or its subsidiaries). All other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

Ampere Computing^® / 4655 Great America Parkway, Suite 601 / Santa Clara, CA 95054 / amperecomputing.com

Created At : July 5th 2022, 5:12:43 pm

Last Updated At : December 19th 2024, 5:40:01 am

Ampere Computing LLC

4655 Great America Parkway Suite 601

Santa Clara, CA 95054

| | |

This site runs on Ampere Processors.