DeathStarBench is an Open-Source Benchmark Suite
July 2022
Oracle Cloud Infrastructure (OCI) offers Ampere® Altra® compute instances on the new Cloud Native Ampere A1 platform. The Ampere A1 platform can be deployed as bare metal servers or flexible VM shapes, giving customers full control of their entire cloud stack. The Ampere A1 VM shapes provide flexible sizing from 1-80 Oracle CPUs (OCPUs) and 1-64 GB of memory per core, along with several key benefits such as deterministic performance, linear scalability, and a secure architecture with the best price-performance in the market.
Cloud compute instances like OCI Ampere A1 are widely used to deploy web services which deliver hosted content over the internet. Web services consist of discrete, reusable components known as microservices that are designed to easily integrate into any cloud environment. Containerized microservices are used to build distributed applications for web services that are fault tolerant and can scale out more effectively than monolithic applications. Ampere Altra processors are the perfect choice for deploying cloud native applications such as these due to the predictable and highly scalable nature of the architecture.
In this solution brief, we compare the performance of a web service that simulates a social network service like Twitter or Facebook on popular compute instances on OCI, using an application suite that is representative of a real-world microservices-based application. We have used DeathStarBench, an open-source application suite developed at Cornell University to study the performance of a simulated social network service. The OCI Ampere A1 compute platform is compared to similarly equipped AMD and Intel OCI compute instances.
More information on DeathStarBench Suite can be found at "An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud and Edge Systems", Y. Gan et al., ASPLOS 2019.
The DeathStarBench Social Network application is an end-to-end service that implements a broadcast style social network with unidirectional follow relationships. Users (clients) send requests over http, which first reach a NGINX load balancer. Users can create posts embedded with text, media, links, and tags to other users. Their posts are then broadcast to their followers. The service’s backend uses Redis and Memcached for caching, and MongoDB for persistent storage of posts, profiles, media, and recommendations. Many of the micro-services are written in C++ and Python.
The benchmark suite comes with a modified workload generator, WRK2, based on WRK. WRK2 uses LuaJIT scripts to perform HTTP request generation, response processing, and custom reporting.
Scalable: Designed from the ground up for cloud customers, Ampere Altra processors are ideal for cloud native usages such as deploying web services which benefit from a 2x higher performance value for the highest throughput delivered under a predefined SLA.
Predictable Performance: High core count on Ampere processors along with compelling single-threaded performance and consistent frequency on all cores helps deliver more than 4x performance value in terms of latency for web services deployments using Ampere A1 compute shapes on OCI.
Power Efficient: Industry-leading energy efficiency allows Ampere Altra shapes to hit competitive levels of raw performance while consuming much lower power than the competition often resulting in lower costs for Ampere based shapes.
Oracle CPU (OCPU) is a unit of measurement for number of processors allocated to a VM in OCI. This is different from the vCPU industry standard and explained in detail here.
One OCPUs on an X86 system is worth two vCPUs: the main CPU core and its associated symmetric multiprocessing (SMP) unit
One OCPU on Ampere A1 system is worth one vCPU: that is one main CPU core since Ampere systems have dedicated cores and do not use SMP or threads.
System
Memory
Connectivity
Scalable 1-40Gb Ethernet
VNIC Options Configurable with up to 24 VNICs
Block Storage
The DeathStarBench suite comes with helm charts for deploying the Social Network application on a Kubernetes cluster. The default configuration uses pre-compiled x86 images from Docker Hub. The first step was to rebuild the images for aarch64 and update the helm chart values to use the new image tag and names. Once images were built, they were uploaded to the OCI container registry. The next step was to deploy an OCI Kubernetes cluster using the desired compute shape, number of CPUs, memory, and boot volume size.
VM.Standard3.Flex | VM.Standard.E4.Flex | VM.Standard.A1.Flex | |
---|---|---|---|
OCPU | 16 | 16 | 32 |
Cores/Threads | 16/32 | 16/32 | 32/32 |
Memory | 128G | 128G | 128G |
Network b/w | 16 Gbps | 16 Gbps | 32 Gbps |
Arch | x86_64 | x86_64 | aarch64 |
Kernel | Oracle Linux 8.5 | Oracle Linux 8.5 | Oracle Linux 8.5 |
Kubernetes Version | v1.21.5 | v1.21.5 | v1.21.5 |
The WRK2 load generator was run on a separate OCI VM instance in the same region/availability-domain as the Social Network Kubernetes cluster. The Kubernetes cluster and client VM instance are configured to use the same subnet within a shared Virtual Cloud Network, so that clients access the cluster on an internal IP address. The OKE cluster uses Kubernetes v1.21.5, with Oracle-Linux-8.5. Three OCI instances are compared:
For each Kubernetes cluster, a single 32 vCPU node with 128 GB of memory was used. Since a single OCPU on x86 systems is worth 2 vCPUs, the benchmarking configuration uses 32 OCPU Ampere A1 VM and 16 OCPU VM on x86 systems
A compose-post workload was used to simulate clients connecting to the Social Network application and creating posts. Each test was configured to run for a duration of 5 minutes using 100 threads and 1000 total connections distributed across the threads. The service load was gradually scaled up by increasing the Requests Per Second (RPS), starting from RPS=1000, then measured for P99 latency and throughput after each run. Each test was run 5 times to ensure minimal run-to-run variance. The highest P99 latency of the 5 runs was used for the results.
Since this test simulates end-to-end client requests, it is realistic to measure throughput under a specified Service Level Agreement (SLA). A 99th percentile latency (p.99) of 2 seconds was selected as the SLA ceiling for the systems under test. This ensures that 99 percent of requests have a response time of 2 seconds or less.
The GitHub repo for DeathStarBench along with licensing information is available at: http://microservices.ece.cornell.edu/
Performance Value – The combined advantage of Performance and Price
OCPU Shape | Price/OCPU* | Total Price/Hour | Price Advantage | Performance Advantage (P99 Latency) | Performance Value (P99 Latency) | Performance Advantage (Throughput) | Performance Value (Throughput) |
---|---|---|---|---|---|---|---|
Intel Standard3.Flex (Xeon 8358) | $0.04 | $0.04 * 16 | 1 | 1.00 | 1.00 | 1 | 1.00 |
AMD E4.Flex (EPYC 7J13) | $0.025 | $0.025 * 16 | 1.6 | 0.93 | 1.50 | 1 | 1.60 |
Ampere A1.Flex (Q80-30) | $0.01 | $0.01 * 32 | 2 | 2.14 | 4.28 | 1.1 | 2.20 |
Oracle Compute Pricing (June 2022)
As seen in Figure 1, with the Social network application on OCI Ampere A1 Instance, clients connecting to the web service receive a response twice as fast compared to the X86 instances. These response times were measured as P99 latency captured at the maximum throughput (requests per second) allowed under the SLA defined. These results showed that Ampere A1 is half of the latency measured for the Intel Standard3 and AMD E4 instances.
When comparing throughput, measured as the highest Requests Per Second (RPS) delivered under an SLA of p.99 latency less than 2 seconds, Figure 2 shows the Ampere A1 instance can handle 10% more requests per second while maintaining a response time of 2 seconds or less per request when compared to the x86 instances.
Figure 3 shows the distribution of latencies for all responses in a single test run @RPS=2000, on each of the compute shapes under test. This load was selected to examine the behavior of the system under test while running a higher stress level without exceeding the chosen SLA of 2 second response time. Under this load the standard deviation for latencies on the Ampere A1 instance is half of the standard deviation for latencies on the X86 instances. In addition, the mean of all response times was both higher in total responses(peak) as well as up to 3x lower in average response time! This graph shows the much more stable and predictable nature of the response on Ampere A1 instance for all requests This is a critical finding for any cloud native workload and especially the microservice based applications tested here. This predictable profile is what makes Ampere Altra family processors truly remarkable for Web tier services like the social media simulation built with DeathStarBench.
In conclusion, the performance advantage on the Ampere instances combined with the price advantages offered by Oracle Cloud provides a much higher value when using Ampere A1 compute instances for cloud deployment of web services or similar SaaS applications. Based on the response times as well the maximum throughput delivered, the Ampere A1 instance has up to 4.3x performance value advantages over the nearest competitive shapes from AMD and Intel.
All data and information contained herein is for informational purposes only and Ampere reserves the right to change it without notice. This document may contain technical inaccuracies, omissions and typographical errors, and Ampere is under no obligation to update or correct this information. Ampere makes no representations or warranties of any kind, including but not limited to express or implied guarantees of noninfringement, merchantability, or fitness for a particular purpose, and assumes no liability of any kind. All information is provided “AS IS.” This document is not an offer or a binding commitment by Ampere. Use of the products contemplated herein requires the subsequent negotiation and execution of a definitive agreement or is subject to Ampere’s Terms and Conditions for the Sale of Goods.
System configurations, components, software versions, and testing environments that differ from those used in Ampere’s tests may result in different measurements than those obtained by Ampere.
Price performance was calculated from OCI Compute pricing list, for A1 Flex VMs in March of 2022. Refer to individual tests for core counts. Memory and Storage is same across all the VM’s, and hence not considered
©2022 Ampere Computing. All Rights Reserved. Ampere, Ampere Computing, Altra and the ‘A’ logo are all registered trademarks or trademarks of Ampere Computing. Arm is a registered trademark of Arm Limited (or its subsidiaries). All other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.
Ampere Computing® / 4655 Great America Parkway, Suite 601 / Santa Clara, CA 95054 / amperecomputing.com