AI Inference on Ampere Altra Max

Ampere—Empowering What’s Next

The Ampere^® Altra^® Max processor is a complete system-on-chip (SOC) solution that supports up to 128 high-performance cores with innovative architecture that delivers predictable high performance, linear scaling, and high energy efficiency. Running AI inference is a rapidly growing production workload in the cloud. While training deep neural networks require a significant amount of GPU or similar hardware acceleration infrastructure, running inference on fully trained, deployment ready AI algorithms can be handled by CPUs in most situations. We demonstrate that Ampere Altra Max is ideal for running AI inference in the cloud, not only meeting latency and throughput requirements, but also outperforming CPUs based on x86 architecture as well as other ARM based processors currently used in the Cloud.

AI Inference on Ampere Altra Max

Ampere Altra Max processors deliver exceptional performance and power efficiency for AI workloads. Running AI inference on Ampere Altra Max requires no modification or translation of your neural network, independent of the platform it was trained on as long as it was done with one of the industry standard AI development frameworks such as TensorFlow, PyTorch or ONNX. Ampere’s optimized TensorFlow, Pytorch and ONNX are available at no charge either from our cloud suppliers or directly from Ampere.

Ampere Altra Max supports the fp16 data format providing up to a 2x performance speed-up over fp32 models with no or negligible loss of accuracy. The quantization from fp32 is straightforward and requires no retraining or rescaling the weights. If trained on fp16 on GPU, inference can be run on the model out of the box. Ampere Altra Max supports fp32, fp16 and int8 data formats.

Ampere provides an ever-growing family of optimized, pre-trained models available for download to use for demos or to adapt and use in your applications.

Finally, Ampere Altra Max CPUs also work in tandem with NVIDIA GPUs for your training needs.

We have run a series of benchmarks, following MLCommons guidelines to demonstrate and document Ampere Altra Max CPUs superior performance in many representative AI inference tasks including Computer Vision and NLP applications.

Benefits of running AI Inference on Ampere Altra Max

Cloud Native: Designed from the ground up for cloud-native workloads, Ampere Altra Max delivers up to 2x higher inference performance than the best x86 servers and 4x better than ARM-based processors.
Industry Standard Platforms: Ampere Altra Max runs AI inference workloads developed on TensorFlow, PyTorch or ONNX without modifications. Customers can run their applications by simply using our optimized frameworks, available free of charge from Ampere or our cloud partners.
Support for fp16 format: Ampere Altra Max natively supports the fp16 data format. Quantizing fp32 trained networks to fp16 is straightforward and results in no visible accuracy loss, while providing up to a 2X speed-up.
Scalable: With an innovative scale-out architecture, Ampere Altra Max processors have a high core count with compelling single-threaded performance. Combined with consistent frequency for all cores Ampere Altra Max delivers consistent performance at the socket level greater than the best x86 servers. This leads to much higher resistance to noisy neighbors in multitenant environments
Energy Efficiency: With up to 128 energy-efficient Arm cores, Ampere Altra Max has a 60% performance/watt advantage over leading x86 servers with better performance. Industry leading performance and high energy efficiency results in Ampere Altra Max having a smaller carbon footprint and reduces Total Cost of Ownership (TCO).

Ampere Altra Max

128 64-bit cores at 3.0GHz
64KB i-Cache, 64KB d-Cache per core
1MB L2 Cache per core
16MB System Level Cache
Coherent mesh-based interconnect

Memory

8x72 bit DDR4-3200 channels
ECC and DDR4 RAS
Up to 16 DIMMs (2 DPC) and 4TB addressable memory

Connectivity

128 lanes of PCIe Gen4
Coherent multi-socket support
4x16 CCIX lanes

System

Armv8.2+, SBSA Level 4
Advanced Power Management

Performance

SPECrate®2017Integer Estimated: 350

Inference Performance

Having run various AI workloads according to MLCommons benchmarking guidelines, we present some of our results below.

In Computer Vision using SSD ResNet-34 for a typical Object Detection application Ampere Altra Max outperforms in latency, Intel Xeon 8375C by 2x, AMD EPYC7Ji3 and Graviton by 4x in fp32 mode. In fp16, Altra Max extends its lead by an additional factor of two while maintaining the same accuracy. See Figure 1.

Fig 1. Object Detection Single-Stream Latency in FPS

Fig 2. ResNet Throughput (FPS)/Power (W)

Summary

Ampere Altra Max processors are a complete System on a Chip (SOC) solution built for Cloud Native workloads, designed to deliver exceptional performance and energy efficiency for AI inferencing. Ampere Altra Max has up to 4x faster performance compared to Intel® Xeon® Platinum 8375c and AMD EPYC 7J13.

Visit https://solutions.amperecomputing.com/solutions/ampere-ai to learn how to access Ampere systems from our partner Cloud Service Providers and experience the performance and power efficiency of Ampere processors.

Downloads

Ampere Optimized PyTorch

Download

Ampere Optimized TensorFlow

Download

Ampere Optimized ONNX Runtime

Download

Benchmarking Configuration

The benchmarks were performed using TensorFlow on bare metal single socket servers with equivalent memory, networking, and storage configurations for the x86 platforms shown. Processors tested include AMD EPYC 7J13 “Milan” with TF2.7 ZenDNN, Intel Xeon 8375C “Cascade Lake” with TF 2.7 DNNL, Intel Xeon 8380 “Ice Lake” with TF 2.7 DNNL and Ampere Altra Max M128-30 with Ampere Optimized TF 2.7. ARM-64 based “Graviton 2”, available exclusively through AWS (c6g shape), was tested in 64-core configuration.

Detailed benchmark conditions and configurations for each device type can be found here

Footnotes

All data and information contained herein is for informational purposes only and Ampere reserves the right to change it without notice. This document may contain technical inaccuracies, omissions and typographical errors, and Ampere is under no obligation to update or correct this information. Ampere makes no representations or warranties of any kind, including but not limited to express or implied guarantees of noninfringement, merchantability, or fitness for a particular purpose, and assumes no liability of any kind. All information is provided “AS IS.” This document is not an offer or a binding commitment by Ampere. Use of the products contemplated herein requires the subsequent negotiation and execution of a definitive agreement or is subject to Ampere’s Terms and Conditions for the Sale of Goods.

System configurations, components, software versions, and testing environments that differ from those used in Ampere’s tests may result in different measurements than those obtained by Ampere.

©2022 Ampere Computing. All Rights Reserved. Ampere, Ampere Computing, Altra and the ‘A’ logo are all registered trademarks or trademarks of Ampere Computing. Arm is a registered trademark of Arm Limited (or its subsidiaries). All other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

Ampere Computing^® / 4655 Great America Parkway, Suite 601 / Santa Clara, CA 95054 / amperecomputing.com

Created At : May 25th 2022, 4:50:08 pm

Last Updated At : December 9th 2024, 6:41:56 pm

Ampere Computing LLC

4655 Great America Parkway Suite 601

Santa Clara, CA 95054

| | |

This site runs on Ampere Processors.