Llama 3 AI Inference on AmpereOne

Sustainable Large Language Models (LLMs) Deployments at Scale

Overview

This workload brief focuses on the performance of AmpereOne® processors for inference tasks using Llama 3. The Llama 3 model represents a significant advancement in large language models (LLMs), offering powerful natural language processing capabilities. Despite its capabilities, running inference on such a massive model comes with substantial computational costs, particularly for enterprises needing to manage large volumes of data and queries. This makes high-performance hardware essential for optimizing throughput and ensuring that the model can deliver real-time responses without excessive energy consumption.

As enterprises increasingly seek to deploy LLMs at scale, there’s a growing trend toward using smaller, optimized versions of these models such as the Llama 3 8 billion parameter model. These scaled-down models are designed to offer a balance between accuracy and computational efficiency, reducing the costs of running large-scale inference workloads. By using quantization, pruning, or distillation techniques, these smaller models help enterprises meet their needs for rapid, low-latency inference while controlling infrastructure costs, making AI deployments more sustainable and cost-effective for everyday business operations.

Results and Key Findings

Figure 1 illustrates the socket-level efficiency (performance/Watt) of the AmpereOne A192-26X processor compared to AMD EPYC 9754 (Bergamo) and 9654 (Genoa) when running the Llama 3 8B model (normalized to AMD EPYC 9654 as the baseline).

Fig.1: Socket-level Efficiency

Fig.1: Socket-level efficiency (perf/Watt) comparison of AmpereOne, AMD EPYC 9754 (Bergamo), and AMD EPYC 9654 (Genoa). AMD EPYC 9654 (Genoa) used as a baseline of 1.00.

Conclusion

AmpereOne Delivers Leading Llama 3 Energy Efficiency

This comparison underscores AmpereOne’s advantage in delivering higher inference efficiency, making it ideal for enterprises aiming to deploy LLMs at scale while controlling energy costs.

About Ampere and AmpereOne

Ampere Computing focuses on delivering high-performance, power-efficient processors for Cloud-Native applications. The AmpereOne processor, with its innovative ARM architecture and up to 192 cores, is designed to meet the demands of modern AI workloads. The integration of Ampere Optimized AI Frameworks (AIO) and Ampere Model Library (AML) further enhances AmpereOne’s AI inference capabilities and facilitates easy transitioning from x86 legacy architecture.

Footnotes

All data and information contained herein is for informational purposes only and Ampere reserves the right to change it without notice. This document may contain technical inaccuracies, omissions and typographical errors, and Ampere is under no obligation to update or correct this information. Ampere makes no representations or warranties of any kind, including but not limited to express or implied guarantees of noninfringement, merchantability, or fitness for a particular purpose, and assumes no liability of any kind. All information is provided “AS IS.” This document is not an offer or a binding commitment by Ampere. Use of the products contemplated herein requires the subsequent negotiation and execution of a definitive agreement or is subject to Ampere’s Terms and Conditions for the Sale of Goods.

System configurations, components, software versions, and testing environments that differ from those used in Ampere’s tests may result in different measurements than those obtained by Ampere.

©2024 Ampere Computing. All Rights Reserved. Ampere, Ampere Computing, Altra and the ‘A’ logo are all registered trademarks or trademarks of Ampere Computing. Arm is a registered trademark of Arm Limited (or its subsidiaries). All other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

Ampere Computing^® / 4655 Great America Parkway, Suite 601 / Santa Clara, CA 95054 / amperecomputing.com