Significant advantages in executing CV tasks on edge devices from Ampere.

GCC Guide for Ampere Processors

How to effectively use GCC for optimizing applications running on Ampere Processors

UPDATED: May, 2025

Introduction

This paper describes how to optimize applications for Ampere Processors by effectively using the GCC compiler. When attempting to optimize an application, it's essential to measure whether a potential optimization, including compiler options, actually improves performance. Using advanced compiler options may result in better runtime performance, potentially at the cost of increased compile time, making code more difficult to debug, and often increased binary size. Why compiler options affect performance is beyond the scope of this paper, although the short answer is that the interaction between code generation and modern processor architectures is very complex. Another important point is that different processors may benefit from different compiler options due to variations in computer architecture and microarchitecture. Repeated experimentation with optimizations is key to performance success.

How to measure an application’s performance to determine the limiting factors, as well as optimization strategies have already been covered in articles previously published that apply to Ampere Altra and AmpereOne Processors. The paper, The First 10 Questions to Answer While Running on Ampere Altra-Based Instances, describes what performance data to collect to understand the entire system’s performance. A Performance Analysis Methodology for Optimizing Ampere® Altra® Family Processors explains how to optimize effectively & efficiently using a data-driven approach.

This paper first summarizes the most common GCC performance related options with a description of how these options affect applications. The discussion then turns to present case studies using GCC options to demonstrate performance improvements for the MySQL database for Ampere Processors. Similar strategies have been effectively used to optimize additional software running on Ampere Processors.

GCC Recommendations

The GCC compiler provides many options that can improve application performance. See the GCC website for details. To generate code that takes advantage of all the performance features available in Ampere Processors, use the appropriate gcc -mcpu option, summarized in Table 1.

To use the gcc -mcpu option, either set the CPU model or tell GCC to use the CPU model based on the machine that GCC is running on via -mcpu=native. Note on legacy x86 based systems, gcc -mcpu is a deprecated synonym for -mtune, while gcc -mcpu is fully supported on Arm based systems. See Arm’s guide to Compiler flags across architectures: -march, -mtune, and -mcpu for details.

In summary, whenever possible, use only -mcpu and avoid -march and -mtune when compiling for Arm.

Setting the -mcpu option:

-mcpu=ampere1a: Generate code that will run on Ampere AmpereOne AC04 Processors. AmpereOne AC04 is the next generation of Cloud Native Processors from Ampere, extending the family of high-performance processors to new industry leading core counts. Note, this can generate code that will not run on Ampere Altra, Altra Max and AmpereOne AC03 Processors. This option was initially available in GCC versions 14 and backported to GCC 13.1, GCC 12.3, GCC 11.4, and GCC 10.5.
- mcpu=ampere1: Generate code that will run on Ampere AmpereOne Processors including AmpereOne AC03 and AmpereOne AC04. Note, this can generate code that will not run on Ampere Altra and Altra Max Processors. This option was initially available in GCC version 12.1 and later, then backported to GCC 10.5 and GCC 11.3.
-mcpu=neoverse-n1: Generate code that will run on Ampere Altra, Ampere Altra Max as well as Ampere AmpereOne. While using this option for code that will run on Ampere AmpereOne is supported, it will potentially not take advantage of all the new performance features available. Note, GCC version 9.1 or higher is required to enable CPU specific tunings for Ampere Altra and Ampere Altra Max processors.
-mcpu=native: Generate code setting the CPU model based on the CPU GCC is running on. Note, GCC version 9.1 or higher is required to enable CPU specific tunings for Ampere Processors.

Using -mcpu=native is potentially easier to use, although it has a potential problem if the executable, shared library, or object file are used on a different system. If the build was done on an Ampere AmpereOne Processor, the code may not run on an Ampere Altra or Altra Max Processor because the generated code may include Armv8.6+ instructions supported on Ampere AmpereOne Processors. If the build was done on an Ampere Altra or Altra Max processor, GCC will not take advantage of the latest performance improvements available on Ampere AmpereOne Processors. This is a general issue when building code to take advantage of performance features for any architecture.

Table 1 summarizes what GCC versions support a given Ampere Processor -mcpu values along with required binutils.

Processor	ISA	-mcpu Value	Binutils Version	GCC 9	GCC 10	GCC 11	GCC 12	GCC 13	GCC 14	LLVM
Ampere Altra	Armv8.2+	neoverse-n1	≥ 2.33	≥ 9.1	ALL	ALL	ALL	ALL	ALL	≥ 16.0
Ampere Altra Max	Armv8.2+	neoverse-n1	≥ 2.33	≥ 9.1	ALL	ALL	ALL	ALL	ALL	≥ 16.0
AmpereOne AC03	Armv8.6+	ampere1	≥ 2.34	N/A	≥ 10.5	≥ 11.3	≥ 12.1	ALL	ALL	≥ 16.0
AmpereOne AC04 and AC04_1	Armv8.6+	ampere1a	≥ 2.34	N/A	≥ 10.5	≥ 11.4	≥ 12.3	ALL	ALL	≥ 16.0

_Table 1: GCC version and its support of a given Ampere Processor -mcpu values along with required binutils and GCC versions as well as LLVM versions.

Our recommendation is to use the gcc -mcpu option with the appropriate value described above (-mcpu=ampere1, -mcpu=neoverse-n1 or -mcpu=native) with -O2 to establish a baseline for performance, then explore additional optimization options and measuring if different options improve performance compared to the baseline.

Summary of common GCC options:

-mcpu Recommended when building on Ampere Processors to enable processor specific tuning and optimizations. See discussion above ‘Setting the_ -mcpu _option’ for details.
-Os Optimize to reduce code size, potentially if your application is limited by fetching instructions.
-O2 Considered standard GCC optimization option and good to use as a baseline to compare with other GCC options.
-O3 Adds additional optimizations to generate more efficient codes for loops, useful to try if your application performance is dominated by time spent in loops.
Profile Guided Optimization (PGO): -fprofile-generate & -fprofile-use Generate profile data that the compiler will use to potentially make better decisions on optimizations such as inlining, loop optimizations and default branches. This is considered an advanced optimization as it requires changes to the build system, see below.
Link-Time Optimization (LTO): -flto – Enable link-time optimizations, allowing the compiler to optimize across individual source files. This enables functions to be inlined across source files among other compiler optimizations. This is also considered an advanced optimization and potentially requires changes to the build system. This option increases overall build time, which can be dramatic for large applications. It is possible to use LTO just on performance critical source files to potentially decrease build times.

GCC Profile Guided Optimization

This section provides an overview of GCC’s Profile Guided Optimization (PGO) and a case study of optimizing MySQL with PGO. Profile Guide Optimizations enable GCC to make better optimization decisions, including optimizing branches, code block reordering, inlining functions and loops optimizations via loop unrolling, loop peeling and vectorization. Using PGO requires modifying the build environment to do a 3-part build.

Build application with Profile Guided Optimization, gcc -fprofile-generate.
Run application on representative workloads to generate the profile data.
Rebuild application using the profile data, gcc -fprofile-use

A challenge of using PGO is the extremely high performance overhead in step 2 above. Due to the slow performance running an application built with gcc -fprofile-generate, it may not be practical to run on systems operating in a production environment. See the GCC manual’s Program Instrumentation Options section to build applications with run-time instrumentation and the section Options That Control Optimization for rebuilding using the generated profile information for additional details.

As described in the GCC manual, -fprofile-update=atomic is recommended for multi-threaded applications, and can improve performance by collecting improved profile data.

When to Use PGO?

With PGO, GCC can better optimize applications by providing additional information such as measuring branches taken vs. not taken and measuring loop trip counts. PGO is a useful optimization to try and see if it improves performance. Performance signatures where PGO may help include applications with a significant percentage of branch mispredictions, which can be measured using the perf utility to read the CPU’s Performance Monitoring Unit (PMU) counter BR_MIS_PRED_RETIRED. Large numbers of branch mispredictions lead to a high percentage of front-end stalls, which can be measured by the STALL_FRONTEND PMU counter. Applications with a high L2 instruction cache miss rate may also benefit from PGO, possibly related to mis-predicted branches. In summary, a large percentage of branch mispredictions, CPU front end stalls and L2 instruction cache misses are performance signatures where PGO can improve performance.

MySQL database GCC PGO Case Study

MySQL is the world’s most popular open-source database and due to the huge MySQL binary size, is an ideal candidate for using GCC PGO optimization. Without PGO information, it’s impossible for GCC to correctly predict the many different code paths executed. Using PGO greatly reduces branch misprediction, L2 instruction cache miss rate and CPU front end stalls on Ampere Altra Max Processor. The paper MYSQL ON AMPEREONE® A192-32X described the throughput and latency of running MySQL on AmpereOneX Processors.

Summarizing how MySQL is optimized using GCC PGO:

sysbench was used to evaluate MySQL performance
GCC PGO was trained using MySQL MTR (mysql-test-run) test suite
Sysbench's oltp_point_select and oltp_read_only tests were used to measure performance with PGO build compared to the default build
The number of threads used were then varied from 1 to 1024, giving an average speed up of 29% for the oltp_point_select and 20% for the oltp_read_only test on an Ampere Altra Max M128-30 processor
With 64 threads, PGO improved performance by 32% by improving MySQL’s throughput

Additional details can be found on the Ampere Developer’s website in the MySQL Tuning Guide.

Summary

Optimizing applications requires experimenting with different strategies to determine what works best for each application and workload. This paper shows how to effectively use different GCC compiler optimizations to generate high performing applications running on Ampere Processors, referencing previous papers, to show how to measure performance and optimization strategies. It provides recommendations for common GCC options and shows the benefits these options give to applications. It highlights using the -mcpu option on Ampere Processors as the easiest way to generate code that takes advantage of all the features supported by Ampere Cloud Native Processors. We showed how to use GCC options to optimize the MySQL database and provided recommendations on other GCC optimizations to help improve performance for applications running on Ampere Processors. Built for sustainable cloud computing, Ampere’s Cloud Native Processors deliver predictable high performance, platform scalability, and power efficiency unprecedented in the industry. We invite you to learn more about our developer efforts and find best practices at https://developer.amperecomputing.com and join the conversation at: https://community.amperecomputing.com/

Created At : September 20th 2023, 4:59:51 pm

Last Updated At : May 13th 2025, 4:40:02 pm

Ampere Computing LLC

4655 Great America Parkway Suite 601

Santa Clara, CA 95054

| | |

This site runs on Ampere Processors.