How to effectively use GCC for optimizing applications running on Ampere Processors
UPDATED: May, 2025
This paper describes how to optimize applications for Ampere Processors by effectively using the GCC compiler. When attempting to optimize an application, it's essential to measure whether a potential optimization, including compiler options, actually improves performance. Using advanced compiler options may result in better runtime performance, potentially at the cost of increased compile time, making code more difficult to debug, and often increased binary size. Why compiler options affect performance is beyond the scope of this paper, although the short answer is that the interaction between code generation and modern processor architectures is very complex. Another important point is that different processors may benefit from different compiler options due to variations in computer architecture and microarchitecture. Repeated experimentation with optimizations is key to performance success.
How to measure an application’s performance to determine the limiting factors, as well as optimization strategies have already been covered in articles previously published that apply to Ampere Altra and AmpereOne Processors. The paper, The First 10 Questions to Answer While Running on Ampere Altra-Based Instances, describes what performance data to collect to understand the entire system’s performance. A Performance Analysis Methodology for Optimizing Ampere® Altra® Family Processors explains how to optimize effectively & efficiently using a data-driven approach.
This paper first summarizes the most common GCC performance related options with a description of how these options affect applications. The discussion then turns to present case studies using GCC options to demonstrate performance improvements for the MySQL database for Ampere Processors. Similar strategies have been effectively used to optimize additional software running on Ampere Processors.
The GCC compiler provides many options that can improve application performance. See the GCC website for details. To generate code that takes advantage of all the performance features available in Ampere Processors, use the appropriate gcc -mcpu option, summarized in Table 1.
To use the gcc -mcpu option, either set the CPU model or tell GCC to use the CPU model based on the machine that GCC is running on via -mcpu=native. Note on legacy x86 based systems, gcc -mcpu is a deprecated synonym for -mtune, while gcc -mcpu is fully supported on Arm based systems. See Arm’s guide to Compiler flags across architectures: -march, -mtune, and -mcpu for details.
In summary, whenever possible, use only -mcpu and avoid -march and -mtune when compiling for Arm.
Setting the -mcpu option:
Using -mcpu=native is potentially easier to use, although it has a potential problem if the executable, shared library, or object file are used on a different system. If the build was done on an Ampere AmpereOne Processor, the code may not run on an Ampere Altra or Altra Max Processor because the generated code may include Armv8.6+ instructions supported on Ampere AmpereOne Processors. If the build was done on an Ampere Altra or Altra Max processor, GCC will not take advantage of the latest performance improvements available on Ampere AmpereOne Processors. This is a general issue when building code to take advantage of performance features for any architecture.
Table 1 summarizes what GCC versions support a given Ampere Processor -mcpu values along with required binutils.
Processor | ISA | -mcpu Value | Binutils Version | GCC 9 | GCC 10 | GCC 11 | GCC 12 | GCC 13 | GCC 14 | LLVM |
---|---|---|---|---|---|---|---|---|---|---|
Ampere Altra | Armv8.2+ | neoverse-n1 | ≥ 2.33 | ≥ 9.1 | ALL | ALL | ALL | ALL | ALL | ≥ 16.0 |
Ampere Altra Max | Armv8.2+ | neoverse-n1 | ≥ 2.33 | ≥ 9.1 | ALL | ALL | ALL | ALL | ALL | ≥ 16.0 |
AmpereOne AC03 | Armv8.6+ | ampere1 | ≥ 2.34 | N/A | ≥ 10.5 | ≥ 11.3 | ≥ 12.1 | ALL | ALL | ≥ 16.0 |
AmpereOne AC04 and AC04_1 | Armv8.6+ | ampere1a | ≥ 2.34 | N/A | ≥ 10.5 | ≥ 11.4 | ≥ 12.3 | ALL | ALL | ≥ 16.0 |
_Table 1: GCC version and its support of a given Ampere Processor -mcpu values along with required binutils and GCC versions as well as LLVM versions.
Our recommendation is to use the gcc -mcpu option with the appropriate value described above (-mcpu=ampere1, -mcpu=neoverse-n1 or -mcpu=native) with -O2 to establish a baseline for performance, then explore additional optimization options and measuring if different options improve performance compared to the baseline.
Summary of common GCC options:
This section provides an overview of GCC’s Profile Guided Optimization (PGO) and a case study of optimizing MySQL with PGO. Profile Guide Optimizations enable GCC to make better optimization decisions, including optimizing branches, code block reordering, inlining functions and loops optimizations via loop unrolling, loop peeling and vectorization. Using PGO requires modifying the build environment to do a 3-part build.
A challenge of using PGO is the extremely high performance overhead in step 2 above. Due to the slow performance running an application built with gcc -fprofile-generate, it may not be practical to run on systems operating in a production environment. See the GCC manual’s Program Instrumentation Options section to build applications with run-time instrumentation and the section Options That Control Optimization for rebuilding using the generated profile information for additional details.
As described in the GCC manual, -fprofile-update=atomic is recommended for multi-threaded applications, and can improve performance by collecting improved profile data.
With PGO, GCC can better optimize applications by providing additional information such as measuring branches taken vs. not taken and measuring loop trip counts. PGO is a useful optimization to try and see if it improves performance. Performance signatures where PGO may help include applications with a significant percentage of branch mispredictions, which can be measured using the perf utility to read the CPU’s Performance Monitoring Unit (PMU) counter BR_MIS_PRED_RETIRED. Large numbers of branch mispredictions lead to a high percentage of front-end stalls, which can be measured by the STALL_FRONTEND PMU counter. Applications with a high L2 instruction cache miss rate may also benefit from PGO, possibly related to mis-predicted branches. In summary, a large percentage of branch mispredictions, CPU front end stalls and L2 instruction cache misses are performance signatures where PGO can improve performance.
MySQL is the world’s most popular open-source database and due to the huge MySQL binary size, is an ideal candidate for using GCC PGO optimization. Without PGO information, it’s impossible for GCC to correctly predict the many different code paths executed. Using PGO greatly reduces branch misprediction, L2 instruction cache miss rate and CPU front end stalls on Ampere Altra Max Processor. The paper MYSQL ON AMPEREONE® A192-32X described the throughput and latency of running MySQL on AmpereOneX Processors.
Summarizing how MySQL is optimized using GCC PGO:
Additional details can be found on the Ampere Developer’s website in the MySQL Tuning Guide.
Optimizing applications requires experimenting with different strategies to determine what works best for each application and workload. This paper shows how to effectively use different GCC compiler optimizations to generate high performing applications running on Ampere Processors, referencing previous papers, to show how to measure performance and optimization strategies. It provides recommendations for common GCC options and shows the benefits these options give to applications. It highlights using the -mcpu option on Ampere Processors as the easiest way to generate code that takes advantage of all the features supported by Ampere Cloud Native Processors. We showed how to use GCC options to optimize the MySQL database and provided recommendations on other GCC optimizations to help improve performance for applications running on Ampere Processors. Built for sustainable cloud computing, Ampere’s Cloud Native Processors deliver predictable high performance, platform scalability, and power efficiency unprecedented in the industry. We invite you to learn more about our developer efforts and find best practices at https://developer.amperecomputing.com and join the conversation at: https://community.amperecomputing.com/