GCC Guide for Ampere Processors
How to effectively use GCC for optimizing applications running on Ampere Processors
This paper describes how to effectively use GNU Compiler Collection (GCC) options to help optimize application performance on Ampere Processors. When attempting to optimize an application, it is essential to measure if a potential optimization improves performance. This includes compiler options. Using advanced compiler options may result in better runtime performance, potentially at the cost of increased compile time, more debug difficulties, and often increased binary size. Why compiler options affect performance is beyond the scope of this paper, although the short answer is that code generation, modern processor architectures and how they interact are very complicated! Another important point is that different processors may benefit from different compiler options because of variations in computer architecture, and the specific microarchitecture. Repeated experimentation with optimizations is key to performance success.
How to measure an application’s performance to determine the limiting factors, as well as optimization strategies have already been covered in articles previously published. The paper, The First 10 Questions to Answer While Running on Ampere Altra-Based Instances, describes what performance data to collect to understand the entire system’s performance. A Performance Analysis Methodology for Optimizing Ampere Altra Family Processors explains how to optimize effectively & efficiently using a data-driven approach.
This paper first summarizes the most common GCC options with a description of how these options affect applications. The discussion then turns to present case studies using GCC options to improve performance of VP9 video encoding software and MySQL database for Ampere Processors. Similar strategies have been effectively used to optimize additional software running on Ampere Processors.
The GCC compiler provides many options that can improve application performance. See the GCC website for details. To generate code that takes advantage of all the performance features available in Ampere Processors, use the gcc -mcpu option.
To use the gcc -mcpu option, either set the CPU model or tell GCC to use the CPU model based on the machine that GCC is running on via -mcpu=native. Note on legacy x86 based systems, gcc -mcpu is a deprecated synonym for -mtune, while gcc -mcpu is fully supported on Arm based systems. See Arm’s guide to Compiler flags across architectures: -march, -mtune, and -mcpu for details.
In summary, whenever possible, use only -mcpu and avoid -march and -mtune when compiling for Arm. Below is a case study highlighting performance gains by setting the gcc -mcpu option with VP9 video encoding software.
Setting the -mcpu option:
Using -mcpu=native is potentially easier to use, although it has a potential problem if the executable, shared library, or object file are used on a different system. If the build was done on an Ampere AmpereOne Processor, the code may not run on an Ampere Altra or Altra Max Processor because the generated code may include Armv8.6+ instructions supported on Ampere AmpereOne Processors. If the build was done on an Ampere Altra or Altra Max processor, GCC will not take advantage of the latest performance improvements available on Ampere AmpereOne Processors. This is a general issue when building code to take advantage of performance features for any architecture.
The following table lists what GCC versions that support Ampere Processor -mcpu values.
|Processor||-mcpu Value||GCC 9||GCC 10||GCC 11||GCC 12||GCC 13|
|Ampere Altra||neoverse-n1||≥ 9.1||ALL||ALL||ALL||ALL|
|Ampere Altra Max||neoverse-n1||≥ 9.1||ALL||ALL||ALL||ALL|
|AmpereOne||ampere1||N/A||≥ 10.5||≥ 11.3||≥ 12.1||ALL|
Our recommendation is to use the gcc -mcpu option with the appropriate value described above (-mcpu=ampere1, -mcpu=neoverse-n1 or -mcpu=native) with -O2 to establish a baseline for performance, then explore additional optimization options and measuring if different options improve performance compared to the baseline.
Summary of common GCC options:
VP9 is a video coding format developed by Google. libvpx is the open source reference software implementation for the VP8 and VP9 video codecs from Google and the Alliance for Open Media (AOMedia). libvpx provides significant improvement in video compression over x264 with the expense of additional computation time. Additional information on VP9 and libvpx is available on Wikipedia.
In this case study, the VP9 build is configured to use the gcc -mcpu=native option to improve performance. As mentioned above, use the -mcpu option when compiling on Ampere Processors to enable CPU specific tuning and optimizations. Initially libvpx was built using the default configuration and then rebuilt using -mcpu=native. To evaluate VP9 performance, a 1080P input video file, original_videos_Sports_1080P_Sports_1080P-0063.mkv from the YouTube’s User Generated Content Dataset, was used. See Ampere’s ffmpeg tuning and build guide for details on how to build ffmpeg and various codecs including VP9 for Ampere Processors.
Default libvpx Build
$ git clone https://chromium.googlesource.com/webm/libvpx $ cd libvpx/ $ export CFLAGS="-mcpu=native -DNDEBUG -O3 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=0 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wdeclaration-after-statement -Wdisabled-optimization -Wfloat-conversion -Wformat=2 -Wpointer-arith -Wtype-limits -Wcast-qual -Wvla -Wimplicit-function-declaration -Wmissing-declarations -Wmissing-prototypes -Wuninitialized -Wunused -Wextra -Wundef -Wframe-larger-than=52000 -std=gnu89" $ export CXXFLAGS="-mcpu=native -DNDEBUG -O3 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=0 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wdisabled-optimization -Wextra-semi -Wfloat-conversion -Wformat=2 -Wpointer-arith -Wtype-limits -Wcast-qual -Wvla -Wmissing-declarations -Wuninitialized -Wunused -Wextra -Wno-psabi -Wc++14-extensions -Wc++17-extensions -Wc++20-extensions -std=gnu++11 -std=gnu++11" $ ./configure $ make verbose=1 $ ./vpxenc --codec=vp9 --profile=0 --height=1080 --width=1920 --fps=25/1 --limit=100 -o output.mkv /home/joneill/Videos/original_videos_Sports_1080P_Sports_1080P-0063.mkv --target-bitrate=2073600 --good --passes=1 --threads=1 –debug
How to Optimize libvpx Build with -mcpu=native
$ # rebuild with -mcpu=native $ make clean $ export CFLAGS="-mcpu=native -DNDEBUG -O3 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=0 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wdeclaration-after-statement -Wdisabled-optimization -Wfloat-conversion -Wformat=2 -Wpointer-arith -Wtype-limits -Wcast-qual -Wvla -Wimplicit-function-declaration -Wmissing-declarations -Wmissing-prototypes -Wuninitialized -Wunused -Wextra -Wundef -Wframe-larger-than=52000 -std=gnu89" $ export CXXFLAGS="-mcpu=native -DNDEBUG -O3 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=0 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wdisabled-optimization -Wextra-semi -Wfloat-conversion -Wformat=2 -Wpointer-arith -Wtype-limits -Wcast-qual -Wvla -Wmissing-declarations -Wuninitialized -Wunused -Wextra -Wno-psabi -Wc++14-extensions -Wc++17-extensions -Wc++20-extensions -std=gnu++11 -std=gnu++11" $ ./configure $ make verbose=1 # verify the build uses the sdot dot product instruction: $ objdump -d vpxenc | grep sdot | wc -l 128 $ ./vpxenc --codec=vp9 --profile=0 --height=1080 --width=1920 --fps=25/1 --limit=100 -o output.mkv /home/joneill/Videos/original_videos_Sports_1080P_Sports_1080P-0063.mkv --target-bitrate=2073600 --good --passes=1 --threads=1 --debug
An investigation using Linux perf to measure the number of CPU cycles in the functions that took the most time include the functions vpx_convolve8_horiz_neon and vpx_convolve8_vert_neon . The libvpx git repository shows these functions were optimized by Arm to use the Armv8.6-A USDOT (mixed-sign dot-product) instruction which is supported by Ampere Processors.
The CPU cycles spent in vpx_convolve8_horiz_neon was reduced from 6.07E+11 to 2.52E+11 using gcc -mcpu=native to enable the dot product optimization on an Ampere Altra processor, reducing the CPU cycles by a factor of 2.4x.
For vpx_convolve8_vert_neon, the CPU cycles were reduced from 2.46E+11 to 2.07E+11, for a 16% reduction.
Overall, using -mcpu=native to enable the dot product instruction sped up transcoding the file original_videos_Sports_1080P_Sports_1080P-0063.mkv by 7% on an Ampere Altra processor by improving the application throughput. The following table shows data collected using the perf record and perf report utilities to measure CPU cycles and instructions retired.
This section provides an overview of GCC’s Profile Guided Optimization (PGO) and a case study of optimizing MySQL with PGO. Profile Guide Optimizations enable GCC to make better optimization decisions, including optimizing branches, code block reordering, inlining functions and loops optimizations via loop unrolling, loop peeling and vectorization. Using PGO requires modifying the build environment to do a 3-part build.
A challenge of using PGO is the extremely high performance overhead in step 2 above. Due to the slow performance running an application built with gcc -fprofile-generate, it may not be practical to run on systems operating in a production environment. See the GCC manual’s Program Instrumentation Options section to build applications with run-time instrumentation and the section Options That Control Optimization for rebuilding using the generated profile information for additional details.
As described in the GCC manual, -fprofile-update=atomic is recommended for multi-threaded applications, and can improve performance by collecting improved profile data.
With PGO, GCC can better optimize applications by providing additional information such as measuring branches taken vs. not taken and measuring loop trip counts. PGO is a useful optimization to try and see if it improves performance. Performance signatures where PGO may help include applications with a significant percentage of branch mispredictions, which can be measured using the perf utility to read the CPU’s Performance Monitoring Unit (PMU) counter BR_MIS_PRED_RETIRED. Large numbers of branch mispredictions lead to a high percentage of front-end stalls, which can be measured by the STALL_FRONTEND PMU counter. Applications with a high L2 instruction cache miss rate may also benefit from PGO, possibly related to mis-predicted branches. In summary, a large percentage of branch mispredictions, CPU front end stalls and L2 instruction cache misses are performance signatures where PGO can improve performance.
MySQL is the world’s most popular open-source database and due to the huge MySQL binary size, is an ideal candidate for using GCC PGO optimization. Without PGO information, it is impossible for GCC to correctly predict the many different code paths executed. Using PGO greatly reduces branch misprediction, L2 instruction cache miss rate and CPU front end stalls on Ampere Altra Max Processor.
Summarizing how MySQL is optimized using GCC PGO:
Additional details can be found on the Ampere Developer’s website in the MySQL Tuning Guide.
Optimizing applications requires experimenting with different strategies to determine what works best. This paper provides recommendations for different GCC compiler optimizations to generate high performing applications running on Ampere Processors. It highlights using the -mcpu option as the easiest way to generate code that takes advantage of all the features supported by Ampere Cloud Native Processors. Two case studies, for MySQL database and VP9 video encoder, show the use of GCC options to optimize these applications where performance is critical.
Built for sustainable cloud computing, Ampere’s first Cloud Native Processors deliver predictable high performance, platform scalability, and power efficiency unprecedented in the industry. We invite you to learn more about our developer efforts and find best practices at https://developer.amperecomputing.com and join the conversation at: https://community.amperecomputing.com/