When to use Larger Page Sizes on Ampere(R) CPUs
One of the ways that the Arm64 architecture is different from x86 is the ability to configure the size of memory pages in the Memory Management Unit (MMU) of the CPU to 4K, 16K, or 64K. This article summarizes what memory page size is, how to configure page size on Linux systems, and when it might make sense to use a different page size in your applications.
As we previously discussed in Diagnosing and Fixing a Page Fault Performance Issue with Arm64 Atomics, operating systems present a virtual memory address space to applications, and map physical memory pages to virtual memory addresses using a page table. The CPU then provides a mechanism called the Translation Lookaside Buffer (TLB) to ensure that recently accessed pages of memory can be identified and read faster using L1 or L2 CPU cache.
The size of physical memory pages (called granules) on the x86 architecture is a fixed 4KB. On ARM64 systems like Ampere Altra(R) or AmpereOne(R), however, the developer can configure the size of physical memory pages to be 4KB, 16KB, or 64KB.
As changing the page size can impact the memory efficiency and performance of your system, it is important to understand when it makes sense to use a larger page size, and the trade-offs involved. Larger page sizes can lead to less efficient use of memory by having pages that are not full. For example, if we store 7 KB of data in memory, this will use two 4KB pages for a total of 8KB of memory on a system with 4KB kernel pages, an efficiency of 87.5%. On a system with 64KB pages, however, we are now consuming a single 64KB page with 7KB of data for an efficiency of 11% with the single allocation above. However, the MMU and the OS kernel are smart enough to use contiguous blocks of memory that have previously been allocated but are not full for future memory allocations. If the same process allocates 32KB of memory later, we are still only using one 64KB page with 39KB occupied. With 4K page size, we will now be managing ten 4KB pages.
The second trade-off is in performance due to cache misses for page table look-ups. There are a relatively small number of page entries stored in the TLB for each level of cache (L1, L2, System Level Cache). With larger page sizes, these TLB entries cover a larger amount of the physical memory. On Ampere Altra and Altra Max processors, for example, the L1 data TLB has 48 entries, and the L2 TLB has 1280 entries. This means that with a 4KB granule, the L1 TLB can cache addresses for 192KB of physical memory, and the L2 TLB can store page addresses covering 5MB of physical memory. With 64KB page sizes, this increases to 3MB for L1 data TLB and 80MB for the L2 TLB. Each cache miss in the TLB adds time for a page walk to find the physical page matching a virtual memory lookup, caching the page once located, and updating the TLB appropriately. With larger pages, you have fewer cache misses, and better performance for memory intensive workloads. You also improve I/O performance by having larger zones of contiguous memory available.
As a result, data intensive applications that have a lot of data in memory or in transit can benefit from larger page sizes. Some of these applications are:
Databases: Database systems tend to store a lot of information in memory for caching purposes and have lots of disk I/O for large datasets. Both characteristics make database servers great candidates for large memory page sizes.
Virtualization infrastructure: Virtual Machines (VMs) include a disk image, comprising of an operating system kernel and all the applications required by that VM, and range in size from hundreds of megabytes to hundreds of gigabytes. As a result, they can use large amounts of memory and can benefit from larger page sizes.
Build servers for Continuous Integration: Tasks like building the Linux kernel process thousands of source files and use a lot of RAM while compiling them. As a high throughput workload, hosts configured with larger page sizes tend to perform better as build servers.
Network or I/O heavy applications: For applications with a lot of network I/O and in-memory data processing like object caches, load balancers, firewalls, or video streaming, large memory pages can result in fewer page faults, improving performance.
Memory intensive applications like AI Inference: AI Inference, executing a trained model like a recommendation engine of an LLM chatbot, is a memory and CPU intensive workload, where large memory page sizes can help provide high performance.
In general, the performance of these types of applications with larger page sizes will depend on several factors, including the data sets involved and the pattern of memory accesses of the application. If you believe that your application could benefit from larger memory pages, you should benchmark your target workload with both 4K and 64K pages and make your deployment decision based on the results of your tests. In addition to benchmarking your target application with both 4K and 64K pages using production-style data, you can evaluate the potential benefit of larger page sizes using the “perf” tool, by measuring TLB stalls (that is, how often TLB misses result in the CPU pipeline to stall while waiting for information to be loaded from memory). First, check that the kernel supports the TLB stall counters on AmpereONE and newer CPUs.
# perf list | grep end_tlb stall_backend_tlb stall_frontend_tlb
With kernel support confirmed the pipeline stalls due to TLB misses can be measured:
# perf stat -e instructions,cycles,stall_frontend_tlb,stall_backend_tlb ./a.out
time for 12344321 * 100M nops: 3.7 s
Performance counter stats for './a.out':
12,648,071,049 instructions # 1.14 insn per cycle
11,109,161,102 cycles
1,482,795,078 stall_frontend_tlb
1,334,751 stall_backend_tlb
3\. 706937365 seconds time elapsed
3\. 629966000 seconds user
0\. 000995000 seconds sys
The ratio (stall_frontend_tlb + stall_backend_tlb)/cycles is an upper bound for the time that could be saved by using larger memory pages.
Beware, however, that as 4K has been the default page size for so long, some software packages may make that assumption about your system, resulting in low efficiency in memory usage. This is not a very common situation in modern software stacks, but it is advised to run some testing and benchmarking before committing to larger page sizes.
Changing the size of memory page size requires running an operating system kernel that has been compiled to support your desired size. For popular cloud operating systems like Red Hat Enterprise Linux, Oracle Enterprise Linux, Suse Enterprise Linux, or Ubuntu from Canonical, the operating systems ship with pre-built kernels supporting 4KB page size and 64KB page size on Arm64.
To use a kernel with 64KB pages on Red Hat Enterprise Linux 9:
1. Install the kernel-64k package:
dnf –y install kernel-64k
2. To enable the 64K kernel to be booted by default at boot time:
k=$(echo /boot/vmlinuz*64k)
grubby --set-default=$k \
--update-kernel=$k \
--args="crashkernel=2G-:640M"
To boot a 64KB kernel on Ubuntu 22.04:
1. Install the arm64+largemem ISO which contains the 64K kernel by default, or:
2. Install the linux-generic-64k package, which will add a 64K kernel option to the boot menu with the command sudo apt install linux-generic-64K
3. You can set the 64K kernel as the default boot option by updating the grub2 boot menu with the command:
echo "GRUB_FLAVOUR_ORDER=generic-64k" | sudo tee
/etc/default/grub.d/local-order.cfg
For 64KB pages on Oracle Linux:
1. Install the kernel-uek64k package:
sudo dnf install -y kernel-uek64k
2. Set the 64K kernel as the default at boot time:
sudo grubby --set-default=$(echo /boot/vmlinuz*64k)
3. After rebooting the system, you can verify that you are running the 64K kernel using getconf as described below.
Similar instructions may be available on the websites of other operating system distributions.
If you are building your own Linux kernel, you can use make menuconfig to change the kernel configuration. In the “Processor type and features” submenu, you will find the ARM64 CPU feature registers based on kernel features configuration option, which you can change to 16K or 64K. Alternatively, you can change the kernel configuration file .config directly to set the value of CONFIG_ARM_PAGE_SHIFT from its default value of 12 (4K = 212 bytes) to 14 (16K =214 bytes) or16 (64K =216 bytes). You can then choose which kernel to boot at boot time by creating multiple entries in your bootloader for the kernels with different page sizes and choosing the appropriate kernel at boot time.
To verify what the kernel page size setting is for your current Linux kernel, you can use the system getconf utility. With a 64K page size, these will show the following:
$ getconf PAGESIZE 65536
To summarize: Changing the kernel memory page size on your cloud systems can have a positive impact on application performance for many common cloud workloads. If your application includes a lot of disk, memory, or network I/O, you may be able to improve your performance significantly by using a kernel with 16K or 64K pages enabled on ARM hosts.
However, this is not a panacea, and your mileage may vary. We recommend that you test with both synthetic and real-world benchmark tests to see if changing page size will result in a positive impact to your bottom line. Many common Linux distributions with Arm64 builds already include multiple kernels in their distribution repositories. By installing these kernel packages and booting them at start-up, the cost to try larger kernels to test whether they provide a performance improvement is relatively low.