Customer reference board (CRB) platforms from Ampere

PostgreSQL Tuning Guide

for Ampere Altra Processors on Oracle Cloud Infrastructure

Overview

This document provides setup and tuning details for PostgreSQL running on Ampere Altra and Altra Max systems.

Note: This tuning guide will show you how to build/tune PostgreSQL for Ampere processors. Please note there is scope to stress the system more by using additional pg threads/more memory to get the best performance as needed.

This test was done on OCI - A1 instance with 4 OCPUs and 40GB RAM. For this test, we used pgbench to stress PostgreSQL. We ran pgbench on the same system as PostgreSQL to avoid network latencies, however, if you find that pgbench consumes significant CPU resources, then it is recommended to run it on a different client system/VM.

Operating Sytem: Oracle Linux 8.6
Kernel: 5.4.17-2136.308.9.el8uek.aarch64
Storage: 1 TB – default balanced SSD (use iSCSI commands to attach the disk)
THP is set to always.

Setup and Tuning

Follow the steps below to setup the system:

sudo yum update 
sudo yum groupinstall 'Development Tools' 
sudo yum install readline-devel

The test used an external disk converted to xfs file system for storing PostgreSQL data.

The test downloaded the latest PostgreSQL source code to build it. GCC 10.2 and above is recommended to build PostgreSQL as it includes lse based optimizations.

wget https://ftp.postgresql.org/pub/source/v14.4/postgresql-14.4.tar.gz

This is an example to build PostgreSQL with recommended GCC flags.

CC=${path to GCC bin} CFLAGS="-mcpu=neoverse-N1 -march=armv8.2-a+crypto+FP16+rcpc+dotprod+crc+lse -O3" ./configure --enable-arm-crc32 --prefix=${path to postgresql build} &&     make clean &&     make -j &&     make install

Initilize and start the newly built PostgreSQL.

pg_ctl -D /<postgres data dir> initdb 
pg_ctl -D /<postgres data dir> -l logfile start

Here is an example of postgresql.conf used for this benchmarking exercise:

listen_addresses = 'localhost'          
port = 5432                            # (change requires restart) 
max_connections = 100 # (change requires restart) 
shared_buffers = 25GB # min 128kB 
huge_pages = try # on, off, or try 
max_wal_senders = 0 
max_wal_size=200GB 
maintenance_work_mem = 512MB # min 1MB 
autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem 
max_stack_depth = 7MB # min 100kB 
dynamic_shared_memory_type = posix # the default is the first option 
max_files_per_process = 4000 # min 25 
effective_io_concurrency = 32 # 1-1000; 0 disables prefetching 
wal_level = minimal # minimal, archive, hot_standby, or logical 
synchronous_commit = off # synchronization level; 
wal_buffers = -1 # min 32kB, -1 sets based on shared_buffers 
checkpoint_timeout = 1h # range 30s-1h 
checkpoint_completion_target = 1 # checkpoint target duration, 0.0 - 1.0 
checkpoint_warning = 0 # 0 disables 
log_min_messages = error # values in order of decreasing detail: 
log_min_error_statement = error # values in order of decreasing detail: 
log_timezone = 'GB' 
log_checkpoints = on 
log_autovacuum_min_duration = 0 
autovacuum = off # Enable autovacuum subprocess? 'on' 
datestyle = 'iso, dmy' 
timezone = 'GB' 
lc_messages = 'C.UTF-8' # locale for system error message 
lc_monetary = 'C.UTF-8' # locale for monetary formatting 
lc_numeric = 'C.UTF-8' # locale for number formatting 
lc_time = 'C.UTF-8' # locale for time formatting 
default_text_search_config = 'pg_catalog.english' 
max_locks_per_transaction = 64 # min 10 
max_pred_locks_per_transaction = 64 # min 10 On Recoverable mode, we took out some of the flags that would impact recoverability.

Adequately scale max_connections, shared_buffers and max_wal_size to avoid bottlenecks to sufficiently scale throughput delivered by PostgreSQL. Now, we create a database and populate it using pgbench. Next, we run a test with 8 clients and 8 threads to stress the database.

pgbench -i -s 50 pg50 // create a database of scale factor 50 <use the size you need> 
pgbench -c 8 -j 8 -T 60 pg50 // start the test, you may -c and -j as needed.

Following the above steps, we measured a TPS of 6146, with a latency of 1.3ms on the above-mentioned OCI A1 system based on Ampere Altra. The same steps on a comparable OCI E4 instance resulted in a TPS of 5547 with a latency of 1.4ms. There is more scope to push CPU/pgbench threads as needed on this system, as seen below.

Scaling of -c (clients) and -j(threads)

This table shows scaling of -c (clients) and -j(threads) within pgbench on a database of scale factor 50.

-c, -j	A1 - tps	A1 - latency (ms)	E4 – tps	E4 – latency (ms)
8	6,041	1.3	4,963	1.6
16	6,723	2.3	5,664	2.8
32	7,946	4	6,010	5.3
48	8,351	5.7	6,462	7.4
64	8,624	7.4	6,545	9.8

In general, it is good practice to do a manual checkpoint between each run. This is to avoid automatic checkpoints in between the run.

Created At : April 6th 2023, 9:18:33 am

Last Updated At : July 31st 2023, 4:27:15 pm

Ampere Computing LLC

4655 Great America Parkway Suite 601

Santa Clara, CA 95054

| | |

This site runs on Ampere Processors.