For Ampere Altra Processors
Memcached is an open source, in-memory key-value data store that is typically used for caching small chunks of arbitrary data such as strings, or objects from results of database and API calls. Due to its in-memory nature, Memcached is intended for use in speeding up dynamic web applications by caching data and objects in RAM and alleviating database lookups. It was one of the seminal caching stores in the cloud and continues to be popular today.
The purpose of this guide is to describe techniques to run memcached in an optimal manner on Ampere® Altra® processors.
Running an application in a performant manner starts with building it correctly and using the appropriate compiler flags. In our case, When running on Ampere Altra processors, we recommend building from source with the GCC compiler version 10 or newer. Newer compilers tend to have better support for new processor features and incorporate more advanced code generation techniques.
We have used CentOS8 as the operating system for our testing.
Download and install GCC 10 from SCL repository:
sudo yum -y install yum install scl-utils scl-utils-build sudo yum -y install gcc-toolset-10-gcc scl enable gcc-toolset-10 bash
For other operating systems like Ubuntu 20.04 LTS and Debian, GCC 10.2.1 is available and can be installed directly from the respective repositories.
Libevent is required to build memcached and can be downloaded as follows:
sudo yum install libevent-devel
The installation guide on the Memcached wiki has instructions on installing memcached on Debian/Ubuntu and Redhat/Fedora. Source code is available on the memcached project page. We recommend using the latest stable version.
The following commands can be used to download Memcached.
wget https://memcached.org/latest #you might need to rename the file tar -zxf memcached-1.x.x.tar.gz cd memcached-1.x.x
Before we continue configure build for Memcached, let us add some compiler flags that are specific to Ampere Altra processors.
./configure CFLAGS="-O3 -march=native -mcpu=neoverse-n1" --prefix=/usr/local/memcached make && make test && sudo make install
Memcached is notoriously network heavy and kernel and network interface card (NIC) tuning is necessary to achieve good performance.
Most kernel tuning configurations can be set by modifying data structures through the sysfs filesystem. However, some knobs might require the kernel to be recompiled. A generic kernel optimization is to use 64 KB page size for the operating system. This will improve Translation Lookaside Buffer (TLB) efficiency on Ampere Altra processors.
To check the page size being used on system:
getconf PAGESIZE
A return value of 65536 is expected for 64 KB page size. If that is not the case, check whether CONFIG_ARM64_64K_PAGES has been applied to the kernel config file, recompile and install the kernel, and reboot.
CONFIG_ARM64_64K_PAGES=y
Given the myriad of kernel configuration knobs, sometimes it’s just easier to use pre-defined profiles to match your usage scenario. Tuned is one such tuning service that can configure the operating system to improve performance by setting tuning profiles
Using CentOS 8 as an example, if throughput of Memcached is the primary metric, we would recommend using the throughput-performance profile of tuned. This profile sets the CPU governors to performance mode, reduces scheduling latency, maximizes I/O throughput, and reduce the swappiness values, all of which have been found to improve performance.
For Ubuntu, tuned may need to be installed if it’s not part of the operating system installation.
sudo apt-get update -y sudo apt-get install -y tuned
To improve kernel scheduling latencies on Ampere Altra processors, we recommend changing sched_wakeup_granularity_ns to 5000 by updating the corresponding setting in the tuned profile.
PROFILE_FILE=/usr/lib/tuned/throughput-performance/tuned.conf sed -i 's/sched_wakeup_granularity_ns = 15000000/sched_wakeup_granularity_ns = 5000/g' $PROFILE_FILE
Then use following command to apply throughput-performance profile:
tuned-adm profile throughput-performance
Applications like Memcached are typically tuned to perform at a high throughput while maintaining stringent Service Level Agreements (SLAs). A 99th percentile (or p.99) latency is usually a good starting point. To account for such SLAs, we recommend tuning kernel TCP/IP settings since incoming requests are established over TCP connections.
The list of TCP/IP tuning settings we have used for our Memcached testing is as follows:
echo 9999999 > /proc/sys/net/core/somaxconn echo 4194304 > /proc/sys/net/core/rmem_max echo 4194304 > /proc/sys/net/core/wmem_max echo 4194304 > /proc/sys/net/core/rmem_default echo 4194304 > /proc/sys/net/core/wmem_default echo "4096 87380 4194304" > /proc/sys/net/ipv4/tcp_rmem echo "4096 87380 4194304" > /proc/sys/net/ipv4/tcp_wmem echo "4096 87380 4194304" > /proc/sys/net/ipv4/tcp_mem echo 250000 > /proc/sys/net/core/netdev_max_backlog echo 50 > /proc/sys/net/core/busy_read echo 50 > /proc/sys/net/core/busy_poll echo 3 > /proc/sys/net/ipv4/tcp_fastopen echo 0 > /proc/sys/kernel/numa_balancing echo 0 > /proc/sys/net/ipv4/tcp_timestamps echo 1 > /proc/sys/net/ipv4/tcp_low_latency echo 0 > /proc/sys/net/ipv4/tcp_sack echo 1 > /proc/sys/net/ipv4/tcp_syncookie
In addition to kernel TCP/IP settings, we need to ensure the application can take advantage of hardware offload capabilities built into most network interface cards (NIC), such as Generic-Receive-Offload, which allows to aggregate multiple incoming packets belonging to the same stream, and Large-Receive-Offload, which allows for combining incoming TCP/IP packets that belong to the same connection into one large receive segment before passing it to the kernel.
This is done as follows:
ethtool -K <NIC_NAME> gro on ethtool -K <NIC_NAME> lro on
For network-bound workloads like Memcached, it is highly recommended to distribute NIC interrupts (IRQs) over multiple cores to avoid bottlenecks. This document2 is a good reference for SMP IRQ affinity.
It is also recommended to use the following command to check the number of hardware channels supported by your NIC and then ensure the number of channels matches the channel capacity of your NIC:
ethtool -l <NIC_NAME> sudo ethtool -L <NIC_NAME> combined <CHANNEL_NUM>
Memcached itself can be tuned to better match your usages. A good starting point to tune it is the stats functionality built into Memcached. Stats can be studied by connecting to memcached using telnet and running it as follows:
telnet localhost 11211 Connected to localhost. Escape character is '^]'. telnet> stats STAT pid 23599 STAT uptime 675 STAT time 1211439587 STAT version 1.2.5 STAT pointer_size 32 STAT rusage_user 1.404992 STAT rusage_system 4.694685 STAT curr_items 32 STAT total_items 56361 STAT bytes 2642 STAT curr_connections 53 STAT total_connections 438 STAT connection_structures 55 STAT cmd_get 113482 STAT cmd_set 80519 STAT get_hits 78926 STAT get_misses 34556 STAT evictions 0 STAT bytes_read 6379783 STAT bytes_written 4860179 STAT limit_maxbytes 67108864 STAT threads 1 END
The get_hits and get_misses values are especially important and they can be used to calculate the cache hit/miss ratios for Memcached. The rule of thumb for an in-memory cache like Memcached is to keep the cache hit ratio in the high 90s.
The evictions value counts the number of non-expired items that were evicted from the cache to free up space for new items. A high number of evictions can indicate overuse of the cache or that the amount of allocated memory is insufficient.
Finally, the number of Memcached threads is probably the one setting that influences overall Memcached performance. With high core count processors like the Ampere Altra family of processors, we recommend increasing the number of threads to use as many cores as possible while studying performance scaling. An extremely high number of threads can result in lock contention leading to lower performance. The number of threads can be changed by using the -t option when starting Memcached.
Fine tuning an application like Memcached in a production environment requires intimate knowledge of the usage and the end-to-end software stack. We hope that the settings discussed in this guide can help improve Memcached performance and recommend studying all the configuration options it provides to better match your usages.