Memcached Tuning Guide

For Ampere Altra Processors

Overview

Memcached is an open source, in-memory key-value data store that is typically used for caching small chunks of arbitrary data such as strings, or objects from results of database and API calls. Due to its in-memory nature, Memcached is intended for use in speeding up dynamic web applications by caching data and objects in RAM and alleviating database lookups. It was one of the seminal caching stores in the cloud and continues to be popular today.

The purpose of this guide is to describe techniques to run memcached in an optimal manner on Ampere^® Altra^® processors.

Build Prerequisites

Running an application in a performant manner starts with building it correctly and using the appropriate compiler flags. In our case, When running on Ampere Altra processors, we recommend building from source with the GCC compiler version 10 or newer. Newer compilers tend to have better support for new processor features and incorporate more advanced code generation techniques.

We have used CentOS8 as the operating system for our testing.

Download and install GCC 10 from SCL repository:

sudo yum -y install yum install scl-utils scl-utils-build 
sudo yum -y install gcc-toolset-10-gcc 
scl enable gcc-toolset-10 bash

For other operating systems like Ubuntu 20.04 LTS and Debian, GCC 10.2.1 is available and can be installed directly from the respective repositories.

Libevent is required to build memcached and can be downloaded as follows:

sudo yum install libevent-devel

Building and Installation

The installation guide on the Memcached wiki has instructions on installing memcached on Debian/Ubuntu and Redhat/Fedora. Source code is available on the memcached project page. We recommend using the latest stable version.

The following commands can be used to download Memcached.

wget https://memcached.org/latest 
#you might need to rename the file 
tar -zxf memcached-1.x.x.tar.gz 
cd memcached-1.x.x

Before we continue configure build for Memcached, let us add some compiler flags that are specific to Ampere Altra processors.

./configure CFLAGS="-O3 -march=native -mcpu=neoverse-n1" --prefix=/usr/local/memcached 
make && make test && sudo make install

Kernel Tuning

Memcached is notoriously network heavy and kernel and network interface card (NIC) tuning is necessary to achieve good performance.

Most kernel tuning configurations can be set by modifying data structures through the sysfs filesystem. However, some knobs might require the kernel to be recompiled. A generic kernel optimization is to use 64 KB page size for the operating system. This will improve Translation Lookaside Buffer (TLB) efficiency on Ampere Altra processors.

To check the page size being used on system:

getconf PAGESIZE

A return value of 65536 is expected for 64 KB page size. If that is not the case, check whether CONFIG_ARM64_64K_PAGES has been applied to the kernel config file, recompile and install the kernel, and reboot.

CONFIG_ARM64_64K_PAGES=y

Tuned Profiles

Given the myriad of kernel configuration knobs, sometimes it’s just easier to use pre-defined profiles to match your usage scenario. Tuned is one such tuning service that can configure the operating system to improve performance by setting tuning profiles

Using CentOS 8 as an example, if throughput of Memcached is the primary metric, we would recommend using the throughput-performance profile of tuned. This profile sets the CPU governors to performance mode, reduces scheduling latency, maximizes I/O throughput, and reduce the swappiness values, all of which have been found to improve performance.

For Ubuntu, tuned may need to be installed if it’s not part of the operating system installation.

sudo apt-get update -y 
sudo apt-get install -y tuned

To improve kernel scheduling latencies on Ampere Altra processors, we recommend changing sched_wakeup_granularity_ns to 5000 by updating the corresponding setting in the tuned profile.

PROFILE_FILE=/usr/lib/tuned/throughput-performance/tuned.conf 
sed -i 's/sched_wakeup_granularity_ns = 15000000/sched_wakeup_granularity_ns = 5000/g' $PROFILE_FILE

Then use following command to apply throughput-performance profile:

tuned-adm profile throughput-performance

Networking Settings

Applications like Memcached are typically tuned to perform at a high throughput while maintaining stringent Service Level Agreements (SLAs). A 99th percentile (or p.99) latency is usually a good starting point. To account for such SLAs, we recommend tuning kernel TCP/IP settings since incoming requests are established over TCP connections.

The list of TCP/IP tuning settings we have used for our Memcached testing is as follows:

echo 9999999 > /proc/sys/net/core/somaxconn 

echo 4194304 > /proc/sys/net/core/rmem_max 

echo 4194304 > /proc/sys/net/core/wmem_max 

echo 4194304 > /proc/sys/net/core/rmem_default 

echo 4194304 > /proc/sys/net/core/wmem_default 

echo "4096 87380 4194304" > /proc/sys/net/ipv4/tcp_rmem 

echo "4096 87380 4194304" > /proc/sys/net/ipv4/tcp_wmem 

echo "4096 87380 4194304" > /proc/sys/net/ipv4/tcp_mem 

echo 250000 > /proc/sys/net/core/netdev_max_backlog 

echo 50 > /proc/sys/net/core/busy_read 

echo 50 > /proc/sys/net/core/busy_poll 

echo 3 > /proc/sys/net/ipv4/tcp_fastopen 

echo 0 > /proc/sys/kernel/numa_balancing 

echo 0 > /proc/sys/net/ipv4/tcp_timestamps 

echo 1 > /proc/sys/net/ipv4/tcp_low_latency 

echo 0 > /proc/sys/net/ipv4/tcp_sack 

echo 1 > /proc/sys/net/ipv4/tcp_syncookie

In addition to kernel TCP/IP settings, we need to ensure the application can take advantage of hardware offload capabilities built into most network interface cards (NIC), such as Generic-Receive-Offload, which allows to aggregate multiple incoming packets belonging to the same stream, and Large-Receive-Offload, which allows for combining incoming TCP/IP packets that belong to the same connection into one large receive segment before passing it to the kernel.

This is done as follows:

ethtool -K <NIC_NAME> gro on 
ethtool -K <NIC_NAME> lro on

For network-bound workloads like Memcached, it is highly recommended to distribute NIC interrupts (IRQs) over multiple cores to avoid bottlenecks. This document2 is a good reference for SMP IRQ affinity.

It is also recommended to use the following command to check the number of hardware channels supported by your NIC and then ensure the number of channels matches the channel capacity of your NIC:

ethtool -l <NIC_NAME> 
sudo ethtool -L <NIC_NAME> combined <CHANNEL_NUM>

Memcached Configuration

Memcached itself can be tuned to better match your usages. A good starting point to tune it is the stats functionality built into Memcached. Stats can be studied by connecting to memcached using telnet and running it as follows:

telnet localhost 11211 

Connected to localhost. 

Escape character is '^]'. 

telnet> stats 

STAT pid 23599 

STAT uptime 675 

STAT time 1211439587 

STAT version 1.2.5 

STAT pointer_size 32 

STAT rusage_user 1.404992 

STAT rusage_system 4.694685 

STAT curr_items 32 

STAT total_items 56361 

STAT bytes 2642 

STAT curr_connections 53 

STAT total_connections 438 

STAT connection_structures 55 

STAT cmd_get 113482 

STAT cmd_set 80519 

STAT get_hits 78926 

STAT get_misses 34556 

STAT evictions 0 

STAT bytes_read 6379783 

STAT bytes_written 4860179 

STAT limit_maxbytes 67108864 

STAT threads 1 

END

The get_hits and get_misses values are especially important and they can be used to calculate the cache hit/miss ratios for Memcached. The rule of thumb for an in-memory cache like Memcached is to keep the cache hit ratio in the high 90s.

The evictions value counts the number of non-expired items that were evicted from the cache to free up space for new items. A high number of evictions can indicate overuse of the cache or that the amount of allocated memory is insufficient.

Finally, the number of Memcached threads is probably the one setting that influences overall Memcached performance. With high core count processors like the Ampere Altra family of processors, we recommend increasing the number of threads to use as many cores as possible while studying performance scaling. An extremely high number of threads can result in lock contention leading to lower performance. The number of threads can be changed by using the -t option when starting Memcached.

Fine tuning an application like Memcached in a production environment requires intimate knowledge of the usage and the end-to-end software stack. We hope that the settings discussed in this guide can help improve Memcached performance and recommend studying all the configuration options it provides to better match your usages.

References

1. https://github.com/memcached/memcached/wiki/Install 2. https://docs.kernel.org/core-api/irq/irq-affinity.html 3. https://solutions.amperecomputing.com/briefs/memcached-workload-brief 4. https://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html

Created At : February 28th 2023, 9:32:21 am

Last Updated At : June 21st 2023, 7:41:09 am

Ampere Computing LLC

4655 Great America Parkway Suite 601

Santa Clara, CA 95054

| | |

This site runs on Ampere Processors.