Locking Primitives and Memory Ordering on the Ampere Altra Family of Processors
Oct, 2022
Let us take care of some basics first. Arm introduced Large System Extensions (LSE) in the Arm v8.1-A architecture that replaced a sequence of instructions for lock operations with a single atomic instruction. A good summary can be found here. While older Arm versions will work fine functionally, expect a performance hit as the core count increases and locks are contended more often. Ampere Altra and Ampere Altra Max support LSE and are well equipped for scalable lock performance.
To illustrate the difference in instructions used, let us look at the gcc intrinsic __atomic_fetch_add(). In this case decrementing a lock value by 1:
__atomic_fetch_add(&lockptr->lockval, -1, __ATOMIC_ACQ_REL);
Compiling using the –march=armv8.1-a option, the compiler generates code with an atomic instruction:
998: f8f60280 ldaddal x22, x0, [x20]
On the other hand, setting –march=armv8-a (no LSE support), generates a different sequence:
9a4: c85ffe60 ldaxr x0, [x19]
9a8: d1000400 sub x0, x0, #0x1
9ac: c801fe60 stlxr w1, x0, [x19]
9b0: 35ffffa1 cbnz w1, 9a4 <main+0x104>
An exclusive monitor is needed to make the sequence atomic. The ldaxr gets a tag for the address, [x19] in this case. Then a subtraction is performed followed by a store back to the memory location. The store, however, is only successful if that tag at the time of store matches the tag from the load. The conditional branch, cbnz, after the stlxr checks if the store was successful, meaning that the tags from load and store match. If not, jump back to the beginning of the sequence, at address 0x9a4 in this case.
What is noteworthy here is that without LSE instructions, this sequence of instructions might be executed several times before it is deemed successful. With LSE, the ldaddal instruction is guaranteed to finish with a single instruction and no loop is needed.
Figure 1 shows the difference in performance for exclusive lock gets per second with and without LSE as the thread count is increased from 1 to 80.
Typically, Compare and Exchange hardware instructions are used to implement locks in software. It is important to note that these instructions need to be atomic.
What does atomic mean in this context? The instructions first get ownership of the cache line containing the lock and load it into the local cache of the CPU. Then the current value is compared with the compare value submitted with the instruction. If equal, the new value submitted as part of the instruction will replace the current value. If not equal, the current value remains. Atomicity in this regard means that the entire sequence is executed by a thread without other threads getting access to the cache line, guaranteed by hardware.
There are different types of locks that can be implemented in software, like mutexes, ticket locks, and spinlocks. As mentioned previously, different lock types are implemented in software, and hardware provides instructions like cmpxchg or fetchadd. The same lock types run on different hardware, and only the instructions used will differ.
This is the million-dollar question. Let us break it down into two options: 1) using available libraries and 2) using atomic instructions to implement a proprietary locking algorithm.
Option 1 has several advantages. The libraries exist, no custom implementation is needed, it is well tested, and typically will be maintained in future library versions. Examples are pthread_mutex_lock and pthread_rwlock. Sounds great, so what is the catch? Lack of statistics can be an issue. No information is returned to the application reporting spins or how many times the thread was scheduled out. Also, the library implementation might not be a good fit for an application as libraries are more generic.
Option 2 is more involved. It requires implementing the locking functions and maintaining them. However, it can reap benefits since it is designed specifically for the application. Locking primitives and atomic instructions can be implemented through inline assembly or utilizing compiler support. Again, writing code with inline assembly requires the application to maintain that code.
For compilers, gcc provides atomic built-in functionsthat enable the application with low-level functions which will be compiled into Arm atomic instructions. These built-in functions provide the application with different methods for atomic instructions and memory ordering directives. The code is also more portable. It is important, however, to compile the application with the correct settings for -mcpu or -march to generate Arm LSE instructions. Ampere Altra and Ampere Altra Max use the Neoverse-n1 architecture, which includes LSE.
Nonetheless, implementing locks with atomic instructions requires design decisions. Is spinning justified if a lock is held? How many times to spin? Should the thread yield the CPU if unsuccessful after a certain number of spins? Is a backoff required in the spin loop, or is it a straight spin? These are just a few considerations that need to be addressed.
As mentioned previously, the correct memory ordering directive is important for correctness. AArch64 follows a relaxed memory model. With LSE, AArch64 instructions enforce certain memory ordering. For example, the cmpxchg instruction set has versions to acquire (CASA instruction), release (CASL instruction), and acquire and release (CASAL instruction). Hardware guarantees that these instructions follow the memory model for their specific directive. It is up to software to use the appropriate instructions. Typically, an acquire is used on lock get and a release on lock free. However, if the application reads data after lock free (for example, if there are any waiters for the lock), then the release semantics of the free can cause issues, as the read of the waiter structure can be hoisted above the lock free, and thus stale data is present in the register after the free. In those cases, it is better to use acquire and release semantics. Again, this is dependent on the application implementation. The gcc compiler uses these directives directly in the built-in functions as seen here.
The Ampere family of CPUs is pushing the envelope with its ever-increasing core count and has all the ingredients for scalable performance for multithreaded applications that use locks. With LSE, atomic instructions are provided in hardware for great lock performance. As we have seen in this white paper, it is up to the application developer to make the best use of these instructions, either through locking libraries or correct implementation of locking algorithms.