AArch64 shared memory synchronization
by Mike Krinkin
I’m continuing playing with 64 bit ARM architecture and the next thing I want to try is spinning up multiple CPUs. However before doing that I need to get out of the way the question of synchronizing multiple concurrently running CPUs and that’s what I touch on in this post.
The plan
Basically for synchronization of multiple CPUs it should be enough to implement a simple mutual exclusion mechanism sometimes referred to as spinlock. And the simplest implementation of the spinlock if pretty straighforward, as you will see.
Once we have a simple implementation, I’m going to use it as an opportunity to look at somewhat more subtle details of shared memory synchronization like memory model and memory barriers. Hopefully that would show why correct synchronization is difficult and why using simple synchronization primities like mutual exclusion is so convenient.
Simple spinlock
In the simple implementation I will rely on an atomic read-modify-write operation that exists in many programming languages. To be specific I will be using C, but C++ and Rust would have a very similar mechanisms available.
In the standard C there is a header stdatomic.h that defines a type
struct atomic_flag. You can think of this type as a bool variable, with a
caveat that you can’t (or shouldn’t) work with it as a simple bool variable
and instead should rely on special functions.
The basic idea of the implementation is that we have a bool variable that
initially is set to false. We want to implement two functions: lock and
unlock.
The point of those two function is that a thread of execution that successfully
finished execution of function lock and before it executed unlock becomes
an “owner” of the lock on which the function was called.
And while the lock has an owner like that, all other threads of execution that
call the lock on the same object will get stuck in the lock function and
will not complete until the current owner of the lock calles unlock function.
Let’s take a look at an example and for that we will start with defining an interface for our implementation:
struct spinlock {
...
};
void spinlock_lock(struct spinlock *l);
void spinlock_unlock(struct spinlock *l);
Now with that interface in mind, let’s consider an example when we have a structure with a complex state that is being accessed from multiple threads. Let’s say for example, we have a binary search tree or something like that:
struct node {
struct node *parent;
struct node *left;
strict node *right;
};
struct search_tree {
struct node *root;
};
If we start inserting and deleting elements from the tree from multiple threads without any synchronization, because making changes to the search tree requires multiple operations done in the correct order, those threads will start steping on each other feet.
To avoid that we can use the spinlock as follows:
struct spinlock lock;
struct search_tree tree;
...
void search_tree_insert(struct search_tree *tree, struct node *new_node)
{
spinlock_lock(&lock);
tree_insert(tree, new_node);
spinlock_unlock(&lock);
}
Because the lock can have no more than one owner, at most one thread can
execute tree_insert at the same time, thus we avoid situation when multiple
threads step on each other feet.
NOTE: We only get this guarantee of mutual exclusion if all the threads of execution follow the same protocol and take ownership of the same lock before making any updates to the search tree in the example above.
Now, let’s take a look at how we can implement this using only C functions defined in the standard (and are not really architecture specific):
struct spinlock {
struct atomic_flag flag;
};
void spinlock_setup(struct spinlock *l)
{
atomic_flag_clear(&l->flag);
}
void spinlock_lock(struct spinlock *l)
{
while (atomic_flag_test_and_set(&l->flag));
}
void spinlock_unlock(struct spinlock *l)
{
atomic_flag_clear(&l->flag);
}
There are basically just a couple of functions:
atomic_flag_test_and_setatomic_flag_clear
atomic_flag_clear writes to the atomic_flag structure a value of false
atomically - rather simple, if we leave aside the details of what atomic
actually means.
atomic_flag_test_and_set is somewhat more interesting it atomically as one
indivisable operation performs the following actions:
- read the current value in the flag
- write
trueto the flag
So it reads the old value and writes a new one as one operation. The old value is returned as a result of the function.
Taking that semantic let’s look at the spinlock_lock function in a bit more
details. The function will spin in the loop as long as the value in the flag is
true.
If the function at some point observes that the value is not true it will
write true there again and returns.
Intuitively it’s rather easy to understand - when atomic_flag structure
contains true it means that the lock has an owner and when it contains
false the lock is free and does not have an owner.
We can aquire the lock by writing true value to the flag, but only as long
as it does not have an owner already. And we can release the lock by writing
false there, but only the current owner of the lock should do that obviously.
This implementation of mutual exclusion is actually correct, but as we will see below its simplicity might be a bit misleading and there is much more going on here than meets the eye. Moreover, as we will see there are some things that we can actually change in this implementation to gain some additional properties.
Memory ordering constraints
Quoting Leslie Lamport, “accessing a single memory location in a multiprocessor is traditionally assumed to be atomic. Such atomicity is a fiction; a memory access consists of a number of hardware actions, and different accesses may be executed concurrently”.
What it means is basically, just because the instructions in the program go in a certain order, does not mean that the processors executing them has to do them in the same order.
It might sound weird, but it actually makes sense if you consider that processor may try to optimize what it is doing to achieve the same result faster or more efficiently.
The problem is that when you have multiple cores working independently it’s difficult to tell what the expected end result should be. Even without potential re-orderings that processor might do, when you run a program on multiple cores, without proper synchronization, the result might essentially be unpredicatble.
And that’s what it comes to in the end - proper synchronization must tell the processor what operations can and cannot be re-ordered. Now, with that in mind let’s take a look at the spinlock implementation and figure out what are the ordering constraints we have.
Looking again at the hypothetical example of updating a search tree:
// Code before the critical section
spinlock_lock(&lock);
// Critical section that should only be executed by one thread at a time
tree_insert(tree, new_node);
spinlock_unlock(&lock);
// Code after critical section
One thing that we want to enforce here is that tree_insert function is not
executed concurrently by multiple threads and that’s why we use the lock. So
at the very least we want to prevent re-orering of tree_insert or any of its
parts before spinlock_lock or after spinlock_unlock, otherwise it defeats
the purpose of having the lock in the first place.
On the other hand, for correctness at least, we don’t need to prevent any code before or after the critical section being moved into the critical section. That’s because that code could be run by multiple threads at the same time, so it could be run by a single thread as well, without violating correctness.
As I mentioned earlier, the implementation of the spinlock was correct, so what
exactly in the implementation prevents re-ordering of tree_insert logic
outside of the critical section guarded by spinlocks?
What controls memory ordering constraints around atomic operations like
atomic_flag_clear or atomic_flag_test_and_set is memory order parameter.
The atomic operations I used above don’t take such a paramter, but there are
versions of these functions that do: atomic_flag_clear_explicit and
atomic_flag_test_and_set_explicit. The versions I used above are equivalent
to the explicit versions with the parameter memory_order_seq_cst.
memory_order_seq_cst means sequentially consistent memory operation ordering.
To understand what kind of constraints does the memory_order_seq_cst implies
let’s look at the C standard. The relevant parts we need to pay attention to
are:
- “sequenced before” relation
- “synchronizes with” relation
- “inter-thread happens before” relation
- “happens before” relation
Quite a few things to unpack here, but once you’re go through the formalisms, it’s not that hard to develop basic intution about those things.
Sequenced-before relation
Let’s start with the “sequenced before” relation. It’s an asymetric and transitive relation between evaluations executed in a single thread.
This relation induces a partial order between evaluations in a single thread. Which in this case means that we can tell for some of the evaluations whether one of them is computed before another.
Let’s take a look at a toy example:
#include <stdio.h>
#include <stdlib.h>
...
char *buf = malloc(large_enough_buffer_size);
char *p = buf;
...
while ((*p++ = getchar()) != EOF);
What can we say about evaluations in this example and how they are ordered?
I can say that malloc is definitely “sequenced before” p = buf. That’s
because those two statements are separated by ‘;’ which creates a so called
sequence point between the two.
Similarly, I can tell that p = buf is “sequenced before” p++ and getchar
because they are separated by a sequence point. Because “sequenced before” is
transitive relations, I can also say that malloc is “sequenced before” p++
and getchar, because it’s “sequenced before” p = buf, and, in turn,
p = buf is “sequenced before” p++ and getchar.
In the same example however, I cannot tell whether increment of p “sequenced
before” getchar or not. So sequencing of those two evaluations is not
determined, so it demonstraints that the relation is not total, but in fact
partial - not all evaluations can be ordered with this relation.
To summarize, “sequence before” relation defines how evaluations happenning in a single thread can or cannot be ordered.
One point that I want to stress is that “sequenced before” relation is strictly single threaded. It does not order, at least directly, evaluations that happen in different threads. For example, just because A is sequenced before B in thread T1, as unnatural as it sounds, it does not actually imply, that another thread T2 will see the effects of A before it will see the effects of B.
Synchronizes-with relation
This relation is really the core and probably the most important thing we need to understand.
Unlike the previous relation, this one is about evaluations or operations happening in different threads. And language standard has to define what operations synchronizes with each other.
The version of the C standard covers several cases, but one particularly important is atomic release and atomic acquire operations on the same object synchronize with each other under some conditions.
What is an atomic release operation? Atomic release operation is an atomic operation that performs a write or stores a value in the atomic variable using one of the following memory ordering parameters:
memory_order_releasememory_order_ack_rel- this is just short foracquire and releasememory_order_seq_cst- this is the one which atomic operations use by default.
An acquire operations is an atomic operation that performse a read or loads a value from an atomic variable using one of the following memory ordering parameters:
memory_order_acquirememory_order_ack_relmemory_order_seq_cst.
NOTE: some atomic operations can perform more than just load or store, like, for example,
atomic_flag_test_and_setperforms both read and write, in this case the operation still can be a release or an acquire operation even though it does more than just store or just a load.
Now, I mentioned that two atomic operations synchronize with each other if they act on the same atomic variable, use right memory ordering paramter and satisfy certain conditions, what are those conditions?
Let’s say we consider two operations A and B, both acting on the same atomic variable M. Let also A be a release operation, it means that among other things it writes a value to M. And let B be an acquire operation, so among other things it reads from M.
A synchronizes with B if B reads from M a value written by A or any atomic operation that wrote a value to M after A. In other words, A synchronizes with B, if B reads a value written by A or a value written after A wrote something to the atomic variable.
So to put it even simpler, two operations on the same atomic variable synchronize with each other if one operation sees the result of another operation plus all the addition conidtions on the memory order parameters.
Inter-thread happens before and happens before relations
So now we have some idea about “sequenced before” and “synchronizes with” relations, what remains is to understand some rules of how those relations can be combined with each other and what properties those combinations have.
In order to do that C standard introduces additional relation called “inter-thread happens before”. We don’t need to know all the rules of this relation for this post, so I will simplify and only cover those that I’m going to use.
Again, let’s say we have two evaluations A and B, unlike in case of “sequenced before relation” these evaluations may potentially be in different threads. We say that A “inter-thread happens before” B if one of those is true:
- A synchronizes with B
- There is some evaluation X and A is sequenced before X and X inter-thread happens before B
- There is some evaluation X and A inter-thread happens before X and X sequenced before B.
So it’s a recursively defined relation, but it’s a rather intutive definition that allows to combine sequences of “sequenced before” and “synchronized with” relations together.
Happens before relation is even simpler, evaluation A happens before B if one of the following is true:
- A is sequenced before B, which implies that A and B are executed in the same thread
- A is inter-thread happens before B, and in this case A and B can be executed in differnt threads.
Hopefully, the name of the relation “happens before” gives you a clue as to what this relation implies in practice, but when it comes to the language standard it additionally formalizes it by introducing a concept of visible side effect.
Basically a side effect of an operation A on some object M is visible to the operation B on the same object M, if A happens before B and there is no other operation X that affects M such that A happens before X and X happens before B.
NOTE: that in this case M does not have to even be an atomic object - as long as we can establish that A happens before B, side effects of A should be visible as long as they were not overriden by some other operation.
Putting things together
Now let’s try to put things together and return to our hypothetical example:
// Code before the critical section
spinlock_lock(&lock);
// Critical section that should only be executed by one thread at a time
tree_insert(tree, new_node);
spinlock_unlock(&lock);
// Code after critical section
Let’s consider an example with just two threads, T1 and T2, to be specific. This example, will extend the same way for a larger number of threads.
Let’s say that thread T1 owns the lock lock at the moment and thread T2 tries
to take ownership by calling spinlock_lock.
Assuming that tree_insert call does not get stuck indefinitely, thread T1 will
eventually call spinlock_unlock. Remember that spinlock_unlock inside uses
a function atomic_flag_clear which writes to an atomic variable and uses
memory_order_seq_cst. So it’s a release operation.
Concurrently, thread T2 calls spinlock_lock and in its implementation we used
atomic_flag_test_and_set function, which, among other things, reads from an
atomic variable using memory_order_seq_cst. So it’s an aquire operation.
If inside spinlock_lock T2 reads from the atomic variable value true it
means that either T1 didn’t write false into the atomic variable yet or the
write is not visible to T2 yet. Either way, T2 will re-try the operation in the
loop.
Once T1 actually writes false in the atomic variable and T2 reads the written
value from the same variable, we can tell that atomic_flag_clear inside
spinlock_unlock syncrhonizes with atomic_flag_test_and_set inside
spinlock_lock.
And at this point, we know that everything that in thread T1 was sequenced
before atomic_flag_clear or spinlock_unlock is guaranteed to have happened
before atomic_flag_test_and_set and everything that is sequenced after it
and spinlock_lock function in the thread T2.
Thus we demonstrated that T2 cannot take ownership of the lock before thread T1 completed all its operations in the critical section and released the lock.
Improved implementation
The name of this section is somewhat misleading - I don’t actually know how much
of an improvement this is in practice. That being said, careful reader may have
noticed, that earlier I mentioned some other memory ordering options besides
memory_order_seq_cst.
Specifically, there were as well the following memory ordering parmaters that
would have provided us with the same properties as memory_order_seq_cst:
memory_order_acquirememory_order_release.
memory_order_seq_cst provides the same guarantess as these two memory
orderings plus some more. It stands to reason, that if we don’t need those
additional guaratees and can use memory_order_acquire and
memory_order_release it will give complier and processor more freedom and
might yield better relults.
I additionally will introduce here another type of operation in C called a fence. Fences can be used to estabslish a synchronizes with relations in the execution of the program, just like atomic variables, except that they are not tied to any particular location in memory, like atomic variables do.
Finally, I will introduce one last additional type of memory ordering that C
standard allows for: memory_order_relaxed. Atomic operations that use
memory_order_relaxed do not establish synchronizes with relations in the
program execution and, in a way, memory_order_relaxed provides the weekest
guarantees across all the memory orderings and allows the most freedom to
re-order operations.
NOTE: atomic operations using
memory_order_relaxedare still atomic though, meaning that observers will see the operation either complete or not, but will never see intermediate results of an atomic operation usingmemory_order_relaxed.
With those additional tools and considerations we can rewrite our locks as follows:
void spinlock_lock(struct spinlock *l)
{
while (atomic_flag_test_and_set_explicit(&l->flag, memory_order_relaxed));
atomic_thread_fence(memory_order_acquire);
}
void spinlock_unlock(struct spinlock *l)
{
atomic_thread_fence(memory_order_release);
atomic_flag_clear_explicit(&l->flag, memory_order_relaxed);
}
Similarly to the analysis above, when atomic_flag_test_and_set_explicit
operation in the spinlock_lock sees a value written by
atomic_flag_clear_explicit inside spinlock_unlock, the atomic_thread_fence
calls in the two functions helps us to establish “synchronizes with” relation.
This implementation might appear to be more intuitive if you think of
atomic_thread_fence as a barrier that prevents re-ordering operations.
Specifically, atomic_thread_fence(memory_order_acquire) prevents re-ordering
of reads and writes that happen after the fence before the fence and
atomic_thread_fence(memory_order_release) prevents re-ordering reads and
writes before the fence after the fence.
This way we get exactly the property that I started with - operations that happen in the critical section cannot be moved outside of the critical section due to the fences we put around them.
NOTE: the new implementation has two advantages over the previous one - it replaces stronger
memory_order_seq_cstwith weakermemory_order_acquireandmemory_order_release; however, besides that it makesatomic_flag_test_and_set_explicitwhich is called in the loop insidespinlock_lockuse the weakestmemory_order_relaxed, so we don’t have to call “a heavy”memory_order_seq_cstatomic operation in a loop.
What about ARM?
So far, all we’ve looked at was provided by the standard of C in an architecture agnostic way. The same spinlock should work just as correctly on ARM machines as it would on Intel machines.
FWIW, C is still called a high-level programming language, so it should as much as practical abstract away the details of the specific hardware and, at least when it comes to memory ordering, it does it well enough for us not to worry about how it would work on ARM, Intel or any other hardware for that matter.
However, it would be interesting to actually see what this implementation actually translates to, so let’s take a look…
llvm-objdump -d spinlock.o
spinlock.o: file format elf64-littleaarch64
Disassembly of section .text:
0000000000000000 <spinlock_setup>:
0: 1f 00 00 39 strb wzr, [x0]
4: c0 03 5f d6 ret
0000000000000008 <spinlock_lock>:
8: 28 00 80 52 mov w8, #1
c: 09 7c 5f 08 ldxrb w9, [x0]
10: 08 7c 0a 08 stxrb w10, w8, [x0]
14: 29 01 00 12 and w9, w9, #0x1
18: 49 01 09 2a orr w9, w10, w9
1c: 89 ff ff 35 cbnz w9, 0xc <spinlock_lock+0x4>
20: bf 39 03 d5 dmb ishld
24: c0 03 5f d6 ret
0000000000000028 <spinlock_unlock>:
28: bf 3b 03 d5 dmb ish
2c: 1f 00 00 39 strb wzr, [x0]
30: c0 03 5f d6 ret
I will start with the spinlock_unlock function given that it’s simpler. strb
instruction is just a varian of a regular store instruction. The b suffix just
means that this store stores a single byte.
It takes a value from wzr register which is a special register which always
reads as 0, so it’s just a fancy way to get 0 value.
And x0, according to the calling convention, contains the first parmater of
the function, which in our case would the the struct spinlock pointer.
Putting things together strb wzr, [x0] writes 0 to the first byte of
struct spinlock passed as a paramter. Given that struct spinlock consists
of just atomic_flag, not hard to see that this writes 0 to the flag.
NOTE: there is nothing specific about the
strbinstruction that makes the store atomic, that’s because there isn’t any need for that and some stores in ARM are atomic, i.e. indivisable, by default without any additional actions.
The only interesting part of this function is dmb ish instruction. This is
a memory barrier instruction and that’s basically what
atomic_thread_fence(memory_order_release) translated to in the end - a single
instruction.
dmb is a data memory barrier instruction. As per ARM refernce manual,
dmb ish prevents re-ordering of read and write operations before the barier
with read and write operations after the barrier.
NOTE: one, rather simplistic and inaccurate, way to think about the
dmbinstruction is that it waits until the relevant operations complete before allowing any other loads or stores, i.e.dmb ishmight wait for all loads and stores before the barrier to complete before any new loads and stores would be allowed to start.
Now, let’s take a look at the implementation of the spinlock_lock. We know
that it must contain a loop of some form, so let’s figure out where in the code
the loop is hidden first.
The key to finding a loop is finding a conditional jump instruction. In this
case cbnz is such instruction. As you can see it has two parmaters: register
w9 and some address.
This instruction compares the value in the register with 0 and if it’s not the case it will transfer control to the address given as the second paramter. Otherwise the control will go to the next instruction after it.
Looking at the address where cbnz would transfer control in our case we can
derive the general structure of the loop:
mov w8, #1
# before the loop
body:
# The loop body
ldxrb w9, [x0]
stxrb w10, w8, [x0]
and w9, w9, #0x1
orr w9, w10, w9
cbnz w9, body
# after the loop
dmb ishld
ret
Let’s tryt to parse the body of the loop next. There are two important
instructions to pay attention to here: ldxrb and stxrb.
ldxrb is Load Exclusive Register Byte. This is a special instruction that
reads a byte from address in memory, indicated by by address [x0], to register
w9.
stxrb is a Store Exclusive Register Byte. This is another special instruction
that writes a byte from a register w8 to memory, indicated by address [x0].
However, stxrb is only successfull if the memory indicated by address [x0]
has not been modified since the ldxrb instruction.
Whether stxrb was able to successfully write the data or not is recorded in
register w10. It would contain 0 if the store operation was successful and 1
otherwise.
ldxrb and stxrb instructions work together with each other and form a pair
of operation often refered to as “load linked/store conditional”. These
operations basically implement atomic_flag_test_and_set_explicit in our case.
And we have to do them in the loop, until the store succeeds, just like it’s
written in the C code.
One last finishing touch is dmb ishld instruction. Just like dmb ish
instruction we looked at above, it’s a memory barrier instruction. The only
difference is that this particular memory barrier prevent loads that happened
before the fence to be re-ordered with loads and stores after the fence.
Why the difference? dmb ish that was used earlier would work here as well,
but dmb ishld is a weaker barrier that dmb ish. And while dmb ishld is
weaker than dmb ish, it’s still enough for correctness, so it was preferred.
Remember that we want loads and stores that happened within the critical section to stay within the critical section and not escape it. In other words, by the time we finish the memory barrier instruction, we want all the effects of the loads and stores to be visible to everyone.
We don’t need to prevent any loads and stores that happen outside of critical section getting inside the critical section, as it’s not needed for correctness.
The only reason why dmb ish was used in the spinlock_unlock is because there
is no other weaker barier that would give the right properties in ARM
instruction set.
NOTE: let’s think a bit about what properties we actually need here in terms of barriers? In the
spinlock_lockfunction we need to make sure that all loads and stores within the critical section happen strictly after the load from theatomic_flagthatatomic_flag_test_and_setdoes, so we need to prevent re-ordering of all the loads and stores after the barrier with the load before the barrier. Thinking similarly aboutspinlock_unlock, we want to prevent all loads and stores before the barrier to be re-ordered with the store after the barrier. For the first case we havedmb ishldthat provides exactly the properties we need, but ARM architecture does not have the instruction with exactly the properties we need for the latter case, so we have to resort to the next best thing and it’sdmb ish.
Back to sequential consistency
Out of curiousity, does returning back to memory_order_seq_cst produces a
different code? Based on what we’ve learned above, it should, but let’s take
a look at what this code translates to:
void spinlock_lock(struct spinlock *l)
{
while (atomic_flag_test_and_set_explicit(&l->flag, memory_order_relaxed));
atomic_thread_fence(memory_order_seq_cst);
}
void spinlock_unlock(struct spinlock *l)
{
atomic_thread_fence(memory_order_seq_cst);
atomic_flag_clear_explicit(&l->flag, memory_order_relaxed);
}
And indeed it does:
llvm-objdump -d spinlock.o
spinlock.o: file format elf64-littleaarch64
Disassembly of section .text:
0000000000000000 <spinlock_setup>:
0: 1f 00 00 39 strb wzr, [x0]
4: c0 03 5f d6 ret
0000000000000008 <spinlock_lock>:
8: 28 00 80 52 mov w8, #1
c: 09 7c 5f 08 ldxrb w9, [x0]
10: 08 7c 0a 08 stxrb w10, w8, [x0]
14: 29 01 00 12 and w9, w9, #0x1
18: 49 01 09 2a orr w9, w10, w9
1c: 89 ff ff 35 cbnz w9, 0xc <spinlock_lock+0x4>
20: bf 3b 03 d5 dmb ish
24: c0 03 5f d6 ret
0000000000000028 <spinlock_unlock>:
28: bf 3b 03 d5 dmb ish
2c: 1f 00 00 39 strb wzr, [x0]
30: c0 03 5f d6 ret
As you can see, with memory_order_seq_cst we now use dmb ish, the strongest
barrier of those we’ve seen, both in spinlock_lock and spinlock_unlock. So
we do in fact see a difference in the generated code, though it’s a difference
in just a single instruction.
How does linux do it?
We looked at how to implement spinlocks in standard C and then looked at what ARM instructions it translates to. I hope I didn’t make any terrible mistakes there that would lead you towards incorrect assumptions about how synchronization works in ARM.
That being said, let’s take a look at how Linux Kernel does it and compare what I created above. Linux Kernel spinlock implementation comes with a complexity of large code base that needs to support multiple different architectures and a lot of tooling (including tooling used to enforce correctness).
Once you get through the complexities, you will find that the actual
implementation of ARM spinlocks lives in arch/arm/include/asm/spinlock.h and
it’s not that large and complicated.
And even though it’s not that large, it has a few fundamentral differences from the implementation above:
- It uses a different algorithm
- It employs an ARM specific relaxation strategy
- It uses
dmb ishbarriers in both lock and unlock.
Let’s start with briefly touching at the algorithm it uses. It appears that ARM spinlock implemnetation in Linux Kernel relies on so-called ticket lock algorithm.
Like our algorithm it relies some kind of read-modify-write operation and has a loop in the lock function. However, unlike our algorithm, ticket lock provides additional guarantees of fairness that our algorithm does not have.
The next difference is that Linux Kernel implementation does not just spin in
a loop like I did above. Instead it executes in wfe instruction in the loop
inside lock implementation while it waits to acquire ownership of the lock.
ARM documnetation describes wfe or Wait For Event instruction as a hint to
the CPU that it could enter a low-power state and remain in that state until a
wake up event occurs.
It seem to suggest, that ARM CPU can potentially get to a mode of operation
that consumes less energy and stay in such a state until something, i.e. another
CPU presumably holding the lock, calls sev instruction that will wake it up.
Finally, as you can see, arch_spin_lock function ends with smp_mb and
arch_spin_unlock function starts with the same smp_mb call. smp_mb is
a memory barrier function, an equivalant of C atomic_thread_fence function
if you will.
So Linux Kernel uses the same type of barrier in both lock and unlock functions.
And digging through the code code you can indeed find that in ARM, smp_mb
function translates into dmb ish instruction, which is a stronger barrier
than needed, at least to the best of my understanding.
Instead of conclusion
Hopefully I explored a rather well explored subject of creating a spinlock with details and follwed up with specifics of the memory model for this post not to be way too boring.
I was a bit suprised to find lock in linux kernel forces a full memory barrier. Naturally, I suspected that I was wrong in my understanding of how things work and that’s still certainly a plausible explanation.
That being said, reading doc/Documentation/memory-barriers.txt suggests that lock operation in Linux Kernel indeed should only behave as an acquire operation.
Whether linux kernel implemenetation is too conservative or not in the use of memory barriers, at least from correctness point of view, using stronger and simpler to reason about barriers seems wise given the overall complexity. And I’m not even sure if using a weaker barrier will have a measurable effect.
tags: aarch64 - synchronization - concurrency