Synchronizing threads in C and C++

Introduction

Question: How should I synchronize threads with shared variables in C and C++ ?

Or: What semantics are guaranteed by C and C++ when executing multiple threads that access shared variables ?

I am an experienced computer programmer, yet I find myself unable to give precise answers to these questions. This bothers me because I believe that, in order to be a good engineer, I need to have a precise understanding of the technology I use.

I tried to educate myself by searching the internet. Although I found many explanations about multi-threading, most of them do not provide straight answers to my questions. Some explanations are just wrong or misleading. Other explanations focus on machine-level aspects in a way that is not directly relevant to the C++ code. Many authors give sensible advice, but do not clarify it at the level of detail that I'm looking for.

I looked for answers in the C++ language specification, but that did not help me much. The formal specification is much too complicated to use as a mental model while writing code.

However, I believe there are intuitive mental models that can be used to understand many multi-threaded programming patterns that are used in practice.

Outline of this article

My goal for this article is to explain thread synchronization at 3 different levels. We start with a simple rule of thumb, then move on to more complete but more complicated models.

  • Level 1: Always use a mutex.

  • Level 2: A precise model for programs without low-level atomics.

  • Level 3: An incomplete model for programs with low-level atomics.

I also provide pointers to literature which will hopefully help you to find answers to your own questions about multi-threading. That may be the most important part of this article. I spent a lot of time to find the papers which introduce various core concepts of the C++ thread semantics.

Not-goal: This is not an entry-level introduction to threads. I assume you already know a lot about C++ and multi-threading. This article does not discuss practical ways to design multi-threaded code.

If you are looking for such things, I recommend that you stop reading this article and instead read the book C++ Concurrency In Action" by Anthony Williams.

Disclaimer: I am not a computer scientist, and not an expert on any subject discussed here. I'm just a programmer. If you spot any errors or misleading explanations in this article, please send me an email.

Multi-threading concepts in C and C++

Both C and C++ support threads. Starting with C11 and C++11, the semantics of multi-threaded programs are explicitly defined in the language specifications. The multi-threaded aspects of C are equivalent to those of C++: both languages have the same rules for thread synchronization and accessing shared variables. From this point on, I will talk about C++ but everything I say is also applicable to C.

The library class std::thread creates a new thread. Every thread must eventually be cleaned up by calling thr.join() or thr.detach().

The class std::mutex may be used to define sections of code that are not allowed to be executed simultaneously by multiple threads. A thread may call mut.lock() to lock the mutex. If another thread had already locked the mutex, this function blocks until the mutex is released. When a thread no longer needs the lock, it must release it by calling mut.unlock(). A mutex may only be unlocked by the thread that locked it.

The template std::atomic provides atomic variables. For example std::atomic<int> is an atomic integer type. Atomic operations have two special properties. First, atomic operations never expose an intermediate state of the object. When two threads access the same atomic object concurrently, each thread observes the other atomic operation as either not-yet-started or already-completed, but never as an intermediate state. Furthermore, atomic operations can be used to synchronize threads.

Atomic operations come in two categories: high-level and low-level. High-level atomic operations are those that do not explicitly specify a memory order parameter (memory_order_xxx). Atomic operations that do specify a memory order are low-level operations. High-level operations are relatively easy to understand, while low-level operations are extremely tricky.

In C, thread management is provided by the thrd_xxx functions, mutexes are provided by the mtx_xxx functions, and atomic operations are provided by the atomic_xxx types and functions.

Level 1: Always use a mutex

The easiest way to synchronize your threads is by following this rule:

  • For each shared variable, choose a mutex that protects the variable.
  • Whenever a thread accesses a shared variable, it must hold the mutex that protects that variable.

This is a good rule for several reasons. First, it typically leads to programs that are easy to understand (well, easier than programs that do not follow the rule).

Second, programs that follow this rule are guaranteed to be free of data races. A data race is a specific type of bug. At this point I can not explain exactly what it means; we will get to that later.

/* example 1 */
std::mutex mut;
int x = 0, y = 0;

void thread1() {
    mut.lock();
    if (y == 0)
        x = 1;
    mut.unlock();
}

void thread2() {
    mut.lock();
    if (x == 0)
        y = 1;
    mut.unlock();
}

In example 1, shared variables x and y are both protected by the mutex mut. Both threads only ever access x or y while holding mut. Therefore this program is correctly synchronized and free of data races.

Note: The rule says you may never access a shared variable without locking the mutex, not even when you only read from the variable.

Note: A variable must be protected by the same mutex each time it is accessed. A mutex can protect multiple variables, but each variable must have just one mutex protecting it.

Let's look at an example where the rule is broken:

/* example 2 (BAD C++) */
std::mutex mut;
int x = 0, y = 0;

void thread1() {
    mut.lock();
    if (y == 0)
        x = 1;
    mut.unlock();
}

void thread2() {
    if (x == 0) {  // BAD: access x without lock
        mut.lock();
        if (x == 0)
            y = 1;
        mut.unlock();
    }
}

In example 2, someone tried to optimize thread2 by checking the value of x before taking the lock. This is clearly a violation of our rule. In fact the unsynchronized read from x is a potential data race. This program has undefined behaviour, and no good will come of it.

Mutexes are often explained as a way to ensure that related actions occur together, such that other threads are unable to observe the intermediate state between the actions. That is a fine way to use a mutex. But note that the level 1 rule goes much further: Even a single access to a shared variable must use the mutex. Even a single read access of a shared variable must use the mutex. That is the rule.

Programs that follow the level 1 rule are free of data races but can still contain other kinds of concurrency bugs. For example, a thread that unlocks the mutex between related changes to a shared data structure, may accidentally allow another thread to observe the intermediate state of the data. As another example, careless use of mutexes may cause deadlock.

In example 3 below, assume that the function pay_debt() may be invoked by multiple threads. This program follows the level 1 rule: every access to the shared variables (savings and debt) is protected by the mutex. However, the code unlocks the mutex between related updates. As a result, the same debt could accidentally be payed off more than once if multiple threads call the function simultaneously. There is no data race in this program, but the program clearly has a thread-related bug.

/* example 3 */
std::mutex mut;
int savings = 43;
int debt = 7;

void pay_debt() {
    mut.lock();
    savings -= debt;
    mut.unlock();

    mut.lock();
    debt = 0;
    mut.unlock();
}

The level 1 rule is a good rule. It avoids nasty threading bugs and does not require a detailed understanding of the C++ memory model. So follow the rule and be happy!

However, maybe you want to know why the rule works, or understand the precise semantics of programs that follow the rule, or learn when you can break the rule without breaking your program. In that case, let's move on to level 2.

What is sequential consistency ?

Sequential consistency is a model for the execution of a multi-threaded program. Understanding sequential consistency is important because we will use it as a conceptual tool to talk about other models.

Please keep in mind that C++ does not provide sequential consistency. The model which I describe in this section is not directly applicable to C++ programs.

Sequential consistency can be informally understood as follows:

  • Within each thread, actions happen in program order. Program order is simply the order in which the thread executes its code as defined by the single-threaded semantics of the programming language.

  • In a multi-threaded program, all actions happen according to some interleaving of the actions of the threads. At each point, the computer picks an arbitrary thread and executes the next action in the program order of that thread.

Let's look at an example to get some feeling for what this means:

      int x = 0, y = 0;
      int p, q;

  thread1:    thread2:
    x = 1;      y = 1;
    p = y;      q = x;

This program has several sequentially consistent executions (interleavings of thread actions). Some of them are shown below:

     execution 1                 execution 2                 execution 3
  thread1:  thread2:          thread1:  thread2:          thread 1:  thread2:
  x = 1;                      x = 1;                                 y = 1;
  p = y;                                y = 1;            x = 1;
            y = 1;            p = y;                      p = y;
            q = x;                      q = x;                       q = x;
  (result: p=0, q=1)          (result: p=1, q=1).         (result: p=1, q=1).

Note that the result of this program (the final values of the variables) depends on which of the sequential-consistent executions ends up occurring. That's fine. Sequential consistency does not imply that the program has only one possible result. Note, however, that there is no sequentially consistent execution that could result in p=0, q=0.

Now that we have some intuition for sequential consistency, we can move to a more precise definition. An execution is sequentially consistent if it satisfies the following conditions:

  • The combined set of all actions of all threads are executed in some total order: the sequential order.

  • For each thread, the sequential order of its actions is consistent with the program order of that thread. The program order is simply the order of evaluations and side-effects according to the rules that also apply to single-threaded code. In the C++ specification, this is called the sequenced-before relation.

    In most cases, the program order follows clearly and intuitively from the code. Note however that C++ leaves the order of actions unspecified in some cases. For example, when calling a function, the parameters passed to the function are evaluated in some unspecified order.

  • Reading a memory location returns the value that was most recently written to it according to the sequential order of actions.

  • After a thread locks a mutex, the same thread must unlock that mutex at a later point in the sequential order. In the mean time, the mutex may not be locked by any thread.

Sequential consistency is a deeply intuitive model for thinking about threads. In fact, it is so intuitive that many inexperienced programmers take sequential consistency for granted and expect that C++ provides it by default. But they are wrong. C++ does not guarantee sequential consistency unless the program follows very specific rules to synchronize its threads. Programmers who are not aware of these rules and still expect sequential consistency, are likely to cause weird bugs that are difficult to figure out.

What is a data race ?

A data race is a conflicting, concurrent, non-atomic access of the same memory location by two threads. Data races are bad and must be avoided.

Let's look at each technical term that appears in this definition.

A memory location contains either a single scalar object or a sequence of adjacent bit fields. Separate objects thus occupy separate memory locations, unless both objects are bit fields in the same struct instance. Threads do not interfere with each other when they access separate memory locations.

In the following example, the objects a, b[0], b[1] and s.c are all in different memory locations, but s.d and s.e share the same memory location:

int a;
char b[2];
struct { int c; int d:2, e:3; } s;

There are several types of actions that access memory:

  • load: Happens when taking the value of a memory location (reading a variable).
  • store: Happens when writing a value to a memory location (assigning to a variable).
  • read-modify-write: Only applies to atomic variables.
  • lock and unlock: Only applies to mutexes.

A load or store action can be either atomic or non-atomic.

Two actions are conflicting accesses if they both access the same memory location and at least one action is a store or a read-modify-write.

In the context of a sequentially consistent analysis, two accesses are concurrent if they are directly adjacent in the sequential order of actions. This definition is precise enough for our level 2 model. For our level 3 model, we will need a more general definition: two accesses are concurrent if they are not ordered with respect to each other.

A data race occurs when:

  • two different threads perform conflicting accesses
  • and the accesses are concurrent
  • and at least one of the accesses is non-atomic.

Note: A data race does not occur when all involved threads are only reading the memory location (even if the reads are non-atomic and concurrent).

Note: A data race does not occur when all involved threads perform atomic access (even if the accesses are conflicting and concurrent).

Informally, a data race is a situation where two threads access the same variable, and the outcome would depend on the precise order in which these accesses just happen to occur, all else being equal. You may imagine a race contest between the two threads, with both threads trying to get to the memory location as fast as they can. Depending on which thread wins the contest, the result would turn out differently. However, it is not a data race if both threads use atomic access.

Level 2: A model for mutexes and high-level atomics

The data-race-free model is a relatively simple mental model for multi-threaded C++ programs. This model correctly describes the semantics of C++ programs that don't use low-level atomics (Boehm, Adve, 2008). The model consists of a condition and a promise:

  • If there is no data race in any sequentially consistent execution of the program,
  • then the program will execute with sequentially consistent semantics,
  • else the program has undefined behaviour.

The if condition is a responsibility of the programmer. The programmer must make sure that the program does not contain any data race. To figure out whether the program might have a data race, all possible sequentially consistent executions of the program must be considered. If none of those executions contain a data race, the program is data-race-free.

The then part is a promise from the C++ system. If the program is data-race-free, it will appear to execute with sequentially consistent semantics, i.e. as if the actions of all threads occur in some interleaved order. (The system is still free to perform the actions in a different order internally, as long as the outcome of the program matches a sequentially consistent execution.)

The else part is a severe punishment for programs that are not data-race-free. Such programs have undefined behaviour, which means that absolutely nothing is guaranteed about them. This implies that a data race is just as bad as an invalid pointer reference (i.e. extremely bad).

The great thing about this model is that it allows us to assume sequential consistency while thinking about the program. We don't have to worry about some thread observing actions in a different order, or whether some access can be re-ordered to a different point in the program. None of that. If we can prove, assuming sequential consistency, that the program is data-race-free, then the program is data-race-free and runs as if sequentially consistent. Nothing will be reordered in a way that affects the outcome of the program.

Please keep in mind that this model does not apply to C++ programs that use low-level atomics (atomic operations with an explicit memory_order_xxx parameter). But the model works fine for programs that use mutexes and/or high-level atomics and/or condition variables.

Let's look at some examples to see how this model applies to a program:

  • As a first exercise, scroll back up to example 1 and consider its sequentially consistent executions. Do you see why this program does not have a data race in any sequentially consistent execution?

  • Then take another look at example 2. Can you find a sequentially consistent execution that exposes a data race?

  • Consider the level 1 rule: always lock a mutex. Do you see why this rule guarantees that there will never be a data race?

  • Consider example 4 below. This program is data-race-free.

    The actions a.store(1) and a.load() do not cause a data race because they are both atomic accesses (even though they are conflicting and may happen concurrently).

    The actions x=1 and x+=1 do not cause a data race because they can not happen concurrently (even though they are conflicting and non-atomic). To understand why these actions are never concurrent, consider that thread2 only observes a.load()==1 if thread1 accessed x before a.load(), but thread2 accesses x after a.load(). Thus, the two conflicting accesses to x are not directly adjacent, therefore no data race. On the other hand, if thread2 observes a.load()==0, thread2 will not execute its access of x, therefore again no data race.

    Note that the result of this program depends on the order in which actions occur at run time. Some executions of this program could end with x having value 1, other executions could end with value 2. That's just fine. Data-race-free does not imply that the program has only one possible result. The result of the program may depend on some unpredictable order of actions, just as long as there is no data race.

/* example 4 */
int x = 0;
std::atomic<int> a{0};

void thread1() {
    x = 1;
    a.store(1);
}

void thread2() {
    if (a.load() == 1)
        x += 1;
}
  • Consider example 5 below. Although it may seem that it should do the same as example 4, this program has a data race and therefore its behaviour is undefined. The problem is that thread2 always accesses x, regardless of the value returned by a.load(). As a result, this program has a sequentially consistent execution where both threads access x concurrently (i.e. directly adjacent). Since these accesses are also conflicting and non-atomic, they form a data race.
/* example 5 (BAD C++) */
int x = 0;
std::atomic<int> a{0};

void thread1() {
    x = 1;
    a.store(1);
}

void thread2() {
    x += a.load();
}
  • Consider example 6 below. This program is data-race-free.

    The assignments to z from the two threads do not cause a data race, because at most one of these assignments actually happens. In any sequentially consistent execution of the program where thread1 observes b.load()==0, there must come a later point in the execution where thread2 observes a.load()==1. And vice versa.

    This program may result in either z==0 or z==1 or z==2 depending on the order in which actions occur. But there will never be a data race.

/* example 6 */
std::atomic<int> a{0}, b{0};
int z = 0;

void thread1() {
    a.store(1);
    if (b.load() == 0)
        z = 1;
}

void thread2() {
    b.store(1);
    if (a.load() == 0)
        z = 2;
}

Beyond sequential consistency

Programs that use low-level atomics may behave in ways that are not sequentially consistent. In order to understand low-level atomics, we have to say goodbye to sequential consistency and learn new rules for thinking about multi-threaded programs.

The first new rule is this: Absolutely nothing may be assumed about the order in which threads observe each other's actions. If thread1 performs a sequence of store actions to shared variables, it may appear to thread2 that these changes occur in a different order. And if thread3 also looks at the same variables, it may observe the same changes in yet another order. The only way to establish a consistent view between threads is through explicit synchronization actions.

Using the new rules will not be easy. Sequential consistency is so deeply intuitive that you may find yourself accidentally thinking that way, even in situations where it is simply not applicable. This can lead to bugs where the code appears correct if you only consider sequentially consistent executions, while C++ in fact allows a different execution that exposes the bug.

Normal programmers, like me, should probably not mess around with low-level atomics. If I tried, I would introduce weird bugs that are difficult to detect and even more difficult to track down.

Why do I try to understand low-level atomics if I don't plan to use it? Because I want to see what hides below the surface of the simpler models. And because I sometimes need to debug code that somebody else wrote using low-level atomics.

Low-level atomics

Low-level atomics are atomic operations that specify an explicit memory order parameter.

All atomic operations, high-level or low-level, regardless of memory order parameter, all provide the following guarantees:

  • Atomic operations are atomic in the sense that no thread can observe or interact with an intermediate state during the operation. To any other thread, the operation appears either as not yet started or as fully completed. This applies even to atomic read-modify-write operations such as fetch_add().

  • Atomic operations will not cause a data race. If two threads concurrently access the same variable, it does not count as a data race as long as both threads use atomic access. (If one thread uses non-atomic access, it may be a data race.)

    The outcome of concurrent atomic access by two threads may depend on the order in which the operations happen to occur, which is typically unpredictable. But it does not count as a data race.

In addition to these two guarantees, atomic operations can also be used to synchronize threads. The synchronization properties of an atomic access depend on the memory order parameter. The following memory orders can be used:

  • memory_order_relaxed
    Does not provide any synchronization.

  • memory_order_acquire
    Provides acquire semantics.
    Can only be used for atomic load or atomic read-modify-write.

  • memory_order_release
    Provides release semantics. (We will soon look at acquire and release in detail.)
    Can only be used with atomic store or atomic read-modify-write.

  • memory_order_acq_rel
    Provides both acquire and release semantics.
    Can only be used with atomic read-modify-write.

  • memory_order_seq_cst
    Provides acquire semantics when used with atomic load or atomic read-modify-write.
    Provides release semantics when used with atomic store or atomic read-modify-write.
    In addition, the operation is sequentially consistent with respect to other memory_order_seq_cst operations.

    This is the default memory order and provides the strongest guarantees.
    High-level atomics are equivalent to low-level atomics with memory_order_seq_cst.

(There is also memory_order_consume but I'm not going to talk about that.)

Why are there so many different memory orders to choose from? Because the memory orders that provide strong guarantees, are also relatively slow. Operations with weak memory orders can be much faster, depending on the hardware architecture and the way in which they are used.

If you want the highest performance that the hardware architecture can offer, you should specify the weakest memory order that is still sufficient to guarantee the correctness of your code. You may be able to save hundreds of clock cycles by avoiding memory_order_seq_cst in a situation where memory_order_acquire suffices. Such optimizations make sense in a thread support library, or in lock-free data structures, or for similar low-level programming tasks. But you should probably not use this in normal application code.

Acquire/release semantics

As described above, an atomic store (or read-modify-write) access can have release semantics, and an atomic load (or read-modify-write) access can have acquire semantics. These two concepts fit together as follows:

  • If thread1 performs a release access,
  • and thread2 performs an acquire access on the same memory location
  • and the acquire in thread2 reads the value stored by the release in thread1
  • then all actions performed by thread1 before the release, will appear to happen before any actions performed by thread2 after the acquire.

In terms of the C++ standard, a store-release synchronizes with a load-acquire which reads the value stored by the release.

Consider example 7 below. Thread1 performs an atomic store-release, and thread2 performs an atomic load-acquire of the same variable. If the acquire in thread2 reads the value stored by thread1, then thread2 continues to execute part 2B. All actions from part 1A will appear to happen before any actions from part 2B. This implies that actions in 2B will observe the effects of actions from 1A. Conversely, actions in 1A will definitely not observe any effects from 2B.

Nothing can be concluded about the effects of actions performed by thread1 after the release, or by thread2 before the acquire.

There is no guarantee that the acquire in thread2 will observe the release from thread1. Perhaps thread2 performs its acquire before thread1 gets to the release. Or perhaps the effects of the store-release have not yet propagated to thread2. In such cases, the load-acquire in thread2 returns 0 (the previous value of the variable) and thread2 continues to execute part 2C. When this happens, no synchronization has taken place and nothing can be concluded about the relative order of actions from the two threads.

If the load-acquire in thread2 returns 0, you might argue that the acquire must have happened before the release, and thus retroactively conclude that part 2A must have happened before part 1B. However that is sequentially consistent thinking, which is no longer allowed! Nothing can be concluded when an acquire does not observe the effect of a release.

/* example 7 */
std::atomic<int> a{0};

void thread1() {
    // Part 1A:
    // Actions here will appear to happen before any actions from 2B
    // (if part 2B is executed at all).

    a.store(1, std::memory_order_release);

    // Part 1B:
    // Actions here may appear to happen at any point relative to thread2.
}

void thread2() {
    // Part 2A:
    // Actions here may appear to happen at any point relative to thread1.

    if (a.load(std::memory_order_acquire) == 1) {
        // Part 2B:
        // Our load-acquire observed the store-release from thread1.
        // Therefore our actions here will appear to happen after
        // any actions from 1A.
    } else {
        // Part 2C:
        // Actions here may appear to happen at any point relative to thread1.
        // Our load-acquire did not see the store-release from thread1.
        // We can not conclude ANYTHING from that.
        // We can NOT conclude that 2A happened before 1B.
    }
}

Note: When we say that the acquire must read the value stored by the release, this means that the acquire must actually observe some causal effect of the release. It won't work if the acquire reads a value which has already been there for ages and just happens to be the same value as the one stored by the release. The computer will not be fooled by the trick in example 8 below.

/* example 8 */
std::atomic<int> a{42}, x{0};

void thread1() {
    x.store(5, std::memory_order_relaxed);
    a.store(42, std::memory_order_release);
}

void thread2() {
    if (a.load(std::memory_order_acquire) == 42) {
        // We read 42, which is the value stored by thread1.
        // However this proves nothing because the value 42
        // would already be there in any case. So now we
        // don't know if we are synchronized with thread1
        // and the assert below may very well fail.

        assert(x.load(std::memory_order_relaxed) == 5);  // this can fail
    }
}

Level 3: A model for low-level atomics

I will now describe a model for the semantics of multi-threaded C++ programs with low-level atomics. This model is safe but incomplete. All programs that are correct according to this model, are really correct. But there are correct C++ programs that seem incorrect according to this model.

We may say than an action A happens-before an action B. It means that B may observe the effect of A, while A will certainly not observe any effect of B.

  • Happens-before is a partial relation. There can be many pairs of actions P and Q where P does not happen-before Q and neither does Q happen-before P.

  • The happens-before relation is transitive: if A happens-before B and B happens-before C, then A happens-before C.

  • The happens-before relation is irreflexive: no action happens before itself.

  • The happens-before relation is a property of a program execution, not of the program itself. If the same program is executed more than once, the happens-before relation may turn out different in each execution.

  • Within a program execution, the happens-before relation involves the actions of all threads. It is not a separate relation per thread or anything like that.

Any execution of a multi-threaded program must obey the following rules:

  • Threads may observe each other's actions in any order, unless explicitly restricted by these rules.

  • Within each thread, actions happen in program order. If a thread performs actions A and B, and A is sequenced-before B according to the single-threaded evaluation rules of C++, then A happens-before B.

    The sequenced-before relation is simply the rule that determines when side effects become visible within a thread. You should already be familiar with this rule, since it applies equally to single-threaded programs and multi-threaded programs. The sequenced-before relation corresponds roughly to the program order: each statement in the program is sequenced-before the next statement. However, within an expression, the evaluation of arguments and subexpressions is unsequenced. As a result, there are no happens-before relations between actions that are part of the same full expression.

  • For any given mutex, all lock actions of that mutex happen in some total order.

  • If thread1 and thread2 both lock the same mutex, and thread1 locks the mutex before thread2, then thread1 surely unlocks the mutex before thread2 locks it. In this case, the unlock() in thread1 happens-before the lock() in thread2.

    Because happens-before is transitive, this implies that all actions performed by thread1 before unlocking the mutex, happen-before any actions performed by thread2 after locking the mutex.

  • If thread1 performs an atomic access with release semantics, and thread2 performs an atomic access with acquire semantics on the same memory location, and the acquire access observes the effect of the release access, then the release access happens-before the acquire access.

    Because happens-before is transitive, this implies that all actions in thread1 before the release access, happen-before any actions in thread2 that follow the acquire access.

  • All atomic operations on variables that are only accessed with memory_order_seq_cst, happen in some total order. If A and B are two such memory_order_seq_cst operations, either A happens-before B or B happens-before A.

    This rule does not apply to variables that are sometimes accessed with memory_order_seq_cst and sometimes with weaker memory orders. Please don't do that, it makes everything horribly complicated.

  • Any load access L reads a value that was written to the same memory location by some store access S which is visible to L. Store S is visible to load L if:

    • S and L both access the same memory location
    • and L does not happen-before S
    • and there is no intervening store S2 to the same memory location, such that S happens-before S2 and S2 happens-before L.

    A load may have more than one visible store. In this case the system may choose any visible store and return that value.

    (By load and store here we mean any type of access, including atomic, non-atomic and read-modify-write.)

If a program can encounter a data race in any execution that follows the rules above, the program has undefined behaviour. Recall that a data race is a conflicting, concurrent, non-atomic access of the same memory location by two threads. In this context, two accesses are concurrent if the rules above do not guarantee that one access happens-before the other.

Please keep in mind that the model I described here is incomplete: some correct C++ programs would seem to be incorrect according to this model. Unfortunately this is the best I can do. If you want a model which is correct and complete, you can find that in the C++ standard and in several papers referenced below. However that model is so wildly complicated that I don't know how to use it.

Let's look at some examples. Hopefully that will clarify how these rules work.

Example 9 is an extremely simplified lock-free queue:

/* example 9 */

// This is an extremely simplified lock-free queue.
//
// We assume that only one thread makes calls to put(),
// and only one (other) thread makes calls to get().
//
// Let's also assume that put is called at most 1000 times.

int data[1000];
std::atomic<unsigned int> head{0};
unsigned int tail = 0;

void put(int v) {
    unsigned int h = head.load(std::memory_order_relaxed);
    data[h] = v;
    head.store(h + 1, std::memory_order_release);

    // The store to "data[h]" Happens-Before the store-release "head := h+1".
    // (because of program order)
}

bool get(int& v) {
    unsigned int h = head.load(std::memory_order_acquire);

    // The load-acquire just observed the effect of the store-release "head := h".
    // Therefore the store "head := h" Happens-Before this point.

    if (tail < h) {
        // The store "head := h" Happens-Before this point.
        // Therefore the store "head := tail+1" Happens-Before this point.
        // Therefore the store to "data[tail]" Happens-Before this point.
        // Therefore we can safely access "data[tail]" here.
        // There will be no data race, and we will obtain the value stored by "put()".

        v = data[tail];
        tail += 1;
        return true;
    }
    return false;
}

Example 10 is an incorrect implementation of the lock-free queue:

/* example 10 (BAD C++) */

// This is an incorrect variant of example 9 above.
// The only difference is that we replaced acquire/release by std::memory_order_relaxed.

int data[1000];
std::atomic<unsigned int> head{0};
unsigned int tail = 0;

void put(int v) {
    unsigned int h = head.load(std::memory_order_relaxed);
    data[h] = v;
    head.store(h + 1, std::memory_order_relaxed);
}

bool get(int& v) {
    unsigned int h = head.load(std::memory_order_relaxed);

    // Although the load just observed the effect of a store "head := h",
    // we can conclude nothing from this because memory_order_relaxed does not
    // provide synchronization between threads.
    //
    // You may think "surely a store must happen before the load that observes its effect",
    // but you would be wrong. That is sequentially consistent thinking.
    // There is no rule that produces a Happens-Before relation in this situation.

    if (tail < h) {
        // We have no guarantee that the store to "data[tail]" Happens-Before this point.
        // Therefore we have two conflicting, non-atomic accesses to "data[tail]" that
        // are not ordered by Happens-Before. This is a data race.
        v = data[tail];  // BAD: data race
        tail += 1;
        return true;
    }
    return false;
}

Example 11 shows that even when something happens-before, it can still appear to happen after:

/* example 11 */

std::atomic<int> a{0}, b{0};

void thread1() {
    a.store(1, std::memory_order_relaxed);
    b.store(1, std::memory_order_relaxed);
    // "a.store" Happens-Before "b.store" due to program order.
}

void thread2() {
    int b2 = b.load(std::memory_order_relaxed);  // this may return 1
    int a2 = a.load(std::memory_order_relaxed);  // and then this may return 0
    // "b.load" Happens-Before "a.load" due to program order.

    // The two stores are ordered by Happens-Before and so are the two loads.
    // However, thread2 may observe the effects of the stores in reverse order.
    // To determine which value a load may observe, we would need Happens-Before
    // ordering between the store and the load. But in this case there is no rule
    // that produces such ordering, so all bets are off.

    assert( !(b2 == 1 && a2 == 0) );  // this can fail
}

Example 12 shows a simple case of atomic counting without any synchronization. Even though the threads are concurrently accessing the same variable, the basic property of indivisible atomic access is still guaranteed.

/* example 12 */

std::atomic<int> a{0};

void thread1() {
    for (int i = 0; i < 1000; i++) {
        a.fetch_add(1, std::memory_order_relaxed);
    }
}

void thread2() {
    for (int i = 0; i < 1000; i++) {
        a.fetch_add(1, std::memory_order_relaxed);
    }
}

// After both threads have completed running, the final value of "a" is 2000.
// Atomic operations are indivisible and can not be disturbed by other threads.
// No data race occurs because both threads use atomic access.

Further reading

Below is a list of papers and links that I found particularly helpful. The papers are listed in order of publication date. That is probably also the best reading order, since many of these papers use concepts that were introduced in earlier papers. However if you have time to read only one of these papers, I recommend "Foundations of the C++ Concurrency Memory Model" by Boehm and Adve.

If you have time to read a book, I recommend C++ Concurrency In Action" by Anthony Williams.

I wrote a few remarks to describe what I find most interesting about each paper. These remarks should not be taken as representative summaries or fair judgements of the papers.

  1. L. Lamport, "Time, Clocks, and the Ordering of Events in a Distributed System." Comm. ACM 21(7), 1978. (ACM) (pdf)
    Introduces the happens-before relation as a method to define a consistent notion of time in distributed systems. The topic of this paper is in fact not directly relevant to multi-threading. However, the happens-before relation plays a big role in later papers.

  2. L. Lamport, "How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs." IEEE Trans. Comp. C-28(9), 1979. (IEEE) (pdf)
    Introduces the concept of sequential consistency and lists technical requirements for a sequentially consistent multiprocessor architecture. It is apparent from this paper that sequential consistency was, at that time, the standard assumption for parallel programming.

  3. M. Dubois, C. Scheurich, F. Briggs, "Memory Access Buffering in Multiprocessors." ACM SIGARCH Comp. Arch. News 14(2), 1986. (ACM) (pdf)
    Introduces weak consistency, a programming model that allows reordering of normal memory access in ways that violate sequential consistency. The model still preserves sequential ordering with respect to specially marked synchronization variables. The paper shows that weak consistency allows important hardware optimizations that are not allowed by sequential consistency.

  4. S. V. Adve, M. D. Hill, "Weak ordering - a new definition." ACM SIGARCH Comp. Arch. News 18(2SI), 1990. (ACM) (pdf)
    Introduces data-race-free-0, a programming model which promises sequentially consistent executions for programs that do not contain data races. This model is similar to weak consistency, but redefined as a contract between software and hardware.

  5. K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, J. Hennessy, "Memory consistency and event ordering in scalable shared-memory multiprocessors." ACM SIGARCH Comp. Arch. News 18(2SI), 1990. (ACM) (pdf)
    Introduces release consistency, a programming model where the programmer marks certain memory accesses as acquire or release access. Release consistency is similar to weak consistency, but can provide weaker guarantees to allow additional optimizations.

  6. H.-J. Boehm, S. V. Adve, "Foundations of the C++ Concurrency Memory Model." ACM SIGPLAN Notices 43(6), 2008. (ACM) (pdf)
    Description of the C++ memory model and the reasons for choosing this particular model. The paper is quite technical, but still a lot easier to read than the C++ language specification.

  7. H.-J. Boehm, "A Less Formal Explanation of the Proposed C++ Concurrency Memory Model." C++ Std. Committee WG21/N2480, 2007. (N2480)

  8. "std::memory_order" on cppreference.com, 2021. (link)

  9. H. Sutter, "Use Critical Sections to Eliminate Races." Dr. Dobb's Journal 32(10), 2007. (link)
    Part of a series of columns called Effective Concurrency in Dr. Dobb's Journal. These are well written articles, full of practical examples of the challenges of concurrent programming in C++.

  10. H. Sutter, "Lock-Free Code: A False Sense of Security", Dr. Dobb's Journal 33(9), 2008. (link)

  11. A. Williams, C++ Concurrency In Action, 2nd ed, 2019. (book)
    This book starts with an entry-level introduction to multi-threading and a thorough explanation of the C++ features for concurrent programming. It then moves on to a detailed description of the C++ memory model, an explanation of lock-free data structures and other advanced topics. In addition to explaining the concepts, the book also shows how to properly design concurrent code.

Copyright

Written in 2022 by Joris van Rantwijk. This work is licensed under CC BY-ND 4.0.