Reading and writing, part 1: locations and locks

This article is part of a five-part series on race conditions. The complete series is:

It is literally impossible to write correct synchronized code using only data accesses.

The Rustonomicon

One of the benefits of learning Rust is that it has made me acutely aware of how any program I work on – in any language – uses and modifies data. The Rust compiler and the APIs in its standard library prevent you from accidentally writing code containing data races. A data race is any situation in which a set of read/write operations happen concurrently, their ordering affects the final state of some data, and that ordering is not adequately controlled. When this happens, events in your program can happen in an unpredictable order and leave your data in an invalid or inconsistent state. I like this pithy definition:

A data race is any unsynchronized, concurrent access to data involving a write.

Fearless concurrency with Rust

Another benefit I’ve derived from learning Rust is that its use of static types gives me a vocabulary for recognising and naming structures that have the potential to be unsafe, to behave in unpredictable or undesirable ways. And one thing that keeps occurring to me while I’m programming now, is that any instance of the following type is potentially unsafe:

Location<T> = {
    read() -> T;
    write(T);
}

The names of this type and its operations don’t matter, but what this type represents is any mutable location – any place where a value (the type T) can be stored that allows one to get the current value (read() -> T) and replace it with a new value (write(T)), and where the above type fully describes the operations available. In particular, read() returning T means that it always returns a value, and write(T) returning nothing implies it always does the same thing, i.e. it always succeeds – if it could fail, it would need to return a result indicating the outcome. In this series of articles I am going to explore what this means: where this type comes up in practice, how to recognise it, and how we can avoid creating it.

Before we go any further: this series is not really about Rust, and it’s not only about statically typed languages. Even if a language does not have a type system, we can still use types to characterise things that happen within it. And the above type is pervasive throughout imperative programming. I am not here to tell you to ditch your current language and take up Rust, or a purely functional language like Clojure or Haskell. Instead I want us to recognise when the Location type appears in any program we write, and decide whether it’s a problem and what to do about it.

In order to demonstrate the problem, let’s start with a very small program. This JavaScript snippet creates a variable named counter and defines a function inc() that increments the counter. The main() function calls inc() a thousand times and prints the value of counter afterward.

let counter = 0;

function inc() {
  let tmp = counter;
  counter = tmp + 1;
}

function main() {
  for (let i = 0; i < 1000; i++) {
    inc();
  }
  console.log("counter:", counter);
}

main();

Here, counter is an instance of the Location type, Location<Number> to be precise. It is a free variable declared outside the inc() function, which inc() is able to modify by getting its current value (by referring to the name counter), and replacing it with another (by assigning to counter = expression). Getting the current value corresponds to the read() -> T operation, and replacing it corresponds to write(T). The variable being external to the inc() function will prove important and we will see further examples of this later.

If we run this program, nothing surprising happens. Calling inc() a thousand times causes counter to be incremented up to the value 1000, which is then printed. To see how this type becomes a problem, let’s make a small change to the definition of inc():

async function inc() {
  let tmp = await counter;
  counter = tmp + 1;
}

Running this program now prints 0 – it seems that counter was not changed at all. The reason is down to what happens in this loop:

  for (let i = 0; i < 1000; i++) {
    inc();
  }

Each iteration of the loop invokes inc(), which is now an async function. It begins executing, and the first thing it does is to read the value of counter. Having done that, execution reaches the await keyword, which causes the function to pause; await is always asynchronous and using it means the function yields and allows other code to run. So when inc() returns control to its caller, it has not actually finished executing, it is paused on the await counter expression.

This means the loop starts a thousand instances of the inc() function, and all of them read the value 0 out of counter, and then return control to main(). These inc() calls will resume on the next tick of the event loop, but by then main() has completed its for-loop, printed the value 0, and the program has exited.

Now we need to define a bit of terminology. What do we mean when we say inc() returns but has not finished executing? Well, the async function above is a built-in JavaScript shorthand for this Promise-based implementation:

function inc() {
  return Promise.resolve(counter).then((tmp) => {
    counter = tmp + 1;
  });
}

inc() immediately returns a Promise, which encapsulates some asynchronous work and lets us get a value out once the work is complete. Even though in the control-flow sense, inc() has returned control to its caller, its logical workload is still executing. I will use the word task to refer to a logical unit of work encapsulated in a function, class, or other building block. A task may continue after the function that defines it returns, and the return value represents the task’s continued execution. In JavaScript, the main abstraction for this is the Promise type and the async/await shorthand for it.

(Historically, JavaScript has also used callbacks for handling async tasks, but any function that takes a callback is semantically equivalent to one that returns a Promise, with the callback being passed to Promise.then(), so we won’t deal with callbacks specifically here. Also, the word “task” does have a technical meaning in the internals of the JavaScript event loop, but it’s less overloaded than terms like “process”, “thread”, or “job”, so I’m sticking with it.)

We can now see that our program prints 0 because we’re not waiting for all the inc() tasks to complete, by waiting on their returned promises. Let’s fix that:

async function main() {
  let tasks = [];

  for (let i = 0; i < 1000; i++) {
    tasks.push(inc());
  }
  await Promise.all(tasks);
  console.log("counter:", counter);
}

Now we’re still launching a thousand inc() tasks and storing them in a list before using await Promise.all() to wait for all of them to complete. However, all the inc() tasks run concurrently, and so all of them read the value 0 out of counter, and then pause until the next tick of the event loop. Now we’re waiting for them all to complete, they all add 1 to 0 and write 1 back into counter, so the program prints 1.

What we have here is called a lost update bug. A thousand separate tasks are concurrently trying to increment a counter, but most of these operations do not get recorded. This happens because of a combination of two things. First, counter itself is fully described by the Location type: the only operations it supports are reading and writing, these operations always succeed, and it has no other access control mechanism protecting it. Second, the inc() tasks are not co-ordinating to make sure they make correct use of the Location they’re all sharing access to. Depending on your situation, fixing a Location-related data race requires fixing at least one of these issues.

For example, we can force the tasks to execute sequentially by awaiting each inc() task before starting the next:

async function main() {
  for (let i = 0; i < 1000; i++) {
    await inc();
  }
  console.log("counter:", counter);
}

This forces the first inc() task to complete, setting counter to 1, before the second one starts. We no longer need to keep hold of the tasks list because we’re forcing each task to complete before starting the next. That’s enough to prevent losing any updates.

Another approach would be to wrap the bare counter in some structure that lets its users access it safely. For example, we could turn it into a Promise, and make inc() work by transforming one Promise into another using the then():

let counter = Promise.resolve(0);

function inc() {
  counter = counter.then((value) => value + 1);
}

async function main() {
  for (let i = 0; i < 1000; i++) {
    inc();
  }
  let value = await counter;
  console.log("counter:", value);
}

main();

This is slightly more subtle and can’t be implemented by using await in inc(): it relies on immediately replacing counter with another promise rather than asynchronously extracting its value. It turns counter into a long chain, like:

Promise.resolve(0).then((v) => v + 1).then((v) => v + 1).then ...

Piping the initial value through a sequence of increments by chaining then() achieves the same effect as forcing inc() to execute sequentially. The difference is that we’ve placed the control mechanism with the data itself, rather than co-ordinating the tasks that access the data. counter no longer has the interface described by the Location type, it instead has the Promise interface:

Promise<T> = {
    then(fn(T) -> U) -> Promise<U>;
}

I’m abusing Rust notation slightly here, but the intent here is that Promise.then() takes a function which itself takes a T and returns some other type U, and returns Promise<U>. For example, Promise.resolve("hello").then((s) => s.length) converts a Promise<String> to a Promise<Number>. Whereas Location describes reading and writing a mutable value, Promise describes transforming one value into another via a pure function. It allows for control that the Location type does not make possible, and so lets us avoid the race condition that Location facilitates.

Now you may be thinking, after reading this far, that these examples are absurd, or at least contrived: nobody would ever write code like this, it’s obviously wrong, and so on. The reason I started with this example is that it establishes a few concepts and also demonstrates how little is needed in order to get a data race into a program. In particular, you do not need threads, and you do not need complex data structures; an asynchronously accessed number is sufficient. These examples may look simple, but many systems do things equivalent to this all the time, it’s just much more indirect. Maybe the counter lives in a database or remote service, maybe the tasks fighting over mutating it are in distant parts of the system barely aware of one another. You probably wouldn’t write the code above, but you have probably written something equivalent to it, just with thousands of lines of code so you can’t see that it’s happening.

One common myth is that race conditions only happen due to threads or operating system processes. JavaScript is single-threaded, but it still supports concurrency: it contains a mechanism where an async task can pause, allowing arbitrary other code to run in the meantime. If that task makes decisions on the basis of things it read before pausing, then those decisions, and the operations they lead to, may now be incorrect. JavaScript is actually rather good at deterministically demonstrating concurrency problems like this.

However, the absence of threads does make it possible to make certain assumptions. We know that a function like this will never be interrupted part-way through its execution:

function inc() {
  let tmp = counter;
  counter = tmp + 1;
}

JavaScript functions always run to completion until something asynchronous happens, which is usually signalled by a function taking a callback, or returning a promise requiring the use of then() or await. JavaScript concurrency is cooperative, meaning functions have to explicitly yield to let other code run, they don’t pause at random points. So you know if you see two synchronous statements one after the other, then no other code will be executed between those statements.

This leads us to the idea that uses of the read() -> T and write(T) operations can be successfully used, if we can guarantee no other code runs during the task that uses them. For example, if our inc() function called some other function between reading the counter value and writing it, this guarantee is lost:

function inc() {
  let tmp = counter;    // read counter
  anotherFunction();
  counter = tmp + 1;    // write counter
}

To make sure this code is still correct, we need to know what anotherFunction() does – we can’t treat it as an opaque box. If it modifies counter then it invalidates the decisions that inc() makes:

function anotherFunction() {
  --counter;
}

We might have been hoping this program would print 0 because we decrement counter whenever we increment it. But it doesn’t: it prints 1000, because inc() makes a decision about the value to write to counter, before allowing another function to modify it. So, it is not just concurrent tasks executing independently that causes this problem, it can even be functions on the same call stack messing with each others’ assumptions.

The above program can be made correct by keeping the read and write to counter together and making sure no other code executes between them:

function inc() {
  let tmp = counter;    // read counter
  counter = tmp + 1;    // write counter
  anotherFunction();
}

We can now treat anotherFunction() is an opaque box and ignore what it does – it won’t affect the read and write in inc() because it cannot interrupt between them.

There’s a clear drive here to be able to control how reads and writes happen, and think about code in small pieces without having to understand the whole system at once. Having many functions arbitrarily access the same memory locations is hard to audit, and it’s easy to cause bugs this way, and this becomes particularly true when there are multiple pieces of data that must be consistent.

These concerns are one of the motivations for object-oriented programming. The idea is that by grouping data with the functions that act on it in some way, we can control access to the data by limiting the set of allowed operations that can be called to change it.

Imagine we wanted something like JavaScript’s Map type, but it iterates its keys in sorted order. We don’t want to pay the cost of sorting the keys every time the map is iterated. We could implement this by storing a sorted list of the keys, which we update whenever the Map is modified. For this to work, it is important that at all times, the #keys list holds a sorted list of exactly the keys that appear in #map – there is a consistency requirement between these two values.

class SortedMap {
  #map
  #keys

  constructor() {
    this.#map  = new Map();
    this.#keys = [];
  }

  set(key, value) {
    this.#map.set(key, value);

    if (!this.#keys.includes(key)) {
      this.#keys.push(key);
      this.#keys = this.#keys.sort();
    }
  }

  all() {
    return this.#keys.map((key) => {
      let value = this.#map.get(key);
      return [key, value];
    });
  }
}

function main() {
  let map = new SortedMap();

  map.set("London",     9.75e6);
  map.set("Birmingham", 2.45e6);
  map.set("Manchester", 1.90e6);

  for (let [key, value] of map.all()) {
    console.log(key, value);
  }
}

main();

// prints:
// Birmingham 2450000
// London 9750000
// Manchester 1900000

In JavaScript the syntax this.#name denotes a private field – those properties cannot be accessed from outside this object’s methods. That means we can check that these fields are modified correctly and consistently by only examining the methods in this class, not the entire system. As long as every method leaves the object in a valid state, the class works correctly.

Looking at the SortedMap.set() method, all it does is synchronously access the object’s own private fields, and it doesn’t call any other methods or objects during the modifications it makes. Therefore we know no other code runs concurrently with these actions, and we can audit them in isolation. If the method did anything asynchronous, or called out to other code in the middle of its work, we’d need to exercise more caution and this might prompt us to change our design.

But even the use of objects in and of itself makes any calls to other methods safer. Inside SortedMap.set(), we call Map.set(), Array.includes(), Array.push() and Array.sort(). We know this is safe because neither Map nor Array holds a reference back to our SortedMap type, and we don’t pass a SortedMap into any of these methods. Therefore, none of these calls can affect the SortedMap in any way. It is certainly possible to create cyclic references in an object-oriented system such that object A calling into object B may trigger a call back into object A, but it’s harder to do this by accident than it is when you have no defined relationships between data and functions.

Objects also create a sense of ownership. Before, all we needed for a data race was for two functions to access the same shared variable without co-ordination. With objects and private fields, this isn’t possible – object methods mediate access to the object’s data, so other objects cannot access that data directly. They can only affect the data by going through the object’s methods, which greatly restricts what they’re able to do. So we see that a key part of the problem with the Location type is that it describes data that a task does not own – there may be many tasks accessing the location, and no control over what they can do with it.

We have removed the risk of using #map and #keys as locations by putting a protective barrier around them: the interface of the SortedMap class, which doesn’t allow these fields to be accessed directly, so they can’t be modified from outside the class. This grouping of values that need to be kept consistent, versus values that can vary independently, is a key part of object-oriented design and one of the major reasons for using that design approach in the first place, but unfortunately it’s not well-covered in contemporary literature. You can read a whole book on OOP and not see a single state mutation happen in the entire text.

Now object-orientation by itself does not solve all our problems. It’s easy to accidentally turn an property back into a Location by writing a getter for it:

class SortedMap {
  // ... as above

  keys() {
    return this.#keys;
  }
}

This leaks a reference to the internal keys list out of the object’s protective wrapper and lets anybody retrieve and mutate it – it’s now a Location again. The same thing could happen if we turned our counter into an object:

class Counter {
  constructor {
    this.#value = 0;
  }

  inc() {
    this.#value += 1;
  }

  get() {
    return this.#value;
  }
}

This design is safe: although Counter.get() lets us read the internal value (i.e. it implements read() -> T), we can’t mutate it because the value is a number, and JavaScript numbers are primitive immutable values. Counter.get() effectively returns a copy of the value, not a reference to a mutable structure. And although inc() lets us mutate the object, it does not allow for the internal value to be accessed directly or arbitrarily replaced – it does not implement write(T), in fact it does not take any arguments at all, so it performs a controlled mutation on data that it has exclusive access to, rather than letting the caller pass in anything it likes.

However, adding one more method makes this unsafe:

class Counter {
  // ... as above

  set(value) {
    this.#value = value;
  }
}

This design is now broken, because it lets code outside the class both read and write the #value field, effectively turning it back into a shared mutable variable that anyone can write a broken asynchronous version of inc() on top of. For this sort of reason, there is a mantra in object-oriented design called tell, don’t ask: objects should receive commands and mutate their own internal data. They should not let callers extract data, make their own decisions about it, and then change the object from outside. There are many reasons this mantra exists, but preventing concurrency bugs is a very good reason to follow it.

We have seen that objects can be highly useful in defining how a data structure can be accessed and mutated, an improvement on top of using a bare Location with uncontrolled access. In a single-threaded context, they provide a lot of reassurance about which operations can occur against a value, and in what order. But when we throw threads into the picture, things get more complicated.

Here’s the same Counter type, implemented in Ruby. The attr_reader :value line defines a method for reading the private @value property.

class Counter
  attr_reader :value

  def initialize
    @value = 0
  end

  def inc
    @value += 1
  end
end

We can use it as we have above: incrementing it a certain number of times. In Ruby, parentheses on method calls are optional and usually omitted for methods with no arguments; counter.value is a method call, not a bare property access. In Ruby you cannot get direct access to object attributes, you can only access them via the object’s methods.

counter = Counter.new
1000.times { counter.inc }

puts "counter: #{counter.value}"

As expected, this prints 1000. Let’s try the same thing, but calling Counter#inc from multiple threads. My machine has four CPU cores, so I will start four threads. I am running this example using JRuby, which supports true parallel CPU usage, as opposed to the canonical Ruby implementation which has a global interpreter lock and only allows one thread to run at a time.

counter = Counter.new

threads = (1..4).map do
  Thread.new do
    1000.times { counter.inc }
  end
end

threads.each(&:join)
puts "counter: #{counter.value}"

We’d want a correct implementation to print 4000, but this will print some random number that’s less than that. That’s because when two threads enter the inc method at the same time, they can both read the @value attribute, get the same result, and then add 1 to the result and write it back to @value. @value += 1 is just short-hand for the two-step read and write we were doing above and is not atomic.

So even though Counter is not a Location per se, we no longer have the guarantee that the code inside a single method will run to completion without interruption. This is because threads are preemptive – they can interrupt your code at any point they like and allow another thread to run. Threads have broken the encapsulation we had, but we can get it back quite easily.

The solution here is to use a mutex. This is a locking mechanism that prevents multiple threads from entering the same block of code at the same time. When a thread hits a lock call, it blocks if another thread has already locked the mutex. When the mutex is unlocked, then one of the threads blocked on it can try to acquire the lock and make progress. In Ruby this is implemented as follows:

class Counter
  attr_reader :value

  def initialize
    @value = 0
    @mutex = Mutex.new
  end

  def inc
    @mutex.lock
    @value += 1
    @mutex.unlock
  end
end

In the inc method, we call Mutex#lock to allow at most one thread to proceed past this point. We then perform our change by reading and writing @value, and call Mutex#unlock to wake up one of the blocked threads. With this small change, the program now reliably prints 4000. We had a problem with @value being used as a Location; the logical task encapsulated in the inc method is to read @value and then write it:

    tmp = @value        # read
    @value = tmp + 1    # write

We have fixed this problem by adding an access control mechanism: a mutex that prevents this task from being executed concurrently, forcing the threads to run it sequentially and therefore not lose any updates. It is critical that the mutex protects both the read and the write; this implementation is incorrect:

  def inc
    tmp = @value
    @mutex.lock
    @value = tmp + 1
    @mutex.unlock
  end

This still allows two concurrent threads to begin executing the inc method, both get the same value for tmp, and therefore both write the same new value. When a write is determined from data we read from the same location, both the read and the write must happen while the mutex is locked:

  def inc
    @mutex.lock
    tmp = @value
    @value = tmp + 1
    @mutex.unlock
  end

This means a thread cannot read @value while some other thread is incrementing it, so it never gets a stale value that would lead it to making an incorrect decision.

We can think of this has having changed the type from Location to something more complex – before we can read or write the value, we need to lock it. In a type system, if we want to enforce that some operation g() may only be called after calling operation f(), we can do that by putting these operations into different types, and making f() return the type that implements g(), i.e.:

// Allows calls f() and g()       // Only allows g() after f()
// in any order

MyType = {                        MyType = {
    f();                              f() -> Other;
    g();                          }
}
                                  Other = {
                                      g();
                                  }

In our case, we want to allow read() -> T and write(T) only after calling lock(), giving us:

Mutex<T> = {
    lock() -> Lock<T>;
}

Lock<T> = {
    read() -> T;
    write(T);
    unlock();
}

We can think of lock() as giving us a Lock<T> type, which allows us to read and write the underlying value, and to unlock the mutex. The key thing here is that only one Lock<T> is allowed to exist at a time, and lock() blocks until the existing one is destroyed by calling unlock() on it. This structural difference in the types of the structures and operations present is what lets us say this code is now safe, compared to the Location-typed implementation which was not.

Some languages bake this concept of a mutex into their syntax. For example, here is a non-thread-safe counter written in Java:

class Counter {
    private int value = 0;

    int get() {
        return value;
    }

    void inc() {
        value += 1;
    }
}

We can use this from a single thread, and it works fine. This program will print 1000 as expected:

public class Main {
    public static void main(String[] args) {
        Counter ctr = new Counter();

        for (int i = 0; i < 1000; i++) {
            ctr.inc();
        }
        System.out.println("counter: " + ctr.get());
    }
}

However, if we try to use this counter from multiple threads, we’ll get a nonsense result. This is equivalent to the Ruby we saw above, which ran four threads, each incrementing the counter a thousand times.

public class Main {
    public static void main(String[] args) {
        Counter ctr = new Counter();
        List<Thread> threads = new ArrayList<Thread>();

        for (int t = 0; t < 4; t++) {
            Thread th = new Thread(() -> {
                for (int i = 0; i < 1000; i++) {
                    ctr.inc();
                }
            });
            th.start();
            threads.add(th);
        }

        for (Thread th: threads) {
            try {
                th.join();
            } catch (InterruptedException e) {
                // ignore
            }
        }
        System.out.println("counter: " + ctr.get());
    }
}

All that’s needed to make the counter thread-safe is to put synchronized before the inc() method definition:

class Counter {
    // ... as above

    synchronized void inc() {
        value += 1;
    }
}

This does the same thing that we saw in Ruby: it gives the object a mutex, and uses the mutex to protect all methods marked synchronized; only one thread at a time is allowed to execute any of the object’s synchronized methods. If get() returned a reference to an internal structure, rather than a copy of an int, you would also want to mark it synchronized or else you would risk a thread being able to access that structure in an intermediate state, while another thread was executing a synchronized method.

The nice thing about making mutexes explicit, rather than implicit as Java does, is that it makes it easier to put them in different places. For example, we could have fixed the Ruby version not by using Mutex inside Counter, but by using it alongside Counter, in the threads that are interacting with it.

counter = Counter.new
mutex   = Mutex.new

threads = (1..4).map do
  Thread.new do
    1000.times do
      mutex.synchronize { counter.inc }
    end
  end
end

Adding mutex synchronisation to a function adds some performance overhead and you might not want to pay that cost all the time. If you know a bunch of the code using Counter is single-threaded, then it makes sense to have no mutex inside that class, and only use a mutex when it’s needed. Or, we could invent a wrapper around Counter whose only purpose was to wrap any calls to it with mutex synchronisation. If you have many disparate parts of a program accessing the same data, it may be easier to wrap the data with a control mechanism than to make those parts of your program co-ordinate their use of the data. For example, you use a mutex-like construction to force async JavaScript tasks to execute sequentially.

So far, we’ve established some basic concepts. We’ve met the Location type, which characterises an object that allows uncontrolled reads and writes. We’ve seen how that pattern can lead to data races, in threaded code, single-threaded asynchronous code, or simply by calling out to other functions. And, we’ve seen a common method of performing safe concurrent mutation, using mutexes, and how that changes the type of the interaction to require a successful lock() call before performing any data access. In the next part of this series, we’ll look at how these patterns appear when accessing data outside our program, in files and databases.