This article is part of a five-part series on race conditions. The complete series is:
- Part 1: locations and locks
- Part 2: files and databases
- Part 3: web applications
- Part 4: user workflows
- Part 5: concurrency and language design
–
It is literally impossible to write correct synchronized code using only data accesses.
One of the benefits of learning Rust is that it has made me acutely aware of how any program I work on – in any language – uses and modifies data. The Rust compiler and the APIs in its standard library prevent you from accidentally writing code containing data races. A data race is any situation in which a set of read/write operations happen concurrently, their ordering affects the final state of some data, and that ordering is not adequately controlled. When this happens, events in your program can happen in an unpredictable order and leave your data in an invalid or inconsistent state. I like this pithy definition:
A data race is any unsynchronized, concurrent access to data involving a write.
Another benefit I’ve derived from learning Rust is that its use of static types gives me a vocabulary for recognising and naming structures that have the potential to be unsafe, to behave in unpredictable or undesirable ways. And one thing that keeps occurring to me while I’m programming now, is that any instance of the following type is potentially unsafe:
Location<T> = {
read() -> T;
write(T);
}
The names of this type and its operations don’t matter, but what this type
represents is any mutable location – any place where a value (the type T
) can
be stored that allows one to get the current value (read() -> T
) and replace
it with a new value (write(T)
), and where the above type fully describes the
operations available. In particular, read()
returning T
means that it
always returns a value, and write(T)
returning nothing implies it always
does the same thing, i.e. it always succeeds – if it could fail, it would need
to return a result indicating the outcome. In this series of articles I am going
to explore what this means: where this type comes up in practice, how to
recognise it, and how we can avoid creating it.
Before we go any further: this series is not really about Rust, and it’s not
only about statically typed languages. Even if a language does not have a type
system, we can still use types to characterise things that happen within it. And
the above type is pervasive throughout imperative programming. I am not here
to tell you to ditch your current language and take up Rust, or a purely
functional language like Clojure or Haskell. Instead I want us to recognise when
the Location
type appears in any program we write, and decide whether it’s a
problem and what to do about it.
In order to demonstrate the problem, let’s start with a very small program. This
JavaScript snippet creates a variable named counter
and defines a function
inc()
that increments the counter. The main()
function calls inc()
a
thousand times and prints the value of counter
afterward.
let counter = 0;
function inc() {
let tmp = counter;
counter = tmp + 1;
}
function main() {
for (let i = 0; i < 1000; i++) {
inc();
}
console.log("counter:", counter);
}
main();
Here, counter
is an instance of the Location
type, Location<Number>
to be
precise. It is a free variable declared outside the inc()
function, which
inc()
is able to modify by getting its current value (by referring to the name
counter
), and replacing it with another (by assigning to counter =
expression
). Getting the current value corresponds to the read() -> T
operation, and replacing it corresponds to write(T)
. The variable being
external to the inc()
function will prove important and we will see further
examples of this later.
If we run this program, nothing surprising happens. Calling inc()
a thousand
times causes counter
to be incremented up to the value 1000
, which is then
printed. To see how this type becomes a problem, let’s make a small change to
the definition of inc()
:
async function inc() {
let tmp = await counter;
counter = tmp + 1;
}
Running this program now prints 0
– it seems that counter
was not changed
at all. The reason is down to what happens in this loop:
for (let i = 0; i < 1000; i++) {
inc();
}
Each iteration of the loop invokes inc()
, which is now an async
function. It
begins executing, and the first thing it does is to read the value of counter
.
Having done that, execution reaches the await
keyword, which causes the
function to pause; await
is always asynchronous and using it means the
function yields and allows other code to run. So when inc()
returns control to
its caller, it has not actually finished executing, it is paused on the await
counter
expression.
This means the loop starts a thousand instances of the inc()
function, and all
of them read the value 0
out of counter
, and then return control to
main()
. These inc()
calls will resume on the next tick of the event
loop, but by then main()
has completed its for
-loop, printed the value
0
, and the program has exited.
Now we need to define a bit of terminology. What do we mean when we say inc()
returns but has not finished executing? Well, the async
function above is a
built-in JavaScript shorthand for this Promise
-based implementation:
function inc() {
return Promise.resolve(counter).then((tmp) => {
counter = tmp + 1;
});
}
inc()
immediately returns a Promise
, which encapsulates some asynchronous
work and lets us get a value out once the work is complete. Even though in the
control-flow sense, inc()
has returned control to its caller, its logical
workload is still executing. I will use the word task to refer to a logical
unit of work encapsulated in a function, class, or other building block. A task
may continue after the function that defines it returns, and the return value
represents the task’s continued execution. In JavaScript, the main abstraction
for this is the Promise
type and the async
/await
shorthand for it.
(Historically, JavaScript has also used callbacks for handling async tasks, but
any function that takes a callback is semantically equivalent to one that
returns a Promise
, with the callback being passed to Promise.then()
, so we
won’t deal with callbacks specifically here. Also, the word “task” does have a
technical meaning in the internals of the JavaScript event loop, but it’s less
overloaded than terms like “process”, “thread”, or “job”, so I’m sticking with
it.)
We can now see that our program prints 0
because we’re not waiting for all the
inc()
tasks to complete, by waiting on their returned promises. Let’s fix
that:
async function main() {
let tasks = [];
for (let i = 0; i < 1000; i++) {
tasks.push(inc());
}
await Promise.all(tasks);
console.log("counter:", counter);
}
Now we’re still launching a thousand inc()
tasks and storing them in a list
before using await Promise.all()
to wait for all of them to complete. However,
all the inc()
tasks run concurrently, and so all of them read the value 0
out of counter
, and then pause until the next tick of the event loop. Now
we’re waiting for them all to complete, they all add 1
to 0
and write 1
back into counter
, so the program prints 1
.
What we have here is called a lost update bug. A thousand separate tasks are
concurrently trying to increment a counter, but most of these operations do not
get recorded. This happens because of a combination of two things. First,
counter
itself is fully described by the Location
type: the only operations
it supports are reading and writing, these operations always succeed, and it has
no other access control mechanism protecting it. Second, the inc()
tasks are
not co-ordinating to make sure they make correct use of the Location
they’re
all sharing access to. Depending on your situation, fixing a Location
-related
data race requires fixing at least one of these issues.
For example, we can force the tasks to execute sequentially by awaiting each
inc()
task before starting the next:
async function main() {
for (let i = 0; i < 1000; i++) {
await inc();
}
console.log("counter:", counter);
}
This forces the first inc()
task to complete, setting counter
to 1
, before
the second one starts. We no longer need to keep hold of the tasks
list
because we’re forcing each task to complete before starting the next. That’s
enough to prevent losing any updates.
Another approach would be to wrap the bare counter
in some structure that lets
its users access it safely. For example, we could turn it into a Promise
, and
make inc()
work by transforming one Promise
into another using the then()
:
let counter = Promise.resolve(0);
function inc() {
counter = counter.then((value) => value + 1);
}
async function main() {
for (let i = 0; i < 1000; i++) {
inc();
}
let value = await counter;
console.log("counter:", value);
}
main();
This is slightly more subtle and can’t be implemented by using await
in
inc()
: it relies on immediately replacing counter
with another promise
rather than asynchronously extracting its value. It turns counter
into a long
chain, like:
Promise.resolve(0).then((v) => v + 1).then((v) => v + 1).then ...
Piping the initial value through a sequence of increments by chaining then()
achieves the same effect as forcing inc()
to execute sequentially. The
difference is that we’ve placed the control mechanism with the data itself,
rather than co-ordinating the tasks that access the data. counter
no longer
has the interface described by the Location
type, it instead has the Promise
interface:
Promise<T> = {
then(fn(T) -> U) -> Promise<U>;
}
I’m abusing Rust notation slightly here, but the intent here is that
Promise.then()
takes a function which itself takes a T
and returns some
other type U
, and returns Promise<U>
. For example,
Promise.resolve("hello").then((s) => s.length)
converts a Promise<String>
to
a Promise<Number>
. Whereas Location
describes reading and writing a mutable
value, Promise
describes transforming one value into another via a pure
function. It allows for control that the Location
type does not make possible,
and so lets us avoid the race condition that Location
facilitates.
Now you may be thinking, after reading this far, that these examples are absurd, or at least contrived: nobody would ever write code like this, it’s obviously wrong, and so on. The reason I started with this example is that it establishes a few concepts and also demonstrates how little is needed in order to get a data race into a program. In particular, you do not need threads, and you do not need complex data structures; an asynchronously accessed number is sufficient. These examples may look simple, but many systems do things equivalent to this all the time, it’s just much more indirect. Maybe the counter lives in a database or remote service, maybe the tasks fighting over mutating it are in distant parts of the system barely aware of one another. You probably wouldn’t write the code above, but you have probably written something equivalent to it, just with thousands of lines of code so you can’t see that it’s happening.
One common myth is that race conditions only happen due to threads or operating system processes. JavaScript is single-threaded, but it still supports concurrency: it contains a mechanism where an async task can pause, allowing arbitrary other code to run in the meantime. If that task makes decisions on the basis of things it read before pausing, then those decisions, and the operations they lead to, may now be incorrect. JavaScript is actually rather good at deterministically demonstrating concurrency problems like this.
However, the absence of threads does make it possible to make certain assumptions. We know that a function like this will never be interrupted part-way through its execution:
function inc() {
let tmp = counter;
counter = tmp + 1;
}
JavaScript functions always run to completion until something asynchronous
happens, which is usually signalled by a function taking a callback, or
returning a promise requiring the use of then()
or await
. JavaScript
concurrency is cooperative, meaning functions have to explicitly yield to
let other code run, they don’t pause at random points. So you know if you see
two synchronous statements one after the other, then no other code will be
executed between those statements.
This leads us to the idea that uses of the read() -> T
and write(T)
operations can be successfully used, if we can guarantee no other code runs
during the task that uses them. For example, if our inc()
function called some
other function between reading the counter
value and writing it, this
guarantee is lost:
function inc() {
let tmp = counter; // read counter
anotherFunction();
counter = tmp + 1; // write counter
}
To make sure this code is still correct, we need to know what
anotherFunction()
does – we can’t treat it as an opaque box. If it modifies
counter
then it invalidates the decisions that inc()
makes:
function anotherFunction() {
--counter;
}
We might have been hoping this program would print 0
because we decrement
counter
whenever we increment it. But it doesn’t: it prints 1000
, because
inc()
makes a decision about the value to write to counter
, before allowing
another function to modify it. So, it is not just concurrent tasks executing
independently that causes this problem, it can even be functions on the same
call stack messing with each others’ assumptions.
The above program can be made correct by keeping the read
and write to counter
together and making sure no other code executes between
them:
function inc() {
let tmp = counter; // read counter
counter = tmp + 1; // write counter
anotherFunction();
}
We can now treat anotherFunction()
is an opaque box and ignore what it does –
it won’t affect the read and write in inc()
because it cannot interrupt
between them.
There’s a clear drive here to be able to control how reads and writes happen, and think about code in small pieces without having to understand the whole system at once. Having many functions arbitrarily access the same memory locations is hard to audit, and it’s easy to cause bugs this way, and this becomes particularly true when there are multiple pieces of data that must be consistent.
These concerns are one of the motivations for object-oriented programming. The idea is that by grouping data with the functions that act on it in some way, we can control access to the data by limiting the set of allowed operations that can be called to change it.
Imagine we wanted something like JavaScript’s Map
type, but it iterates
its keys in sorted order. We don’t want to pay the cost of sorting the keys
every time the map is iterated. We could implement this by storing a sorted list
of the keys, which we update whenever the Map
is modified. For this to work,
it is important that at all times, the #keys
list holds a sorted list of
exactly the keys that appear in #map
– there is a consistency requirement
between these two values.
class SortedMap {
#map
#keys
constructor() {
this.#map = new Map();
this.#keys = [];
}
set(key, value) {
this.#map.set(key, value);
if (!this.#keys.includes(key)) {
this.#keys.push(key);
this.#keys = this.#keys.sort();
}
}
all() {
return this.#keys.map((key) => {
let value = this.#map.get(key);
return [key, value];
});
}
}
function main() {
let map = new SortedMap();
map.set("London", 9.75e6);
map.set("Birmingham", 2.45e6);
map.set("Manchester", 1.90e6);
for (let [key, value] of map.all()) {
console.log(key, value);
}
}
main();
// prints:
// Birmingham 2450000
// London 9750000
// Manchester 1900000
In JavaScript the syntax this.#name
denotes a private field – those
properties cannot be accessed from outside this object’s methods. That means we
can check that these fields are modified correctly and consistently by only
examining the methods in this class, not the entire system. As long as every
method leaves the object in a valid state, the class works correctly.
Looking at the SortedMap.set()
method, all it does is synchronously access the
object’s own private fields, and it doesn’t call any other methods or objects
during the modifications it makes. Therefore we know no other code runs
concurrently with these actions, and we can audit them in isolation. If the
method did anything asynchronous, or called out to other code in the middle of
its work, we’d need to exercise more caution and this might prompt us to change
our design.
But even the use of objects in and of itself makes any calls to other methods
safer. Inside SortedMap.set()
, we call Map.set()
, Array.includes()
,
Array.push()
and Array.sort()
. We know this is safe because neither Map
nor Array
holds a reference back to our SortedMap
type, and we don’t pass a
SortedMap
into any of these methods. Therefore, none of these calls can affect
the SortedMap
in any way. It is certainly possible to create cyclic references
in an object-oriented system such that object A
calling into object B
may
trigger a call back into object A
, but it’s harder to do this by accident than
it is when you have no defined relationships between data and functions.
Objects also create a sense of ownership. Before, all we needed for a data
race was for two functions to access the same shared variable without
co-ordination. With objects and private fields, this isn’t possible – object
methods mediate access to the object’s data, so other objects cannot access that
data directly. They can only affect the data by going through the object’s
methods, which greatly restricts what they’re able to do. So we see that a key
part of the problem with the Location
type is that it describes data that a
task does not own – there may be many tasks accessing the location, and no
control over what they can do with it.
We have removed the risk of using #map
and #keys
as locations by putting a
protective barrier around them: the interface of the SortedMap
class, which
doesn’t allow these fields to be accessed directly, so they can’t be modified
from outside the class. This grouping of values that need to be kept consistent,
versus values that can vary independently, is a key part of object-oriented
design and one of the major reasons for using that design approach in the first
place, but unfortunately it’s not well-covered in contemporary literature. You
can read a whole book on OOP and not see a single state mutation happen in the
entire text.
Now object-orientation by itself does not solve all our problems. It’s easy to
accidentally turn an property back into a Location
by writing a getter for it:
class SortedMap {
// ... as above
keys() {
return this.#keys;
}
}
This leaks a reference to the internal keys list out of the object’s protective
wrapper and lets anybody retrieve and mutate it – it’s now a Location
again.
The same thing could happen if we turned our counter into an object:
class Counter {
constructor {
this.#value = 0;
}
inc() {
this.#value += 1;
}
get() {
return this.#value;
}
}
This design is safe: although Counter.get()
lets us read the internal value
(i.e. it implements read() -> T
), we can’t mutate it because the value is a
number, and JavaScript numbers are primitive immutable values. Counter.get()
effectively returns a copy of the value, not a reference to a mutable structure.
And although inc()
lets us mutate the object, it does not allow for the
internal value to be accessed directly or arbitrarily replaced – it does not
implement write(T)
, in fact it does not take any arguments at all, so it
performs a controlled mutation on data that it has exclusive access to, rather
than letting the caller pass in anything it likes.
However, adding one more method makes this unsafe:
class Counter {
// ... as above
set(value) {
this.#value = value;
}
}
This design is now broken, because it lets code outside the class both read
and write the #value
field, effectively turning it back into a shared
mutable variable that anyone can write a broken asynchronous version of inc()
on top of. For this sort of reason, there is a mantra in object-oriented design
called tell, don’t ask: objects should receive commands and mutate their
own internal data. They should not let callers extract data, make their own
decisions about it, and then change the object from outside. There are many
reasons this mantra exists, but preventing concurrency bugs is a very good
reason to follow it.
We have seen that objects can be highly useful in defining how a data structure
can be accessed and mutated, an improvement on top of using a bare Location
with uncontrolled access. In a single-threaded context, they provide a lot of
reassurance about which operations can occur against a value, and in what order.
But when we throw threads into the picture, things get more complicated.
Here’s the same Counter
type, implemented in Ruby. The attr_reader :value
line defines a method for reading the private @value
property.
class Counter
attr_reader :value
def initialize
@value = 0
end
def inc
@value += 1
end
end
We can use it as we have above: incrementing it a certain number of times. In
Ruby, parentheses on method calls are optional and usually omitted for methods
with no arguments; counter.value
is a method call, not a bare property access.
In Ruby you cannot get direct access to object attributes, you can only access
them via the object’s methods.
counter = Counter.new
1000.times { counter.inc }
puts "counter: #{counter.value}"
As expected, this prints 1000
. Let’s try the same thing, but calling
Counter#inc
from multiple threads. My machine has four CPU cores, so I will start
four threads. I am running this example using JRuby, which supports true
parallel CPU usage, as opposed to the canonical Ruby implementation which
has a global interpreter lock and only allows one thread to run at a time.
counter = Counter.new
threads = (1..4).map do
Thread.new do
1000.times { counter.inc }
end
end
threads.each(&:join)
puts "counter: #{counter.value}"
We’d want a correct implementation to print 4000
, but this will print some
random number that’s less than that. That’s because when two threads enter the
inc
method at the same time, they can both read the @value
attribute, get
the same result, and then add 1
to the result and write it back to @value
.
@value += 1
is just short-hand for the two-step read and write we were doing
above and is not atomic.
So even though Counter
is not a Location
per se, we no longer have the
guarantee that the code inside a single method will run to completion without
interruption. This is because threads are preemptive – they can interrupt
your code at any point they like and allow another thread to run. Threads have
broken the encapsulation we had, but we can get it back quite easily.
The solution here is to use a mutex. This is a locking mechanism that
prevents multiple threads from entering the same block of code at the same time.
When a thread hits a lock
call, it blocks if another thread has already locked
the mutex. When the mutex is unlocked, then one of the threads blocked on it can
try to acquire the lock and make progress. In Ruby this is implemented as
follows:
class Counter
attr_reader :value
def initialize
@value = 0
@mutex = Mutex.new
end
def inc
@mutex.lock
@value += 1
@mutex.unlock
end
end
In the inc
method, we call Mutex#lock
to allow at most one thread to proceed
past this point. We then perform our change by reading and writing @value
, and
call Mutex#unlock
to wake up one of the blocked threads. With this small
change, the program now reliably prints 4000
. We had a problem with @value
being used as a Location
; the logical task encapsulated in the inc
method is
to read @value
and then write it:
tmp = @value # read
@value = tmp + 1 # write
We have fixed this problem by adding an access control mechanism: a mutex that prevents this task from being executed concurrently, forcing the threads to run it sequentially and therefore not lose any updates. It is critical that the mutex protects both the read and the write; this implementation is incorrect:
def inc
tmp = @value
@mutex.lock
@value = tmp + 1
@mutex.unlock
end
This still allows two concurrent threads to begin executing the inc
method,
both get the same value for tmp
, and therefore both write the same new value.
When a write is determined from data we read from the same location, both the
read and the write must happen while the mutex is locked:
def inc
@mutex.lock
tmp = @value
@value = tmp + 1
@mutex.unlock
end
This means a thread cannot read @value
while some other thread is incrementing
it, so it never gets a stale value that would lead it to making an incorrect
decision.
We can think of this has having changed the type from Location
to something
more complex – before we can read or write the value, we need to lock it. In a
type system, if we want to enforce that some operation g()
may only be called
after calling operation f()
, we can do that by putting these operations into
different types, and making f()
return the type that implements g()
, i.e.:
// Allows calls f() and g() // Only allows g() after f()
// in any order
MyType = { MyType = {
f(); f() -> Other;
g(); }
}
Other = {
g();
}
In our case, we want to allow read() -> T
and write(T)
only after calling
lock()
, giving us:
Mutex<T> = {
lock() -> Lock<T>;
}
Lock<T> = {
read() -> T;
write(T);
unlock();
}
We can think of lock()
as giving us a Lock<T>
type, which allows us to read
and write the underlying value, and to unlock the mutex. The key thing here is
that only one Lock<T>
is allowed to exist at a time, and lock()
blocks until
the existing one is destroyed by calling unlock()
on it. This structural
difference in the types of the structures and operations present is what lets us
say this code is now safe, compared to the Location
-typed implementation which
was not.
Some languages bake this concept of a mutex into their syntax. For example, here is a non-thread-safe counter written in Java:
class Counter {
private int value = 0;
int get() {
return value;
}
void inc() {
value += 1;
}
}
We can use this from a single thread, and it works fine. This program will print
1000
as expected:
public class Main {
public static void main(String[] args) {
Counter ctr = new Counter();
for (int i = 0; i < 1000; i++) {
ctr.inc();
}
System.out.println("counter: " + ctr.get());
}
}
However, if we try to use this counter from multiple threads, we’ll get a nonsense result. This is equivalent to the Ruby we saw above, which ran four threads, each incrementing the counter a thousand times.
public class Main {
public static void main(String[] args) {
Counter ctr = new Counter();
List<Thread> threads = new ArrayList<Thread>();
for (int t = 0; t < 4; t++) {
Thread th = new Thread(() -> {
for (int i = 0; i < 1000; i++) {
ctr.inc();
}
});
th.start();
threads.add(th);
}
for (Thread th: threads) {
try {
th.join();
} catch (InterruptedException e) {
// ignore
}
}
System.out.println("counter: " + ctr.get());
}
}
All that’s needed to make the counter thread-safe is to put synchronized
before the inc()
method definition:
class Counter {
// ... as above
synchronized void inc() {
value += 1;
}
}
This does the same thing that we saw in Ruby: it gives the object a mutex, and
uses the mutex to protect all methods marked synchronized
; only one thread at
a time is allowed to execute any of the object’s synchronized methods. If
get()
returned a reference to an internal structure, rather than a copy of an
int
, you would also want to mark it synchronized
or else you would risk a
thread being able to access that structure in an intermediate state, while
another thread was executing a synchronized
method.
The nice thing about making mutexes explicit, rather than implicit as Java does,
is that it makes it easier to put them in different places. For example, we
could have fixed the Ruby version not by using Mutex
inside Counter
, but by
using it alongside Counter
, in the threads that are interacting with it.
counter = Counter.new
mutex = Mutex.new
threads = (1..4).map do
Thread.new do
1000.times do
mutex.synchronize { counter.inc }
end
end
end
Adding mutex synchronisation to a function adds some performance overhead and
you might not want to pay that cost all the time. If you know a bunch of the
code using Counter
is single-threaded, then it makes sense to have no mutex
inside that class, and only use a mutex when it’s needed. Or, we could invent a
wrapper around Counter
whose only purpose was to wrap any calls to it with
mutex synchronisation. If you have many disparate parts of a program accessing
the same data, it may be easier to wrap the data with a control mechanism than
to make those parts of your program co-ordinate their use of the data. For
example, you use a mutex-like construction to force async JavaScript tasks
to execute sequentially.
So far, we’ve established some basic concepts. We’ve met the Location
type,
which characterises an object that allows uncontrolled reads and writes. We’ve
seen how that pattern can lead to data races, in threaded code, single-threaded
asynchronous code, or simply by calling out to other functions. And, we’ve seen
a common method of performing safe concurrent mutation, using mutexes, and how
that changes the type of the interaction to require a successful lock()
call
before performing any data access. In the next part of this series, we’ll look
at how these patterns appear when accessing data outside our program, in files
and databases.