This article is part of a five-part series on race conditions. The complete series is:
- Part 1: locations and locks
- Part 2: files and databases
- Part 3: web applications
- Part 4: user workflows
- Part 5: concurrency and language design
–
In this series, we’ve been exploring the idea of a Location
type and how it
manifests in various programming situations. In part 1 we looked at
programs managing data in memory and using mutexes to control concurrent access.
In part 2 we moved to looking at files and databases, and we saw how lost
updates can be prevented by transactional locks, repeatable read isolation,
and compare-and-swap. In part 3 we looked at uniqueness violations, partial
validity, and other forms of write skew, and how we can use the aforementioned
techniques along with serializable isolation to prevent such bugs.
We have been building up a library of phenomena that can happen in web applications, or really any multi-user client-server system, when two tasks try to read and write the same record or table without co-ordination. Everything we’ve examined so far has been a single-request phenomenon, that is the task in question is confined to a single update, typically involving a single request to the server. These types of bugs are often overlooked because it seems unlikely they’d ever happen in practice: what are the odds of two conflicting requests actually overlapping in time and triggering one of the race conditions we’ve studied?
There are several answers I like to give here. The simplest is that, allowing concurrent requests is the whole reason we run multiple instances of our applications. If our traffic is low enough, and we don’t mind downtime during deployments, we can run a single instance of the application, process requests one at a time, and none of the phenomena we’ve examined will ever happen. But for many applications, this is not acceptable, and we run multiple instances entirely to increase throughput, by processing requests concurrently.
That means races can happen, but it doesn’t mean they will happen. A race condition actually being triggered requires two conflicting requests – two requests that modify the same record, or an interrelated set of records – being processed at the same time. This isn’t a simple function of how much traffic you have, although having more traffic will certainly make races more likely in general. It also depends on how traffic is distributed across resources. For example, if your service produces situations where many users are interacting with the same resource – a viral social media post, or an in-demand concert, say – then it may concentrate a lot of activity on the same resource in a short period of time, even if your overall traffic is still low.
It is especially critical to pay attention to concurrency bugs if the resource represents a scarce good that must be reliably controlled. For example, when thousands of users are trying to buy concert tickets, you need to make sure you don’t oversell your allocation, so any race condition that results in miscounting sold tickets must be avoided. This is a particularly bad scenario because the conditions that are most likely to trigger a race condition are also those where it causes the worst problems. A database under heavy load will also make this worse, because it causes requests to take longer, and it therefore becomes more likely that they will overlap.
It is often worth proactively preventing these bugs rather than waiting for them
to happen in production. Sometimes, you’ll have a little corrupted data to
manually clean up, and that’s manageable. But if you oversell a
ten-thousand-seat concert, it’s a horrible task to have to clean up the mess
after it’s happened. These bugs are insidious because it’s often very hard to
notice them during development, and they only show up after going to production,
where they can cause a lot of damage. This is why I’m interested in tools to
help us spot them sooner; when PostgreSQL rejects an unserialisable transaction,
or when CouchDB rejects writes without the correct _rev
value, they’re helping
us ship more robust code so these bugs never make it to production. The
languages and libraries we use can help too, but we also need a conceptual
understanding of where these bugs come from – hence this series.
Ultimately, every bug I’ve given as an example here is something I’ve observed in real production systems. They become even harder to deal with at scale, because you rely so much more on autonomous data management without manual checking of much of the data in your systems. These bugs can go unnoticed for a while, until your service starts behaving erratically, at which point you may not be able to make any sense of the data. The fact that these bugs lead to nonsense being persisted in databases can make their effects quite long-lasting, and also far-removed from the original cause, and therefore even more expensive to clean up once the bug is noticed.
Hopefully, if you’ve read this far, I’ve persuaded you that these concurrency
bugs are worth paying attention to. But even if we prevent race conditions
during a single request’s interaction with the database, stepping up a level in
the stack reveals yet another Location
: when users interact with a website to
update some data, they typically do this by loading a page presenting some
existing resource in a form (i.e. a read() -> T
) and then at some later time,
they submit the form, sending the resource’s new state to the server (a
write(T)
). This creates further opportunity for race conditions, and makes
them much more likely to occur, because the task spans the whole read/write
cycle of the user loading, editing, and saving the resource. Rather than being
confined to a single request, this multi-request workflow could span minutes
or even hours, between the user receiving a copy of the resource, and sending
their modified copy back to the server. You might even deploy a new version of
your service while the user is making their changes.
Although forms in web applications introduce new opportunities for race conditions, they also introduce some established semantics we can build on, as well as context that helps us make decisions about the risks and how they should be handled. Rather than looking at a computational task in isolation, we now have an end user to consider, and we should think about what their expectations are when they use our services, and how those expectations are shaped by the interface and information the service presents.
There are many useful ways to characterise multi-request data races that can help us decide how they should be handled. I’m going to focus on a couple of distinctions that I often use when thinking about them: whether the race involves a single user or multiple users, and whether it involves conflicting or duplicate requests.
Let’s deal with single-user scenarios first. These are situations where the tasks involved in a race condition are all triggered by the same end user. For example, they might open the same form in two different tabs – maybe even on different devices – make different changes in each place, and submit both of them. This would be a conflicting request scenario, because the server receives concurrent requests containing different state for the same resource.
These requests are concurrent in the sense of causality, even if the POST
requests do not actually overlap in time: they both contain state that’s derived
from the same starting point, rather than one including the changes made by the
other and therefore coming after it. Imagine I open my user profile in two tabs.
In one of them I change my username and submit the form, and in the other I
change the email address and submit. This sequence of events looks like this:
Tab 1 | Tab 2
--------------------------------+--------------------------------
GET /profile |
| GET /profile
# change username |
POST /profile |
| # change email address
| POST /profile
The GET /profile
request will return the same state in both tabs, and so the
POST
in Tab 2 will contain the username from before I edited it in Tab 1.
Therefore Tab 2 will overwrite the changes made in Tab 1, causing a lost update.
Again, we see it’s a race condition because the result is different if we
perform these tasks sequentially:
Tab 1 | Tab 2
--------------------------------+--------------------------------
GET /profile |
# change username |
POST /profile |
| GET /profile
| # change email address
| POST /profile
In this scenario, the GET
from Tab 2 happens causally after the POST
from
Tab 1; it includes the username we set in Tab 1 and no updates are lost. This is
equivalent to me editing my profile twice in the same tab.
To decide how to handle this, we can ask what the user’s expectations are. The nice thing about forms is that they’re complete: all the inputs are sent to the server, regardless of whether they’ve been changed. The user is aware that they made both these submissions, and so it’s reasonable to simply let the last submission win. If the user can see the whole form and submit button, then it is reasonable to interpret this as them asserting what the complete state of their profile should be, and letting the most recent submission overwrite all previous ones. It may actually be confusing to try to merge both these requests, because it would result in the username not matching what the user could see in the last form they submitted.
The one wrinkle in this plan is that we can’t control the order in which
requests are processed. Unless the requests are sent over the same TCP
connection, they can be reordered between the browser and the server. Once they
reach the server, they might still be reordered by a load balancer dispatching
the requests to different app servers. So we can’t rely on the order the client
sent the requests to make any guarantees. In the above scenario, the POST
in
Tab 1 might be delayed for long enough that the user decides to retry on a
different device. For example, Tab 1 might be on a phone that’s experiencing
very high latency connecting to the internet. While they’re repeating the
process in Tab 2 on their laptop, the server eventually receives the POST
from
Tab 1 and handles it. We’ll amend the above sequence diagram to show the first
POST
being sent while Tab 2 is active.
Tab 1 | Tab 2
--------------------------------+--------------------------------
GET /profile |
# change username |
POST /profile |
. | GET /profile
. | # change email address
. | POST /profile
# response received
If the events in Tab 2 occur while the first POST
is in flight, then the GET
in Tab 2 may or may not include the changes from Tab 1, and the race condition
may occur. Preventing this would require co-ordination between clients, which
requires them to communicate over the network, which is exactly what they’re
failing to do to begin with. So there’s not a nice solution to this that doesn’t
add a ton of additional complexity. It’s simpler to assume that the UI of forms
means the user knows what state they’re saving, and we should clearly show them
what state their data is in as they continue to use the service.
If the data being modified is especially sensitive, and unlikely to be proactively checked without digging in the service’s settings, it may be worth alerting the user to the change, either with an on-site notification or by sending them an email or push notification. For example, some services will email you if your password is changed, to make sure you intended for that to happen, and so you can go and amend the change if necessary.
(Note that, as we saw in part 3, because these tasks update different fields, it
is still possible to cause write skew if the POST
requests actually overlap in
time and the fields are both subject to a validation constraint. However, as
these tasks are being triggered by a single person, it’s quite unlikely that
would actually happen. The most likely cause of this is network latency delaying
requests, as we have just discussed.)
We have seen that the correct response to a conflicting set of requests doesn’t have a clear-cut answer; it depends on how the UI communicates with the user about what a certain action will do, and how it informs the user about the new state after any action they take. This is even more true in the case of duplicates, where the user sends the same request twice. This often happens when a user double-clicks a form, or retries a request that appears to have failed.
It’s tempting to try to prevent such situations by using JavaScript to disable a form’s submit button once it’s clicked, and sending the request using JavaScript. However, this requires error handling: we need to detect when the request fails, and retry the request, or re-enable the submit button so the user can retry it. The problem is – and this is a general problem in distributed systems – that it’s not always possible to tell if the request was processed.
If we receive an HTTP response from the server then we know the outcome of the request for certain, but until that response arrives the client can only guess at what might have happened. It may not have been sent at all, if the client is offline. It may have been sent but the server may not have received it yet. It may have been received by a server but is sitting in a queue waiting to be processed. It may be being processed by the server-side application. It may have been processed, and the response may be on its way to the client, but not received yet. It could be in any of these states and the connection could be severed at any point. So it’s possible the request was received by the server, or even processed, without the client receiving a response. So if the client has not received a response, it cannot know whether its original request was received and processed.
We’d like to handle failed requests and re-enable the form so the user can re-submit it. But, we can’t wait to receive a failed HTTP response – this may never arrive, so we would wait forever and the form would remain unusable. The alternative is that we re-enable the form after a timeout if no response is received, but in this scenario we cannot know whether the server received the first request. So we have to choose between definitely preventing duplicate requests (exactly once delivery) but making the form potentially unusable, or keeping the form usable but allowing duplicate requests to be sent (at least once delivery). It tends to be a better bet to pick the latter, and make sure the server can reliably handle receipt of duplicate requests.
For requests that update an existing resource, a duplicate write is often completely safe – the later request will overwrite the earlier one, but since they contain the same data the net effect is the same. This only becomes a problem if the request causes new records to be created, and it’s problematic to create the same record twice, for example creating a new user account. In this situation, the first submission of the sign-up form will create the account, and the second request will fail if there’s a unique index on the account username or email address. If that failure is not handled then the user will see an error response, even though their request for a new account was actually fulfilled.
A useful approach in such situations is to think not in terms of what change a request should produce, but what state it should bring about, regardless of how many times it’s processed. In the case of a sign-up form, we want an account with the requested username to exist, but it must be a new account. We don’t want to hand out access to whatever existing account already exists with that name; we want to create a new account for this user, if the username is not already taken. If the given username is not already registered, then the sign-up proceeds normally: the account is created and the user is logged in as that account. If it already exists, we can handle the duplicate-request case gracefully by checking if the request contains the same information as the existing account – specifically, does its password match that already stored.
class UsersController < ApplicationController
def create
begin
@user = User.create(params[:user])
rescue ActiveRecord::RecordNotUnique
username, password = params[:user].values_at(:username, :password)
existing_user = User.find_by(username: username)
if existing_user.authenticate(password)
@user = existing_user
else
# return an error
end
end
# log the user in
# show success page
end
end
If the credentials match the existing account, then @user
is assigned to the
existing account and everything proceeds normally. If not, then we display an
error page, but note that this error page must not display information from
existing_user
– that’s an account the current user does not have access to.
If we handle the error by re-displaying the sign-up form, then the form should
include the data from @user
, the object containing the information this user
submitted, not existing_user
, which is not their account.
Now, let’s consider multi-user scenarios. When multiple users are involved, then we have genuine race conditions because people may be submitting conflicting requests without co-ordinating with one another. We can’t just wave the problem away as we did for single-user scenarios with the argument that the conflicting requests all came from a single user, so they’re aware of all the requests that have been made and may reasonably expect their last write to win. Now the conflicting requests result from different users making independent decisions, and having one user’s changes overwrite another’s may not be acceptable.
It is not advisable to place a database-level lock on a resource while this happens, because it prevents anyone else reading the record, and also creates a situation where a database transaction must span multiple requests, but only apply to requests from a single user, which is hard to implement.
The main problem with implementing locks at this level is that HTTP is stateless – once a client has requested a page, its connection to the server is not preserved (other than as a performance optimisation at the networking layer). A database client can initiate transactions and acquire locks and that state persists while the client is connected; if a client disconnects during a transaction, the transaction is aborted and all its locks released. In HTTP there is no such mechanism for reliably releasing locks when a client disconnects, because there is no persistent and stateful connection at a logical level between a user agent and the server, and a single end-user’s interaction with the service might span multiple user-agents, for example their laptop and their phone’s browsers. Therefore if you implement application-level locking, you need a way to reliably release locks held by users, or a way to negotiate them being released after a time limit, or when the user’s login session ends.
Instead, a common solution is to implement a model where each resource has a clear owner or set of owners at any given time, and only that owner can modify the resource. Race conditions usually involve a lack of co-ordination, and it’s unlikely that a single end user would make concurrent edits to the same object without knowing there were doing that. In any case, the semantics of web forms are that the user submits the entire form as a single request, so even if they open the same form in two tabs, it’s reasonable for them to expect that the set of data they submit last is what ends up being saved. This model can be extended to a resource being owned by a single team or organisation at a time, rather than a single person, if that team works closely together and is likely to communicate via other channels about the changes they’re making. The weaker the communication and co-ordination is among owners, the more care the service needs to take over controlling concurrent access.
You can also use an optimistic control technique, like embedding a version
number in the form and atomically checking it on submission. For example, you
can embed the object’s current lock_version
in the form, and that will make
Rails reject the update if someone changed the resource concurrently. The main
risk here from a usability point of view is that a user spends some time working
on their changes only to have them rejected. This should be handled gracefully
by, at least, presenting the form back to them with their submitted data intact.
The workflow can then be improved by showing them the new state of the resource
if someone else submitted an update while they were working, potentially
attempting to merge both sets of changes together. How viable this is will
depend on the structure of your data and what it represents, but for example the
diff3 algorithm allows two sets of changes to a text document to be merged,
if you have a copy of the state before the changes were made.
Beyond this, there are many systems for real-time collaborative editing, which are beyond the scope of this series, including conflict-free replicated data types, operational transforms, and hybrid approaches. For most applications that don’t require this model, using an ownership system or optimistic control is probably most appropriate.
To finish up our discussion of conflicts in user workflows, I’d like to visit a topic that trips a lot of developers up in both single-user and multi-user scenarios: state machines. Let’s imagine we’re running a food delivery service, where orders can be in one of these states:
accepted
: the order has been received and is awaiting processingpreparing
: the order is being prepared in the kitchenshipping
: the meal is has been handed off to a delivery driverdelivered
: the meal has been delivered to the customer
The states are strictly sequential; all orders begin in the accepted
state and
progress through the others in order. As the people responsible for fulfilling
the order complete the work, they use an app to update the status of the order
so the customer can see what’s going on. Because the states are sequential, we
always know what state an order will move to next, based on its current status.
Given that, it’s tempting to implement the action for updating the state as a
single button that triggers the next state. For example, if an order is
accepted
then we might display a button allowing the user to indicate they’ve
starting preparing it:
<form method="post" action="/orders/42">
<input type="submit" value="Start preparing">
</form>
This form contains no input values other than the submit button label. The
controller handles the POST /orders/42
request by moving the order to
whichever status is next:
class OrdersController < ApplicationController
STATES = [
"accepted",
"preparing",
"shipping",
"delivered",
]
def update
order = Order.find(params[:id])
new_status = status_after(order.status)
order.update(status: new_status)
end
def status_after(status)
index = STATES.index(status)
STATES[index + 1]
end
end
This seems reasonable, and is simple enough to implement. However, it is prone to race conditions if two users perform the following steps:
User 1 | User 2
--------------------------------+--------------------------------
Load the order page |
| Load the order page
Click 'Start preparing' |
| Click 'Start preparing'
The form submission from User 2 will update the status to shipping
, because
the controller handles Order#status
like a counter, incrementing it to the
next state in the sequence. This is not what the user expects on clicking a
button labelled Start preparing
, and it’s a race condition because if User 2
loaded the form after User 1’s submission, they’d see a different label on the
button and potentially make a different decision. This is a common mistake in
web form design: the page a user is currently looking at does not necessarily
agree with the current state on the server, which may have changed since the
page was loaded.
When we implemented counters earlier in this series we saw that each caller
supplying the desired next state was a problem, but here it’s exactly what we
need. If the form includes a status
input indicating what state we’re
requesting, then the controller can verify that this request still makes sense.
<form method="post" action="/orders/42">
<input type="hidden" name="status" value="preparing">
<input type="submit" value="Start preparing">
</form>
Rather than incrementing Order#status
to the next value, we now check that
the requested new status is the next state in the sequence, and therefore
represents a valid transition. If not, we show the user an error.
def update
order = Order.find(params[:id])
new_status = params[:status]
if new_status == status_after(order.status)
order.update(status: new_status)
# show success page
else
# show error page
end
end
Now we see a problem: this controller doesn’t handle double-submission
gracefully. If I double-click Start preparing
, the first request will cause a
transition from accepted
to preparing
, but the second request will fail
because preparing
is now not the next state in the sequence. We can handle
this by checking whether the order is already in the requested state, and
showing a success response if so. This also deals with the situation above where
two different users try to cause the same state change – the intent of each is
now honoured. We’re thinking about the desired end state we want, rather than
the desired change – sometimes the resource is already in the desired state and
no change is necessary.
def update
order = Order.find(params[:id])
new_status = params[:status]
if new_status == order.status
# show success page
elsif new_status == status_after(order.status)
order.update(status: new_status)
# show success page
else
# show error page
end
end
An order changing state may result in side effects, like a notification being sent to the customer. Where possible we only want these effects to be executed once, and in this case we can trigger them only when the order changes state, not when it’s already in the desired state.
def update
order = Order.find(params[:id])
new_status = params[:status]
if new_status == order.status
# show success page
elsif new_status == status_after(order.status)
order.update(status: new_status)
send_notification(order)
# show success page
else
# show error page
end
end
Finally, we spot another Location
pattern occurring here: we’re loading an
Order
from the database and making some decision about how to change it based
on its current state. Using the lock_version
method for optimistic control
would cause errors on duplicate requests, so let’s instead use a pessimistic
lock to ensure the record is not changed while we’re making these changes.
def update
ActiveRecord::Base.transaction do
order = Order.lock.find(params[:id])
new_status = params[:status]
if new_status == order.status
# show success page
elsif new_status == status_after(order.status)
order.update(status: new_status)
send_notification(order)
# show success page
else
# show error page
end
end
end
This action is now robust against either the same user or multiple users submitting semantically duplicate requests, and against the order being changed during those requests. We’ve seen that although one can model the order’s status as a counter, that doesn’t make sense if the UI presented to the user implies the order will be put into a particular state as a result of their action. Our modelling needs to match the interface we’re exposing to users, and the user’s intentions as informed by that interface need considering when deciding on solutions to concurrency problems.
There’s a lot of highly non-obvious things to get right when coding something
like this, and while there are libraries that deal with modelling state
machines, they don’t consider all these use cases in their API. For example,
aasm uses a transactional lock to update a record’s state
attribute, but
it acquires that lock after first loading the record. It doesn’t deal with the
duplicate request problem at all because its API is centred on triggering events
that necessarily change the record’s state, rather than expressing a desired
next state, checking if that’s a legal transition, and handling the case where
the record is already in that state. It does prevent various classes of race
condition by being diligent about how it checks and effects state transitions,
and you’re probably better off using it than hand-coding a state machine, but
you need to be aware that it will sometimes result in errors that could have
been handled more gracefully if its API were designed to centre target states
rather than changes in state.
We have scratched the surface of concurrency problems that occur in web
application workflows here, but we’ve identified some common patterns. We can
spot potential problems by recognising instances of the Location
pattern,
where users can both read and write a resource as is common when using forms.
When deciding how to handle the resulting problems, we can think about requests
originating from a single user, a team of communicating people, or from
completely independent actors. We can ask what happens when requests contain
duplicate or conflicting information. And, we must pay attention to the
interface we expose to users and whether our service’s semantics match the
user’s expectations.
A similar set of problems occur when designing APIs, both for internal and external consumption, and a lot of the above considerations apply. They pose additional challenges, for example APIs often allow partial updates – whereas web forms contain a complete representation of resource, APIs may let the client sent only the fields it wants to change, and so write skew becomes more of a factor.
Whereas web pages typically represent many related resources in a single page,
an API may require the client to make multiple requests to get all the objects
it wants, making the request non-atomic. What happens if any of those objects
change during this process? It might make more sense to allow all the related
resources to be fetched in a single request, maybe using something like
GraphQL – especially if the network round-trip time is large, for example
if the client is a JavaScript app running in a browser. Again, concerts are my
go-to example for this: a concert is mostly made of pointers to other things:
the artists, the venue, the city the venue is in, the ticket vendors, etc. A
client almost always wants all that information in order to display anything
useful, so having an API return a concert
object that’s just a date and a list
of URLs to other objects tends to be wasteful.
Many of the techniques we’ve seen above apply when designing APIs for concurrent
access. Using ownership systems and compare-and-swap are both very effective,
particular in dealing with the fact that you can’t rely on request ordering
between clients and servers. The only way to make sure one request follows
another is to make it contain a value from the first request’s response – for
example, see how CouchDB requires all requests to contain the latest value of
the _rev
field for a document. If you’re talking to a API or web service that
doesn’t have any concurrency control built in, you may need to use your own
database to co-ordinate your interaction with it. For example, if you have
multiple processes all talking to an external CRM that lets you retrieve and
update users’ email preferences, then using your own database to place advisory
locks against user records while their CRM record is being updated can stop your
processes overwriting each other’s changes. If multiple applications are talking
to this service without any such co-ordination, then you can’t prevent race
conditions and all bets are off.
That wraps up our discussion of race conditions in web systems. In the final article in this series, we will look at alternative approaches to dealing with these problems at the language and library layer, in systems that don’t allow bugs like this to happen at all.