Reading and writing, part 4: user workflows

This article is part of a five-part series on race conditions. The complete series is:

–

In this series, we’ve been exploring the idea of a Location type and how it manifests in various programming situations. In part 1 we looked at programs managing data in memory and using mutexes to control concurrent access. In part 2 we moved to looking at files and databases, and we saw how lost updates can be prevented by transactional locks, repeatable read isolation, and compare-and-swap. In part 3 we looked at uniqueness violations, partial validity, and other forms of write skew, and how we can use the aforementioned techniques along with serializable isolation to prevent such bugs.

We have been building up a library of phenomena that can happen in web applications, or really any multi-user client-server system, when two tasks try to read and write the same record or table without co-ordination. Everything we’ve examined so far has been a single-request phenomenon, that is the task in question is confined to a single update, typically involving a single request to the server. These types of bugs are often overlooked because it seems unlikely they’d ever happen in practice: what are the odds of two conflicting requests actually overlapping in time and triggering one of the race conditions we’ve studied?

There are several answers I like to give here. The simplest is that, allowing concurrent requests is the whole reason we run multiple instances of our applications. If our traffic is low enough, and we don’t mind downtime during deployments, we can run a single instance of the application, process requests one at a time, and none of the phenomena we’ve examined will ever happen. But for many applications, this is not acceptable, and we run multiple instances entirely to increase throughput, by processing requests concurrently.

That means races can happen, but it doesn’t mean they will happen. A race condition actually being triggered requires two conflicting requests – two requests that modify the same record, or an interrelated set of records – being processed at the same time. This isn’t a simple function of how much traffic you have, although having more traffic will certainly make races more likely in general. It also depends on how traffic is distributed across resources. For example, if your service produces situations where many users are interacting with the same resource – a viral social media post, or an in-demand concert, say – then it may concentrate a lot of activity on the same resource in a short period of time, even if your overall traffic is still low.

It is especially critical to pay attention to concurrency bugs if the resource represents a scarce good that must be reliably controlled. For example, when thousands of users are trying to buy concert tickets, you need to make sure you don’t oversell your allocation, so any race condition that results in miscounting sold tickets must be avoided. This is a particularly bad scenario because the conditions that are most likely to trigger a race condition are also those where it causes the worst problems. A database under heavy load will also make this worse, because it causes requests to take longer, and it therefore becomes more likely that they will overlap.

It is often worth proactively preventing these bugs rather than waiting for them to happen in production. Sometimes, you’ll have a little corrupted data to manually clean up, and that’s manageable. But if you oversell a ten-thousand-seat concert, it’s a horrible task to have to clean up the mess after it’s happened. These bugs are insidious because it’s often very hard to notice them during development, and they only show up after going to production, where they can cause a lot of damage. This is why I’m interested in tools to help us spot them sooner; when PostgreSQL rejects an unserialisable transaction, or when CouchDB rejects writes without the correct _rev value, they’re helping us ship more robust code so these bugs never make it to production. The languages and libraries we use can help too, but we also need a conceptual understanding of where these bugs come from – hence this series.

Ultimately, every bug I’ve given as an example here is something I’ve observed in real production systems. They become even harder to deal with at scale, because you rely so much more on autonomous data management without manual checking of much of the data in your systems. These bugs can go unnoticed for a while, until your service starts behaving erratically, at which point you may not be able to make any sense of the data. The fact that these bugs lead to nonsense being persisted in databases can make their effects quite long-lasting, and also far-removed from the original cause, and therefore even more expensive to clean up once the bug is noticed.

Hopefully, if you’ve read this far, I’ve persuaded you that these concurrency bugs are worth paying attention to. But even if we prevent race conditions during a single request’s interaction with the database, stepping up a level in the stack reveals yet another Location: when users interact with a website to update some data, they typically do this by loading a page presenting some existing resource in a form (i.e. a read() -> T) and then at some later time, they submit the form, sending the resource’s new state to the server (a write(T)). This creates further opportunity for race conditions, and makes them much more likely to occur, because the task spans the whole read/write cycle of the user loading, editing, and saving the resource. Rather than being confined to a single request, this multi-request workflow could span minutes or even hours, between the user receiving a copy of the resource, and sending their modified copy back to the server. You might even deploy a new version of your service while the user is making their changes.

Although forms in web applications introduce new opportunities for race conditions, they also introduce some established semantics we can build on, as well as context that helps us make decisions about the risks and how they should be handled. Rather than looking at a computational task in isolation, we now have an end user to consider, and we should think about what their expectations are when they use our services, and how those expectations are shaped by the interface and information the service presents.

There are many useful ways to characterise multi-request data races that can help us decide how they should be handled. I’m going to focus on a couple of distinctions that I often use when thinking about them: whether the race involves a single user or multiple users, and whether it involves conflicting or duplicate requests.

Let’s deal with single-user scenarios first. These are situations where the tasks involved in a race condition are all triggered by the same end user. For example, they might open the same form in two different tabs – maybe even on different devices – make different changes in each place, and submit both of them. This would be a conflicting request scenario, because the server receives concurrent requests containing different state for the same resource.

These requests are concurrent in the sense of causality, even if the POST requests do not actually overlap in time: they both contain state that’s derived from the same starting point, rather than one including the changes made by the other and therefore coming after it. Imagine I open my user profile in two tabs. In one of them I change my username and submit the form, and in the other I change the email address and submit. This sequence of events looks like this:

Tab 1                           | Tab 2
--------------------------------+--------------------------------
GET /profile                    |
                                | GET /profile
# change username               |
POST /profile                   |
                                | # change email address
                                | POST /profile

The GET /profile request will return the same state in both tabs, and so the POST in Tab 2 will contain the username from before I edited it in Tab 1. Therefore Tab 2 will overwrite the changes made in Tab 1, causing a lost update. Again, we see it’s a race condition because the result is different if we perform these tasks sequentially:

Tab 1                           | Tab 2
--------------------------------+--------------------------------
GET /profile                    |
# change username               |
POST /profile                   |
                                | GET /profile
                                | # change email address
                                | POST /profile

In this scenario, the GET from Tab 2 happens causally after the POST from Tab 1; it includes the username we set in Tab 1 and no updates are lost. This is equivalent to me editing my profile twice in the same tab.

To decide how to handle this, we can ask what the user’s expectations are. The nice thing about forms is that they’re complete: all the inputs are sent to the server, regardless of whether they’ve been changed. The user is aware that they made both these submissions, and so it’s reasonable to simply let the last submission win. If the user can see the whole form and submit button, then it is reasonable to interpret this as them asserting what the complete state of their profile should be, and letting the most recent submission overwrite all previous ones. It may actually be confusing to try to merge both these requests, because it would result in the username not matching what the user could see in the last form they submitted.

The one wrinkle in this plan is that we can’t control the order in which requests are processed. Unless the requests are sent over the same TCP connection, they can be reordered between the browser and the server. Once they reach the server, they might still be reordered by a load balancer dispatching the requests to different app servers. So we can’t rely on the order the client sent the requests to make any guarantees. In the above scenario, the POST in Tab 1 might be delayed for long enough that the user decides to retry on a different device. For example, Tab 1 might be on a phone that’s experiencing very high latency connecting to the internet. While they’re repeating the process in Tab 2 on their laptop, the server eventually receives the POST from Tab 1 and handles it. We’ll amend the above sequence diagram to show the first POST being sent while Tab 2 is active.

Tab 1                           | Tab 2
--------------------------------+--------------------------------
GET /profile                    |
# change username               |
POST /profile                   |
  .                             | GET /profile
  .                             | # change email address
  .                             | POST /profile
  # response received

If the events in Tab 2 occur while the first POST is in flight, then the GET in Tab 2 may or may not include the changes from Tab 1, and the race condition may occur. Preventing this would require co-ordination between clients, which requires them to communicate over the network, which is exactly what they’re failing to do to begin with. So there’s not a nice solution to this that doesn’t add a ton of additional complexity. It’s simpler to assume that the UI of forms means the user knows what state they’re saving, and we should clearly show them what state their data is in as they continue to use the service.

If the data being modified is especially sensitive, and unlikely to be proactively checked without digging in the service’s settings, it may be worth alerting the user to the change, either with an on-site notification or by sending them an email or push notification. For example, some services will email you if your password is changed, to make sure you intended for that to happen, and so you can go and amend the change if necessary.

(Note that, as we saw in part 3, because these tasks update different fields, it is still possible to cause write skew if the POST requests actually overlap in time and the fields are both subject to a validation constraint. However, as these tasks are being triggered by a single person, it’s quite unlikely that would actually happen. The most likely cause of this is network latency delaying requests, as we have just discussed.)

We have seen that the correct response to a conflicting set of requests doesn’t have a clear-cut answer; it depends on how the UI communicates with the user about what a certain action will do, and how it informs the user about the new state after any action they take. This is even more true in the case of duplicates, where the user sends the same request twice. This often happens when a user double-clicks a form, or retries a request that appears to have failed.

It’s tempting to try to prevent such situations by using JavaScript to disable a form’s submit button once it’s clicked, and sending the request using JavaScript. However, this requires error handling: we need to detect when the request fails, and retry the request, or re-enable the submit button so the user can retry it. The problem is – and this is a general problem in distributed systems – that it’s not always possible to tell if the request was processed.

If we receive an HTTP response from the server then we know the outcome of the request for certain, but until that response arrives the client can only guess at what might have happened. It may not have been sent at all, if the client is offline. It may have been sent but the server may not have received it yet. It may have been received by a server but is sitting in a queue waiting to be processed. It may be being processed by the server-side application. It may have been processed, and the response may be on its way to the client, but not received yet. It could be in any of these states and the connection could be severed at any point. So it’s possible the request was received by the server, or even processed, without the client receiving a response. So if the client has not received a response, it cannot know whether its original request was received and processed.

We’d like to handle failed requests and re-enable the form so the user can re-submit it. But, we can’t wait to receive a failed HTTP response – this may never arrive, so we would wait forever and the form would remain unusable. The alternative is that we re-enable the form after a timeout if no response is received, but in this scenario we cannot know whether the server received the first request. So we have to choose between definitely preventing duplicate requests (exactly once delivery) but making the form potentially unusable, or keeping the form usable but allowing duplicate requests to be sent (at least once delivery). It tends to be a better bet to pick the latter, and make sure the server can reliably handle receipt of duplicate requests.

For requests that update an existing resource, a duplicate write is often completely safe – the later request will overwrite the earlier one, but since they contain the same data the net effect is the same. This only becomes a problem if the request causes new records to be created, and it’s problematic to create the same record twice, for example creating a new user account. In this situation, the first submission of the sign-up form will create the account, and the second request will fail if there’s a unique index on the account username or email address. If that failure is not handled then the user will see an error response, even though their request for a new account was actually fulfilled.

A useful approach in such situations is to think not in terms of what change a request should produce, but what state it should bring about, regardless of how many times it’s processed. In the case of a sign-up form, we want an account with the requested username to exist, but it must be a new account. We don’t want to hand out access to whatever existing account already exists with that name; we want to create a new account for this user, if the username is not already taken. If the given username is not already registered, then the sign-up proceeds normally: the account is created and the user is logged in as that account. If it already exists, we can handle the duplicate-request case gracefully by checking if the request contains the same information as the existing account – specifically, does its password match that already stored.

class UsersController < ApplicationController
  def create
    begin
      @user = User.create(params[:user])
    rescue ActiveRecord::RecordNotUnique
      username, password = params[:user].values_at(:username, :password)
      existing_user = User.find_by(username: username)
      if existing_user.authenticate(password)
        @user = existing_user
      else
        # return an error
      end
    end

    # log the user in
    # show success page
  end
end

If the credentials match the existing account, then @user is assigned to the existing account and everything proceeds normally. If not, then we display an error page, but note that this error page must not display information from existing_user – that’s an account the current user does not have access to. If we handle the error by re-displaying the sign-up form, then the form should include the data from @user, the object containing the information this user submitted, not existing_user, which is not their account.

Now, let’s consider multi-user scenarios. When multiple users are involved, then we have genuine race conditions because people may be submitting conflicting requests without co-ordinating with one another. We can’t just wave the problem away as we did for single-user scenarios with the argument that the conflicting requests all came from a single user, so they’re aware of all the requests that have been made and may reasonably expect their last write to win. Now the conflicting requests result from different users making independent decisions, and having one user’s changes overwrite another’s may not be acceptable.

It is not advisable to place a database-level lock on a resource while this happens, because it prevents anyone else reading the record, and also creates a situation where a database transaction must span multiple requests, but only apply to requests from a single user, which is hard to implement.

The main problem with implementing locks at this level is that HTTP is stateless – once a client has requested a page, its connection to the server is not preserved (other than as a performance optimisation at the networking layer). A database client can initiate transactions and acquire locks and that state persists while the client is connected; if a client disconnects during a transaction, the transaction is aborted and all its locks released. In HTTP there is no such mechanism for reliably releasing locks when a client disconnects, because there is no persistent and stateful connection at a logical level between a user agent and the server, and a single end-user’s interaction with the service might span multiple user-agents, for example their laptop and their phone’s browsers. Therefore if you implement application-level locking, you need a way to reliably release locks held by users, or a way to negotiate them being released after a time limit, or when the user’s login session ends.

Instead, a common solution is to implement a model where each resource has a clear owner or set of owners at any given time, and only that owner can modify the resource. Race conditions usually involve a lack of co-ordination, and it’s unlikely that a single end user would make concurrent edits to the same object without knowing there were doing that. In any case, the semantics of web forms are that the user submits the entire form as a single request, so even if they open the same form in two tabs, it’s reasonable for them to expect that the set of data they submit last is what ends up being saved. This model can be extended to a resource being owned by a single team or organisation at a time, rather than a single person, if that team works closely together and is likely to communicate via other channels about the changes they’re making. The weaker the communication and co-ordination is among owners, the more care the service needs to take over controlling concurrent access.

You can also use an optimistic control technique, like embedding a version number in the form and atomically checking it on submission. For example, you can embed the object’s current lock_version in the form, and that will make Rails reject the update if someone changed the resource concurrently. The main risk here from a usability point of view is that a user spends some time working on their changes only to have them rejected. This should be handled gracefully by, at least, presenting the form back to them with their submitted data intact. The workflow can then be improved by showing them the new state of the resource if someone else submitted an update while they were working, potentially attempting to merge both sets of changes together. How viable this is will depend on the structure of your data and what it represents, but for example the diff3 algorithm allows two sets of changes to a text document to be merged, if you have a copy of the state before the changes were made.

Beyond this, there are many systems for real-time collaborative editing, which are beyond the scope of this series, including conflict-free replicated data types, operational transforms, and hybrid approaches. For most applications that don’t require this model, using an ownership system or optimistic control is probably most appropriate.

To finish up our discussion of conflicts in user workflows, I’d like to visit a topic that trips a lot of developers up in both single-user and multi-user scenarios: state machines. Let’s imagine we’re running a food delivery service, where orders can be in one of these states:

accepted: the order has been received and is awaiting processing
preparing: the order is being prepared in the kitchen
shipping: the meal is has been handed off to a delivery driver
delivered: the meal has been delivered to the customer

The states are strictly sequential; all orders begin in the accepted state and progress through the others in order. As the people responsible for fulfilling the order complete the work, they use an app to update the status of the order so the customer can see what’s going on. Because the states are sequential, we always know what state an order will move to next, based on its current status. Given that, it’s tempting to implement the action for updating the state as a single button that triggers the next state. For example, if an order is accepted then we might display a button allowing the user to indicate they’ve starting preparing it:

<form method="post" action="/orders/42">
  <input type="submit" value="Start preparing">
</form>

This form contains no input values other than the submit button label. The controller handles the POST /orders/42 request by moving the order to whichever status is next:

class OrdersController < ApplicationController
  STATES = [
    "accepted",
    "preparing",
    "shipping",
    "delivered",
  ]

  def update
    order = Order.find(params[:id])
    new_status = status_after(order.status)
    order.update(status: new_status)
  end

  def status_after(status)
    index = STATES.index(status)
    STATES[index + 1]
  end
end

This seems reasonable, and is simple enough to implement. However, it is prone to race conditions if two users perform the following steps:

User 1                          | User 2
--------------------------------+--------------------------------
Load the order page             |
                                | Load the order page
Click 'Start preparing'         |
                                | Click 'Start preparing'

The form submission from User 2 will update the status to shipping, because the controller handles Order#status like a counter, incrementing it to the next state in the sequence. This is not what the user expects on clicking a button labelled Start preparing, and it’s a race condition because if User 2 loaded the form after User 1’s submission, they’d see a different label on the button and potentially make a different decision. This is a common mistake in web form design: the page a user is currently looking at does not necessarily agree with the current state on the server, which may have changed since the page was loaded.

When we implemented counters earlier in this series we saw that each caller supplying the desired next state was a problem, but here it’s exactly what we need. If the form includes a status input indicating what state we’re requesting, then the controller can verify that this request still makes sense.

<form method="post" action="/orders/42">
  <input type="hidden" name="status" value="preparing">
  <input type="submit" value="Start preparing">
</form>

Rather than incrementing Order#status to the next value, we now check that the requested new status is the next state in the sequence, and therefore represents a valid transition. If not, we show the user an error.

  def update
    order = Order.find(params[:id])
    new_status = params[:status]

    if new_status == status_after(order.status)
      order.update(status: new_status)
      # show success page
    else
      # show error page
    end
  end

Now we see a problem: this controller doesn’t handle double-submission gracefully. If I double-click Start preparing, the first request will cause a transition from accepted to preparing, but the second request will fail because preparing is now not the next state in the sequence. We can handle this by checking whether the order is already in the requested state, and showing a success response if so. This also deals with the situation above where two different users try to cause the same state change – the intent of each is now honoured. We’re thinking about the desired end state we want, rather than the desired change – sometimes the resource is already in the desired state and no change is necessary.

  def update
    order = Order.find(params[:id])
    new_status = params[:status]

    if new_status == order.status
      # show success page
    elsif new_status == status_after(order.status)
      order.update(status: new_status)
      # show success page
    else
      # show error page
    end
  end

An order changing state may result in side effects, like a notification being sent to the customer. Where possible we only want these effects to be executed once, and in this case we can trigger them only when the order changes state, not when it’s already in the desired state.

  def update
    order = Order.find(params[:id])
    new_status = params[:status]

    if new_status == order.status
      # show success page
    elsif new_status == status_after(order.status)
      order.update(status: new_status)
      send_notification(order)
      # show success page
    else
      # show error page
    end
  end

Finally, we spot another Location pattern occurring here: we’re loading an Order from the database and making some decision about how to change it based on its current state. Using the lock_version method for optimistic control would cause errors on duplicate requests, so let’s instead use a pessimistic lock to ensure the record is not changed while we’re making these changes.

  def update
    ActiveRecord::Base.transaction do
      order = Order.lock.find(params[:id])
      new_status = params[:status]

      if new_status == order.status
        # show success page
      elsif new_status == status_after(order.status)
        order.update(status: new_status)
        send_notification(order)
        # show success page
      else
        # show error page
      end
    end
  end

This action is now robust against either the same user or multiple users submitting semantically duplicate requests, and against the order being changed during those requests. We’ve seen that although one can model the order’s status as a counter, that doesn’t make sense if the UI presented to the user implies the order will be put into a particular state as a result of their action. Our modelling needs to match the interface we’re exposing to users, and the user’s intentions as informed by that interface need considering when deciding on solutions to concurrency problems.

There’s a lot of highly non-obvious things to get right when coding something like this, and while there are libraries that deal with modelling state machines, they don’t consider all these use cases in their API. For example, aasm uses a transactional lock to update a record’s state attribute, but it acquires that lock after first loading the record. It doesn’t deal with the duplicate request problem at all because its API is centred on triggering events that necessarily change the record’s state, rather than expressing a desired next state, checking if that’s a legal transition, and handling the case where the record is already in that state. It does prevent various classes of race condition by being diligent about how it checks and effects state transitions, and you’re probably better off using it than hand-coding a state machine, but you need to be aware that it will sometimes result in errors that could have been handled more gracefully if its API were designed to centre target states rather than changes in state.

We have scratched the surface of concurrency problems that occur in web application workflows here, but we’ve identified some common patterns. We can spot potential problems by recognising instances of the Location pattern, where users can both read and write a resource as is common when using forms. When deciding how to handle the resulting problems, we can think about requests originating from a single user, a team of communicating people, or from completely independent actors. We can ask what happens when requests contain duplicate or conflicting information. And, we must pay attention to the interface we expose to users and whether our service’s semantics match the user’s expectations.

A similar set of problems occur when designing APIs, both for internal and external consumption, and a lot of the above considerations apply. They pose additional challenges, for example APIs often allow partial updates – whereas web forms contain a complete representation of resource, APIs may let the client sent only the fields it wants to change, and so write skew becomes more of a factor.

Whereas web pages typically represent many related resources in a single page, an API may require the client to make multiple requests to get all the objects it wants, making the request non-atomic. What happens if any of those objects change during this process? It might make more sense to allow all the related resources to be fetched in a single request, maybe using something like GraphQL – especially if the network round-trip time is large, for example if the client is a JavaScript app running in a browser. Again, concerts are my go-to example for this: a concert is mostly made of pointers to other things: the artists, the venue, the city the venue is in, the ticket vendors, etc. A client almost always wants all that information in order to display anything useful, so having an API return a concert object that’s just a date and a list of URLs to other objects tends to be wasteful.

Many of the techniques we’ve seen above apply when designing APIs for concurrent access. Using ownership systems and compare-and-swap are both very effective, particular in dealing with the fact that you can’t rely on request ordering between clients and servers. The only way to make sure one request follows another is to make it contain a value from the first request’s response – for example, see how CouchDB requires all requests to contain the latest value of the _rev field for a document. If you’re talking to a API or web service that doesn’t have any concurrency control built in, you may need to use your own database to co-ordinate your interaction with it. For example, if you have multiple processes all talking to an external CRM that lets you retrieve and update users’ email preferences, then using your own database to place advisory locks against user records while their CRM record is being updated can stop your processes overwriting each other’s changes. If multiple applications are talking to this service without any such co-ordination, then you can’t prevent race conditions and all bets are off.

That wraps up our discussion of race conditions in web systems. In the final article in this series, we will look at alternative approaches to dealing with these problems at the language and library layer, in systems that don’t allow bugs like this to happen at all.

The If Works

by James Coglan

Reading and writing, part 4: user workflows