Reading and writing, part 3: web applications

This article is part of a five-part series on race conditions. The complete series is:

In part 1 of this series, we met the idea of the Location type, a resource that can be read from and written to without co-ordination, and which therefore allows race conditions to happen. In part 2 we saw how this pattern occurs when interacting with files and databases, and we learned many ways that databases let us address its problems using pessimistic and optimistic control techniques. In this article, we’ll put this in context by looking at how read and write patterns in web applications can lead to race conditions and inconsistent data.

To begin with, let’s look at probably the most common form of race condition in such applications: uniqueness violations. It’s common for applications to require certain fields, like a user’s email address, to be unique, and Rails has historically not dealt with this requirement robustly. Feral Concurrency Control gives a good account of this problem, including a reminder that, at least in its early days, the design approach in Rails was that all logic should reside in application code, not in the database. The database should be treated as a simple data store for reading and writing records, and nothing more – that is, it should be thought of as a Location.

I don’t want my database to be clever! … I consider stored procedures and constraints vile and reckless destroyers of coherence. No, Mr. Database, you can not have my business logic. Your procedural ambitions will bear no fruit and you’ll have to pry that logic from my dead, cold object-oriented hands… I want a single layer of cleverness: My domain model.

David Heinemeier Hansson

The problem with this assertion is that once you’re running more than one instance of your application – whether that’s using multiple threads, processes, or server machines – you no longer have a single “layer of cleverness”, you have multiple clients talking concurrently to your database. In that environment, many of the interfaces Rails provides for guaranteeing consistency aren’t effective unless they’re combined with database concurrency control.

I should emphasize that the problems I am about to demonstrate are not unique to Rails, they’re commonplace in web application logic across many platforms. Rails is simply a well-known example, with which I’m most familiar. It has also had an enormous influence on how web apps in general are written; tools like Rails are how many people have learned to develop applications, and so its lack of decent concurrency control contributes to many people never learning how to notice, diagnose or fix the bugs it leads to. Tools aren’t just a means of getting work done, they’re a means of understanding the work itself, and the ergonomic effect of a tool making it easy to write certain categories of bug has an impact beyond the use of that tool in particular.

Here’s an example of how the “Rails Way” for ensuring uniqueness can be easily broken. Say we have a table that stores settings, that is configuration data, and it should be treated as a key-value store. Many open-source applications include such a table, so that end users can tweak the system’s functionality. In Rails we would declare a schema for this table with two columns, name and value, and a model class to represent the table.

ActiveRecord::Schema.define do
  create_table :settings, force: true do |t|
    t.string :name
    t.string :value
  end
end

class Setting < ActiveRecord::Base
end

Now, we’d like to design an API for assigning settings in this table, making sure each name appears only once and thus corresponds to a unique value. Here’s the experiment we’ll run to determine whether each implementation is correct. We’ll start a set of threads, each of which will try to assign a different value for the timezone setting. Once the threads have completed, we’ll count the number of rows in the settings table.

threads = (1..4).map do |n|
  Thread.new do
    Setting.set_value("timezone", "UTC+#{n}")
  end
end

threads.each(&:join)
p [:count, Setting.count]

As we’ve only assigned to a single setting name, we expect only one row in the table. As I mentioned in a previous article, I am using JRuby here as it supports truly parallel thread execution.

Historically, by far the most common way to see this implemented is to use find_or_create_by to check if a record exists with certain properties, and create one if not. Having ensured the existence of a Setting with the right name, we can assign a value to it.

class Setting
  def self.set_value(name, value)
    setting = find_or_create_by(name: name)
    setting.update(value: value)
  end
end

When I run this implementation, there are almost always four records in settings, all with the same name, so this is clearly not workable. Let’s look at the SQL this generates to understand what’s happening:

SELECT * FROM settings WHERE name = 'timezone' LIMIT 1;
BEGIN;
INSERT INTO settings (name) VALUES ('timezone') RETURNING id;
COMMIT;
BEGIN;
UPDATE settings SET value = 'UTC+1' WHERE id = 1;
COMMIT;

It begins by doing a SELECT to see if a record with name = 'timezone' exists. If this returns nothing, then it performs an INSERT for that name inside a transaction; this will return the id of the new record. Given that id, it then issues an UPDATE to set the value on the new record, in a separate transaction. If the initial SELECT does return a record, then the middle transaction containing the INSERT does not happen:

SELECT * FROM settings WHERE name = 'timezone' LIMIT 1;
BEGIN;
UPDATE settings SET value = 'UTC+2' WHERE id = 1;
COMMIT;

Note if there is already a transaction in progress, then Rails does not insert transactions around the INSERT or UPDATE queries – it relies on the existing transaction to make any associated life-cycle callbacks atomic.

BEGIN;
...
SELECT * FROM settings WHERE name = 'timezone' LIMIT 1;
INSERT INTO settings (name) VALUES ('timezone') RETURNING id;
UPDATE settings SET value = 'UTC+1' WHERE id = 1;
...
COMMIT;

We can see why this doesn’t work: we’re treating the settings table as a Location – not an individual record in the table, but the table itself. We read from the table to check if any records with a certain name exist, and then based on the result of that we write to the table to add an additional row, which always succeeds. Since this operation can be characterised by the Location type, it is inherently prone to race conditions. In this case, it means two tasks can both run the first SELECT, both see that no matching records exist, and therefore both INSERT one.

Task 1                          | Task 2
--------------------------------+--------------------------------
SELECT * FROM settings ...      |
    --> no results              |
                                | SELECT * FROM settings ...
                                |     --> no results
BEGIN                           |
                                | BEGIN
INSERT INTO settings ...        |
                                | INSERT INTO settings ...
COMMIT                          |
                                | COMMIT

Putting the INSERT in a transaction here does not give us any protection from the database – these INSERT queries are both valid, and don’t conflict with each other since the database doesn’t know name needs to be unique. So, all the above queries succeed and two INSERT queries are done with the same name.

This race condition is an example of write skew: two transactions read from the same set of records, and then both write something that would be valid on its own, but invalidates a decision made by the other transaction. In this case, both transactions decide to create a record with name = 'timezone' because they believe that none exists, but when both are committed then our requirement that name is unique has been violated. It’s a race condition because if these tasks were executed sequentially, this mistake wouldn’t happen: the later task would see the record insert by the earlier one, and decide to do nothing.

A lost update can be viewed as a special case of write skew in which both writes go to the same location, and so one overwrites the other. In general, write skews involve both writes being kept and together creating an invalid state. When we examined lost updates with the counters example, we saw that in PostgreSQL the repeatable read isolation level can detect lost updates and reject the transactions that cause them. Some write skew scenarios can be detected this way, but not all, because in general the writes affect different records.

They also can’t be addressed by locking. For example, let’s try using the lock method inside a transaction and look at the generated SQL:

class Setting
  def self.set_value(name, value)
    transaction do
      setting = lock.find_or_create_by(name: name)
      setting.update(value: value)
    end
  end
end

As expected, this generates a FOR UPDATE clause on the initial SELECT. However, it does not prevent the race from happening, because FOR UPDATE locks the records that happen to match the WHERE clause right now, it does not prevent new records matching the clause being created, existing records being deleted, or existing records being modified to match.

BEGIN;
SELECT * FROM settings WHERE name = 'timezone' LIMIT 1 FOR UPDATE;
INSERT INTO settings (name) VALUES ('timezone') RETURNING id;
UPDATE settings SET value = 'UTC+1' WHERE id = 1;
COMMIT;

That is, it locks specific records, rather than the abstract set of { all records where name = 'timezone' }. Therefore, two transactions can both run the INSERT query and create a duplicate record in the process.

The same approach is used in the validates_uniqueness_of macro, and its own documentation will tell you it’s not safe under concurrent requests, as we’ve just seen. They recommend instead that a unique index is added to the database, which will cause it to reject any INSERT or UPDATE that creates duplicate values on the given columns. This sounds promising; as we’ve seen, simply making write() fallible makes it possible to avoid some consistency bugs.

ActiveRecord::Schema.define do
  create_table :settings, force: true do |t|
    t.string :name
    t.string :value

    t.index [:name], unique: true
  end
end

We can then reverse the order of operations: first try an INSERT, and if that fails, detect the RecordNotUnique exception and UPDATE the existing record. This exception type was added in Rails 3.0, released in 2010. Unfortunately, documentation for using it correctly is scant, and it’s easily misused. Here’s a first attempt:

class Setting
  def self.set_value(name, value)
    create(name: name, value: value)
  rescue ActiveRecord::RecordNotUnique
    where(name: name).update_all(value: value)
  end
end

When a matching record does not exist, then we perform the INSERT with all the required values, and we’re done:

BEGIN;
INSERT INTO settings (name, value) VALUES ('timezone', 'UTC+1');
COMMIT;

If a matching record does exist, the INSERT fails. The transaction that was started inside the Rails create method is rolled back, and then we run an UPDATE to set the desired value on the existing record.

BEGIN;
INSERT INTO settings (name, value) VALUES ('timezone', 'UTC+2');
ROLLBACK;
UPDATE settings SET value = 'UTC+2' WHERE name = 'timezone';

This will often work fine, although the control flow is a little awkward outside of trivial cases because of the exception handling. The deeper problem is that this doesn’t work if a transaction is already in progress. Most databases don’t support nesting transactions, and as we saw above Rails won’t initiate a transaction to create or update something if there’s a transaction already open on the current connection. Because we catch the RecordNotUnique exception, and create does not start and manage its own transaction, no ROLLBACK happens:

BEGIN; -- existing transaction started outside of Setting.set_value
...
INSERT INTO settings (name, value) VALUES ('timezone', 'UTC+2');
UPDATE settings SET value = 'UTC+2' WHERE name = 'timezone';
...

Because the INSERT was rejected, PostgreSQL now considers this transaction failed; it’s not aborted, but running further queries in it is now an error. But, we don’t want to ROLLBACK, as that would undo whatever changes were already done before the INSERT happened – we just want to recover from the failed INSERT, not throw the whole transaction away.

On MySQL, the effect is even worse. Because of the way it locks index records during INSERT and UPDATE operations, two transactions executing the above queries can deadlock each other. The failed INSERT does not render the transaction unusable, and when a duplicate-key error happens, INSERT acquires a shared lock on the index record for the given key. As the transaction is still viable, we can run an UPDATE, which tries to upgrade the index lock to an exclusive lock. If both transactions have executed the INSERT and hold a shared lock, neither can acquire the exclusive lock until the other commits or aborts.

This is problematic because your code will usually appear to work fine during development, unless you actively check for concurrency bugs, but as soon as you put it in production you’ll get errors from users deadlocking the database. By eagerly invalidating the transaction after the failed INSERT, PostgreSQL forces you to write something that will work predictably in production. An important element of ergonomic design in software systems is the decision to eagerly report errors for things that aren’t guaranteed to work, rather than letting the code work sometimes and letting you go to production before finding out it might fail. In this case, PostgreSQL protects you from making this mistake, but as we’ve seen there are other cases where it won’t unless you explicitly ask it to. For example, by running transactions in the read committed level by default, it does not prevent lost updates or write skew.

To handle rolling back the failed INSERT, PostgreSQL, MySQL and other databases provide savepoints. These let you set markers within a transaction and roll back to the state at a given point, without discarding the entire transaction. Rails has long had an API for savepoints, but it remains undocumented and is not widely known. We can use it in our example by calling create_savepoint before the INSERT operation, and then rollback_to_savepoint if it fails:

class Setting
  def self.set_value(name, value)
    sp_name = "create_setting_#{name}"

    transaction do
      connection.create_savepoint(sp_name)
      create(name: name, value: value)
    rescue ActiveRecord::RecordNotUnique
      connection.rollback_to_savepoint(sp_name)
      where(name: name).update_all(value: value)
    end
  end
end

The transaction call starts a new transaction if there is not one currently running; savepoints can only be used inside transactions, and this construction is safe if we call set_value from within another transaction. It generates the following queries when no matching records exist:

BEGIN;
SAVEPOINT create_setting_timezone;
INSERT INTO settings (name, value) VALUES ('timezone', 'UTC+1');
COMMIT;

When a matching record does exist, a ROLLBACK TO SAVEPOINT happens and we can continue to use the open transaction – this also prevents deadlocks on MySQL by releasing the lock acquired by the failed INSERT.

BEGIN;
SAVEPOINT create_setting_timezone;
INSERT INTO settings (name, value) VALUES ('timezone', 'UTC+2');
ROLLBACK TO SAVEPOINT create_setting_timezone;
UPDATE settings SET value = 'UTC+2' WHERE name = 'timezone';
COMMIT;

There is a more convenient way to use savepoints in Rails: the transaction method takes an option called requires_new. Databases don’t support nested transactions, but requires_new lets you pretend they do, at least as far as rolling back is concerned. Normally transaction only starts a transaction with BEGIN if there isn’t already one running. requires_new means that, if there’s already a transaction running, a savepoint will be set so that any exception causes a rollback to that savepoint, rather than undoing the whole transaction. It wraps the above workflow up so we can use the same code whether we’re already in a transaction or not.

class Setting
  def self.set_value(name, value)
    transaction(requires_new: true) do
      create(name: name, value: value)
    end
  rescue ActiveRecord::RecordNotUnique
    where(name: name).update_all(value: value)
  end
end

Note that our exception handler is outside the transaction block. To issue a ROLLBACK, transaction needs to know an exception has happened, and catching it inside the transaction block would prevent that. Instead, we let transaction roll the transaction back and then handle the exception ourselves. When there’s no transaction in progress, this generates the following queries:

BEGIN;
INSERT INTO settings (name, value) VALUES ('timezone', 'UTC+2');
ROLLBACK;
UPDATE settings SET value = 'UTC+2' WHERE name = 'timezone';

If we’re inside an existing transaction, then an equivalent set of operations is done using SAVEPOINT, so that the whole transaction is not aborted and we can keep using it.

BEGIN;
...
SAVEPOINT active_record_1;
INSERT INTO settings (name, value) VALUES ('timezone', 'UTC+2');
ROLLBACK TO SAVEPOINT active_record_1;
UPDATE settings SET value = 'UTC+2' WHERE name = 'timezone';
...
COMMIT;

In Rails 6.0, released in 2019, a hybrid between this pattern and find_or_create_by was introduced, named create_or_find_by. It reverses the operations of find_or_create_by by attempting an INSERT first, and then doing a SELECT if this fails. The record returned by either of these operations can then be completed using an UPDATE.

class Setting
  def self.set_value(name, value)
    setting = create_or_find_by(name: name)
    setting.update(value: value)
  end
end

create_or_find_by deals with transactions correctly, starting a new one if needed and using savepoints if one is already in progress, so it prevents transactions failing on PostgreSQL or deadlocking on MySQL. When no matching record exists, we get an INSERT followed by an UPDATE:

BEGIN;
INSERT INTO settings (name) VALUES ('timezone') RETURNING id;
COMMIT;
BEGIN;
UPDATE settings SET value = 'UTC+1' WHERE id = 1;
COMMIT;

Or, if a transaction is already running:

BEGIN;
...
SAVEPOINT active_record_1;
INSERT INTO settings (name) VALUES ('timezone') RETURNING id;
RELEASE SAVEPOINT active_record_1;
UPDATE settings SET value = 'UTC+1' WHERE id = 1;
...
COMMIT;

If a matching record exists, then the INSERT fails, so we ROLLBACK, load the existing record with SELECT and modify it using UPDATE:

BEGIN;
INSERT INTO settings (name) VALUES ('timezone') RETURNING id;
ROLLBACK;
SELECT * FROM settings WHERE name = 'timezone' LIMIT 1;
BEGIN;
UPDATE settings SET value = 'UTC+2' WHERE id = 1;
COMMIT;

If there is already a transaction running, then savepoints are used, but note that the INSERT, SELECT and UPDATE now happen in the same transaction. If some other query deletes the matching record after the INSERT fails, then under read committed the SELECT will return nothing. Under repeatable read, the SELECT will return the record that caused the INSERT to fail, even though it no longer exists, and the UPDATE will trigger a serialization failure, so this can still go wrong in certain scenarios.

BEGIN;
...
SAVEPOINT active_record_1;
INSERT INTO settings (name) VALUES ('timezone') RETURNING id;
ROLLBACK TO SAVEPOINT active_record_1;
SELECT * FROM settings WHERE name = 'timezone' LIMIT 1;
UPDATE settings SET value = 'UTC+2' WHERE id = 1;
...
COMMIT;

Although create_or_find_by is a welcome addition in terms of producing consistent writes, it is less efficient than our RecordNotUnique approach above. When the INSERT succeeds, it only fills in the name column and an additional UPDATE is needed to add the value. When the INSERT fails, an additional SELECT is done to load the existing record into memory so it can be updated. This is consistent with the general Rails approach of performing updates by loading a record with SELECT, modifying the in-memory copy, and then writing it back with UPDATE, but is has some notable disadvantages.

In addition to requiring extra queries, create_or_find_by shares a flaw with find_or_create_by: it creates partial records. The INSERT only fills in the field we’re searching on – the name – and leaves all other fields NULL. This NULL may never be observed, if create_or_find_by is run inside a transaction wrapping both the INSERT and UPDATE operations, and if none of your queries run in the read uncommitted isolation level. However, it still requires that your schema allow the other columns to be NULL, which may be undesirable even if it is the default for columns created by Rails.

Fortunately, PostgreSQL (since 9.5) and MySQL both offer a solution to all these problems: the upsert operation, short for update or insert. Rather than thinking about the exact change operations we want to make to the data, we can think instead in terms of the final state we want to produce, which is that exactly one row must exist in settings with the given name, and that row should have the given value. On PostgreSQL this is written as:

INSERT INTO settings (name, value) VALUES ('timezone', 'UTC+1')
    ON CONFLICT (name) DO UPDATE SET value=excluded.value
    RETURNING id;

On MySQL we do not have to specify which column the conflict occurs on, and the equivalent query is:

INSERT INTO settings (name, value) VALUES ('timezone', 'UTC+1')
    ON DUPLICATE KEY
    UPDATE name=VALUES(name), value=VALUES(value);

This is a single atomic query that achieves the desired state without having to read anything first, and does not require a transaction. The Rails wrapper for it is called upsert, also added in 6.0:

class Setting
  def self.set_value(name, value)
    upsert({ name: name, value: value }, unique_by: [:name])
  end
end

On PostgreSQL the unique_by option is needed to specify the columns for the ON CONFLICT clause, whereas on MySQL this option must not be used, and will cause an error if present, so this API is not portable across databases. However, it is the first API Rails has introduced that directly maps to the functionality built into databases for precisely this scenario.

Not all write skew errors can be prevented with a unique index. Such indexes can prevent duplicate values from appearing in a column, but they can’t express more complex constraints like “there must be exactly one article marked as featured for each author_id”. In general, preventing write skew requires either taking a FOR UPDATE lock on the existing records over which the constraint applies, or using an application-level advisory lock if the operation involves the set of matching records gaining or losing rows. For example, taking an advisory lock for the affected author_id would handle the aforementioned requirement.

The final option you have for preventing this type of bug is to run the operation in a transaction with the serializable isolation level. This means the database will run transactions so their result is consistent with running them in some sequential order; any operation that would lead to an ambiguous outcome due to the ordering of concurrent queries is rejected. That makes it safe to do this:

BEGIN ISOLATION LEVEL SERIALIZABLE;
SELECT * FROM settings WHERE name = 'timezone' LIMIT 1;
INSERT INTO settings (name) VALUES ('timezone') RETURNING id;
UPDATE settings SET value = 'UTC+1' WHERE id = 1;
COMMIT;

As we saw with find_or_create_by, this transaction is ambiguous depending on how two transactions interleave the queries: both could run SELECT, get no results, and so both run the INSERT, whereas if they were run sequentially then only one would run the INSERT. That ambiguity means running these transactions concurrently under serializable isolation is illegal, and one of them will be rejected.

This means we can take our find_or_create_by implementation, wrap it in a serializable transaction and retry it if it’s rejected, and it will do the right thing.

class Setting
  def self.set_value(name, value)
    transaction(isolation: :serializable) do
      setting = find_or_create_by(name: name)
      setting.update(value: value)
    end
  rescue ActiveRecord::SerializationFailure
    retry
  end
end

Serializable is the strictest isolation level offered by PostgreSQL, but that comes with a cost. Preventing these race conditions requires managing locks and delaying or rejecting some concurrent tasks, reducing the write throughput of the database. So for many applications, it may not be desirable, and one of the other techniques above should be considered instead. In particular, you should aim for the finest-grained locks that will give you the desired result, locking over particular records or values, rather than locking a whole table on every write.

Now, there are many other kinds of race conditions possible in web applications, but one thing that makes them easier to reason about is forms. One nice thing about forms in terms of relating user expectations to internal semantics is that they’re complete: when you submit a form, all its inputs and their values are sent to the server, not just the ones you’ve changed. You’re submitting a complete representation of the resource to the server, which can validate and accept it as such. If you submit a form on a website, and the site indicates this was successful, your expectation is that all the values you saw in the form are what was saved, not some combination of your changes and someone else’s.

However, it is still possible to end up with a hybrid state under concurrent requests, even an invalid state, because of a performance optimisation in how Rails updates records. An update action in a Rails controller typically does something like this, interspersed with access control, validation, generating the response and so on:

  def update
    thing = Thing.find(params[:id])
    thing.update(params[:thing])
  end

What happens here is that state from the database and the incoming POST request is merged, as follows. First the id URL parameter is used to SELECT a record. This makes a copy of the database record in memory, as a Thing instance. Then, the update method does three things. First, the attributes of this in-memory object are updated from the form parameters stored in params[:thing]. Second, the model’s validations are run against the in-memory attributes. If this is successful, then an UPDATE query is sent to the database to store the in-memory state into the database.

As a performance optimisation, Rails does not include the complete object state in the UPDATE query. It only includes the fields that have changed, i.e. those that differ between the values obtained from the initial SELECT and those provided in the form data params[:thing]. This creates the possibility that some other process modifies the same record between your SELECT and UPDATE, altering fields you didn’t change, but in a way that would make the record invalid, even though each change is valid on its own – another form of write skew.

Let’s say we have the following schema for storing events, which have a name, and two dates: starts_on and ends_on.

ActiveRecord::Schema.define do
  create_table :events, force: true do |t|
    t.string :name, null: false
    t.date :starts_on, null: false
    t.date :ends_on, null: false
  end
end

In the Event model that represents this table, we’ll implement a validation rule that the ends_on date must always be greater than or equal to the starts_on date, that is an event does not finish before it starts.

class Event < ActiveRecord::Base
  validate do
    unless starts_on <= ends_on
      errors.add(:base, "Events must start before they end")
    end
  end
end

Our expectation here is that Rails should make sure that all objects written to the database are valid, because it runs the record’s validations as a precondition for sending an UPDATE query. We’ll run an experiment that tests this, by running the following scenario a hundred times. First, we call Event.create to create an event that begins tomorrow and ends in four days’ time, so it lasts for three days. We then start two threads; one changes the start date of the event and one changes the end date. After a hundred iterations, we print the total number of events, and the number that are valid according to the above logic.

100.times do |i|
  dates = { starts_on: 1.day.from_now, ends_on: 4.days.from_now }
  Event.create(name: "event-#{i}", **dates)

  threads = [
    Thread.new { move_start_date("event-#{i}") },
    Thread.new { move_end_date("event-#{i}") }
  ]

  threads.each(&:join)
end

p [:count, Event.count]
p [:valid, Event.count(&:valid?)]

Our initial implementations for move_start_date and move_end_date will be to load the relevant record using Event.find, and then call Event#update. move_start_date increments the starts_on date by two days, and move_end_date decrements the ends_on date by the same amount. Each of these changes is valid on its own, but applying both to the same record would violate the validation rule we defined above.

def move_start_date(name)
  evt = Event.find_by(name: name)
  evt.update(starts_on: evt.starts_on + 2.days, ends_on: evt.ends_on)
end

def move_end_date(name)
  evt = Event.find_by(name: name)
  evt.update(starts_on: evt.starts_on, ends_on: evt.ends_on - 2.days)
end

To simulate how update is usually called in web applications with forms, I am passing both starts_on and ends_on to the update call, even though only one attribute is being changed in each case. Forms always send all their inputs’ values, and those values are passed into the model, so I want to make it clear that the following bug is not a result of me passing a subset of the attributes to update.

These two functions are executed concurrently, and each one performs a SELECT to load the record with the given name, and then an UPDATE against the same record. Let’s say the initial insert is:

INSERT INTO events (name, starts_on, ends_on)
VALUES ('event-42', 2020-09-01, 2020-09-04);

Now, suppose move_start_date runs first. It runs a SELECT query and will get the Event in its initial state, with starts_on and ends_on three days apart. It will then modify starts_on so that this in-memory copy of the record has a duration of one day. Rails will validate this object, see that starts_on <= ends_on as required, and therefore send an UPDATE query to persist this state to the database. When move_end_date runs, its SELECT will see the Event either before or after the UPDATE from move_start_date has been committed. We’d then expect that one of the following happens:

  1. If the Event includes the change from move_start_date, then changing ends_on will result in an invalid object where starts_on > ends_on, and Rails should not save it to the database.
  2. If the Event is in its initial state, then changing its ends_on date will result in a valid object, and Rails will persist that object state to the database.

Either way, we expect the database record to remain valid and obey the rules encoded in the Event model. But, that’s not what happens, because Rails generates UPDATE queries that only include fields that have been changed in memory, relative to the state retrieved in the SELECT query. In case 2, that means it sends the following queries:

SELECT * FROM events WHERE name = 'event-42';
--> { id = 42, starts_on = 2020-09-01, ends_on = 2020-09-04 }

BEGIN;
UPDATE events SET ends_on = 2020-09-02 WHERE id = 42;
END;

Rails determines that the copy of the Event record in memory is valid, but it only persists part of that object to the database – the part that’s changed since the initial SELECT. This allows the following sequence of events to happen:

move_start_date                 | move_end_date
--------------------------------+--------------------------------
SELECT * FROM events ...        |
                                |
BEGIN                           |
                                | SELECT * FROM events ...
UPDATE events                   |
    SET starts_on = 2020-09-03  |
    WHERE id = 1                |
                                | BEGIN
COMMIT                          |
                                | UPDATE events
                                |     SET ends_on = 2020-09-02
                                |     WHERE id = 1
                                |
                                | COMMIT

This is a form of write skew: two tasks read a record’s current state, and based on that they modify different fields of it. Each of these changes would be valid on its own, but they are not valid when combined, and one of them would not have been made if these tasks executed sequentially and all executions followed case 1 above. When I run this experiment on PostgreSQL, 70 to 80 records out of 100 end up in an invalid state.

This bug is the result of an optimisation in Rails – even though web forms submit a complete representation of an object, and all that data is sent through to the model, the implementation of update means a record can end up with a mix of two requests’ submitted values for the record, and no program or person has verified that the combined state is valid.

This race condition will happen if the SELECT in move_end_date happens at any time before the COMMIT in move_start_date, so that both tasks see the record’s initial state. In PostgreSQL, we can run these tasks inside repeatable read transactions, which means one of the UPDATES will be rejected if the transactions are concurrent – PostgreSQL treats this as a lost update since both tasks modify the same record, and indeed it would be a lost update if one request simply overwrote the other. MySQL does not provide such protection in repeatable read, and in serializable mode these tasks will deadlock.

The best portable protection against this bug is to use compare-and-swap, for example by taking advantage of the built-in Rails behaviour when we add a lock_version to the table:

ActiveRecord::Schema.define do
  change_table :events do |t|
    t.integer :lock_version, null: false
  end
end

We also need to catch the resulting StaleObjectError in our functions. I’m handling this error by doing nothing, just to prevent invalid records ending up in the database, but you’d probably want ask the user to resolve the conflict here rather than silently losing their input.

def move_start_date(name)
  evt = Event.find_by(name: name)
  evt.update(starts_on: evt.starts_on + 2.days, ends_on: evt.ends_on)
rescue ActiveRecord::StaleObjectError
end

def move_end_date(name)
  evt = Event.find_by(name: name)
  evt.update(starts_on: evt.starts_on, ends_on: evt.ends_on - 2.days)
rescue ActiveRecord::StaleObjectError
end

Again, we’ve seen that simply making write() fallible, changing its type from write(T) to write(T) -> bool (or something more complex), can prevent race conditions that our web framework is introducing. “Fallible” can mean that the database outright rejects a write due to isolation rules, or it can report that the write affected no rows in a compare-and-swap, but either way it’s not simply allowing the write to unconditionally succeed as the empty return type of write(T) implies.

In this article, we’ve added another category of race condition to our understanding. Whereas a lost update means one write replacing another, a write skew means two writes being persisted, where one of them would have been different – or not performed at all – if the tasks were executed sequentially, often leaving your data in an inconsistent state. Everything we’ve looked at so far is a single-request bug, which happens if two POST requests arrive at the same time. In the next instalment, we’ll extend this concept and look at multi-request scenarios where the Location type appears in the workflow of viewing and updating resources in applications.