This article is part of a five-part series on race conditions. The complete series is:
- Part 1: locations and locks
- Part 2: files and databases
- Part 3: web applications
- Part 4: user workflows
- Part 5: concurrency and language design
–
In part 1 of this series, we met the idea of the Location type, a
resource that can be read from and written to without co-ordination, and which
therefore allows race conditions to happen. In part 2 we saw how this
pattern occurs when interacting with files and databases, and we learned many
ways that databases let us address its problems using pessimistic and
optimistic control techniques. In this article, we’ll put this in context by
looking at how read and write patterns in web applications can lead to race
conditions and inconsistent data.
To begin with, let’s look at probably the most common form of race condition in
such applications: uniqueness violations. It’s common for applications to
require certain fields, like a user’s email address, to be unique, and Rails has
historically not dealt with this requirement robustly. Feral Concurrency
Control gives a good account of this problem, including a reminder that, at
least in its early days, the design approach in Rails was that all logic should
reside in application code, not in the database. The database should be treated
as a simple data store for reading and writing records, and nothing more – that
is, it should be thought of as a Location.
I don’t want my database to be clever! … I consider stored procedures and constraints vile and reckless destroyers of coherence. No, Mr. Database, you can not have my business logic. Your procedural ambitions will bear no fruit and you’ll have to pry that logic from my dead, cold object-oriented hands… I want a single layer of cleverness: My domain model.
David Heinemeier Hansson
The problem with this assertion is that once you’re running more than one instance of your application – whether that’s using multiple threads, processes, or server machines – you no longer have a single “layer of cleverness”, you have multiple clients talking concurrently to your database. In that environment, many of the interfaces Rails provides for guaranteeing consistency aren’t effective unless they’re combined with database concurrency control.
I should emphasize that the problems I am about to demonstrate are not unique to Rails, they’re commonplace in web application logic across many platforms. Rails is simply a well-known example, with which I’m most familiar. It has also had an enormous influence on how web apps in general are written; tools like Rails are how many people have learned to develop applications, and so its lack of decent concurrency control contributes to many people never learning how to notice, diagnose or fix the bugs it leads to. Tools aren’t just a means of getting work done, they’re a means of understanding the work itself, and the ergonomic effect of a tool making it easy to write certain categories of bug has an impact beyond the use of that tool in particular.
Here’s an example of how the “Rails Way” for ensuring uniqueness can be easily
broken. Say we have a table that stores settings, that is configuration data,
and it should be treated as a key-value store. Many open-source applications
include such a table, so that end users can tweak the system’s functionality. In
Rails we would declare a schema for this table with two columns, name and
value, and a model class to represent the table.
ActiveRecord::Schema.define do
  create_table :settings, force: true do |t|
    t.string :name
    t.string :value
  end
end
class Setting < ActiveRecord::Base
end
Now, we’d like to design an API for assigning settings in this table, making
sure each name appears only once and thus corresponds to a unique value.
Here’s the experiment we’ll run to determine whether each implementation is
correct. We’ll start a set of threads, each of which will try to assign a
different value for the timezone setting. Once the threads have completed,
we’ll count the number of rows in the settings table.
threads = (1..4).map do |n|
  Thread.new do
    Setting.set_value("timezone", "UTC+#{n}")
  end
end
threads.each(&:join)
p [:count, Setting.count]
As we’ve only assigned to a single setting name, we expect only one row in the table. As I mentioned in a previous article, I am using JRuby here as it supports truly parallel thread execution.
Historically, by far the most common way to see this implemented is to use
find_or_create_by to check if a record exists with certain properties, and
create one if not. Having ensured the existence of a Setting with the right
name, we can assign a value to it.
class Setting
  def self.set_value(name, value)
    setting = find_or_create_by(name: name)
    setting.update(value: value)
  end
end
When I run this implementation, there are almost always four records in
settings, all with the same name, so this is clearly not workable. Let’s
look at the SQL this generates to understand what’s happening:
SELECT * FROM settings WHERE name = 'timezone' LIMIT 1;
BEGIN;
INSERT INTO settings (name) VALUES ('timezone') RETURNING id;
COMMIT;
BEGIN;
UPDATE settings SET value = 'UTC+1' WHERE id = 1;
COMMIT;
It begins by doing a SELECT to see if a record with name = 'timezone'
exists. If this returns nothing, then it performs an INSERT for that name
inside a transaction; this will return the id of the new record. Given that
id, it then issues an UPDATE to set the value on the new record, in a
separate transaction. If the initial SELECT does return a record, then the
middle transaction containing the INSERT does not happen:
SELECT * FROM settings WHERE name = 'timezone' LIMIT 1;
BEGIN;
UPDATE settings SET value = 'UTC+2' WHERE id = 1;
COMMIT;
Note if there is already a transaction in progress, then Rails does not insert
transactions around the INSERT or UPDATE queries – it relies on the
existing transaction to make any associated life-cycle callbacks atomic.
BEGIN;
...
SELECT * FROM settings WHERE name = 'timezone' LIMIT 1;
INSERT INTO settings (name) VALUES ('timezone') RETURNING id;
UPDATE settings SET value = 'UTC+1' WHERE id = 1;
...
COMMIT;
We can see why this doesn’t work: we’re treating the settings table as a
Location – not an individual record in the table, but the table itself. We
read from the table to check if any records with a certain name exist, and
then based on the result of that we write to the table to add an additional row,
which always succeeds. Since this operation can be characterised by the
Location type, it is inherently prone to race conditions. In this case, it
means two tasks can both run the first SELECT, both see that no matching
records exist, and therefore both INSERT one.
Task 1                          | Task 2
--------------------------------+--------------------------------
SELECT * FROM settings ...      |
    --> no results              |
                                | SELECT * FROM settings ...
                                |     --> no results
BEGIN                           |
                                | BEGIN
INSERT INTO settings ...        |
                                | INSERT INTO settings ...
COMMIT                          |
                                | COMMIT
Putting the INSERT in a transaction here does not give us any protection from
the database – these INSERT queries are both valid, and don’t conflict with
each other since the database doesn’t know name needs to be unique. So, all
the above queries succeed and two INSERT queries are done with the same
name.
This race condition is an example of write skew: two transactions read from
the same set of records, and then both write something that would be valid on
its own, but invalidates a decision made by the other transaction. In this case,
both transactions decide to create a record with name = 'timezone' because
they believe that none exists, but when both are committed then our requirement
that name is unique has been violated. It’s a race condition because if these
tasks were executed sequentially, this mistake wouldn’t happen: the later task
would see the record insert by the earlier one, and decide to do nothing.
A lost update can be viewed as a special case of write skew in which both
writes go to the same location, and so one overwrites the other. In general,
write skews involve both writes being kept and together creating an invalid
state. When we examined lost updates with the counters example, we saw that in
PostgreSQL the repeatable read isolation level can detect lost updates and
reject the transactions that cause them. Some write skew scenarios can be
detected this way, but not all, because in general the writes affect different
records.
They also can’t be addressed by locking. For example, let’s try using the lock
method inside a transaction and look at the generated SQL:
class Setting
  def self.set_value(name, value)
    transaction do
      setting = lock.find_or_create_by(name: name)
      setting.update(value: value)
    end
  end
end
As expected, this generates a FOR UPDATE clause on the initial SELECT.
However, it does not prevent the race from happening, because FOR UPDATE locks
the records that happen to match the WHERE clause right now, it does not
prevent new records matching the clause being created, existing records being
deleted, or existing records being modified to match.
BEGIN;
SELECT * FROM settings WHERE name = 'timezone' LIMIT 1 FOR UPDATE;
INSERT INTO settings (name) VALUES ('timezone') RETURNING id;
UPDATE settings SET value = 'UTC+1' WHERE id = 1;
COMMIT;
That is, it locks specific records, rather than the abstract set of { all
records where name = 'timezone' }. Therefore, two transactions can both run
the INSERT query and create a duplicate record in the process.
The same approach is used in the validates_uniqueness_of macro, and its
own documentation will tell you it’s not safe under concurrent requests, as
we’ve just seen. They recommend instead that a unique index is added to the
database, which will cause it to reject any INSERT or UPDATE that creates
duplicate values on the given columns. This sounds promising; as we’ve seen,
simply making write() fallible makes it possible to avoid some consistency
bugs.
ActiveRecord::Schema.define do
  create_table :settings, force: true do |t|
    t.string :name
    t.string :value
    t.index [:name], unique: true
  end
end
We can then reverse the order of operations: first try an INSERT, and if that
fails, detect the RecordNotUnique exception and UPDATE the existing record.
This exception type was added in Rails 3.0, released in 2010. Unfortunately,
documentation for using it correctly is scant, and it’s easily misused. Here’s a
first attempt:
class Setting
  def self.set_value(name, value)
    create(name: name, value: value)
  rescue ActiveRecord::RecordNotUnique
    where(name: name).update_all(value: value)
  end
end
When a matching record does not exist, then we perform the INSERT with all the
required values, and we’re done:
BEGIN;
INSERT INTO settings (name, value) VALUES ('timezone', 'UTC+1');
COMMIT;
If a matching record does exist, the INSERT fails. The transaction that was
started inside the Rails create method is rolled back, and then we run an
UPDATE to set the desired value on the existing record.
BEGIN;
INSERT INTO settings (name, value) VALUES ('timezone', 'UTC+2');
ROLLBACK;
UPDATE settings SET value = 'UTC+2' WHERE name = 'timezone';
This will often work fine, although the control flow is a little awkward outside
of trivial cases because of the exception handling. The deeper problem is that
this doesn’t work if a transaction is already in progress. Most databases don’t
support nesting transactions, and as we saw above Rails won’t initiate a
transaction to create or update something if there’s a transaction already open
on the current connection. Because we catch the RecordNotUnique exception, and
create does not start and manage its own transaction, no ROLLBACK happens:
BEGIN; -- existing transaction started outside of Setting.set_value
...
INSERT INTO settings (name, value) VALUES ('timezone', 'UTC+2');
UPDATE settings SET value = 'UTC+2' WHERE name = 'timezone';
...
Because the INSERT was rejected, PostgreSQL now considers this transaction
failed; it’s not aborted, but running further queries in it is now an error.
But, we don’t want to ROLLBACK, as that would undo whatever changes were
already done before the INSERT happened – we just want to recover from the
failed INSERT, not throw the whole transaction away.
On MySQL, the effect is even worse. Because of the way it locks index
records during INSERT and UPDATE operations, two transactions executing
the above queries can deadlock each other. The failed INSERT does not
render the transaction unusable, and when a duplicate-key error happens,
INSERT acquires a shared lock on the index record for the given key. As the
transaction is still viable, we can run an UPDATE, which tries to upgrade the
index lock to an exclusive lock. If both transactions have executed the INSERT
and hold a shared lock, neither can acquire the exclusive lock until the other
commits or aborts.
This is problematic because your code will usually appear to work fine during
development, unless you actively check for concurrency bugs, but as soon as you
put it in production you’ll get errors from users deadlocking the database. By
eagerly invalidating the transaction after the failed INSERT, PostgreSQL
forces you to write something that will work predictably in production. An
important element of ergonomic design in software systems is the decision to
eagerly report errors for things that aren’t guaranteed to work, rather than
letting the code work sometimes and letting you go to production before finding
out it might fail. In this case, PostgreSQL protects you from making this
mistake, but as we’ve seen there are other cases where it won’t unless you
explicitly ask it to. For example, by running transactions in the read
committed level by default, it does not prevent lost updates or write skew.
To handle rolling back the failed INSERT, PostgreSQL, MySQL and other
databases provide savepoints. These let you set markers within a
transaction and roll back to the state at a given point, without discarding the
entire transaction. Rails has long had an API for savepoints, but it remains
undocumented and is not widely known. We can use it in our example by calling
create_savepoint before the INSERT operation, and then
rollback_to_savepoint if it fails:
class Setting
  def self.set_value(name, value)
    sp_name = "create_setting_#{name}"
    transaction do
      connection.create_savepoint(sp_name)
      create(name: name, value: value)
    rescue ActiveRecord::RecordNotUnique
      connection.rollback_to_savepoint(sp_name)
      where(name: name).update_all(value: value)
    end
  end
end
The transaction call starts a new transaction if there is not one currently
running; savepoints can only be used inside transactions, and this construction
is safe if we call set_value from within another transaction. It generates the
following queries when no matching records exist:
BEGIN;
SAVEPOINT create_setting_timezone;
INSERT INTO settings (name, value) VALUES ('timezone', 'UTC+1');
COMMIT;
When a matching record does exist, a ROLLBACK TO SAVEPOINT happens and we can
continue to use the open transaction – this also prevents deadlocks on MySQL by
releasing the lock acquired by the failed INSERT.
BEGIN;
SAVEPOINT create_setting_timezone;
INSERT INTO settings (name, value) VALUES ('timezone', 'UTC+2');
ROLLBACK TO SAVEPOINT create_setting_timezone;
UPDATE settings SET value = 'UTC+2' WHERE name = 'timezone';
COMMIT;
There is a more convenient way to use savepoints in Rails: the
transaction method takes an option called requires_new. Databases don’t
support nested transactions, but requires_new lets you pretend they do, at
least as far as rolling back is concerned. Normally transaction only starts a
transaction with BEGIN if there isn’t already one running. requires_new
means that, if there’s already a transaction running, a savepoint will be set so
that any exception causes a rollback to that savepoint, rather than undoing the
whole transaction. It wraps the above workflow up so we can use the same code
whether we’re already in a transaction or not.
class Setting
  def self.set_value(name, value)
    transaction(requires_new: true) do
      create(name: name, value: value)
    end
  rescue ActiveRecord::RecordNotUnique
    where(name: name).update_all(value: value)
  end
end
Note that our exception handler is outside the transaction block. To issue a
ROLLBACK, transaction needs to know an exception has happened, and catching
it inside the transaction block would prevent that. Instead, we let
transaction roll the transaction back and then handle the exception ourselves.
When there’s no transaction in progress, this generates the following queries:
BEGIN;
INSERT INTO settings (name, value) VALUES ('timezone', 'UTC+2');
ROLLBACK;
UPDATE settings SET value = 'UTC+2' WHERE name = 'timezone';
If we’re inside an existing transaction, then an equivalent set of operations is
done using SAVEPOINT, so that the whole transaction is not aborted and we can
keep using it.
BEGIN;
...
SAVEPOINT active_record_1;
INSERT INTO settings (name, value) VALUES ('timezone', 'UTC+2');
ROLLBACK TO SAVEPOINT active_record_1;
UPDATE settings SET value = 'UTC+2' WHERE name = 'timezone';
...
COMMIT;
In Rails 6.0, released in 2019, a hybrid between this pattern and
find_or_create_by was introduced, named create_or_find_by. It reverses the
operations of find_or_create_by by attempting an INSERT first, and then
doing a SELECT if this fails. The record returned by either of these
operations can then be completed using an UPDATE.
class Setting
  def self.set_value(name, value)
    setting = create_or_find_by(name: name)
    setting.update(value: value)
  end
end
create_or_find_by deals with transactions correctly, starting a new one if
needed and using savepoints if one is already in progress, so it prevents
transactions failing on PostgreSQL or deadlocking on MySQL. When no matching
record exists, we get an INSERT followed by an UPDATE:
BEGIN;
INSERT INTO settings (name) VALUES ('timezone') RETURNING id;
COMMIT;
BEGIN;
UPDATE settings SET value = 'UTC+1' WHERE id = 1;
COMMIT;
Or, if a transaction is already running:
BEGIN;
...
SAVEPOINT active_record_1;
INSERT INTO settings (name) VALUES ('timezone') RETURNING id;
RELEASE SAVEPOINT active_record_1;
UPDATE settings SET value = 'UTC+1' WHERE id = 1;
...
COMMIT;
If a matching record exists, then the INSERT fails, so we ROLLBACK, load the
existing record with SELECT and modify it using UPDATE:
BEGIN;
INSERT INTO settings (name) VALUES ('timezone') RETURNING id;
ROLLBACK;
SELECT * FROM settings WHERE name = 'timezone' LIMIT 1;
BEGIN;
UPDATE settings SET value = 'UTC+2' WHERE id = 1;
COMMIT;
If there is already a transaction running, then savepoints are used, but note
that the INSERT, SELECT and UPDATE now happen in the same transaction. If
some other query deletes the matching record after the INSERT fails, then
under read committed the SELECT will return nothing. Under repeatable
read, the SELECT will return the record that caused the INSERT to fail,
even though it no longer exists, and the UPDATE will trigger a serialization
failure, so this can still go wrong in certain scenarios.
BEGIN;
...
SAVEPOINT active_record_1;
INSERT INTO settings (name) VALUES ('timezone') RETURNING id;
ROLLBACK TO SAVEPOINT active_record_1;
SELECT * FROM settings WHERE name = 'timezone' LIMIT 1;
UPDATE settings SET value = 'UTC+2' WHERE id = 1;
...
COMMIT;
Although create_or_find_by is a welcome addition in terms of producing
consistent writes, it is less efficient than our RecordNotUnique approach
above. When the INSERT succeeds, it only fills in the name column and an
additional UPDATE is needed to add the value. When the INSERT fails, an
additional SELECT is done to load the existing record into memory so it can be
updated. This is consistent with the general Rails approach of performing
updates by loading a record with SELECT, modifying the in-memory copy, and
then writing it back with UPDATE, but is has some notable disadvantages.
In addition to requiring extra queries, create_or_find_by shares a flaw with
find_or_create_by: it creates partial records. The INSERT only fills in the
field we’re searching on – the name – and leaves all other fields NULL.
This NULL may never be observed, if create_or_find_by is run inside a
transaction wrapping both the INSERT and UPDATE operations, and if none of
your queries run in the read uncommitted isolation level. However, it still
requires that your schema allow the other columns to be NULL, which may be
undesirable even if it is the default for columns created by Rails.
Fortunately, PostgreSQL (since 9.5) and MySQL both offer a solution to all these
problems: the upsert operation, short for update or insert. Rather than
thinking about the exact change operations we want to make to the data, we can
think instead in terms of the final state we want to produce, which is that
exactly one row must exist in settings with the given name, and that row
should have the given value. On PostgreSQL this is written as:
INSERT INTO settings (name, value) VALUES ('timezone', 'UTC+1')
    ON CONFLICT (name) DO UPDATE SET value=excluded.value
    RETURNING id;
On MySQL we do not have to specify which column the conflict occurs on, and the equivalent query is:
INSERT INTO settings (name, value) VALUES ('timezone', 'UTC+1')
    ON DUPLICATE KEY
    UPDATE name=VALUES(name), value=VALUES(value);
This is a single atomic query that achieves the desired state without having to
read anything first, and does not require a transaction. The Rails wrapper for
it is called upsert, also added in 6.0:
class Setting
  def self.set_value(name, value)
    upsert({ name: name, value: value }, unique_by: [:name])
  end
end
On PostgreSQL the unique_by option is needed to specify the columns for the
ON CONFLICT clause, whereas on MySQL this option must not be used, and will
cause an error if present, so this API is not portable across databases.
However, it is the first API Rails has introduced that directly maps to the
functionality built into databases for precisely this scenario.
Not all write skew errors can be prevented with a unique index. Such indexes can
prevent duplicate values from appearing in a column, but they can’t express more
complex constraints like “there must be exactly one article marked as
featured for each author_id”. In general, preventing write skew requires
either taking a FOR UPDATE lock on the existing records over which the
constraint applies, or using an application-level advisory lock if the operation
involves the set of matching records gaining or losing rows. For example, taking
an advisory lock for the affected author_id would handle the aforementioned
requirement.
The final option you have for preventing this type of bug is to run the operation in a transaction with the serializable isolation level. This means the database will run transactions so their result is consistent with running them in some sequential order; any operation that would lead to an ambiguous outcome due to the ordering of concurrent queries is rejected. That makes it safe to do this:
BEGIN ISOLATION LEVEL SERIALIZABLE;
SELECT * FROM settings WHERE name = 'timezone' LIMIT 1;
INSERT INTO settings (name) VALUES ('timezone') RETURNING id;
UPDATE settings SET value = 'UTC+1' WHERE id = 1;
COMMIT;
As we saw with find_or_create_by, this transaction is ambiguous depending on
how two transactions interleave the queries: both could run SELECT, get no
results, and so both run the INSERT, whereas if they were run sequentially
then only one would run the INSERT. That ambiguity means running these
transactions concurrently under serializable isolation is illegal, and one of
them will be rejected.
This means we can take our find_or_create_by implementation, wrap it in a
serializable transaction and retry it if it’s rejected, and it will do the right
thing.
class Setting
  def self.set_value(name, value)
    transaction(isolation: :serializable) do
      setting = find_or_create_by(name: name)
      setting.update(value: value)
    end
  rescue ActiveRecord::SerializationFailure
    retry
  end
end
Serializable is the strictest isolation level offered by PostgreSQL, but that comes with a cost. Preventing these race conditions requires managing locks and delaying or rejecting some concurrent tasks, reducing the write throughput of the database. So for many applications, it may not be desirable, and one of the other techniques above should be considered instead. In particular, you should aim for the finest-grained locks that will give you the desired result, locking over particular records or values, rather than locking a whole table on every write.
Now, there are many other kinds of race conditions possible in web applications, but one thing that makes them easier to reason about is forms. One nice thing about forms in terms of relating user expectations to internal semantics is that they’re complete: when you submit a form, all its inputs and their values are sent to the server, not just the ones you’ve changed. You’re submitting a complete representation of the resource to the server, which can validate and accept it as such. If you submit a form on a website, and the site indicates this was successful, your expectation is that all the values you saw in the form are what was saved, not some combination of your changes and someone else’s.
However, it is still possible to end up with a hybrid state under concurrent
requests, even an invalid state, because of a performance optimisation in how
Rails updates records. An update action in a Rails controller typically does
something like this, interspersed with access control, validation, generating
the response and so on:
  def update
    thing = Thing.find(params[:id])
    thing.update(params[:thing])
  end
What happens here is that state from the database and the incoming POST
request is merged, as follows. First the id URL parameter is used to SELECT
a record. This makes a copy of the database record in memory, as a Thing
instance. Then, the update method does three things. First, the attributes of
this in-memory object are updated from the form parameters stored in
params[:thing]. Second, the model’s validations are run against the in-memory
attributes. If this is successful, then an UPDATE query is sent to the
database to store the in-memory state into the database.
As a performance optimisation, Rails does not include the complete object state
in the UPDATE query. It only includes the fields that have changed, i.e. those
that differ between the values obtained from the initial SELECT and those
provided in the form data params[:thing]. This creates the possibility that
some other process modifies the same record between your SELECT and UPDATE,
altering fields you didn’t change, but in a way that would make the record
invalid, even though each change is valid on its own – another form of write
skew.
Let’s say we have the following schema for storing events, which have a name,
and two dates: starts_on and ends_on.
ActiveRecord::Schema.define do
  create_table :events, force: true do |t|
    t.string :name, null: false
    t.date :starts_on, null: false
    t.date :ends_on, null: false
  end
end
In the Event model that represents this table, we’ll implement a validation
rule that the ends_on date must always be greater than or equal to the
starts_on date, that is an event does not finish before it starts.
class Event < ActiveRecord::Base
  validate do
    unless starts_on <= ends_on
      errors.add(:base, "Events must start before they end")
    end
  end
end
Our expectation here is that Rails should make sure that all objects written to
the database are valid, because it runs the record’s validations as a
precondition for sending an UPDATE query. We’ll run an experiment that tests
this, by running the following scenario a hundred times. First, we call
Event.create to create an event that begins tomorrow and ends in four days’
time, so it lasts for three days. We then start two threads; one changes the
start date of the event and one changes the end date. After a hundred
iterations, we print the total number of events, and the number that are valid
according to the above logic.
100.times do |i|
  dates = { starts_on: 1.day.from_now, ends_on: 4.days.from_now }
  Event.create(name: "event-#{i}", **dates)
  threads = [
    Thread.new { move_start_date("event-#{i}") },
    Thread.new { move_end_date("event-#{i}") }
  ]
  threads.each(&:join)
end
p [:count, Event.count]
p [:valid, Event.count(&:valid?)]
Our initial implementations for move_start_date and move_end_date will be to
load the relevant record using Event.find, and then call Event#update.
move_start_date increments the starts_on date by two days, and
move_end_date decrements the ends_on date by the same amount. Each of these
changes is valid on its own, but applying both to the same record would violate
the validation rule we defined above.
def move_start_date(name)
  evt = Event.find_by(name: name)
  evt.update(starts_on: evt.starts_on + 2.days, ends_on: evt.ends_on)
end
def move_end_date(name)
  evt = Event.find_by(name: name)
  evt.update(starts_on: evt.starts_on, ends_on: evt.ends_on - 2.days)
end
To simulate how update is usually called in web applications with forms, I am
passing both starts_on and ends_on to the update call, even though only
one attribute is being changed in each case. Forms always send all their inputs’
values, and those values are passed into the model, so I want to make it clear
that the following bug is not a result of me passing a subset of the attributes
to update.
These two functions are executed concurrently, and each one performs a SELECT
to load the record with the given name, and then an UPDATE against the same
record. Let’s say the initial insert is:
INSERT INTO events (name, starts_on, ends_on)
VALUES ('event-42', 2020-09-01, 2020-09-04);
Now, suppose move_start_date runs first. It runs a SELECT query and will get
the Event in its initial state, with starts_on and ends_on three days
apart. It will then modify starts_on so that this in-memory copy of the record
has a duration of one day. Rails will validate this object, see that starts_on
<= ends_on as required, and therefore send an UPDATE query to persist this
state to the database. When move_end_date runs, its SELECT will see the
Event either before or after the UPDATE from move_start_date has been
committed. We’d then expect that one of the following happens:
- If the Eventincludes the change frommove_start_date, then changingends_onwill result in an invalid object wherestarts_on > ends_on, and Rails should not save it to the database.
- If the Eventis in its initial state, then changing itsends_ondate will result in a valid object, and Rails will persist that object state to the database.
Either way, we expect the database record to remain valid and obey the rules
encoded in the Event model. But, that’s not what happens, because Rails
generates UPDATE queries that only include fields that have been changed in
memory, relative to the state retrieved in the SELECT query. In case 2, that
means it sends the following queries:
SELECT * FROM events WHERE name = 'event-42';
--> { id = 42, starts_on = 2020-09-01, ends_on = 2020-09-04 }
BEGIN;
UPDATE events SET ends_on = 2020-09-02 WHERE id = 42;
END;
Rails determines that the copy of the Event record in memory is valid, but it
only persists part of that object to the database – the part that’s changed
since the initial SELECT. This allows the following sequence of events to
happen:
move_start_date                 | move_end_date
--------------------------------+--------------------------------
SELECT * FROM events ...        |
                                |
BEGIN                           |
                                | SELECT * FROM events ...
UPDATE events                   |
    SET starts_on = 2020-09-03  |
    WHERE id = 1                |
                                | BEGIN
COMMIT                          |
                                | UPDATE events
                                |     SET ends_on = 2020-09-02
                                |     WHERE id = 1
                                |
                                | COMMIT
This is a form of write skew: two tasks read a record’s current state, and based on that they modify different fields of it. Each of these changes would be valid on its own, but they are not valid when combined, and one of them would not have been made if these tasks executed sequentially and all executions followed case 1 above. When I run this experiment on PostgreSQL, 70 to 80 records out of 100 end up in an invalid state.
This bug is the result of an optimisation in Rails – even though web forms
submit a complete representation of an object, and all that data is sent through
to the model, the implementation of update means a record can end up with a
mix of two requests’ submitted values for the record, and no program or person
has verified that the combined state is valid.
This race condition will happen if the SELECT in move_end_date happens at
any time before the COMMIT in move_start_date, so that both tasks see the
record’s initial state. In PostgreSQL, we can run these tasks inside repeatable
read transactions, which means one of the UPDATES will be rejected if the
transactions are concurrent – PostgreSQL treats this as a lost update since
both tasks modify the same record, and indeed it would be a lost update if one
request simply overwrote the other. MySQL does not provide such protection in
repeatable read, and in serializable mode these tasks will deadlock.
The best portable protection against this bug is to use compare-and-swap, for
example by taking advantage of the built-in Rails behaviour when we add a
lock_version to the table:
ActiveRecord::Schema.define do
  change_table :events do |t|
    t.integer :lock_version, null: false
  end
end
We also need to catch the resulting StaleObjectError in our functions. I’m
handling this error by doing nothing, just to prevent invalid records ending up
in the database, but you’d probably want ask the user to resolve the conflict
here rather than silently losing their input.
def move_start_date(name)
  evt = Event.find_by(name: name)
  evt.update(starts_on: evt.starts_on + 2.days, ends_on: evt.ends_on)
rescue ActiveRecord::StaleObjectError
end
def move_end_date(name)
  evt = Event.find_by(name: name)
  evt.update(starts_on: evt.starts_on, ends_on: evt.ends_on - 2.days)
rescue ActiveRecord::StaleObjectError
end
Again, we’ve seen that simply making write() fallible, changing its type from
write(T) to write(T) -> bool (or something more complex), can prevent race
conditions that our web framework is introducing. “Fallible” can mean that the
database outright rejects a write due to isolation rules, or it can report that
the write affected no rows in a compare-and-swap, but either way it’s not simply
allowing the write to unconditionally succeed as the empty return type of
write(T) implies.
In this article, we’ve added another category of race condition to our
understanding. Whereas a lost update means one write replacing another, a
write skew means two writes being persisted, where one of them would have been
different – or not performed at all – if the tasks were executed sequentially,
often leaving your data in an inconsistent state. Everything we’ve looked at so
far is a single-request bug, which happens if two POST requests arrive at
the same time. In the next instalment, we’ll extend this concept and look at
multi-request scenarios where the Location type appears in the workflow of
viewing and updating resources in applications.
 
        