This article is part of a five-part series on race conditions. The complete series is:
- Part 1: locations and locks
- Part 2: files and databases
- Part 3: web applications
- Part 4: user workflows
- Part 5: concurrency and language design
–
In part 1 of this series, we met the idea of the Location
type, a
resource that can be read from and written to without co-ordination, and which
therefore allows race conditions to happen. In part 2 we saw how this
pattern occurs when interacting with files and databases, and we learned many
ways that databases let us address its problems using pessimistic and
optimistic control techniques. In this article, we’ll put this in context by
looking at how read and write patterns in web applications can lead to race
conditions and inconsistent data.
To begin with, let’s look at probably the most common form of race condition in
such applications: uniqueness violations. It’s common for applications to
require certain fields, like a user’s email address, to be unique, and Rails has
historically not dealt with this requirement robustly. Feral Concurrency
Control gives a good account of this problem, including a reminder that, at
least in its early days, the design approach in Rails was that all logic should
reside in application code, not in the database. The database should be treated
as a simple data store for reading and writing records, and nothing more – that
is, it should be thought of as a Location
.
I don’t want my database to be clever! … I consider stored procedures and constraints vile and reckless destroyers of coherence. No, Mr. Database, you can not have my business logic. Your procedural ambitions will bear no fruit and you’ll have to pry that logic from my dead, cold object-oriented hands… I want a single layer of cleverness: My domain model.
David Heinemeier Hansson
The problem with this assertion is that once you’re running more than one instance of your application – whether that’s using multiple threads, processes, or server machines – you no longer have a single “layer of cleverness”, you have multiple clients talking concurrently to your database. In that environment, many of the interfaces Rails provides for guaranteeing consistency aren’t effective unless they’re combined with database concurrency control.
I should emphasize that the problems I am about to demonstrate are not unique to Rails, they’re commonplace in web application logic across many platforms. Rails is simply a well-known example, with which I’m most familiar. It has also had an enormous influence on how web apps in general are written; tools like Rails are how many people have learned to develop applications, and so its lack of decent concurrency control contributes to many people never learning how to notice, diagnose or fix the bugs it leads to. Tools aren’t just a means of getting work done, they’re a means of understanding the work itself, and the ergonomic effect of a tool making it easy to write certain categories of bug has an impact beyond the use of that tool in particular.
Here’s an example of how the “Rails Way” for ensuring uniqueness can be easily
broken. Say we have a table that stores settings, that is configuration data,
and it should be treated as a key-value store. Many open-source applications
include such a table, so that end users can tweak the system’s functionality. In
Rails we would declare a schema for this table with two columns, name
and
value
, and a model class to represent the table.
ActiveRecord::Schema.define do
create_table :settings, force: true do |t|
t.string :name
t.string :value
end
end
class Setting < ActiveRecord::Base
end
Now, we’d like to design an API for assigning settings in this table, making
sure each name
appears only once and thus corresponds to a unique value.
Here’s the experiment we’ll run to determine whether each implementation is
correct. We’ll start a set of threads, each of which will try to assign a
different value for the timezone
setting. Once the threads have completed,
we’ll count the number of rows in the settings
table.
threads = (1..4).map do |n|
Thread.new do
Setting.set_value("timezone", "UTC+#{n}")
end
end
threads.each(&:join)
p [:count, Setting.count]
As we’ve only assigned to a single setting name, we expect only one row in the table. As I mentioned in a previous article, I am using JRuby here as it supports truly parallel thread execution.
Historically, by far the most common way to see this implemented is to use
find_or_create_by
to check if a record exists with certain properties, and
create one if not. Having ensured the existence of a Setting
with the right
name
, we can assign a value
to it.
class Setting
def self.set_value(name, value)
setting = find_or_create_by(name: name)
setting.update(value: value)
end
end
When I run this implementation, there are almost always four records in
settings
, all with the same name
, so this is clearly not workable. Let’s
look at the SQL this generates to understand what’s happening:
SELECT * FROM settings WHERE name = 'timezone' LIMIT 1;
BEGIN;
INSERT INTO settings (name) VALUES ('timezone') RETURNING id;
COMMIT;
BEGIN;
UPDATE settings SET value = 'UTC+1' WHERE id = 1;
COMMIT;
It begins by doing a SELECT
to see if a record with name = 'timezone'
exists. If this returns nothing, then it performs an INSERT
for that name
inside a transaction; this will return the id
of the new record. Given that
id
, it then issues an UPDATE
to set the value
on the new record, in a
separate transaction. If the initial SELECT
does return a record, then the
middle transaction containing the INSERT
does not happen:
SELECT * FROM settings WHERE name = 'timezone' LIMIT 1;
BEGIN;
UPDATE settings SET value = 'UTC+2' WHERE id = 1;
COMMIT;
Note if there is already a transaction in progress, then Rails does not insert
transactions around the INSERT
or UPDATE
queries – it relies on the
existing transaction to make any associated life-cycle callbacks atomic.
BEGIN;
...
SELECT * FROM settings WHERE name = 'timezone' LIMIT 1;
INSERT INTO settings (name) VALUES ('timezone') RETURNING id;
UPDATE settings SET value = 'UTC+1' WHERE id = 1;
...
COMMIT;
We can see why this doesn’t work: we’re treating the settings
table as a
Location
– not an individual record in the table, but the table itself. We
read from the table to check if any records with a certain name
exist, and
then based on the result of that we write to the table to add an additional row,
which always succeeds. Since this operation can be characterised by the
Location
type, it is inherently prone to race conditions. In this case, it
means two tasks can both run the first SELECT
, both see that no matching
records exist, and therefore both INSERT
one.
Task 1 | Task 2
--------------------------------+--------------------------------
SELECT * FROM settings ... |
--> no results |
| SELECT * FROM settings ...
| --> no results
BEGIN |
| BEGIN
INSERT INTO settings ... |
| INSERT INTO settings ...
COMMIT |
| COMMIT
Putting the INSERT
in a transaction here does not give us any protection from
the database – these INSERT
queries are both valid, and don’t conflict with
each other since the database doesn’t know name
needs to be unique. So, all
the above queries succeed and two INSERT
queries are done with the same
name
.
This race condition is an example of write skew: two transactions read from
the same set of records, and then both write something that would be valid on
its own, but invalidates a decision made by the other transaction. In this case,
both transactions decide to create a record with name = 'timezone'
because
they believe that none exists, but when both are committed then our requirement
that name
is unique has been violated. It’s a race condition because if these
tasks were executed sequentially, this mistake wouldn’t happen: the later task
would see the record insert by the earlier one, and decide to do nothing.
A lost update can be viewed as a special case of write skew in which both
writes go to the same location, and so one overwrites the other. In general,
write skews involve both writes being kept and together creating an invalid
state. When we examined lost updates with the counters
example, we saw that in
PostgreSQL the repeatable read isolation level can detect lost updates and
reject the transactions that cause them. Some write skew scenarios can be
detected this way, but not all, because in general the writes affect different
records.
They also can’t be addressed by locking. For example, let’s try using the lock
method inside a transaction and look at the generated SQL:
class Setting
def self.set_value(name, value)
transaction do
setting = lock.find_or_create_by(name: name)
setting.update(value: value)
end
end
end
As expected, this generates a FOR UPDATE
clause on the initial SELECT
.
However, it does not prevent the race from happening, because FOR UPDATE
locks
the records that happen to match the WHERE
clause right now, it does not
prevent new records matching the clause being created, existing records being
deleted, or existing records being modified to match.
BEGIN;
SELECT * FROM settings WHERE name = 'timezone' LIMIT 1 FOR UPDATE;
INSERT INTO settings (name) VALUES ('timezone') RETURNING id;
UPDATE settings SET value = 'UTC+1' WHERE id = 1;
COMMIT;
That is, it locks specific records, rather than the abstract set of { all
records where name = 'timezone'
}. Therefore, two transactions can both run
the INSERT
query and create a duplicate record in the process.
The same approach is used in the validates_uniqueness_of
macro, and its
own documentation will tell you it’s not safe under concurrent requests, as
we’ve just seen. They recommend instead that a unique index is added to the
database, which will cause it to reject any INSERT
or UPDATE
that creates
duplicate values on the given columns. This sounds promising; as we’ve seen,
simply making write()
fallible makes it possible to avoid some consistency
bugs.
ActiveRecord::Schema.define do
create_table :settings, force: true do |t|
t.string :name
t.string :value
t.index [:name], unique: true
end
end
We can then reverse the order of operations: first try an INSERT
, and if that
fails, detect the RecordNotUnique
exception and UPDATE
the existing record.
This exception type was added in Rails 3.0, released in 2010. Unfortunately,
documentation for using it correctly is scant, and it’s easily misused. Here’s a
first attempt:
class Setting
def self.set_value(name, value)
create(name: name, value: value)
rescue ActiveRecord::RecordNotUnique
where(name: name).update_all(value: value)
end
end
When a matching record does not exist, then we perform the INSERT
with all the
required values, and we’re done:
BEGIN;
INSERT INTO settings (name, value) VALUES ('timezone', 'UTC+1');
COMMIT;
If a matching record does exist, the INSERT
fails. The transaction that was
started inside the Rails create
method is rolled back, and then we run an
UPDATE
to set the desired value
on the existing record.
BEGIN;
INSERT INTO settings (name, value) VALUES ('timezone', 'UTC+2');
ROLLBACK;
UPDATE settings SET value = 'UTC+2' WHERE name = 'timezone';
This will often work fine, although the control flow is a little awkward outside
of trivial cases because of the exception handling. The deeper problem is that
this doesn’t work if a transaction is already in progress. Most databases don’t
support nesting transactions, and as we saw above Rails won’t initiate a
transaction to create or update something if there’s a transaction already open
on the current connection. Because we catch the RecordNotUnique
exception, and
create
does not start and manage its own transaction, no ROLLBACK
happens:
BEGIN; -- existing transaction started outside of Setting.set_value
...
INSERT INTO settings (name, value) VALUES ('timezone', 'UTC+2');
UPDATE settings SET value = 'UTC+2' WHERE name = 'timezone';
...
Because the INSERT
was rejected, PostgreSQL now considers this transaction
failed; it’s not aborted, but running further queries in it is now an error.
But, we don’t want to ROLLBACK
, as that would undo whatever changes were
already done before the INSERT
happened – we just want to recover from the
failed INSERT
, not throw the whole transaction away.
On MySQL, the effect is even worse. Because of the way it locks index
records during INSERT
and UPDATE
operations, two transactions executing
the above queries can deadlock each other. The failed INSERT
does not
render the transaction unusable, and when a duplicate-key error happens,
INSERT
acquires a shared lock on the index record for the given key. As the
transaction is still viable, we can run an UPDATE
, which tries to upgrade the
index lock to an exclusive lock. If both transactions have executed the INSERT
and hold a shared lock, neither can acquire the exclusive lock until the other
commits or aborts.
This is problematic because your code will usually appear to work fine during
development, unless you actively check for concurrency bugs, but as soon as you
put it in production you’ll get errors from users deadlocking the database. By
eagerly invalidating the transaction after the failed INSERT
, PostgreSQL
forces you to write something that will work predictably in production. An
important element of ergonomic design in software systems is the decision to
eagerly report errors for things that aren’t guaranteed to work, rather than
letting the code work sometimes and letting you go to production before finding
out it might fail. In this case, PostgreSQL protects you from making this
mistake, but as we’ve seen there are other cases where it won’t unless you
explicitly ask it to. For example, by running transactions in the read
committed level by default, it does not prevent lost updates or write skew.
To handle rolling back the failed INSERT
, PostgreSQL, MySQL and other
databases provide savepoints. These let you set markers within a
transaction and roll back to the state at a given point, without discarding the
entire transaction. Rails has long had an API for savepoints, but it remains
undocumented and is not widely known. We can use it in our example by calling
create_savepoint
before the INSERT
operation, and then
rollback_to_savepoint
if it fails:
class Setting
def self.set_value(name, value)
sp_name = "create_setting_#{name}"
transaction do
connection.create_savepoint(sp_name)
create(name: name, value: value)
rescue ActiveRecord::RecordNotUnique
connection.rollback_to_savepoint(sp_name)
where(name: name).update_all(value: value)
end
end
end
The transaction
call starts a new transaction if there is not one currently
running; savepoints can only be used inside transactions, and this construction
is safe if we call set_value
from within another transaction. It generates the
following queries when no matching records exist:
BEGIN;
SAVEPOINT create_setting_timezone;
INSERT INTO settings (name, value) VALUES ('timezone', 'UTC+1');
COMMIT;
When a matching record does exist, a ROLLBACK TO SAVEPOINT
happens and we can
continue to use the open transaction – this also prevents deadlocks on MySQL by
releasing the lock acquired by the failed INSERT
.
BEGIN;
SAVEPOINT create_setting_timezone;
INSERT INTO settings (name, value) VALUES ('timezone', 'UTC+2');
ROLLBACK TO SAVEPOINT create_setting_timezone;
UPDATE settings SET value = 'UTC+2' WHERE name = 'timezone';
COMMIT;
There is a more convenient way to use savepoints in Rails: the
transaction
method takes an option called requires_new
. Databases don’t
support nested transactions, but requires_new
lets you pretend they do, at
least as far as rolling back is concerned. Normally transaction
only starts a
transaction with BEGIN
if there isn’t already one running. requires_new
means that, if there’s already a transaction running, a savepoint will be set so
that any exception causes a rollback to that savepoint, rather than undoing the
whole transaction. It wraps the above workflow up so we can use the same code
whether we’re already in a transaction or not.
class Setting
def self.set_value(name, value)
transaction(requires_new: true) do
create(name: name, value: value)
end
rescue ActiveRecord::RecordNotUnique
where(name: name).update_all(value: value)
end
end
Note that our exception handler is outside the transaction
block. To issue a
ROLLBACK
, transaction
needs to know an exception has happened, and catching
it inside the transaction
block would prevent that. Instead, we let
transaction
roll the transaction back and then handle the exception ourselves.
When there’s no transaction in progress, this generates the following queries:
BEGIN;
INSERT INTO settings (name, value) VALUES ('timezone', 'UTC+2');
ROLLBACK;
UPDATE settings SET value = 'UTC+2' WHERE name = 'timezone';
If we’re inside an existing transaction, then an equivalent set of operations is
done using SAVEPOINT
, so that the whole transaction is not aborted and we can
keep using it.
BEGIN;
...
SAVEPOINT active_record_1;
INSERT INTO settings (name, value) VALUES ('timezone', 'UTC+2');
ROLLBACK TO SAVEPOINT active_record_1;
UPDATE settings SET value = 'UTC+2' WHERE name = 'timezone';
...
COMMIT;
In Rails 6.0, released in 2019, a hybrid between this pattern and
find_or_create_by
was introduced, named create_or_find_by
. It reverses the
operations of find_or_create_by
by attempting an INSERT
first, and then
doing a SELECT
if this fails. The record returned by either of these
operations can then be completed using an UPDATE
.
class Setting
def self.set_value(name, value)
setting = create_or_find_by(name: name)
setting.update(value: value)
end
end
create_or_find_by
deals with transactions correctly, starting a new one if
needed and using savepoints if one is already in progress, so it prevents
transactions failing on PostgreSQL or deadlocking on MySQL. When no matching
record exists, we get an INSERT
followed by an UPDATE
:
BEGIN;
INSERT INTO settings (name) VALUES ('timezone') RETURNING id;
COMMIT;
BEGIN;
UPDATE settings SET value = 'UTC+1' WHERE id = 1;
COMMIT;
Or, if a transaction is already running:
BEGIN;
...
SAVEPOINT active_record_1;
INSERT INTO settings (name) VALUES ('timezone') RETURNING id;
RELEASE SAVEPOINT active_record_1;
UPDATE settings SET value = 'UTC+1' WHERE id = 1;
...
COMMIT;
If a matching record exists, then the INSERT
fails, so we ROLLBACK
, load the
existing record with SELECT
and modify it using UPDATE
:
BEGIN;
INSERT INTO settings (name) VALUES ('timezone') RETURNING id;
ROLLBACK;
SELECT * FROM settings WHERE name = 'timezone' LIMIT 1;
BEGIN;
UPDATE settings SET value = 'UTC+2' WHERE id = 1;
COMMIT;
If there is already a transaction running, then savepoints are used, but note
that the INSERT
, SELECT
and UPDATE
now happen in the same transaction. If
some other query deletes the matching record after the INSERT
fails, then
under read committed the SELECT
will return nothing. Under repeatable
read, the SELECT
will return the record that caused the INSERT
to fail,
even though it no longer exists, and the UPDATE
will trigger a serialization
failure, so this can still go wrong in certain scenarios.
BEGIN;
...
SAVEPOINT active_record_1;
INSERT INTO settings (name) VALUES ('timezone') RETURNING id;
ROLLBACK TO SAVEPOINT active_record_1;
SELECT * FROM settings WHERE name = 'timezone' LIMIT 1;
UPDATE settings SET value = 'UTC+2' WHERE id = 1;
...
COMMIT;
Although create_or_find_by
is a welcome addition in terms of producing
consistent writes, it is less efficient than our RecordNotUnique
approach
above. When the INSERT
succeeds, it only fills in the name
column and an
additional UPDATE
is needed to add the value
. When the INSERT
fails, an
additional SELECT
is done to load the existing record into memory so it can be
updated. This is consistent with the general Rails approach of performing
updates by loading a record with SELECT
, modifying the in-memory copy, and
then writing it back with UPDATE
, but is has some notable disadvantages.
In addition to requiring extra queries, create_or_find_by
shares a flaw with
find_or_create_by
: it creates partial records. The INSERT
only fills in the
field we’re searching on – the name
– and leaves all other fields NULL
.
This NULL
may never be observed, if create_or_find_by
is run inside a
transaction wrapping both the INSERT
and UPDATE
operations, and if none of
your queries run in the read uncommitted isolation level. However, it still
requires that your schema allow the other columns to be NULL
, which may be
undesirable even if it is the default for columns created by Rails.
Fortunately, PostgreSQL (since 9.5) and MySQL both offer a solution to all these
problems: the upsert operation, short for update or insert. Rather than
thinking about the exact change operations we want to make to the data, we can
think instead in terms of the final state we want to produce, which is that
exactly one row must exist in settings
with the given name
, and that row
should have the given value
. On PostgreSQL this is written as:
INSERT INTO settings (name, value) VALUES ('timezone', 'UTC+1')
ON CONFLICT (name) DO UPDATE SET value=excluded.value
RETURNING id;
On MySQL we do not have to specify which column the conflict occurs on, and the equivalent query is:
INSERT INTO settings (name, value) VALUES ('timezone', 'UTC+1')
ON DUPLICATE KEY
UPDATE name=VALUES(name), value=VALUES(value);
This is a single atomic query that achieves the desired state without having to
read anything first, and does not require a transaction. The Rails wrapper for
it is called upsert
, also added in 6.0:
class Setting
def self.set_value(name, value)
upsert({ name: name, value: value }, unique_by: [:name])
end
end
On PostgreSQL the unique_by
option is needed to specify the columns for the
ON CONFLICT
clause, whereas on MySQL this option must not be used, and will
cause an error if present, so this API is not portable across databases.
However, it is the first API Rails has introduced that directly maps to the
functionality built into databases for precisely this scenario.
Not all write skew errors can be prevented with a unique index. Such indexes can
prevent duplicate values from appearing in a column, but they can’t express more
complex constraints like “there must be exactly one article
marked as
featured
for each author_id
”. In general, preventing write skew requires
either taking a FOR UPDATE
lock on the existing records over which the
constraint applies, or using an application-level advisory lock if the operation
involves the set of matching records gaining or losing rows. For example, taking
an advisory lock for the affected author_id
would handle the aforementioned
requirement.
The final option you have for preventing this type of bug is to run the operation in a transaction with the serializable isolation level. This means the database will run transactions so their result is consistent with running them in some sequential order; any operation that would lead to an ambiguous outcome due to the ordering of concurrent queries is rejected. That makes it safe to do this:
BEGIN ISOLATION LEVEL SERIALIZABLE;
SELECT * FROM settings WHERE name = 'timezone' LIMIT 1;
INSERT INTO settings (name) VALUES ('timezone') RETURNING id;
UPDATE settings SET value = 'UTC+1' WHERE id = 1;
COMMIT;
As we saw with find_or_create_by
, this transaction is ambiguous depending on
how two transactions interleave the queries: both could run SELECT
, get no
results, and so both run the INSERT
, whereas if they were run sequentially
then only one would run the INSERT
. That ambiguity means running these
transactions concurrently under serializable isolation is illegal, and one of
them will be rejected.
This means we can take our find_or_create_by
implementation, wrap it in a
serializable transaction and retry it if it’s rejected, and it will do the right
thing.
class Setting
def self.set_value(name, value)
transaction(isolation: :serializable) do
setting = find_or_create_by(name: name)
setting.update(value: value)
end
rescue ActiveRecord::SerializationFailure
retry
end
end
Serializable is the strictest isolation level offered by PostgreSQL, but that comes with a cost. Preventing these race conditions requires managing locks and delaying or rejecting some concurrent tasks, reducing the write throughput of the database. So for many applications, it may not be desirable, and one of the other techniques above should be considered instead. In particular, you should aim for the finest-grained locks that will give you the desired result, locking over particular records or values, rather than locking a whole table on every write.
Now, there are many other kinds of race conditions possible in web applications, but one thing that makes them easier to reason about is forms. One nice thing about forms in terms of relating user expectations to internal semantics is that they’re complete: when you submit a form, all its inputs and their values are sent to the server, not just the ones you’ve changed. You’re submitting a complete representation of the resource to the server, which can validate and accept it as such. If you submit a form on a website, and the site indicates this was successful, your expectation is that all the values you saw in the form are what was saved, not some combination of your changes and someone else’s.
However, it is still possible to end up with a hybrid state under concurrent
requests, even an invalid state, because of a performance optimisation in how
Rails updates records. An update
action in a Rails controller typically does
something like this, interspersed with access control, validation, generating
the response and so on:
def update
thing = Thing.find(params[:id])
thing.update(params[:thing])
end
What happens here is that state from the database and the incoming POST
request is merged, as follows. First the id
URL parameter is used to SELECT
a record. This makes a copy of the database record in memory, as a Thing
instance. Then, the update
method does three things. First, the attributes of
this in-memory object are updated from the form parameters stored in
params[:thing]
. Second, the model’s validations are run against the in-memory
attributes. If this is successful, then an UPDATE
query is sent to the
database to store the in-memory state into the database.
As a performance optimisation, Rails does not include the complete object state
in the UPDATE
query. It only includes the fields that have changed, i.e. those
that differ between the values obtained from the initial SELECT
and those
provided in the form data params[:thing]
. This creates the possibility that
some other process modifies the same record between your SELECT
and UPDATE
,
altering fields you didn’t change, but in a way that would make the record
invalid, even though each change is valid on its own – another form of write
skew.
Let’s say we have the following schema for storing events, which have a name
,
and two dates: starts_on
and ends_on
.
ActiveRecord::Schema.define do
create_table :events, force: true do |t|
t.string :name, null: false
t.date :starts_on, null: false
t.date :ends_on, null: false
end
end
In the Event
model that represents this table, we’ll implement a validation
rule that the ends_on
date must always be greater than or equal to the
starts_on
date, that is an event does not finish before it starts.
class Event < ActiveRecord::Base
validate do
unless starts_on <= ends_on
errors.add(:base, "Events must start before they end")
end
end
end
Our expectation here is that Rails should make sure that all objects written to
the database are valid, because it runs the record’s validations as a
precondition for sending an UPDATE
query. We’ll run an experiment that tests
this, by running the following scenario a hundred times. First, we call
Event.create
to create an event that begins tomorrow and ends in four days’
time, so it lasts for three days. We then start two threads; one changes the
start date of the event and one changes the end date. After a hundred
iterations, we print the total number of events, and the number that are valid
according to the above logic.
100.times do |i|
dates = { starts_on: 1.day.from_now, ends_on: 4.days.from_now }
Event.create(name: "event-#{i}", **dates)
threads = [
Thread.new { move_start_date("event-#{i}") },
Thread.new { move_end_date("event-#{i}") }
]
threads.each(&:join)
end
p [:count, Event.count]
p [:valid, Event.count(&:valid?)]
Our initial implementations for move_start_date
and move_end_date
will be to
load the relevant record using Event.find
, and then call Event#update
.
move_start_date
increments the starts_on
date by two days, and
move_end_date
decrements the ends_on
date by the same amount. Each of these
changes is valid on its own, but applying both to the same record would violate
the validation rule we defined above.
def move_start_date(name)
evt = Event.find_by(name: name)
evt.update(starts_on: evt.starts_on + 2.days, ends_on: evt.ends_on)
end
def move_end_date(name)
evt = Event.find_by(name: name)
evt.update(starts_on: evt.starts_on, ends_on: evt.ends_on - 2.days)
end
To simulate how update
is usually called in web applications with forms, I am
passing both starts_on
and ends_on
to the update
call, even though only
one attribute is being changed in each case. Forms always send all their inputs’
values, and those values are passed into the model, so I want to make it clear
that the following bug is not a result of me passing a subset of the attributes
to update
.
These two functions are executed concurrently, and each one performs a SELECT
to load the record with the given name
, and then an UPDATE
against the same
record. Let’s say the initial insert is:
INSERT INTO events (name, starts_on, ends_on)
VALUES ('event-42', 2020-09-01, 2020-09-04);
Now, suppose move_start_date
runs first. It runs a SELECT
query and will get
the Event
in its initial state, with starts_on
and ends_on
three days
apart. It will then modify starts_on
so that this in-memory copy of the record
has a duration of one day. Rails will validate this object, see that starts_on
<= ends_on
as required, and therefore send an UPDATE
query to persist this
state to the database. When move_end_date
runs, its SELECT
will see the
Event
either before or after the UPDATE
from move_start_date
has been
committed. We’d then expect that one of the following happens:
- If the
Event
includes the change frommove_start_date
, then changingends_on
will result in an invalid object wherestarts_on > ends_on
, and Rails should not save it to the database. - If the
Event
is in its initial state, then changing itsends_on
date will result in a valid object, and Rails will persist that object state to the database.
Either way, we expect the database record to remain valid and obey the rules
encoded in the Event
model. But, that’s not what happens, because Rails
generates UPDATE
queries that only include fields that have been changed in
memory, relative to the state retrieved in the SELECT
query. In case 2, that
means it sends the following queries:
SELECT * FROM events WHERE name = 'event-42';
--> { id = 42, starts_on = 2020-09-01, ends_on = 2020-09-04 }
BEGIN;
UPDATE events SET ends_on = 2020-09-02 WHERE id = 42;
END;
Rails determines that the copy of the Event
record in memory is valid, but it
only persists part of that object to the database – the part that’s changed
since the initial SELECT
. This allows the following sequence of events to
happen:
move_start_date | move_end_date
--------------------------------+--------------------------------
SELECT * FROM events ... |
|
BEGIN |
| SELECT * FROM events ...
UPDATE events |
SET starts_on = 2020-09-03 |
WHERE id = 1 |
| BEGIN
COMMIT |
| UPDATE events
| SET ends_on = 2020-09-02
| WHERE id = 1
|
| COMMIT
This is a form of write skew: two tasks read a record’s current state, and based on that they modify different fields of it. Each of these changes would be valid on its own, but they are not valid when combined, and one of them would not have been made if these tasks executed sequentially and all executions followed case 1 above. When I run this experiment on PostgreSQL, 70 to 80 records out of 100 end up in an invalid state.
This bug is the result of an optimisation in Rails – even though web forms
submit a complete representation of an object, and all that data is sent through
to the model, the implementation of update
means a record can end up with a
mix of two requests’ submitted values for the record, and no program or person
has verified that the combined state is valid.
This race condition will happen if the SELECT
in move_end_date
happens at
any time before the COMMIT
in move_start_date
, so that both tasks see the
record’s initial state. In PostgreSQL, we can run these tasks inside repeatable
read transactions, which means one of the UPDATES
will be rejected if the
transactions are concurrent – PostgreSQL treats this as a lost update since
both tasks modify the same record, and indeed it would be a lost update if one
request simply overwrote the other. MySQL does not provide such protection in
repeatable read, and in serializable mode these tasks will deadlock.
The best portable protection against this bug is to use compare-and-swap, for
example by taking advantage of the built-in Rails behaviour when we add a
lock_version
to the table:
ActiveRecord::Schema.define do
change_table :events do |t|
t.integer :lock_version, null: false
end
end
We also need to catch the resulting StaleObjectError
in our functions. I’m
handling this error by doing nothing, just to prevent invalid records ending up
in the database, but you’d probably want ask the user to resolve the conflict
here rather than silently losing their input.
def move_start_date(name)
evt = Event.find_by(name: name)
evt.update(starts_on: evt.starts_on + 2.days, ends_on: evt.ends_on)
rescue ActiveRecord::StaleObjectError
end
def move_end_date(name)
evt = Event.find_by(name: name)
evt.update(starts_on: evt.starts_on, ends_on: evt.ends_on - 2.days)
rescue ActiveRecord::StaleObjectError
end
Again, we’ve seen that simply making write()
fallible, changing its type from
write(T)
to write(T) -> bool
(or something more complex), can prevent race
conditions that our web framework is introducing. “Fallible” can mean that the
database outright rejects a write due to isolation rules, or it can report that
the write affected no rows in a compare-and-swap, but either way it’s not simply
allowing the write to unconditionally succeed as the empty return type of
write(T)
implies.
In this article, we’ve added another category of race condition to our
understanding. Whereas a lost update means one write replacing another, a
write skew means two writes being persisted, where one of them would have been
different – or not performed at all – if the tasks were executed sequentially,
often leaving your data in an inconsistent state. Everything we’ve looked at so
far is a single-request bug, which happens if two POST
requests arrive at
the same time. In the next instalment, we’ll extend this concept and look at
multi-request scenarios where the Location
type appears in the workflow of
viewing and updating resources in applications.