A little over a year ago, I wrote an essay called Callbacks are Imperative, Promises are Functional, in which I argued that promises are not a control-flow device but rather a tool to describe relationships between dependent values, such that the promise library can figure out the control flow – the correct and optimal order of operations – for you. I gave the example of a module loader, where by modelling a set of modules as a graph of lazy promises, we can write code that describes the dependencies between modules and automatically have them downloaded in the correct order and with the maximum possible parallelisation. This echoes a quote from the book Learn You a Haskell for Great Good:
In purely functional programming you don’t tell the computer what to do as such but rather you tell it what stuff is.
Writing the control flow for optimally downloading modules by hand is rather hard; better to invent a system where you can describe the shape of your domain and have the system solve the problem for you.
A few months ago, I wrote a tutorial on Make, having spent a year with it while writing JavaScript Testing Recipes. Though I didn’t realise it when writing my promises essay, there is a deep similarity between Make and my promise-based module loader: in both systems we work by defining dependencies among a set of artefacts, and then the platform figures out which things must be run and it what order when you ask for one of those artefacts. Make can even parallelise the tasks it runs when it spots that two tasks are independent and can therefore be run in any order – if it doesn’t matter which order you perform two tasks in, you can probably run them in parallel.
This week I was working on some Redis code, and I realised that this concept of solving problems by relating values and deriving the control flow from those relationships had given me an insight into solving a problem I’d been having with performing atomic changes to a Redis database. I don’t use any of the above concepts directly in the code, but I found that they gave me a new way to think about the problem and suddenly realise a solution. Let me try to explain.
Shortly before writing that promises article, I released reStore, a
remoteStorage server written in Node with both filesystem- and Redis-based
backends. remoteStorage is a protocol for storing arbitrary documents via
HTTP; each user on a server has a directory tree and they can grant applications
access to parts of that tree for the application to store user data. The
application uses GET
to retrieve documents by pathname, and receives the
document’s content, plus its media type in a Content-Type
header and its
version in an ETag
. It can also request folder listings using GET
to a
pathname ending in a trailing slash, which returns a list of the folder’s
children (documents and other folders) with their current versions.
The app can save new documents using PUT
, setting their type using
Content-Type
, and remove them using DELETE
. Both these methods allow use of
the If-Match
header; the application can pass in the current version it has
for a document, and the server will process the request only if that version is
indeed current. Otherwise, the app makes a GET
to retrieve the latest version,
reapplies its changes and tries to save again; this prevents applications from
clobbering one another’s changes.
Now, these semantics lend themselves naturally to using the filesystem as a backend, and indeed reStore has such a backend. But it also has a Redis backend, and I’ve recently been doing some work to make sure that when it saves and deletes documents, all changes to the Redis database are done atomically.
Most of the complexity in doing this comes from the fact that remoteStorage is not quite as low-level as the Unix filesystem. When you save a document, all its parent folders are created implicitly, and if you delete the only remaining item from a folder then that folder is deleted. So folders exist only if they contain documents, or other folders. Clients never manipulate folders directly, they only deal with saving documents.
Furthermore, all folders have an ETag
version just like documents do. A
folder’s version must change if any document within its tree is updated or
removed. This is to facilitate syncing: a client can easily tell if its copy of
a tree is up to date by comparing the version tag on the root folder, rather
than checking every document within the tree.
So when a document is saved, we need to make sure that all its containing
folders exist and update all of their versions. (We could store only the
documents and figure out the folders when a GET
is made, but it’s cheaper to
generate the tree on write than on read.) For ease of implementation, the
version of a document in reStore is simply the Unix timestamp of the last time
it was changed (the remoteStorage spec previously required versions to be
timestamps and it used Last-Modified
rather than ETag
). The version of a
folder is the latest timestamp that exists on any of its children.
Here’s a naive version of a backend store for the requirements we’ve layed out.
Although reStore is written in Node, I’ve written these examples in Ruby since
it’s far easier to see what’s actually happening without all those callbacks
getting in the way. It has one public method: put()
, that handles requests to
save documents. It takes a username, document path, content type, and optional
version string. It begins by splitting the pathname into its segments, for
example parse_path('/path/to/file.txt')
returns ['/', 'path/', 'to/',
'file.txt']
. It checks whether the given version is current or not – if no
version is given, the request is always allowed, but if a version is given for a
document that doesn’t exist the request is forbidden since the client has stale
data – then proceeds to update the tree. It uses SADD
to make sure each item
is listed as a child of its parent folder, and HSET
to set the modified
time
on each parent folder to the current time. Finally, it checks if the target file
exists (we need to tell the client if the document was created or updated), then
saves its content, media type, size and modified time using HMSET
.
class Store
VersionConflict = Class.new(StandardError)
def initialize(redis_client)
@redis = redis_client
end
def put(username, pathname, content, type, version = nil)
query = parse_path(pathname)
filename = query.pop
prefix = "users:#{username}:data:"
modified = (Time.now.utc.to_f * 1000).floor
is_current = get_current_state(prefix + pathname, version)
raise VersionConflict unless is_current
query.each_with_index do |dir, i|
key = prefix + query.slice(0, i + 1).join('')
@redis.hset(key, 'modified', modified.to_s)
@redis.sadd(key + ':children', query[i + 1] || filename)
end
exists = @redis.exists(prefix + pathname)
@redis.hmset(prefix + pathname, :length, content.size,
:type, type,
:modified, modified,
:content, content)
[!exists, modified]
end
private
def parse_path(pathname)
pathname.scan(/[^\/]*(?:\/|$)/)[0..-2]
end
end
Here’s what this code does when we run the call:
store.put('jcoglan', '/books/jstr/preface.txt', 'Preface to JSTR', 'text/plain')
The Redis MONITOR
command prints the following:
"hget" "users:jcoglan:data:/books/jstr/preface.txt" "modified"
"hset" "users:jcoglan:data:/" "modified" "1402430750408"
"sadd" "users:jcoglan:data:/:children" "books/"
"hset" "users:jcoglan:data:/books/" "modified" "1402430750408"
"sadd" "users:jcoglan:data:/books/:children" "jstr/"
"hset" "users:jcoglan:data:/books/jstr/" "modified" "1402430750408"
"sadd" "users:jcoglan:data:/books/jstr/:children" "preface.txt"
"exists" "users:jcoglan:data:/books/jstr/preface.txt"
"hmset" "users:jcoglan:data:/books/jstr/preface.txt" "length" "15" \
"type" "text/plain" "modified" "1402430750408" "content" "Preface to JSTR"
So we can see it checking the version of the target document, updating all the parent folders and finally the document itself. However, there are two glaring holes in this implementation:
- While we’re running these commands, another client might be doing work that becomes interleaved with ours. We might both check the current version, find that we’re up to date, and then both go ahead and update the document, clobbering one another’s changes. This is precisely the ‘interceding update’ problem that the version check is supposed to fix.
- If our server process crashes part-way through this set of updates, the tree is left in an invalid state. Folders are said to have children that do not actually exist, or have modification times that don’t reflect the true state of things.
To make the system robust, we need to solve both of these problems, and this will partly involve making better use of the tools Redis gives us, and also re-framing the problem slightly to make it fit better into Redis’s constraints. The former technique is certainly useful but it’s the latter that I am particularly interested in here.
The first problem is actually quite easy to solve: since we’re making changes to the user’s entire directory tree, conditional on the state of one document, we will lock the user’s data while this is happening. To have each update have a consistent view of the world, they must be applied in sequence, and the lock will make sure only one update executes at a time.
The documentation for SETNX
provides a locking algorithm for us. You
should read those docs to understand the algorithm; here’s an implementation in
Ruby:
LOCK_TIMEOUT = 10_000
def lock(name)
result = nil
while result.nil?
lock_key = "locks:#{name}"
current_time = (Time.now.utc.to_f * 1000).floor
expiry = current_time + LOCK_TIMEOUT
if @redis.setnx(lock_key, expiry)
next result = yield.tap { @redis.del(lock_key) }
end
timeout = @redis.get(lock_key)
if timeout.nil? or current_time < timeout.to_i
next sleep 0.1
end
old_value = @redis.getset(lock_key, expiry)
if old_value == timeout.to_s
result = yield.tap { @redis.del(lock_key) }
else
sleep 0.1
end
end
result
end
Wrapping this mechanism around our PUT
logic lets us make sure that nothing
about the user’s tree will change between us checking the version and making our
changes.
def put(username, pathname, content, type, version = nil)
query = parse_path(pathname)
filename = query.pop
prefix = "users:#{username}:data:"
lock username do
modified = (Time.now.utc.to_f * 1000).floor
is_current = get_current_state(prefix + pathname, version)
raise VersionConflict unless is_current
query.each_with_index do |dir, i|
key = prefix + query.slice(0, i + 1).join('')
@redis.hset(key, 'modified', modified.to_s)
@redis.sadd(key + ':children', query[i + 1] || filename)
end
exists = @redis.exists(prefix + pathname)
@redis.hmset(prefix + pathname, :length, content.size,
:type, type,
:modified, modified,
:content, content)
[!exists, modified]
end
end
The second problem is more subtle. We want to make all these changes to the
database – a combination of HSET
, SADD
and HMSET
commands – happen as
one, so Redis will execute them as a unit. Fortunately, Redis has
transactions that let us do exactly this. But, there’s one minor problem: a
Redis transaction works by sending the MULTI
command, then all the commands
you want to run together, then sending EXEC
. Only then will all the commands
be executed; you cannot have a command that depends on the result of another
command, because you don’t get any responses until after all the commands have
run.
In our code, there’s a pesky little EXISTS
sitting between the writes to the
parent folders and the document itself. This must happen before the final
HMSET
since we need to know if the document existed before that write, and
there’s a temptation to put it as close to the HMSET
as possible to make sure
no other changes happen between the two.
However: notice that none of the writes depend on the result of the EXISTS
call. They do depend on the result of the HGET
that checks the version,
since they are conditional on that, but everything after that depends only on
information from the original method inputs. Because of this lack of
dependencies, the writes can actually be done in any order and we will get the
same result. As long as it happens at some point before the final HMSET
, and
within the lock
so we know no other changes have happened, the EXISTS
call
can happen as early in the process as we like. We can push it before the parent
folder updates:
def put(username, pathname, content, type, version = nil)
query = parse_path(pathname)
filename = query.pop
prefix = "users:#{username}:data:"
lock username do
modified = (Time.now.utc.to_f * 1000).floor
is_current = get_current_state(prefix + pathname, version)
raise VersionConflict unless is_current
exists = @redis.exists(prefix + pathname)
query.each_with_index do |dir, i|
key = prefix + query.slice(0, i + 1).join('')
@redis.hset(key, 'modified', modified.to_s)
@redis.sadd(key + ':children', query[i + 1] || filename)
end
@redis.hmset(prefix + pathname, :length, content.size,
:type, type,
:modified, modified,
:content, content)
[!exists, modified]
end
end
Or, we can get rid of it entirely: the initial version check can tell us whether
the document exists, since it will have no modified
key if it’s absent. Let’s
modify get_current_state()
to return two values: a boolean for whether the
version is current, and one for whether the document exists:
def get_current_state(key, version)
modified = @redis.hget(key, 'modified')
return [version.nil?, false] if modified.nil?
return [true, true] if version.nil?
return [version.to_i == modified.to_i, true]
end
and use this new value in our put()
method:
def put(username, pathname, content, type, version = nil)
query = parse_path(pathname)
filename = query.pop
prefix = "users:#{username}:data:"
lock username do
modified = (Time.now.utc.to_f * 1000).floor
is_current, exists = get_current_state(prefix + pathname, version)
raise VersionConflict unless is_current
query.each_with_index do |dir, i|
key = prefix + query.slice(0, i + 1).join('')
@redis.hset(key, 'modified', modified.to_s)
@redis.sadd(key + ':children', query[i + 1] || filename)
end
@redis.hmset(prefix + pathname, :length, content.bytes.size,
:type, type,
:modified, modified,
:content, content)
[!exists, modified]
end
end
Now if we run our test again, we see this pattern of commands:
"setnx" "locks:jcoglan" "1402432911685"
"hget" "users:jcoglan:data:/books/jstr/preface.txt" "modified"
"hset" "users:jcoglan:data:/" "modified" "1402432901692"
"sadd" "users:jcoglan:data:/:children" "books/"
"hset" "users:jcoglan:data:/books/" "modified" "1402432901692"
"sadd" "users:jcoglan:data:/books/:children" "jstr/"
"hset" "users:jcoglan:data:/books/jstr/" "modified" "1402432901692"
"sadd" "users:jcoglan:data:/books/jstr/:children" "preface.txt"
"hmset" "users:jcoglan:data:/books/jstr/preface.txt" "length" "15" \
"type" "text/plain" "modified" "1402432901692" "content" "Preface to JSTR"
"del" "locks:jcoglan"
Two important things have happened here: first, all the commands have happened
inside a locking block bounded by the SETNX
and DEL
commands. Second, and
more importantly for making the writes atomic: notice that this operation is now
a read (HGET
) followed by a sequence of writes (HSET
, SADD
, HMSET
). The
fact that all the writes are uninterrupted by any reads, and do not depend on
one another, means we can wrap them in a MULTI
transaction:
def put(username, pathname, content, type, version = nil)
query = parse_path(pathname)
filename = query.pop
prefix = "users:#{username}:data:"
lock username do
modified = (Time.now.utc.to_f * 1000).floor
is_current, exists = get_current_state(prefix + pathname, version)
raise VersionConflict unless is_current
@redis.multi do
query.each_with_index do |dir, i|
key = prefix + query.slice(0, i + 1).join('')
@redis.hset(key, 'modified', modified.to_s)
@redis.sadd(key + ':children', query[i + 1] || filename)
end
@redis.hmset(prefix + pathname, :length, content.bytes.size,
:type, type,
:modified, modified,
:content, content)
end
[!exists, modified]
end
end
This leaves us with out final command sequence:
"setnx" "locks:jcoglan" "1402433277043"
"hget" "users:jcoglan:data:/books/jstr/preface.txt" "modified"
"multi"
"hset" "users:jcoglan:data:/" "modified" "1402433267046"
"sadd" "users:jcoglan:data:/:children" "books/"
"hset" "users:jcoglan:data:/books/" "modified" "1402433267046"
"sadd" "users:jcoglan:data:/books/:children" "jstr/"
"hset" "users:jcoglan:data:/books/jstr/" "modified" "1402433267046"
"sadd" "users:jcoglan:data:/books/jstr/:children" "preface.txt"
"hmset" "users:jcoglan:data:/books/jstr/preface.txt" "length" "15" \
"type" "text/plain" "modified" "1402433267046" "content" "Preface to JSTR"
"exec"
"del" "locks:jcoglan"
We have made all the changes made by the put()
method into a single
transaction. Redis will not run any of them until it receives EXEC
, preventing
any of them from running if the process crashes while sending them. Either they
all run, or none does.
This was kind of a silly example though: technically, we could have left that
EXISTS
call in the middle of the transaction. None of the writes depended on
its result, so we could have left it in before the final HMSET
and received
its result at the end of the process. It turned out the EXISTS
call was
redundant anyway, but this example demonstrates a key idea: analysing the
dependencies between operations, flattening that graph so that things happen as
early as possible, disinterleaving reads and writes, and noticing when the order
of commands does and does not matter, is a key strategy to atomising Redis
operations. We got lucky with PUT
, but DELETE
is far more tricky.
DELETE
looks a little like PUT
run in reverse. We delete the target
document, and remove it as a child from its parent folder. Then we go back up
the tree, removing and updating things. If a folder has no children, we delete
it. If it does have children, we set its version to the latest version of any of
its remaining children, then move up to the next level. Here’s a first stab at
an implementation, where we’ve added another return value to
get_current_state()
: we need to tell the client whether the deleted document
existed and what its version was:
def delete(username, pathname, version = nil)
prefix = "users:#{username}:data:"
query = parse_path(pathname)
filename = query.pop
parents = (0...query.size).map { |i| query[0..i].join('') }.reverse
lock username do
is_current, exists, modified = get_current_state(prefix + pathname, version)
raise VersionConflict unless is_current
@redis.del(prefix + pathname)
@redis.srem(prefix + parents.first + ':children', filename)
parents.each_with_index do |parent, i|
children = @redis.smembers(prefix + parent + ':children')
if children.empty?
@redis.del(prefix + parent)
if i + 1 < parents.size
item = query[query.size - i - 1]
@redis.srem(prefix + parents[i + 1] + ':children', item)
end
else
mtimes = children.map do |child|
@redis.hget(prefix + parent + child, 'modified')
end
@redis.hset(prefix + parent, 'modified', mtimes.map(&:to_i).max)
end
end
[exists, modified]
end
end
private
def get_current_state(key, version)
modified = @redis.hget(key, 'modified')
return [version.nil?, false, nil] if modified.nil?
return [true, true, modified.to_i] if version.nil?
return [version.to_i == modified.to_i, true, modified.to_i]
end
As an example, say we run the following commands to add three documents and then delete them in reverse order:
store.put('jcoglan', '/books/jstr/preface.txt', 'Preface to JSTR', 'text/plain')
store.put('jcoglan', '/books/jstr/chapters/browser.txt', 'Browser Applications', 'text/plain')
store.put('jcoglan', '/books/jstr/chapters/cli.txt', 'Command-line Interfaces', 'text/plain')
store.delete('jcoglan', '/books/jstr/chapters/cli.txt')
store.delete('jcoglan', '/books/jstr/chapters/browser.txt')
store.delete('jcoglan', '/books/jstr/preface.txt')
Here’s the command sequence for the second delete()
, which will implicitly
delete its parent folder but leave the others intact. We see it deleting the
document, then its parent, then updating the versions of the other folders above
it, by checking the children of each folder in turn.
"setnx" "locks:jcoglan" "1402435177397"
"hget" "users:jcoglan:data:/books/jstr/chapters/browser.txt" "modified"
"del" "users:jcoglan:data:/books/jstr/chapters/browser.txt"
"srem" "users:jcoglan:data:/books/jstr/chapters/:children" "browser.txt"
"smembers" "users:jcoglan:data:/books/jstr/chapters/:children"
"del" "users:jcoglan:data:/books/jstr/chapters/"
"srem" "users:jcoglan:data:/books/jstr/:children" "chapters/"
"smembers" "users:jcoglan:data:/books/jstr/:children"
"hget" "users:jcoglan:data:/books/jstr/preface.txt" "modified"
"hset" "users:jcoglan:data:/books/jstr/" "modified" "1402435167365"
"smembers" "users:jcoglan:data:/books/:children"
"hget" "users:jcoglan:data:/books/jstr/" "modified"
"hset" "users:jcoglan:data:/books/" "modified" "1402435167365"
"smembers" "users:jcoglan:data:/:children"
"hget" "users:jcoglan:data:/books/" "modified"
"hset" "users:jcoglan:data:/" "modified" "1402435167365"
"del" "locks:jcoglan"
This problem seems far harder to fix: in order to delete each folder or update its version, we need to check its children, and in order to make sure the children are up to date, we need to update the next folder down the chain, all the way back to the target document. This is, of course, how we would implement it on a real filesystem, where one cannot delete non-empty directories.
But there is a way out: think back to the PUT
example: we made the writes
atomic by figuring out as many of their parameters as possible in advance,
allowing us to disinterleave the writes from anything they depended on, making
it possible to run them in a single transaction.If you think about it for a
minute, you’ll realise you can predict which folders will need to be deleted
and updated before you start deleting anything at all, and we’ll use that
trick to tease apart the reads and writes.
We can say, in advance, that if we delete a document then any folders that form a chain from its immediate parent that have only one child will need to be deleted: those folders exist entirely to contain the document we’re going to delete. So we can figure out the deletable folders in advance by walking up from the document until we find a folder with more than one child.
Similarly, we can figure out the versions of all the remaining folders after the deletion by reading all the relevant item versions in advance, that is, getting the versions of all the children of all the parents of the deleted document. Note this doesn’t involve reading the whole tree; only the immediate children of each folder need to be considered. From these lists, we can remove the version for the topmost deletable folder, then back-propagate to calculate the resulting versions of the other folders.
The code for this is significantly more verbose but I’ve tried to comment what’s going on at each point. You should be able to see how the previous approach of deleting an item, then checking how its parents need to be updated, then recursing up the tree, has been replaced with an algorithm that plans all the changes we need to make up front by doing that same process on data that we’ve already gathered into memory ahead of time.
def delete(username, pathname, version = nil)
prefix = "users:#{username}:data:"
query = parse_path(pathname)
parents = (0...(query.size - 1)).map { |i| query[0..i].join('') }.reverse
# e.g. if query = ['/', 'books/', 'jstr/', 'preface.txt']
# then parents = ['/books/jstr/', '/books/', '/']
lock username do
is_current, exists, modified = get_current_state(prefix + pathname, version)
raise VersionConflict unless is_current
# Find all the immediate child items of the list of parent folders; this
# returns an Array<Array<String>>.
children = parents.map do |parent|
@redis.smembers(prefix + parent + ':children')
end
# Determine which folders will be empty by starting with the immediate
# parent and finding folders with only one child.
empty, index = [], 0
while index < parents.size and children[index].size == 1
empty << parents[index]
index += 1
end
# The remaining folders are the non-empty ones, i.e. those after the index
# we just counted up to. Their children are the corresponding lists in the
# children array.
remaining = parents[index..-1]
children = children[index..-1]
# Construct an Array<Array<Fixnum>> of modified times corresponding to the
# children array.
mtimes = children.map.with_index do |child_list, i|
child_list.map do |child|
@redis.hget(prefix + remaining[i] + child, 'modified').to_i
end
end
# An index where we'll store the computed versions for remaining folders.
ancestors = {}
if remaining.size > 0
# Remove the uppermost empty folder as a child from its parent, and
# delete its corresponding version from the mtimes list.
item = query[query.size - empty.size - 1]
@redis.srem(prefix + remaining.first + ':children', item)
mtimes.first.delete_at(children.first.index(item))
# Back-propagate the version updates to the other remaining parent
# folders. We calculate the new version from each folder's remaining
# children, update this version in the mtimes list then repeat at the
# next level up.
remaining.each_with_index do |dir, i|
ancestors[dir] = mtimes[i].max
if i + 1 < remaining.size
item = query[query.size - empty.size - 2 - i]
mtimes[i + 1][children[i + 1].index(item)] = ancestors[dir]
end
end
# Save the updated versions to the database.
ancestors.each do |dir, mtime|
@redis.hset(prefix + dir, 'modified', mtime)
end
end
# Delete all the folders we determined were empty, and their sets of
# children.
empty.each do |dir|
@redis.del(prefix + dir)
@redis.del(prefix + dir + ':children')
end
# And finally, delete the document itself.
@redis.del(prefix + pathname)
[exists, modified]
end
end
This code produces the following command sequence for deleting the second item in our test:
"setnx" "locks:jcoglan" "1402437739275"
"hget" "users:jcoglan:data:/books/jstr/chapters/browser.txt" "modified"
"smembers" "users:jcoglan:data:/books/jstr/chapters/:children"
"smembers" "users:jcoglan:data:/books/jstr/:children"
"smembers" "users:jcoglan:data:/books/:children"
"smembers" "users:jcoglan:data:/:children"
"hget" "users:jcoglan:data:/books/jstr/preface.txt" "modified"
"hget" "users:jcoglan:data:/books/jstr/chapters/" "modified"
"hget" "users:jcoglan:data:/books/jstr/" "modified"
"hget" "users:jcoglan:data:/books/" "modified"
"srem" "users:jcoglan:data:/books/jstr/:children" "chapters/"
"hset" "users:jcoglan:data:/books/jstr/" "modified" "1402437729226"
"hset" "users:jcoglan:data:/books/" "modified" "1402437729226"
"hset" "users:jcoglan:data:/" "modified" "1402437729226"
"del" "users:jcoglan:data:/books/jstr/chapters/"
"del" "users:jcoglan:data:/books/jstr/chapters/:children"
"del" "users:jcoglan:data:/books/jstr/chapters/browser.txt"
"del" "locks:jcoglan"
Before, we had calls to SMEMBERS
and HGET
interleaved with SREM
, HSET
and DEL
. Now however, he have completely disinterleaved the reads and writes;
all the writes have been planned ahead of time and aren’t interleaved with the
reads that determine what needs to be changed. This means we can wrap all the
writes in a MULTI
; I won’t reproduce the Ruby code for this (we wrap part of
the delete()
method in @redis.multi do ... end
) but the commands look like
this:
"setnx" "locks:jcoglan" "1402437849398"
"hget" "users:jcoglan:data:/books/jstr/chapters/browser.txt" "modified"
"smembers" "users:jcoglan:data:/books/jstr/chapters/:children"
"smembers" "users:jcoglan:data:/books/jstr/:children"
"smembers" "users:jcoglan:data:/books/:children"
"smembers" "users:jcoglan:data:/:children"
"hget" "users:jcoglan:data:/books/jstr/preface.txt" "modified"
"hget" "users:jcoglan:data:/books/jstr/chapters/" "modified"
"hget" "users:jcoglan:data:/books/jstr/" "modified"
"hget" "users:jcoglan:data:/books/" "modified"
"multi"
"srem" "users:jcoglan:data:/books/jstr/:children" "chapters/"
"hset" "users:jcoglan:data:/books/jstr/" "modified" "1402437839350"
"hset" "users:jcoglan:data:/books/" "modified" "1402437839350"
"hset" "users:jcoglan:data:/" "modified" "1402437839350"
"del" "users:jcoglan:data:/books/jstr/chapters/"
"del" "users:jcoglan:data:/books/jstr/chapters/:children"
"del" "users:jcoglan:data:/books/jstr/chapters/browser.txt"
"exec"
"del" "locks:jcoglan"
This is now a robust deletion procedure. The SETNX
lock around the whole
process means that while we’re making our reads to figure out what to do, we
know the structure won’t change so we’ll get a consistent view of the data. And
putting all the writes together in one transaction means that we cannot leave
the database in an invalid state by accident. After every atomic change that
Redis makes to the data, it will be in a meaningful rather than an intermediate
state that makes sense to the application.
What we’ve done here has a deep similarity with optimising module loading or
Make task execution: they all involve flattening a dependency graph. If you look
at the first command trace for the delete()
method, where reads and writes are
interleaved, almost every command depends on the one before it. The HSET
that
updates a folder’s modified time depends on the HGET
s that read its children’s
modified times, which in turn depend on the SMEMBERS
to find the children’s
names, all of which depends on the list of children and their times being left
up-to-date by the previous iteration. The only parallelisable commands in the
whole list are the DEL
/SREM
pairs that delete items; the dependency graph of
this trace is thirteen steps deep, including the initial version check.
HGET current document version
|
DEL document / SREM from parent's children
|
SMEMBERS parent's children
|
DEL folder / SREM from parent's children
|
SMEMBERS parent's children
|
HGET all child versions
|
HSET parent version
|
SMEMBERS parent's children
|
HGET all child versions
|
HSET parent version
|
SMEMBERS parent's children
|
HGET all child versions
|
HSET parent version
By contrast, the final trace we ended up with is very shallow: all the writes
are mutually independent, which is why we can put them in a transaction. All the
HGETS
are mutually independent, but they each depend on one of the SMEMBERS
calls, though those calls are also mutually independent. Including the initial
version check, this dependency graph is then only four steps deep compared to
the thirteen of the previous trace.
HGET current document version
|
SMEMBERS the children of all parent folders
|
HGET remaining children's versions
|
DEL / SREM / HSET items as necessary
We’ve reshaped the graph so that more of its items are order-independent. For some problems, this lets us run tasks in parallel. For this task, it means operations can be grouped into transactions and made atomic.
Now you’re probably wondering what all this has to do with promises, or Make, or functional programming. Well, the interesting thing for me is what triggered me to do this work in the first place. I was actually working on the Redis backend for Faye, where each client has a set of channels it’s subscribed to (and each channel has a reverse index of clients subscribed to it, for routing messages), and a list of messages that are queued for delivery to that client. The engine periodically sweeps the database for expired sessions and removes all their subscriptions and message queues.
For a long time after I launched the Redis engine, there were problems with race conditions that would let expired sessions leak memory over time. For example, if we delete a client’s message queue before we delete its subscriptions, then there’s a window of time where a published message can still be routed to that client and added to its queue, after that queue has been deleted.
I found this problem really hard to solve for a long time – the Redis backend
launched over 3 years ago – and it surprised me that all of the sudden the
solution became obvious to me. Just like our PUT
example above, the session
expiry mechanism had a read right in the middle of it that looked up the
client’s subscriptions so it could delete them all, actually by delegating to
another function that just handled unsubscribing a client from a single channel.
Thinking about the problem as a dependency graph made me realise that the read
could be pushed to the start of the process, leaving a bunch of mutually
independent writes that could all be put in a transaction. Et voilá,
atomic client expiry, one less race condition to worry about.
I had been stuck because I was thinking about the problem from an imperative programming mindset. In imperative programming, or more precisely, in systems with strict rather than lazy evaluation, we are forced to tell the computer to perform actions in a certain order. Sometimes, the order we decide upon is important – the program would break if its statements were reordered – and sometimes it is not – a set of statements can be evaluated in any order and the program would still work. That is, in an imperative program we often see code like this:
f()
g()
But we are not sure whether it matters whether f()
is called first or not.
This ambiguity arises because the program does not express the dependency
relationships between the entities it manipulates. Those relationships are in
the programmer’s head or on a whiteboard somewhere, and they have done a bunch
of computation outside of the program to determine what the program itself
should look like. The program expresses the solution the programmer invented,
rather than solving the problem itself, and crucial constraints on the design of
the solution are omitted.
In a functional language like Haskell, where functions are lazy and have no side effects, the only way to force two expressions to run in a certain order is to make one depend on the other, either by taking its output as input:
g(f())
Or by making one expression conditional on the other:
if f() {
g()
}
(These examples are generic pseudocode, not executable Haskell.)
In the above construction that simply calls f()
and then g()
, it’s not clear
from the code itself whether f()
can be called later or removed from the
program entirely. The only reason it could be necessary to place it before g()
is that it has some side effect that affects the result of g()
. The only way
to tell for sure is to run the program, possibly via a test harness, and check
what its cumulative side effects are, for example checking the state it leaves a
database in.
The latter two constructs express, via the syntax of the program, that g()
depends on f()
in ways that are far stronger than simply listing them one
after the other. They make it clear that the result of the program will be very
different if f()
is removed; if its output is fed directly into g()
, or used
to decide whether to call g()
at all, then we know it must affect g()
’s
output in some way. This dependency information is encoded in the program
itself, and not enforced implicitly via a test suite or manual QA process.
Stuck in the imperative mindset, I was convinced the solution had to lie in the order I was running these Redis commands in, that if I could just find the right recipe and run commands at just the right times, the problem would be solved. But having spent the last year thinking about things like promises and Make, I suddenly saw the problem instead in terms of dependencies. Just as I have done in the examples above, I thought about the dependencies between values explicitly, then tried to see if I could rearrange that graph to bring all the writes together, and lo and behold the right answer just fell out.
The funny thing is, I’m not sure anyone would call my final code for DELETE
a
‘functional’ program; it’s still very much stateful imperative code, full of
side effects and mutable data structures. But I would argue that it’s more
functional than the original DELETE
code. In the original, the order of
statements is very important, and there are few places where an explicit
dependency is drawn between two operations by having values defined in terms of
other values. Frequently, getting a correct value depends on some SREM
or
HSET
side effect having been executed before its calculation, and that
dependency is entirely encoded by the statements executing in a certain order
rather than by relationships between values.
But in the second example, far more of the statements that are order-dependent
are forced to be so by having values defined in terms of other values. The
calculations for empty
, remaining
and mtimes
depend explicitly on the
value of children
, thus forcing children
to be calculated before them. Once
we get to the write operations, those can be executed in any order and we can
tell this because there is no write operation that uses the value of another
write operation as its input. We can reorder the statements for the writes and
the program will still work. Whereas, if we reorder the statements for the
reads, we will get an ‘undefined variable’ error because a value they depend on
is not available yet. We get direct feedback from the programming language that
an operation makes no sense, rather than being able to freely reorder statements
and having to check the database to see what went wrong. If we were writing this
in a compiled language, this error could even be flagged by static analysis
without running the program at all.
In other words, the program uses functional techniques to indicate and force the correct order of operations, rather than imperative ones, and makes it clear where dependencies exist and where they do not. Instead of using the database as the working space to incrementally figure out which changes to make, we’ve moved all that work into the program’s memory where we can derive values from one another and compute a single set of changes to make at the end.
Once you start thinking in terms of dependency graphs, you start seeing them everywhere. They’re a great tool for figuring out how you can optimise programs, and make them more explicit and resistant to people accidentally introducing sequencing bugs later. They encode the problem constraints directly in the program, and get you a step closer to letting the computer solve your problems for you.