The general thrust of this post is: use a MAC function like HMAC to sign data, don’t use hash functions. Although not all hash functions suffer from the problem I’m going to illustrate, in general using a hash function for message authentication comes with a lot of potential problems because those functions aren’t designed for this task. You shouldn’t try to work around it by creatively processing the inputs or inventing some fancy way of chaining hash functions. Just use the functions that were designed for this task instead of inventing your own crypto schemes.
This morning I was reading an article on the Intridea blog about signed idempotent action links. These are links you can embed in emails to your users that let them perform authenticated actions on your site, without going through a login screen or any other interactions. This technique relies on embedding action data in the link, for example an email from Twitter could include your user ID and the name of another user, and following the link would make you follow that user on Twitter.
But these parameters are typically easily guessable and so we must provide an
authentication mechanism, otherwise anyone can construct URLs to modify other
accounts however they like. To do this, we ‘sign’ the action data by combining
it with a secret value, for example a global secret token stored on our servers,
or the user’s password hash to produce a ‘tag’. This ‘tag’ does not reveal the
secret values used in its construction but lets us verify that the link was
generated by our site and for the correct user. The article gives the following
methods you could add to a user class to let them sign and verify actions,
assuming the user has an id
and a secret_token
. (I don’t mean to pick on
Intridea, I’ve seen a lot of other people make this mistake and I only found out
recently why you shouldn’t do it.)
class User
def sign_action(action, *params)
Digest::SHA1.hexdigest(
"--signed--#{id}-#{action}-#{params.join('-')}-#{secret_token}"
)
end
def verify(signature, action, *params)
signature == sign_action(action, *params)
end
end
This combines the action data with a secret token and calculates the SHA-1 hash; this hash is then appended to the URL to go in the email, and verified when the link is followed. Seems fine, right? SHA-1 doesn’t reveal anything about its input, so the secret is safe. This might be true (modulo the existence of rainbow tables) but hash functions still don’t provide any guarantee that a (message,tag) pair is genuine. To see why, we need to consider how hash functions work.
Many hash functions are based on something called the ‘Merkle-Damgard iterated construction’, which works like this: say you have a long string you want to hash, that looks like this:
+--------------------------+
| abcdefghijklmnopqrstuvwx |
+--------------------------+
Your message gets split into fixed-size blocks like this:
+--------+ +--------+ +--------+ +--------+
| abcdef | | ghijkl | | mnopqr | | stuvwx |
+--------+ +--------+ +--------+ +--------+
It is then prefixed with a value called the ‘initialization vector’ or IV, which is a global constant that’s part of the hash function’s definition. This IV is the same size as the message blocks:
IV M0 M1 M2 M3
+--------+ +--------+ +--------+ +--------+ +--------+
| Zn8AGy | | abcdef | | ghijkl | | mnopqr | | stuvwx |
+--------+ +--------+ +--------+ +--------+ +--------+
This sequence is then folded using a compression function h()
. The details of
the h()
depend on the hashing function, but the only thing that concerns us
here is that the compression function takes two message blocks and returns
another block of the same size.
So, first we take IV
and M0
and compute h(IV,M0)
to get H0
. Then we take
H0
and M1
and compute H1 = h(H0,M1)
and so on down the chain.
IV M0 M1 M2 M3
+--------+ +--------+ +--------+ +--------+ +--------+
| Zn8AGy | | abcdef | | ghijkl | | mnopqr | | stuvwx |
+--------+ +--------+ +--------+ +--------+ +--------+
| | | | |
| +-+-+ +-+-+ +-+-+ +-+-+
+-------->| h |--H0-->| h |--H1-->| h |--H2-->| h |--> H3
+---+ +---+ +---+ +---+
Whatever value comes out the end of the chain is the result of the hash
function; this is how hash functions take arbitrary-size input and produce
fixed-size output. But what does it mean for message authentication? Well say
you have a (message,tag) pair, like our action params and SHA-1 hash from above.
The construction of hash functions means that if you know the value of
hash(string)
, you can easily work out the value of hash(string +
modification)
without knowing what string
is: you just take the hash you
already have, and do some more rounds of Merkle-Damgard with your
modification
. So, if we’re signing data by doing this:
tag = sha1(secret + message)
we’re then going to send (message
,tag
) over the wire in an email. The
Merkle-Damgard construction means an attacker can take these values and easily
compute the following:
fake_tag = sha1(secret + message + modification)
The attacker can do this without knowing secret
, and can therefore construct
new (message
,tag
) pairs that look genuine to the application. The creation
of new pairs by untrusted parties is called an extension attack, and common hash
functions are vulnerable to it.
Fortunately, there’s a function that is resistant to extension attacks and is
designed precisely for signing messages: HMAC, short for Hash-based Message
Authentication Code. In Ruby you can use the OpenSSL
module from the standard
library for this:
sha1 = OpenSSL::Digest::Digest.new('sha1')
tag = OpenSSL::HMAC.hexdigest(sha1, secret_token, message)
Although HMAC is based on hash functions, it’s constructed in a way that prevents extension attacks and other classes of ‘existential forgery’, i.e. people constructing fake-but-valid (message,tag) pairs. If you’re signing data, you should use it.
But wait: there’s more! Take another look at the verify()
function:
def verify(signature, action, *params)
signature == sign_action(action, *params)
end
See anything wrong with it? It’s actually vulnerable to a timing attack: the
String#==
method as commonly implemented compares strings
character-by-character (or byte-by-byte) and exits with false
as soon as it
finds an unequal pair. This fact means that an attacker can determine the first
correct character of the tag by submitting requests to a signed URL with a
different first character in the tag each time, and stopping when the request
takes a little longer than usual. After guessing the first character they can
move onto the second, and so on until they’ve guessed the whole correct tag.
The easiest way to defeat this attack is, instead of directly comparing two strings, compare their mappings under a collision-resistant hash function:
def verify(signature, action, *params)
expected = Digest::SHA1.hexdigest(sign_action(action, *params))
actual = Digest::SHA1.hexdigest(signature)
expected == actual
end
This strategy means that if any of the supplied tag is wrong, it will probably differ on the first character when compared to the hash of the expected signature. Even if it doesn’t sometimes, you’ve destroyed the linear relationship between the correctness of the string and the time the comparison takes.
Finally, you should make sure your application does not exit early if the tag is invalid. You should do all the data processing you would normally do, just short of modifying the database, and check the tag last. If you return early you risk another timing attack.
All the information in this post I learned from Stanford’s introduction to cryptography, which is an excellent six-week primer on the topic. It’s a little mathsy but not nearly as much as it could be, meaning it’s fairly accessible and focuses on practical problems and intuition, and is really useful to anyone handling user data for a living. I’d go so far as to say it should be required reading for any web developer. It starts up again on Monday June 11: go sign up and make sure you’re not the next victim of the password leaks we’ve seen this week.