Why you should never use hash functions for message authentication

The general thrust of this post is: use a MAC function like HMAC to sign data, don’t use hash functions. Although not all hash functions suffer from the problem I’m going to illustrate, in general using a hash function for message authentication comes with a lot of potential problems because those functions aren’t designed for this task. You shouldn’t try to work around it by creatively processing the inputs or inventing some fancy way of chaining hash functions. Just use the functions that were designed for this task instead of inventing your own crypto schemes.

This morning I was reading an article on the Intridea blog about signed idempotent action links. These are links you can embed in emails to your users that let them perform authenticated actions on your site, without going through a login screen or any other interactions. This technique relies on embedding action data in the link, for example an email from Twitter could include your user ID and the name of another user, and following the link would make you follow that user on Twitter.

But these parameters are typically easily guessable and so we must provide an authentication mechanism, otherwise anyone can construct URLs to modify other accounts however they like. To do this, we ‘sign’ the action data by combining it with a secret value, for example a global secret token stored on our servers, or the user’s password hash to produce a ‘tag’. This ‘tag’ does not reveal the secret values used in its construction but lets us verify that the link was generated by our site and for the correct user. The article gives the following methods you could add to a user class to let them sign and verify actions, assuming the user has an id and a secret_token. (I don’t mean to pick on Intridea, I’ve seen a lot of other people make this mistake and I only found out recently why you shouldn’t do it.)

class User
  def sign_action(action, *params)
    Digest::SHA1.hexdigest(
      "--signed--#{id}-#{action}-#{params.join('-')}-#{secret_token}"
    )
  end
  
  def verify(signature, action, *params)
    signature == sign_action(action, *params)
  end
end

This combines the action data with a secret token and calculates the SHA-1 hash; this hash is then appended to the URL to go in the email, and verified when the link is followed. Seems fine, right? SHA-1 doesn’t reveal anything about its input, so the secret is safe. This might be true (modulo the existence of rainbow tables) but hash functions still don’t provide any guarantee that a (message,tag) pair is genuine. To see why, we need to consider how hash functions work.

Many hash functions are based on something called the ‘Merkle-Damgard iterated construction’, which works like this: say you have a long string you want to hash, that looks like this:

+--------------------------+
| abcdefghijklmnopqrstuvwx |
+--------------------------+

Your message gets split into fixed-size blocks like this:

+--------+  +--------+  +--------+  +--------+
| abcdef |  | ghijkl |  | mnopqr |  | stuvwx |
+--------+  +--------+  +--------+  +--------+

It is then prefixed with a value called the ‘initialization vector’ or IV, which is a global constant that’s part of the hash function’s definition. This IV is the same size as the message blocks:

    IV          M0          M1          M2          M3
+--------+  +--------+  +--------+  +--------+  +--------+
| Zn8AGy |  | abcdef |  | ghijkl |  | mnopqr |  | stuvwx |
+--------+  +--------+  +--------+  +--------+  +--------+

This sequence is then folded using a compression function h(). The details of the h() depend on the hashing function, but the only thing that concerns us here is that the compression function takes two message blocks and returns another block of the same size.

So, first we take IV and M0 and compute h(IV,M0) to get H0. Then we take H0 and M1 and compute H1 = h(H0,M1) and so on down the chain.

    IV          M0          M1          M2          M3
+--------+  +--------+  +--------+  +--------+  +--------+
| Zn8AGy |  | abcdef |  | ghijkl |  | mnopqr |  | stuvwx |
+--------+  +--------+  +--------+  +--------+  +--------+
     |           |           |           |           |
     |         +-+-+       +-+-+       +-+-+       +-+-+
     +-------->| h |--H0-->| h |--H1-->| h |--H2-->| h |--> H3
               +---+       +---+       +---+       +---+

Whatever value comes out the end of the chain is the result of the hash function; this is how hash functions take arbitrary-size input and produce fixed-size output. But what does it mean for message authentication? Well say you have a (message,tag) pair, like our action params and SHA-1 hash from above. The construction of hash functions means that if you know the value of hash(string), you can easily work out the value of hash(string + modification) without knowing what string is: you just take the hash you already have, and do some more rounds of Merkle-Damgard with your modification. So, if we’re signing data by doing this:

tag = sha1(secret + message)

we’re then going to send (message,tag) over the wire in an email. The Merkle-Damgard construction means an attacker can take these values and easily compute the following:

fake_tag = sha1(secret + message + modification)

The attacker can do this without knowing secret, and can therefore construct new (message,tag) pairs that look genuine to the application. The creation of new pairs by untrusted parties is called an extension attack, and common hash functions are vulnerable to it.

Fortunately, there’s a function that is resistant to extension attacks and is designed precisely for signing messages: HMAC, short for Hash-based Message Authentication Code. In Ruby you can use the OpenSSL module from the standard library for this:

sha1 = OpenSSL::Digest::Digest.new('sha1')
tag  = OpenSSL::HMAC.hexdigest(sha1, secret_token, message)

Although HMAC is based on hash functions, it’s constructed in a way that prevents extension attacks and other classes of ‘existential forgery’, i.e. people constructing fake-but-valid (message,tag) pairs. If you’re signing data, you should use it.

But wait: there’s more! Take another look at the verify() function:

def verify(signature, action, *params)
  signature == sign_action(action, *params)
end

See anything wrong with it? It’s actually vulnerable to a timing attack: the String#== method as commonly implemented compares strings character-by-character (or byte-by-byte) and exits with false as soon as it finds an unequal pair. This fact means that an attacker can determine the first correct character of the tag by submitting requests to a signed URL with a different first character in the tag each time, and stopping when the request takes a little longer than usual. After guessing the first character they can move onto the second, and so on until they’ve guessed the whole correct tag.

The easiest way to defeat this attack is, instead of directly comparing two strings, compare their mappings under a collision-resistant hash function:

def verify(signature, action, *params)
  expected = Digest::SHA1.hexdigest(sign_action(action, *params))
  actual = Digest::SHA1.hexdigest(signature)
  expected == actual
end

This strategy means that if any of the supplied tag is wrong, it will probably differ on the first character when compared to the hash of the expected signature. Even if it doesn’t sometimes, you’ve destroyed the linear relationship between the correctness of the string and the time the comparison takes.

Finally, you should make sure your application does not exit early if the tag is invalid. You should do all the data processing you would normally do, just short of modifying the database, and check the tag last. If you return early you risk another timing attack.

All the information in this post I learned from Stanford’s introduction to cryptography, which is an excellent six-week primer on the topic. It’s a little mathsy but not nearly as much as it could be, meaning it’s fairly accessible and focuses on practical problems and intuition, and is really useful to anyone handling user data for a living. I’d go so far as to say it should be required reading for any web developer. It starts up again on Monday June 11: go sign up and make sure you’re not the next victim of the password leaks we’ve seen this week.