I’m writing a book. It’s called JavaScript Testing Recipes.

Why you should never use hash functions for message authentication

The general thrust of this post is: use a MAC function like HMAC to sign data, don’t use hash functions. Although not all hash functions suffer from the problem I’m going to illustrate, in general using a hash function for message authentication comes with a lot of potential problems because those functions aren’t designed for this task. You shouldn’t try to work around it by creatively processing the inputs or inventing some fancy way of chaining hash functions. Just use the functions that were designed for this task instead of inventing your own crypto schemes.

This morning I was reading an article on the Intridea blog about signed idempotent action links. These are links you can embed in emails to your users that let them perform authenticated actions on your site, without going through a login screen or any other interactions. This technique relies on embedding action data in the link, for example an email from Twitter could include your user ID and the name of another user, and following the link would make you follow that user on Twitter.

But these parameters are typically easily guessable and so we must provide an authentication mechanism, otherwise anyone can construct URLs to modify other accounts however they like. To do this, we ‘sign’ the action data by combining it with a secret value, for example a global secret token stored on our servers, or the user’s password hash to produce a ‘tag’. This ‘tag’ does not reveal the secret values used in its construction but lets us verify that the link was generated by our site and for the correct user. The article gives the following methods you could add to a user class to let them sign and verify actions, assuming the user has an id and a secret_token. (I don’t mean to pick on Intridea, I’ve seen a lot of other people make this mistake and I only found out recently why you shouldn’t do it.)

class User
  def sign_action(action, *params)
    Digest::SHA1.hexdigest(
      "--signed--#{id}-#{action}-#{params.join('-')}-#{secret_token}"
    )
  end
  
  def verify(signature, action, *params)
    signature == sign_action(action, *params)
  end
end

This combines the action data with a secret token and calculates the SHA-1 hash; this hash is then appended to the URL to go in the email, and verified when the link is followed. Seems fine, right? SHA-1 doesn’t reveal anything about its input, so the secret is safe. This might be true (modulo the existence of rainbow tables) but hash functions still don’t provide any guarantee that a (message,tag) pair is genuine. To see why, we need to consider how hash functions work.

Many hash functions are based on something called the ‘Merkle-Damgard iterated construction’, which works like this: say you have a long string you want to hash, that looks like this:

+--------------------------+
| abcdefghijklmnopqrstuvwx |
+--------------------------+

Your message gets split into fixed-size blocks like this:

+--------+  +--------+  +--------+  +--------+
| abcdef |  | ghijkl |  | mnopqr |  | stuvwx |
+--------+  +--------+  +--------+  +--------+

It is then prefixed with a value called the ‘initialization vector’ or IV, which is a global constant that’s part of the hash function’s definition. This IV is the same size as the message blocks:

    IV          M0          M1          M2          M3
+--------+  +--------+  +--------+  +--------+  +--------+
| Zn8AGy |  | abcdef |  | ghijkl |  | mnopqr |  | stuvwx |
+--------+  +--------+  +--------+  +--------+  +--------+

This sequence is then folded using a compression function h(). The details of the h() depend on the hashing function, but the only thing that concerns us here is that the compression function takes two message blocks and returns another block of the same size.

So, first we take IV and M0 and compute h(IV,M0) to get H0. Then we take H0 and M1 and compute H1 = h(H0,M1) and so on down the chain.

    IV          M0          M1          M2          M3
+--------+  +--------+  +--------+  +--------+  +--------+
| Zn8AGy |  | abcdef |  | ghijkl |  | mnopqr |  | stuvwx |
+--------+  +--------+  +--------+  +--------+  +--------+
     |           |           |           |           |
     |         +-+-+       +-+-+       +-+-+       +-+-+
     +-------->| h |--H0-->| h |--H1-->| h |--H2-->| h |--> H3
               +---+       +---+       +---+       +---+

Whatever value comes out the end of the chain is the result of the hash function; this is how hash functions take arbitrary-size input and produce fixed-size output. But what does it mean for message authentication? Well say you have a (message,tag) pair, like our action params and SHA-1 hash from above. The construction of hash functions means that if you know the value of hash(string), you can easily work out the value of hash(string + modification) without knowing what string is: you just take the hash you already have, and do some more rounds of Merkle-Damgard with your modification. So, if we’re signing data by doing this:

tag = sha1(secret + message)

we’re then going to send (message,tag) over the wire in an email. The Merkle-Damgard construction means an attacker can take these values and easily compute the following:

fake_tag = sha1(secret + message + modification)

The attacker can do this without knowing secret, and can therefore construct new (message,tag) pairs that look genuine to the application. The creation of new pairs by untrusted parties is called an extension attack, and common hash functions are vulnerable to it.

Fortunately, there’s a function that is resistant to extension attacks and is designed precisely for signing messages: HMAC, short for Hash-based Message Authentication Code. In Ruby you can use the OpenSSL module from the standard library for this:

sha1 = OpenSSL::Digest::Digest.new('sha1')
tag  = OpenSSL::HMAC.hexdigest(sha1, secret_token, message)

Although HMAC is based on hash functions, it’s constructed in a way that prevents extension attacks and other classes of ‘existential forgery’, i.e. people constructing fake-but-valid (message,tag) pairs. If you’re signing data, you should use it.

But wait: there’s more! Take another look at the verify() function:

def verify(signature, action, *params)
  signature == sign_action(action, *params)
end

See anything wrong with it? It’s actually vulnerable to a timing attack: the String#== method as commonly implemented compares strings character-by-character (or byte-by-byte) and exits with false as soon as it finds an unequal pair. This fact means that an attacker can determine the first correct character of the tag by submitting requests to a signed URL with a different first character in the tag each time, and stopping when the request takes a little longer than usual. After guessing the first character they can move onto the second, and so on until they’ve guessed the whole correct tag.

The easiest way to defeat this attack is, instead of directly comparing two strings, compare their mappings under a collision-resistant hash function:

def verify(signature, action, *params)
  expected = Digest::SHA1.hexdigest(sign_action(action, *params))
  actual = Digest::SHA1.hexdigest(signature)
  expected == actual
end

This strategy means that if any of the supplied tag is wrong, it will probably differ on the first character when compared to the hash of the expected signature. Even if it doesn’t sometimes, you’ve destroyed the linear relationship between the correctness of the string and the time the comparison takes.

Finally, you should make sure your application does not exit early if the tag is invalid. You should do all the data processing you would normally do, just short of modifying the database, and check the tag last. If you return early you risk another timing attack.

All the information in this post I learned from Stanford’s introduction to cryptography, which is an excellent six-week primer on the topic. It’s a little mathsy but not nearly as much as it could be, meaning it’s fairly accessible and focuses on practical problems and intuition, and is really useful to anyone handling user data for a living. I’d go so far as to say it should be required reading for any web developer. It starts up again on Monday June 11: go sign up and make sure you’re not the next victim of the password leaks we’ve seen this week.

36 thoughts on “Why you should never use hash functions for message authentication

  1. The example you introduce at the top has the secret key at the beginning of the message, but your explanation uses a secret key at the end of the message. As such, it’s not clear to me how the explanation is relevant to the initial example.

    I suggest using the same type of example in both cases.

  2. J: This is true, but if you’re using a hash function it’s easy enough to put them the other way round: there’s no obvious right way to combine the inputs. On the other hand, HMAC accepts the two inputs separately and makes sure they’re safely combined, so you don’t need to worry about it.

  3. A timing attack is a ridiculous notion for that scenario – how long would it take to compare 40-byte strings on a modern CPU? In both the case of a first-character mismatch and two matching strings, the number is very likely smaller than one microsecond, which is well below the noise levels for time-measurement over the internet, or over a network in general (just one context switch can easily add an unpredictable tens of milliseconds of latency)

  4. Hi,

    I’m wondering if it would be ‘safe’ to send two related hashes, e.g.

    hash1 = sha1 ( f(message, secret) )
    hash2 = sha1 ( f(message, secret) + hash1)

    The attacker may attempt to break hash1 and hash2 separately as you explained, producing fakehash1 and fakehash2, but would it ever match the condition the backend would then test, i.e. fakehash2 = sha1 ( f(message, secret) + fakehash1) ?

  5. While this article does a decent job of describing the high level problems associated with using secure hashes as digital signatures, it misses the minutiae enough that it might lead readers to draw the wrong conclusions. I detailed this on HackerNews:

    http://news.ycombinator.com/item?id=4089076

  6. Jakub: you’ve reinvented something that sort of looks like PBKDF2, but not quite, and the devil really is in the details. That example is vulnerable to an extension on hash1, for example.

    You should not try and invent clever schemes based on inappropriate functions. Just use functions that have already been developed and thoroughly analysed by experts in the field, that are appropriate to your use case. For message authentication, this means MAC functions, HMAC being one family of those.

  7. I’m not sure that’s true in the general case, Luca. The current ruby head uses memcmp, as far as I can tell, which may or may not be timing attack invariant. That probably comes down to the compiler. And that entirely leaves aside the issue of what rubinius or jRuby do in that case.

  8. Luca: relying on the implementation here is a little weak. It’s to be expected that a reasonable implementation of String#== would not be time-invariant, so it’s better to use approaches that make sure you don’t have a timing problem, based on properties of the algorithms you’re using.

    I also worry that, given that so much of our craft is learned by cargo-culting from other people, relying on tricks like this sets a bad example. This is especially true for security stuff, where good information is almost invisible to many web practitioners.

  9. Better yet, use a perishable token like Authlogic does. This is tied to a single user, and it expires both after being used and after a set amount of time.

  10. @Joost is right, this is the same problem as sessions and should be treated the same way. Create a random session_id (or call it token, if you prefer), save it to the DB together with the user_id and email it to your user together with that the user_id (or some unique hash if you’d rather keep the ID secret) . Make sure you use a good random function and that token expires after the first use and you’ll be fine.

    On the another level, good idea is to always try to prevent brute forcing the accountlog by logging IP address on all auth errors. When you notice more then a few errors from the same IP just block it. This will, of course, not stop a determined hacker, but will stop the most of script kiddies and at least slow down the rest…

  11. I don’t understand why a program shouldn’t fail early if the tag is invalid?

    If the tag is invalid and ends at 40ms, and a valid tag request ends at 60ms, the only way a timing attack could uncover something is if the hacker found a valid tag, in which case they could determine that information in other ways anyway and the damage would already be done.

  12. Joost et al: yes, that’s another way you can implement authenticated links but that wasn’t really the point of the article. For what it’s worth you can implement sessions, and many other things often solved using stored tokens, using self-contained signed messages. e.g. Rails lets you store the whole session in the cookie rather than on disk using authenticated encryption.

    Besides, storage comes with its own set of problems: you need to secure the storage, make sure nothing is in plaintext when you don’t need it etc. I’ve seen plenty of token-based solutions (for authenticated links, OAuth, etc.) that store tokens in plaintext when the application never needs the real value, it just needs to validate it when the user supplies it. Much like passwords, these tokens grant access to change things and should not be stored in plaintext. Next time a site gets broken into and we get all its OAuth tokens instead of passwords you’ll see what I mean.

  13. Tyler: it can be useful to give an attacker no indication at all that their attempt failed. If there’s no way to tell whether the request was valid or not, they have no information to use to keep trying until they succeed.

  14. Why you should never use hash functions for message authentication « Don McCaughey

  15. Actually, most cryptographic hash functions (in the MD5/SHA family) include extension-resistant padding at the end of the last block. That is part of the Merkle-Damgard construction. According to Wikipedia, they’re only vulnerable to extension if you’ve already found a collision in the hash.

    Agree that the HMAC is still stronger and more appropriate for the purpose, though.

  16. James: What real life goal can be accomplished without this sort of feedback? For instance, a web based API call either will return a response of an error or the desired response based on whether or not the credentials validated.

    I can’t imagine a scenario where there would not be a different response if the request failed than if it succeeded.

  17. You do realize, of course, that sha3 is about to come out, rendering the concerns about attacks on the construction of hash functions in general totally irrelevant.

  18. >The easiest way to defeat this attack is, instead of directly comparing two strings, compare their mappings under a collision-resistant hash function:

    This is not enough. You must make sure the attacker doesn’t know the bytes that are compared. If just plain SHA1 is used (without a concatenated secret etc), the attacker can also compute the value by just SHA1′ing the signature.

    Easiest way to defeat the timing side-channel is to use so called “double HMAC”.

    expected = HMAC(sign_action(action, *params))
    actual = HMAC(signature)
    expected == actual

    The signing key (the HMAC key) is unknown to the attacker, which makes it impossible for her to know the bytes.

    Also, make sure the user supplied “signature” has correct length.

  19. Shaun: I deliberately left padding out of this discussion because on the surface it looks like it will protect you from attacks. While it is part of the M-D construction, too many people see it and assume that hashes work for authentication as long as the message isn’t an integer multiple of the block size. This turns out not to be the case, for example there was an attack on Flickr that got around this.

  20. Tyler: this is certainly a concern — legitimate users need helpful feedback, and trading this off against information leaks is hard. These authenticated links are typically not part of the standard resource structure of the site seen in browsers or by an API client, for example they’re obvious all state-changing requests made using GET so they’re kind of a special case.

    Of course, if the authentication grants access to read secret resources, you can’t make the success and failure responses the same.

  21. Bram: I didn’t realize SHA-3 was so close to being shipped. I also didn’t mean to give the impression that *all* hash functions are constructed this way, I just wanted to point out that tools designed for authentication specifically exist, and using other functions not designed for this purpose can leave you open to attacks you don’t know about.

  22. timoh: it’s a valid point to use unpredictable bytes, e.g. using HMAC with a secret rather than a vanilla hash. I’m not sure how it helps the attacker in this case: knowing you got the first 39 characters of a hash correct is not that useful since you don’t know how to modify the input to get the next character correct — the entire input is probably wrong. Could you explain?

  23. Good article, but I would start it by stating what is guaranteed by a typical cryptographic hash function. It guarantees that when you compute a hash of some data, you cannot alter that data without altering the hash.

    It seems that the root problem of message signing with a hash is that you _really_ verify the message against your secret_token, not the hash. So the attacker can alter the message and generate a _new_ hash that would still match the secret.

  24. “Just use the functions that were designed for this task instead of inventing your own crypto schemes.”

    Exactly.

    When you design a crypto scheme and use it on the Internet, what you’re doing is building a barrier and claiming that nobody *in the world* *from the future* is going to figure out how to bypass your barrier throughout its entire useful lifetime. Also, unlike with other kinds of software, there’s probably no way to confirm that you’ve succeeded.

    Far too few programmers really grasp how difficult that is to accomplish.

  25. James Coglan: It is still leaking “some secrets”. While maybe not all the bytes, but still at least the first bytes are quite easy to fetch. This is unpleasant.

    HMAC also offers some properties that could save you when the used hash function fails (I mean a situation where HMAC with, say, SHA-256 still resists the attack while SHA-256 itself has a explotable problem).

  26. The vulnerability you’ve detailed is worth considering when constructing an auth mechanism like this, but it’s not a show stopper… Whilst we can easily get hash(real_input+secret+malicious) from hash(params+secret), that doesn’t really matter: in the ruby above, and I’d have thought the most common case, you’re checking that token == hash(input+secret). So whilst an attacker can get hash(real_input+secret+malicious), for this to be useful the attacker would need to be able to compute hash(real_input+malicious+secret). Which they can’t: they can only add to the hash after the secret, which would garbage their fake token.

  27. It is really possible that guys who cracked these passwords on the past week have taken the same course you’ve had :-)

  28. Rounded Corners 378 – Lifting weights violates DRY /by @assaf

  29. @Shaun: if the attacker can arrange things such that the original hash length is a valid value for a parameter and in the right position, then they can trivially defeat the extension-resistent padding in the MD5/SHA-1 family (RIPEMD-160 is an exception). This is why HMAC is constructed as it is. The 1 stop-bit and the 0 padding bits, and the length field in SHA-1 are treated exactly like other data to be hashed, so if the attacker can construct a message where these extra bits appended to the message would make a valid message, they can continue on their merry way.

    HMAC (or CMAC) is the way to go, no doubt about it. You’re less clever than you think. Unless you’ve discovered some as of yet unpublished attack against HMAC-SHA1, trying to save a few cycles with some wacky homebrew MAC is like trying to save storage by storing years as two digits. Do it right and sleep well at night.

    @Jakub: your construction is (1) similar to HMAC (2) more costly to compute than HMAC (3) less well studied than HMAC and (4) generates . The good news is you’re on the right track.

  30. I should clarify that CMAC using AES is reasonable if you’re using hardware with acceleration for AES but without acceleration for SHA-1.

    @Jakub: that should read “(4) generates auth tokens twice as big as HMAC”.

    There are also some subtleties that seem counter-intuitive. I designed a CAPTCHA system for a former employer (for reasons out of my control, the challenges needed to be created in PHP and checked in Perl, and the two systems couldn’t use common storage, ruling out all existing options I could find). Let’s pretend I used HMAC-SHA1 (I used MD5( secret0 + data + secret1) where each secret was 128 bits. It’s not terrible, but don’t do this.) I truncated the HMAC output to 96 bits because this was much much stronger than the CAPTCHA anyway (30 bits), and in the case of some cryptographic weakness, it leaks 64 fewer bits to the attacker. (For instance, even if you’re silly and use plain SHA-1 with a secret key just at the beginning, truncating to 96 bits still gives you 64 bits of resistance to extension attacks. Don’t take this as an excuse to do things you know are wrong, however.)

    I also changed that employer’s Gnutella querykey implemetation to no longer depend on some big DES library and instead use TEA CBC-MAC (constant length messages, so extension attacks against CBC-MAC aren’t a problem). I also shortened the querykeys from 64-bits to 40-bits, since at that length an attacker still needs an average of about 130 billion wasted packets sent to each drone before they can start their amplification DDoS… and this must be done within the key rotation time window.

    That brings me to key rotation. Don’t assume you’ve done everything right, even if you’re using HMAC or CMAC. Have a pair of keys (key0 and key1). Let’s say these keys are for password change emails and will expire after 24 hours. Keep a creation timestamp for each key (not an expiry timestamp; always leave yourself the option to change the key lifetime on a running system). In your token creation routine, if key0 is older than 24 hours, make key1 = key0 and securely generate a new key0. In checking your tokens, start with key0, check if key0 is older than 48 hours then reject. If HMAC with key0 validates the message, then accept. Otherwise, check the timestamp of key1 and if it hasn’t expired check the HMAC with key1. Note that depending on exactly when the token is generated, it will actually be good from anywhere from 24 to 48 hours. You can tell them the real lifetime of the password change link, or you can just tell them it expires in 24 hours.

  31. Morning (Missing) Link? « Digital: Divide and Conquer

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>