UTF-8: it’s what strings are made of

Earlier today, a change was announced to Node.js that affects how it handles invalid UTF-8 in strings that are converted to buffers. Once again, I find myself checking over the UTF-8 validation code in websocket-driver, and once again I find I cannot ever remember how to make sense of this regex that performs the validation. I just copied it off a webpage once and it took a while (and reimplementing UTF-8 myself) to fully understand what it does. If you write software that processes text, you’ll probably need to understand this too, so I figured I’d write it down.

The first thing you need to understand is that Unicode and UTF-8 are not the same thing. Unicode is a standard that aims to assign a fixed number to every character and glyph in all the world’s writing systems. For example, number 65, or U+0041 is the uppercase letter ‘A’, 90 or U+005A is the uppercase letter ‘Z’, and 32 or U+0020 is a space. U+02A4 is the character ‘ʤ’, U+046C is ‘Ѭ’, U+0BF5 is ‘௵’, and so on. In all, these numbers or ‘code points’ range up to U+10FFFF or 1,114,111.

A Unicode string, a sequence of characters, is really a sequence of these numbers, from 0 to 1,114,111. How those numbers translate to the glyphs you see on the screen depend on which font is used to render the text. But when we send text over a TCP connection or save it on disk, we store it as a sequence of fixed-width bytes. An 8-bit byte can only take one of 256 values, so how are we supposed to represent 1,114,112 possible code points? That’s where encodings come in.

UTF-8 is one of many encodings of Unicode. An encoding defines a mapping between a sequence of bytes and a sequence of code points, and tells us how to convert between them. UTF-8 is an encoding commonly used on the web, and is mandated as the encoding of text messages in the WebSocket protocol.

So how does UTF-8 work? Well, the first thing to realise is we can’t just map all the code points onto bytes: most of them are too big. We can’t even do that for the code points 00 to FF, since that leaves us with no bytes to represent higher values. But we can do it for the range 00 to 7F (0 to 127) and leave the range 80 to FF to represent the other code points. The first 128 code points are represented literally using the least significant 7 bits of a single byte:

    U+0000 to U+007F:

    00000000  00  --  7F  01111111

This is what makes UTF-8 special: rather than use 3 bytes to represent each code point (1,114,111 is 21 bits), it uses a variable number of bytes, between 1 and

  1. The first 128 code points are 1 byte each, and all the remaining code points are made of combinations of the remaining 128 bytes. This has two main benefits, although one applies mostly to programmers and English speakers. The first is that UTF-8 is backward-compatible with ASCII: all valid ASCII documents are valid UTF-8 documents, per the above mapping. The second, which is a consequence of this, is that we did not double or triple the volume of data required to transmit English text.

The 1-byte range gives us 7 bits to play with. To represent bigger numbers, we need more bytes, and UTF-8 defines the 2-byte range as being composed of pairs of the form 110xxxxx 10yyyyyy. The x and y bits are variable, giving us 11 bits to play with; this gets us up to U+07FF.

    U+0080 to U+07FF:

    11000010  C2  --  DF  11011111
    10000000  80  --  BF  10111111

That is, the code point U+0080 becomes the bytes C2 80 and code point U+07FF becomes DF BF. Note that it is an error to use more space than you need to represent a code point: C1 BF or 11000001 10111111 could translate to U+007F but we can represent this code point with only one byte, so C1 BF is not a legal byte sequence.

In general, multibyte code points are composed of a byte with the most significant bit set (i.e. bytes >= 80) followed by one or more bytes of the form 10xxxxxx. These trailing bytes can therefore range from 80 to BF. Bytes lower than 80 are used for 1-byte code points and it is an error to see them in a multibyte encoding. The value of the first byte tells us how many trailing bytes will follow it.

Moving on to 3-byte code points, those are of the form 1110xxxx 10yyyyyy 10zzzzzz, giving us 16 bits of data and letting us reach code point U+FFFF. However, we now run into an accident of history. Unicode was first described in the Unicode 88 paper, which states:

The idea of expanding the basis for character encoding from 8 to 16 bits is so sensible, indeed so obvious, that the mind initially recoils from it. There must be a catch to it…

Are 16 bits, providing at most 65,536 distinct codes, sufficient to encode all characters of all the world’s scripts? Since the definition of a “character” is itself part of the design of a text encoding scheme, the question is meaningless unless it is restated as: Is it possible to engineer a reasonable definition of “character” such that all the world’s scripts contain fewer than 65,536 of them?

The answer to this is Yes.

– Joseph D. Becker PhD, ‘Unicode 88’

Of course, it turns out that the answer to this is in fact No, as you may have guessed from the existence of 1,114,112 modern-day code points. During the design of UTF-16 – another encoding with a fixed two-bytes-per-codepoint scheme – it was realised that 16 bits were in fact insufficient to encode all known characters. So, the Unicode standard reserves a special range of code points that UTF-16 uses to encode values above FFFF. Such values are encoded using four bytes, that is two ‘regular’ code points, where the first two bytes are in the range D8 00 to DB FF and the second two are in the range DC 00 to DF FF. Code points in the range U+D800 to U+DFFF are called surrogates, and UTF-16 uses surrogate pairs to represent large values. No characters will ever be assigned to these code points and no encoding should attempt to represent them.

So, for our 3-byte range we can actually only encode the code points U+0800 to U+D7FF and U+E000 to U+FFFF.

    U+0800 to U+D7FF:

    11100000  E0  --  ED  11101101
    10100000  A0  --  9F  10011111
    10000000  80  --  BF  10111111

    U+E000 to U+FFFF:

    11101110  EE  --  EF  11101111
    10000000  80  --  BF  10111111
    10000000  80  --  BF  10111111

And finally, we reach the 4-byte range, whose bytes look like 11110www 10xxxxxx 10yyyyyy 10zzzzzz, giving us 21 bits and letting us reach U+10FFFF. There are no gaps in this range, but we don’t need to use all possible byte values in this range to cover the remaining code points, so we end up with:

    U+010000 to U+10FFFF:

    11110000  F0  --  F4  11110100
    10010000  90  --  8F  10001111
    10000000  80  --  BF  10111111
    10000000  80  --  BF  10111111

We’ve now covered all the valid sequences of bytes that can represent a single character in UTF-8. They are:

  • [00-7F]
  • [C2-DF] [80-BF]
  • E0 [A0-BF] [80-BF]
  • [E1-EC] [80-BF] [80-BF]
  • ED [80-9F] [80-BF]
  • [EE-EF] [80-BF] [80-BF]
  • F0 [90-BF] [80-BF] [80-BF]
  • [F1-F3] [80-BF] [80-BF] [80-BF]
  • F4 [80-8F] [80-BF] [80-BF]

These can be encoded as a regex, but remember that regexes work on characters, not bytes. In Node, we can convert a buffer (a byte array) into a string whose characters are the literal code points given by those bytes (i.e. code points 0 to 255) using buffer.toString('binary'), and then compare that string to the regex to validate it.

So, now that we understand UTF-8, we can understand what has changed in Node:

// Prior to these releases:
new Buffer('ab\ud800cd', 'utf8');
// <Buffer 61 62 ed a0 80 63 64>

// After this release:
new Buffer('ab\ud800cd', 'utf8');
// <Buffer 61 62 ef bf bd 63 64>

The character \ud800 is a surrogate, and is therefore not a valid character since no encoding should exist for it. However, JavaScript allows this string to exist without raising an error, so it was decided that converting such a string to a buffer should not raise an error either. Instead, that character is now replaced with '\ufffd', the unknown character. Rather than letting your program send a valid JS string to another peer that would reject it as invalid UTF-8, Node replaces it with a character that’s not a surrogate, preventing downstream errors. In general I’d advise against attempting to guess what the programmer meant when they provide malformed input, but since Unicode provides a code point that one should “used to replace an incoming character whose value is unknown or unrepresentable in Unicode”, this seems like a reasonable choice.