Earlier today, a change was announced to Node.js that affects how it handles invalid UTF-8 in strings that are converted to buffers. Once again, I find myself checking over the UTF-8 validation code in websocket-driver, and once again I find I cannot ever remember how to make sense of this regex that performs the validation. I just copied it off a webpage once and it took a while (and reimplementing UTF-8 myself) to fully understand what it does. If you write software that processes text, you’ll probably need to understand this too, so I figured I’d write it down.
The first thing you need to understand is that Unicode and UTF-8 are not the
same thing. Unicode is a standard that aims to assign a fixed number to
every character and glyph in all the world’s writing systems. For example,
number 65, or U+0041
is the uppercase letter ‘A’, 90 or U+005A
is the
uppercase letter ‘Z’, and 32 or U+0020
is a space. U+02A4
is the character
‘ʤ’, U+046C
is ‘Ѭ’, U+0BF5
is ‘௵’, and so on. In all, these numbers or ‘code
points’ range up to U+10FFFF
or 1,114,111.
A Unicode string, a sequence of characters, is really a sequence of these numbers, from 0 to 1,114,111. How those numbers translate to the glyphs you see on the screen depend on which font is used to render the text. But when we send text over a TCP connection or save it on disk, we store it as a sequence of fixed-width bytes. An 8-bit byte can only take one of 256 values, so how are we supposed to represent 1,114,112 possible code points? That’s where encodings come in.
UTF-8 is one of many encodings of Unicode. An encoding defines a mapping between a sequence of bytes and a sequence of code points, and tells us how to convert between them. UTF-8 is an encoding commonly used on the web, and is mandated as the encoding of text messages in the WebSocket protocol.
So how does UTF-8 work? Well, the first thing to realise is we can’t just map
all the code points onto bytes: most of them are too big. We can’t even do that
for the code points 00
to FF
, since that leaves us with no bytes to
represent higher values. But we can do it for the range 00
to 7F
(0 to
127) and leave the range 80
to FF
to represent the other code points. The
first 128 code points are represented literally using the least significant 7
bits of a single byte:
U+0000 to U+007F:
00000000 00 -- 7F 01111111
This is what makes UTF-8 special: rather than use 3 bytes to represent each code point (1,114,111 is 21 bits), it uses a variable number of bytes, between 1 and
- The first 128 code points are 1 byte each, and all the remaining code points are made of combinations of the remaining 128 bytes. This has two main benefits, although one applies mostly to programmers and English speakers. The first is that UTF-8 is backward-compatible with ASCII: all valid ASCII documents are valid UTF-8 documents, per the above mapping. The second, which is a consequence of this, is that we did not double or triple the volume of data required to transmit English text.
The 1-byte range gives us 7 bits to play with. To represent bigger numbers, we
need more bytes, and UTF-8 defines the 2-byte range as being composed of pairs
of the form 110xxxxx 10yyyyyy
. The x
and y
bits are variable, giving us 11
bits to play with; this gets us up to U+07FF
.
U+0080 to U+07FF:
11000010 C2 -- DF 11011111
10000000 80 -- BF 10111111
That is, the code point U+0080
becomes the bytes C2 80
and code point
U+07FF
becomes DF BF
. Note that it is an error to use more space than you
need to represent a code point: C1 BF
or 11000001 10111111
could translate
to U+007F
but we can represent this code point with only one byte, so C1 BF
is not a legal byte sequence.
In general, multibyte code points are composed of a byte with the most
significant bit set (i.e. bytes >= 80
) followed by one or more bytes of the
form 10xxxxxx
. These trailing bytes can therefore range from 80
to BF
.
Bytes lower than 80
are used for 1-byte code points and it is an error to see
them in a multibyte encoding. The value of the first byte tells us how many
trailing bytes will follow it.
Moving on to 3-byte code points, those are of the form 1110xxxx 10yyyyyy
10zzzzzz
, giving us 16 bits of data and letting us reach code point U+FFFF
.
However, we now run into an accident of history. Unicode was first described in
the Unicode 88 paper, which states:
The idea of expanding the basis for character encoding from 8 to 16 bits is so sensible, indeed so obvious, that the mind initially recoils from it. There must be a catch to it…
Are 16 bits, providing at most 65,536 distinct codes, sufficient to encode all characters of all the world’s scripts? Since the definition of a “character” is itself part of the design of a text encoding scheme, the question is meaningless unless it is restated as: Is it possible to engineer a reasonable definition of “character” such that all the world’s scripts contain fewer than 65,536 of them?
The answer to this is Yes.
– Joseph D. Becker PhD, ‘Unicode 88’
Of course, it turns out that the answer to this is in fact No, as you may have
guessed from the existence of 1,114,112 modern-day code points. During the
design of UTF-16 – another encoding with a fixed two-bytes-per-codepoint
scheme – it was realised that 16 bits were in fact insufficient to encode all
known characters. So, the Unicode standard reserves a special range of code
points that UTF-16 uses to encode values above FFFF
. Such values are encoded
using four bytes, that is two ‘regular’ code points, where the first two bytes
are in the range D8 00
to DB FF
and the second two are in the range DC 00
to DF FF
. Code points in the range U+D800
to U+DFFF
are called
surrogates, and UTF-16 uses surrogate pairs to represent large values. No
characters will ever be assigned to these code points and no encoding should
attempt to represent them.
So, for our 3-byte range we can actually only encode the code points U+0800
to
U+D7FF
and U+E000
to U+FFFF
.
U+0800 to U+D7FF:
11100000 E0 -- ED 11101101
10100000 A0 -- 9F 10011111
10000000 80 -- BF 10111111
U+E000 to U+FFFF:
11101110 EE -- EF 11101111
10000000 80 -- BF 10111111
10000000 80 -- BF 10111111
And finally, we reach the 4-byte range, whose bytes look like 11110www 10xxxxxx
10yyyyyy 10zzzzzz
, giving us 21 bits and letting us reach U+10FFFF
. There are
no gaps in this range, but we don’t need to use all possible byte values in this
range to cover the remaining code points, so we end up with:
U+010000 to U+10FFFF:
11110000 F0 -- F4 11110100
10010000 90 -- 8F 10001111
10000000 80 -- BF 10111111
10000000 80 -- BF 10111111
We’ve now covered all the valid sequences of bytes that can represent a single character in UTF-8. They are:
[00-7F]
[C2-DF] [80-BF]
E0 [A0-BF] [80-BF]
[E1-EC] [80-BF] [80-BF]
ED [80-9F] [80-BF]
[EE-EF] [80-BF] [80-BF]
F0 [90-BF] [80-BF] [80-BF]
[F1-F3] [80-BF] [80-BF] [80-BF]
F4 [80-8F] [80-BF] [80-BF]
These can be encoded as a regex, but remember that regexes work on
characters, not bytes. In Node, we can convert a buffer (a byte array) into a
string whose characters are the literal code points given by those bytes (i.e.
code points 0 to 255) using buffer.toString('binary')
, and then compare that
string to the regex to validate it.
So, now that we understand UTF-8, we can understand what has changed in Node:
// Prior to these releases:
new Buffer('ab\ud800cd', 'utf8');
// <Buffer 61 62 ed a0 80 63 64>
// After this release:
new Buffer('ab\ud800cd', 'utf8');
// <Buffer 61 62 ef bf bd 63 64>
The character \ud800
is a surrogate, and is therefore not a valid character
since no encoding should exist for it. However, JavaScript allows this string to
exist without raising an error, so it was decided that converting such a string
to a buffer should not raise an error either. Instead, that character is now
replaced with '\ufffd'
, the unknown character. Rather than letting your
program send a valid JS string to another peer that would reject it as invalid
UTF-8, Node replaces it with a character that’s not a surrogate, preventing
downstream errors. In general I’d advise against attempting to guess what the
programmer meant when they provide malformed input, but since Unicode provides a
code point that one should “used to replace an incoming character whose value is
unknown or unrepresentable in Unicode”, this seems like a reasonable choice.