A couple of days ago I pushed the latest release of faye-websocket for Node and Ruby. The only user-facing change in version 0.5 is that the library now better supports the I/O conventions of each platform; on Node this means WebSocket objects are now duplex streams so making an echo server is as simple as:
var http = require('http'),
WebSocket = require('faye-websocket');
var server = http.createServer();
server.on('upgrade', function(request, socket, body) {
var ws = new WebSocket(request, socket, body);
ws.pipe(ws);
});
server.listen(8000);
On Ruby, it means that Faye::WebSocket
now supports the rack.hijack API
for accessing the TCP socket, which means you can now use it to handle
WebSockets in apps served by Puma, Rainbows 4.5, Phusion Passenger 4.0, and
other servers.
But there’s a much bigger change behind the scenes, which is that faye-websocket is now powered by an I/O agnostic WebSocket protocol module called websocket-driver, available for Node and Ruby. The entire protocol is encapsulated in that module such that all the user needs to do is supply some means of doing I/O. faye-websocket is now just a thin module that hooks websocket-driver up to various I/O systems, such as Rack and Node web servers or TCP/TLS sockets on the client side.
I started work on this a few weeks ago when the authors of Celluloid and
Puma asked me if faye-websocket could be used to add WebSocket support to
those systems. I said it could probably already do this, since Poltergeist
and Terminus have been using the protocol classes with Ruby’s TCPServer
for a while without too much effort. So I began extracting these classes into
their own library, and wrote the beginnings of some documentation for them.
But as I got into explaining how to use this new library, I noticed how hard it was to use correctly. Loads of protocol details were leaking out of these classes and would have to be reimplemented by users. For example, here’s a pseudocode-ish outline of how the client would have to process data it received over TCP. If it looks complicated, that’s because it is complicated, but I’ll explain it soon enough.
class Client
def initialize(url)
@uri = URI.parse(url)
@parser = Faye::WebSocket::HybiParser.new(url, :masking => true)
@state = :connecting
@tcp = tcp_connect(@uri.host, @uri.port || 80)
@handshake = @parser.create_handshake
@tcp.write(@handshake.request_data)
loop { parse(@tcp.read) }
end
def parse(data)
case @state
when :connecting
leftovers = @handshake.parse(data)
return unless @handshake.complete?
if @handshake.valid?
@state = :open
parse(leftovers)
@queue.each { |msg| send(msg) } if @queue
else
@state = :closed
end
when :open, :closing
@parser.parse(data)
end
end
def send(message)
case @state
when :connecting
@queue ||= []
@queue << message
when :open
data = @parser.frame(message, :text)
@tcp.write(data)
end
end
end
But using websocket-driver the equivalent implementation would be:
class Client
attr_reader :url
def initialize(url)
@url = url
@uri = URI.parse(url)
@driver = WebSocket::Driver.client(self)
@tcp = tcp_connect(@uri.host, @uri.port || 80)
@driver.start
loop { parse(@tcp.read) }
end
def parse(data)
@driver.parse(data)
end
def send(message)
@driver.text(message)
end
def write(data)
@tcp.write(data)
end
end
So before, the client had to implement code to create a handshake request, split the input stream on whether it was currently parsing the HTTP handshake headers or a WebSocket frame and switch state accordingly, remembering to parse the leftovers; it’s entirely possible you might receive the handshake headers and some WebSocket frame data in the same data chunk, and you can’t drop that frame data. Because of the design of the WebSocket wire format, dropping or misinterpreting even one byte of input changes the meaning of the rest of the stream, possibly leading to behaviour an attacker might exploit.
It also had to maintain state around sending messages out, since messages can
only be sent after the handshake is complete. So if you tried to send a message
while in the :connecting
state, it would put the message in a queue and
deliver it once the handshake was complete.
When we switch to websocket-driver, all those concerns go away. We treat the
whole TCP input stream as one stream of data, because that’s what it is. We
stream all incoming bytes to the driver and let it deal with managing state. It
will emit events to tell us when interesting things happen, like the handshake
completing or a message being received. When we want to send a message, we tell
the driver to format it as a text frame. If the driver knows the handshake is
not complete it will queue it and deliver it when it’s safe to do so. In the
second example, we don’t even mention the concept of handshakes: the user
doesn’t need to know anything about how the protocol works to use the driver
correctly. The new Client
class just hooks the driver up to a TCP socket and
provides an API for sending messages.
The driver produces TCP output by calling the client’s write()
method with the
data we should send over the socket. When we call @driver.start
, the driver
calls client.write
with a string containing handshake request headers. When we
call @driver.text("Hello")
, the driver will call
client.write("\x81\x05Hello")
(for unmasked frames), either immediately or
after the handshake is complete.
This final point highlights the core problem with a lot of protocol libraries. By taking a strictly object-oriented approach where all protocol state is encapsulated and objects send commands to one another, we’ve allowed the protocol library to control when output happens, not just what output happens. A protocol is not just some functions for parsing and formatting between TCP streams and domain messages, it’s a sequence of actions that must be performed in a certain order by two or more parties in order to reach a goal. A protocol library, if it wishes to help users deploy the protocol correctly and safely, should drive the user’s code by telling it when to do certain things, not just give the user a bag of parsing ingredients and ask them to glue them together in the right order.
The fact that other protocol libraries have no means of telling the user when to send certain bits of output means that they end up leaking a lot of protocol details into the user’s code. For example, WebSocket has various control frames aside from those for text and binary messages. If you receive a ‘ping’ frame, you must respond by sending a ‘pong’ frame containing the same payload. If you receive a ‘close’ frame, you should respond in kind and then close the connection. If you receive malformed data you should send a ‘close’ frame with an error code and then close the connection. So there are various situations when the parser should react, not by yielding the data to the user, but by telling the user to send certain types of responses. But the most-downloaded Ruby library for this (websocket) handles the latter case by yielding the data to the user and expecting them to do the right thing.
I’ve tried reimplementing faye-websocket’s Client
class on top of
websocket
and the amount of boilerplate required is huge if you want to
produce a correct implementation. Here’s a laundry list of stuff you need to
implement yourself (links are to relevant sections of code):
- Know that there is a handshake process and send a handshake over the TCP connection
- Manage the state of the parser and route input either to the handshake or to the frame parser
- Switch modes when the handshake is complete
- Remember to parse frame data received in the same chunk as the handshake
- Queue messages if the handshake is not done and flush the queue after the handshake
- Respond to pings by sending a matching pong frame
- Validate the payload and encoding of close frames and extract the closing code from the first two bytes of the payload
- Know which closing codes are allowed and return an error code if you receive one
- Respond to close frames with a matching frame with the same code
- Know about all the types of errors and that you should send a close frame if any of them happen
- Remember which version of the protocol is in use and tell the frame formatter to use this version when creating frames
- Do all this while accounting for the fact that multiple versions of the WebSocket protocol exist that have very different requirements
So this protocol library not only leaks by making the user track the state of the connection and the state of the parser, but also makes them implement stuff the protocol should deal with. Almost all the above points are behaviours set down in the specification; the user must implement them this way or their deployment is buggy. Since the user has no meaningful control over how this stuff works, all this code is just boilerplate that requires significant knowledge to write correctly. In contrast, faye-websocket and websocket-driver have never emitted events on ping frames because the user has no choice over how to handle them, so why make them write code for that? In websocket-driver, all the above points (and this list is not exhaustive) are dealt with by the protocol library and this gives users a much better hope of deploying WebSockets correctly and safely.
I’m not saying the websocket
library is broken, per se. I’m saying it doesn’t
go far enough. In Ruby we have lots of different means of doing network I/O, and
there’s a few in Node if you consider HTTP servers and TCP/TLS sockets, though
they all have similar interfaces. If you want to build a widely useful protocol
library, you should solve as many problems as possible for the user so that they
just need to bring some I/O and they’re pretty much done. Asking the user to
rebuild half the protocol themselves is a recipe for bugs, security holes and
wasted effort. We shouldn’t have to rebuild each protocol for every I/O stack we
invent, so let’s stop.
Let the user tell you what they want to do, and then tell their code how and when to realize this over TCP. If you find yourself explaining the protocol mechanics when you’re documenting your library, it’s not simple enough yet. Refactor until I don’t need to read the RFC to deploy it properly.