websocket-driver: an I/O-agnostic WebSocket module, or, why most protocol libraries aren't

A couple of days ago I pushed the latest release of faye-websocket for Node and Ruby. The only user-facing change in version 0.5 is that the library now better supports the I/O conventions of each platform; on Node this means WebSocket objects are now duplex streams so making an echo server is as simple as:

var http      = require('http'),
    WebSocket = require('faye-websocket');

var server = http.createServer();

server.on('upgrade', function(request, socket, body) {
  var ws = new WebSocket(request, socket, body);
  ws.pipe(ws);
});

server.listen(8000);

On Ruby, it means that Faye::WebSocket now supports the rack.hijack API for accessing the TCP socket, which means you can now use it to handle WebSockets in apps served by Puma, Rainbows 4.5, Phusion Passenger 4.0, and other servers.

But there’s a much bigger change behind the scenes, which is that faye-websocket is now powered by an I/O agnostic WebSocket protocol module called websocket-driver, available for Node and Ruby. The entire protocol is encapsulated in that module such that all the user needs to do is supply some means of doing I/O. faye-websocket is now just a thin module that hooks websocket-driver up to various I/O systems, such as Rack and Node web servers or TCP/TLS sockets on the client side.

I started work on this a few weeks ago when the authors of Celluloid and Puma asked me if faye-websocket could be used to add WebSocket support to those systems. I said it could probably already do this, since Poltergeist and Terminus have been using the protocol classes with Ruby’s TCPServer for a while without too much effort. So I began extracting these classes into their own library, and wrote the beginnings of some documentation for them.

But as I got into explaining how to use this new library, I noticed how hard it was to use correctly. Loads of protocol details were leaking out of these classes and would have to be reimplemented by users. For example, here’s a pseudocode-ish outline of how the client would have to process data it received over TCP. If it looks complicated, that’s because it is complicated, but I’ll explain it soon enough.

class Client
  def initialize(url)
    @uri       = URI.parse(url)
    @parser    = Faye::WebSocket::HybiParser.new(url, :masking => true)
    @state     = :connecting
    @tcp       = tcp_connect(@uri.host, @uri.port || 80)
    @handshake = @parser.create_handshake

    @tcp.write(@handshake.request_data)
    loop { parse(@tcp.read) }
  end

  def parse(data)
    case @state
    when :connecting
      leftovers = @handshake.parse(data)
      return unless @handshake.complete?
      if @handshake.valid?
        @state = :open
        parse(leftovers)
        @queue.each { |msg| send(msg) } if @queue
      else
        @state = :closed
      end
    when :open, :closing
      @parser.parse(data)
    end
  end

  def send(message)
    case @state
    when :connecting
      @queue ||= []
      @queue << message
    when :open
      data = @parser.frame(message, :text)
      @tcp.write(data)
    end
  end
end

But using websocket-driver the equivalent implementation would be:

class Client
  attr_reader :url

  def initialize(url)
    @url    = url
    @uri    = URI.parse(url)
    @driver = WebSocket::Driver.client(self)
    @tcp    = tcp_connect(@uri.host, @uri.port || 80)

    @driver.start
    loop { parse(@tcp.read) }
  end

  def parse(data)
    @driver.parse(data)
  end

  def send(message)
    @driver.text(message)
  end

  def write(data)
    @tcp.write(data)
  end
end

So before, the client had to implement code to create a handshake request, split the input stream on whether it was currently parsing the HTTP handshake headers or a WebSocket frame and switch state accordingly, remembering to parse the leftovers; it’s entirely possible you might receive the handshake headers and some WebSocket frame data in the same data chunk, and you can’t drop that frame data. Because of the design of the WebSocket wire format, dropping or misinterpreting even one byte of input changes the meaning of the rest of the stream, possibly leading to behaviour an attacker might exploit.

It also had to maintain state around sending messages out, since messages can only be sent after the handshake is complete. So if you tried to send a message while in the :connecting state, it would put the message in a queue and deliver it once the handshake was complete.

When we switch to websocket-driver, all those concerns go away. We treat the whole TCP input stream as one stream of data, because that’s what it is. We stream all incoming bytes to the driver and let it deal with managing state. It will emit events to tell us when interesting things happen, like the handshake completing or a message being received. When we want to send a message, we tell the driver to format it as a text frame. If the driver knows the handshake is not complete it will queue it and deliver it when it’s safe to do so. In the second example, we don’t even mention the concept of handshakes: the user doesn’t need to know anything about how the protocol works to use the driver correctly. The new Client class just hooks the driver up to a TCP socket and provides an API for sending messages.

The driver produces TCP output by calling the client’s write() method with the data we should send over the socket. When we call @driver.start, the driver calls client.write with a string containing handshake request headers. When we call @driver.text("Hello"), the driver will call client.write("\x81\x05Hello") (for unmasked frames), either immediately or after the handshake is complete.

This final point highlights the core problem with a lot of protocol libraries. By taking a strictly object-oriented approach where all protocol state is encapsulated and objects send commands to one another, we’ve allowed the protocol library to control when output happens, not just what output happens. A protocol is not just some functions for parsing and formatting between TCP streams and domain messages, it’s a sequence of actions that must be performed in a certain order by two or more parties in order to reach a goal. A protocol library, if it wishes to help users deploy the protocol correctly and safely, should drive the user’s code by telling it when to do certain things, not just give the user a bag of parsing ingredients and ask them to glue them together in the right order.

The fact that other protocol libraries have no means of telling the user when to send certain bits of output means that they end up leaking a lot of protocol details into the user’s code. For example, WebSocket has various control frames aside from those for text and binary messages. If you receive a ‘ping’ frame, you must respond by sending a ‘pong’ frame containing the same payload. If you receive a ‘close’ frame, you should respond in kind and then close the connection. If you receive malformed data you should send a ‘close’ frame with an error code and then close the connection. So there are various situations when the parser should react, not by yielding the data to the user, but by telling the user to send certain types of responses. But the most-downloaded Ruby library for this (websocket) handles the latter case by yielding the data to the user and expecting them to do the right thing.

I’ve tried reimplementing faye-websocket’s Client class on top of websocket and the amount of boilerplate required is huge if you want to produce a correct implementation. Here’s a laundry list of stuff you need to implement yourself (links are to relevant sections of code):

Know that there is a handshake process and send a handshake over the TCP connection
Manage the state of the parser and route input either to the handshake or to the frame parser
Switch modes when the handshake is complete
Remember to parse frame data received in the same chunk as the handshake
Queue messages if the handshake is not done and flush the queue after the handshake
Respond to pings by sending a matching pong frame
Validate the payload and encoding of close frames and extract the closing code from the first two bytes of the payload
Know which closing codes are allowed and return an error code if you receive one
Respond to close frames with a matching frame with the same code
Know about all the types of errors and that you should send a close frame if any of them happen
Remember which version of the protocol is in use and tell the frame formatter to use this version when creating frames
Do all this while accounting for the fact that multiple versions of the WebSocket protocol exist that have very different requirements

So this protocol library not only leaks by making the user track the state of the connection and the state of the parser, but also makes them implement stuff the protocol should deal with. Almost all the above points are behaviours set down in the specification; the user must implement them this way or their deployment is buggy. Since the user has no meaningful control over how this stuff works, all this code is just boilerplate that requires significant knowledge to write correctly. In contrast, faye-websocket and websocket-driver have never emitted events on ping frames because the user has no choice over how to handle them, so why make them write code for that? In websocket-driver, all the above points (and this list is not exhaustive) are dealt with by the protocol library and this gives users a much better hope of deploying WebSockets correctly and safely.

I’m not saying the websocket library is broken, per se. I’m saying it doesn’t go far enough. In Ruby we have lots of different means of doing network I/O, and there’s a few in Node if you consider HTTP servers and TCP/TLS sockets, though they all have similar interfaces. If you want to build a widely useful protocol library, you should solve as many problems as possible for the user so that they just need to bring some I/O and they’re pretty much done. Asking the user to rebuild half the protocol themselves is a recipe for bugs, security holes and wasted effort. We shouldn’t have to rebuild each protocol for every I/O stack we invent, so let’s stop.

Let the user tell you what they want to do, and then tell their code how and when to realize this over TCP. If you find yourself explaining the protocol mechanics when you’re documenting your library, it’s not simple enough yet. Refactor until I don’t need to read the RFC to deploy it properly.

The If Works

by James Coglan

websocket-driver: an I/O-agnostic WebSocket module, or, why most protocol libraries aren’t