Terminus: a client-side Capybara driver

Last week the Capybara project released version 0.4. Since the 0.3.9 release, which added support for third-party drivers, I’ve been working on turning Terminus into fully compatible driver for it. It’s still an experiment but it’s in the sort of semi-useful state that means I’m okay throwing it out to see if anyone’s interested in it.

So what is it? Terminus is a Capybara driver where most of the driver functions are implemented in client-side JavaScript. It lets you script any browser on any machine using the Capybara API, without any browser plugins or extensions. ‘Any browser’ really means ‘any browser that supports the document.evaluate() XPath API’ for now, so you can’t use IE, but the ‘any machine’ part is true: any browser that can see your development machine can be controlled using Terminus, including any phone, iPad or desktop machine on your local network.

This is very much still an experiment. Because every little Capybara call has to go over the network (it uses Faye to send commands to browsers) it’s very slow, and the connection sometimes flakes out. Under ideal circumstances though it does pass most of the Capybara test suite. It supports JavaScript, cookies and window switching (assuming the connected browser has these features enabled). The major omissions are:

  • attach_file does not work since file uploads cannot be controlled by JavaScript for security reasons.
  • A few selectors and HTML5 features in the specs don’t work due to browser inconsistencies.
  • Redirects work but infinite redirection cannot be detected.

The original intention was for Terminus to be a tool for automating cross-browser testing, so the lack of IE support is kind of a problem. I actually went some way towards fixing this; I wrote a pretty stupid document.evaluate() replacement called Pathology, and to support that a PEG parser compiler called Canopy so I could write a declarative XPath grammar. Unfortunately these don’t yet perform well enough to support the workload that Capybara throws at a browser.

I also want to add an API for switching browsers based on vendor and OS versions, but this release is just enough to support the required Capybara API.

You can find out more on the project’s GitHub page, and installation is the usual gem install terminus command – I’d love if a few of you could kick its tyres a bit and figure out if it’s actually useful for anything.

Improving the observer pattern with subscription objects

Way back in February I wrote a series of articles on event-driven programming; the pattern I find myself using the most out of all those I covered is the observer pattern: an object that collects callback functions in order to notify other parts of a system when interesting things happen. The basic pattern looks something like this:

Observable = {
  bind: function(eventType, listener, context) {
    this._listeners = this._listeners || {};
    var list = this._listeners[eventType] = this._listeners[eventType] || [];
    list.push([listener, context]);
  },

  trigger: function(eventType, args) {
    if (!this._listeners) return;
    var list = this._listeners[eventType];
    if (!list) return;
    list.forEach(function(listener) {
      listener[0].apply(listener[1], args);
    });
  }
};

This is pretty simple: we can add callbacks to a list, and call them all when something happens. But what if we want to remove a listener? The approach taken by many DOM interfaces is to provide essentially the opposite of the bind() method:

Observable.unbind = function(eventType, listener, context) {
  if (!this._listeners) return;
  var list = this._listeners[eventType];
  if (!list) return;
  
  var i = list.length;
  while (i--) {
    if (list[i][0] === listener && list[i][1] === context)
      list.splice(i, 1);
  }
};

This just takes the original event type and listener function that we gave to bind(), and removes the listener from that event type’s list of listeners. Internally, that’s exactly what we want to achieve, but the API is hard to use. It reeks of something inspired by Java, where listeners are objects and easier to keep references to. Idiomatic JavaScript rests heavily on anonymous functions, and being forced keep references to a function and a context object just so you can remove a listener often feels clunky. It would be nicer if the bind() API returned something to represent the subscription so we could cancel it later:

var subscription = object.bind('someEvent', function(event) {
  // Handle the event
});

// Later on
subscription.cancel();

This is easily done, we just need an object to store the event type, listener function and observed object so that it can call the unbind() method for us when we want to cancel the subscription.

Subscription = function(target, eventType, listener, context) {
  this._target    = target;
  this._eventType = eventType;
  this._listener  = listener;
  this._context   = context;
  this._active    = true;
};

Subscription.prototype.cancel = function() {
  if (!this._active) return;
  this._target.unbind(this._eventType, this._listener, this._context);
  this._active = false;
};

We can then change the bind() method to return one of these objects:

Observable.bind = function(eventType, listener, context) {
  this._listeners = this._listeners || {};
  var list = this._listeners[eventType] = this._listeners[eventType] || [];
  list.push([listener, context]);
  return new Subscription(this, eventType, listener, context);
};

There are further refactorings we could make; maybe it would be better to store the subscription objects rather than the listener functions on the observable object. We’d probably then move the listener calling logic into the Subscription class, although this could introduce extra function call overhead while dispatching events. For now I’m happy leaving it as sugar over the unbind() method. (This is cribbed from the Faye client, which is required by the protocol to support an unsubscribe() method that takes a channel name and listener to remove.)

So we’ve got an API that we’re happy with but its internals are sub-optimal: it takes O(n) time to remove a listener from an object, and if we need to do this operation a lot at scale we’ll have a performance problem. In Faye, for example, every client that subscribes to a channel must eventually unsubscribe, so we’d like that to be fast. In this situation we have two things we can use to make this faster: every client has a unique ID, and the list of subscribers to a channel must contain no duplicates – otherwise a client would receive some messages twice.

Putting all this together, we could model a channel’s subscribers using a JavaScript object to produce set-like behaviour:

var Channel = function() {
  this._subscribers = {};
};

Channel.prototype.bind = function(clientId, listener, context) {
  this._subscribers[clientId] = [listener, context];
};

Channel.prototype.unbind = function(clientId) {
  delete this._subscribers[clientId];
};

Channel.prototype.trigger = function(message) {
  for (var clientId in this._subscriptions) {
    var client = this._subscriptions[clientId];
    client[0].call(client[1], message);
  }
};

This would make it easy for us to add and remove subscriptions from a channel, but it means that message dispatch is slowed down due to the poor performance of for/in. But as I mentioned last week, there’s an easy solution to that: keep a sorted list of all the subscribed client IDs. The list speeds up iteration and message dispatch, while sorting the list makes removal O(log n) rather than O(n). Combining this with the Subscriber class from above, we get a final draft of our Channel API:

Channel = function() {
  this._clientIds   = [];
  this._subscribers = {};
};

Channel.prototype.bind = function(clientId, listener, context) {
  if (!this._subscribers.hasOwnProperty(clientId)) {
    var index = this._indexOf(clientId);
    this._clientIds.splice(index, 0, clientId);
  }
  this._subscribers[clientId] = [listener, context];
  return new Subscription(this, clientId);
};

Channel.prototype.unbind = function(clientId) {
  if (!this._subscribers.hasOwnProperty(clientId)) return;
  delete this._subscribers[clientId];
  var index = this._indexOf(clientId);
  this._clientIds.splice(index, 1);
};

Channel.prototype.trigger = function(message) {
  var clientIds = this._clientIds,
      i         = clientIds.length,
      client;
  
  while (i--) {
    client = this._subscribers[clientIds[i]];
    client[0].call(client[1], message);
  }
};

Channel.prototype._indexOf = function(key) {
  var keys = this._clientIds,
      n    = keys.length,
      i    = 0,
      d    = n;

  if (n === 0)         return 0;
  if (key < keys[0])   return 0;
  if (key > keys[n-1]) return n;

  while (key !== keys[i] && d > 0.5) {
    d = d / 2;
    i += (key > keys[i] ? 1 : -1) * Math.round(d);
    if (key > keys[i-1] && key < keys[i]) d = 0;
  }
  return i;
};

All we’ve done here is use the implementation of SortedTable from the previous article to model the channel’s set of subscribers. It supports O(log n) insertion and removal of subscribers and should iterate over them reasonably quickly. Of course in practise we would want to measure the total load of binding, dispatching and unbinding under production traffic to see which model has the best trade-off for our use case. For example, in Ruby the iteration performance of a Hash is nowhere near as bad as for JavaScript objects, so we could use the simpler Channel implementation and get O(1) insertion and removal of clients.

One final benefit of this API is that all that’s required to unbind from a channel is the clientId, since that uniquely identifies the listener to remove. The use of a Subscription object hides the mechanics of unbinding from the caller so we don’t have to change its code if we change the Channel class’s storage model.

JavaScript objects: the poor man’s set

Every so often, you need a collection of items where you know the items in it are unique. JavaScript having no standard library, there’s a pretty strong temptation to hack something using an array, since that’s the closest-looking data structure available to you. A set’s just a list with no duplicates, right?

Well, kinda. In practise it won’t work out that well, since the time it takes to determine whether an item is already in the set scales with the size of the collection, and this becomes very expensive for large amounts of data. You could keep the array sorted, which will speed up searches quite a bit: you get O(log n) insertion rather than O(n). But we can do better.

Many set implementations (see Ruby, Java) use a hashtable as the storage for a set. This is because hashtables are optimised for finding keys as quickly as possible (ideally they should maintain constant search performance as the collection grows), and they make sure no key appears twice. In JavaScript we don’t have proper hashtables, but we do have Object, and if our data consists of strings or numbers, this is good enough.

In a set implementation, we just use the fact that an object does not allow duplicate keys, and put some dummy values in for each key. We don’t care what the values are, we only care about the keys.

Set = function() {
  this._table = {};
};

Set.prototype.add = function(value) {
  if (this.contains(value)) return false;
  this._table[value] = true;
  return true;
};

Set.prototype.remove = function(value) {
  delete this._table[value];
};

Set.prototype.contains = function(value) {
  return this._table.hasOwnProperty(value);
};

Set.prototype.forEach = function(block, context) {
  for (var value in this._table)
    block.call(context, value);
};

Set.prototype.toArray = function() {
  var array = [];
  this.forEach(array.push, array);
  return array;
};

This is plenty to let us add, remove and iterate over values in the set. Note how the add() method returns true or false to tell us if the item was added or not.

> load('set.js')
> s = new Set()
> s.add('foo')
true
> s.add('foo')
false
> s.add('bar')
true
> s.toArray().join(', ')
foo, bar
> s.remove('foo')
> s.toArray().join(', ')
bar

Of course, if I need a collection of objects rather than just strings, I’ll reach for one of the JS.Set classes, but for a lot of what you’d typically do with JavaScript the above will get you a long way. And if iteration performance is problematic, there are ways to optimise.

I am a fast loop

I’ve been taking a lot of measurements recently, trying to find out ways I can optimise Faye and JS.Class. One of the big things I’ve been looking at is how much various iteration constructs cost in JavaScript. Loops are what made initial versions of Sylvester slow, and there’s a good chance they’re holding back other projects of mine performance-wise. If you’re thinking of moving to a different data structure to speed access times up, it can have unexpected consequences for iteration.

As a preliminary exercise, I took a look at these three common constructs for looping over an array. Note that my benchmarks include the cost of accessing each member of the collection, not just doing the iteration.

// Array: for loop
for (var i = 0, n = array.length; i < n; i++) { array[i] }

// Array: reverse while
var i = array.length;
while (i--) { array[i] }

// Array: forEach()
array.forEach(function(x) {});

The reverse while loop is not so common, but I use it a lot when I don't care about the order since it's slightly faster and more terse without losing clarity.

Some representative numbers from these tests are below. I used an array containing 100,000 integers and measured how long it took to iterate over the array 100 times. Times are given in milliseconds.

V8 (Chrome, Node)
Array: for loop      :: 64 +/- 1%
Array: reverse while :: 60 +/- 0%
Array: forEach()     :: 235 +/- 0%

Firefox 3.6
Array: for loop      :: 965 +/- 0%
Array: reverse while :: 964 +/- 1%
Array: forEach()     :: 1465 +/- 1%

Opera 10
Array: for loop      :: 35 +/- 20%
Array: reverse while :: 40 +/- 28%
Array: forEach()     :: 371 +/- 4%

Rhino (Narwhal)
Array: for loop      :: 644 +/- 4%
Array: reverse while :: 585 +/- 1%
Array: forEach()     :: 569 +/- 3%

So we see a reverse while loop is ever-so-slightly faster than a for loop, and forEach() is usually much more expensive since it involves a function call on every iteration. (Internet Explorer 8 doesn’t have forEach(), but its while and for perform similarly to other platforms.)

Now, a common optimisation if you have a collection of items that needs to support fast removal (for example, unsubscribing a client from a channel in Faye) is to replace an array with a hashtable as the collection’s storage mechanism. Arrays have O(n) time complexity for finding and removing an item, whereas a table is O(1) modulo some edge cases in the implementation.

But, there’s a problem: iterating over an object in JavaScript using this technique is extremely slow:

// Object: for/in
for (var key in object) {
  if (!object.hasOwnProperty(key)) continue;
  // do something with object[key];
}

Don’t believe me? Here are the numbers for an object with 100,000 fields, compared to a 100,000-item array:

V8 (Chrome, Node)
Array: reverse while :: 60 +/- 0%
Object: for/in       :: 2480 +/- 0%

Firefox 3.6
Array: reverse while :: 964 +/- 1%
Object: for/in       :: 8846 +/- 4%

Internet Explorer 8
Array: reverse while :: 21 +/- 35%
Object: for/in       :: 111 +/- 14%

Opera 10
Array: reverse while :: 40 +/- 28%
Object: for/in       :: 1010 +/- 8%

Rhino (Narwhal)
Array: reverse while :: 585 +/- 1%
Object: for/in       :: 3085 +/- 3%

This doesn’t look good: making unsubscription faster in Faye could end up making message distribution orders of magnitude slower! However, what if we combine an array and an object to get the best of both worlds? Suppose the object’s keys are stored in the array, we can iterate over the array instead of the object:

// Object: while
var i = array.length, key;
while (i--) {
  key = array[i];
  // do something with object[key];
}

We could even make a new class to encapsulate this and manage the array for us:

Table = function() {
  this._keys = [];
  this._data = {};
};

Table.prototype.put = function(key, value) {
  if (!this._data.hasOwnProperty(key)) this._keys.push(key);
  this._data[key] = value;
};

Table.prototype.forEach = function(block, context) {
  var keys = this._keys,
      data = this._data,
      i    = keys.length,
      key;
  
  while (i--) {
    key = keys[i];
    block.call(context, key, data[key]);
  }
};

Table#put() stores a key/value pair in the table, and Table#forEach() iterates over the stored keys for us and yields a key and value to the block. Checking whether a key already exists just involves looking it up in a table, so adding pairs to the collection is O(1).

Let’s see how a while loop and Table#forEach() compare to a for/in loop:

V8 (Chrome, Node)
Object: for/in       :: 2480 +/- 0%
Object: while        :: 98 +/- 0%
Table: forEach()     :: 191 +/- 1%

Firefox 3.6
Object: for/in       :: 8846 +/- 4%
Object: while        :: 2728 +/- 1%
Table: forEach()     :: 2732 +/- 0%

Internet Explorer 8
Object: for/in       :: 111 +/- 14%
Object: while        :: 42 +/- 14%
Table: forEach()     :: 63 +/- 18%

Opera 10
Object: for/in       :: 1010 +/- 8%
Object: while        :: 79 +/- 9%
Table: forEach()     :: 874 +/- 6%

Rhino (Narwhal)
Object: for/in       :: 3085 +/- 3%
Object: while        :: 1729 +/- 0%
Table: forEach()     :: 2626 +/- 1%

From the above numbers, we see we can get from 2x to 10x faster iteration depending on the platform. Some platforms even run Table#forEach() at a good enough pace to beat a for/in loop! Firefox 4.0 actually reported Table#forEach() faster than a while loop, which I have to say leaves me a little puzzled.

There are further optimisations we can make, though. Recall that my whole reason for wanting a new data structure is that I want fast removal (i.e. better than O(n)), but this Table class won’t support that since to remove a key I have to look through an array for it, which is no improvement on what I was doing before for storing client-channel bindings.

In JavaScript, object keys are strings and so we can sort them. If we can keep the keys array sorted we’ll get O(log n) removal; we’ll also get O(log n) insertion rather than O(1), but for large n where every client must eventually be removed, 2log(n) < n by quite some margin. Let's make a subclass and override the storage methods to keep the keys array in order:

SortedTable = function() {
  Table.call(this);
};
SortedTable.prototype = new Table();

SortedTable.prototype.put = function(key, value) {
  if (!this._data.hasOwnProperty(key)) {
    var index = this._indexOf(key);
    this._keys.splice(index, 0, key);
  }
  this._data[key] = value;
};

SortedTable.prototype.remove = function(key) {
  if (!this._data.hasOwnProperty(key)) return;
  delete this._data[key];
  var index = this._indexOf(key);
  this._keys.splice(index, 1);
};

SortedTable.prototype._indexOf = function(key) {
  var keys = this._keys,
      n    = keys.length,
      i    = 0,
      d    = n;

  if (n === 0)         return 0;
  if (key < keys[0])   return 0;
  if (key > keys[n-1]) return n;

  while (key !== keys[i] && d > 0.5) {
    d = d / 2;
    i += (key > keys[i] ? 1 : -1) * Math.round(d);
    if (key > keys[i-1] && key < keys[i]) d = 0;
  }
  return i;
};

The indexOf() method is a basic binary search that assumes values are comparable using JavaScript operators and that the values are unique. It’s a simplified version of the code from JS.SortedSet, which is orders of magnitude faster than JS.Set for insertion and removal on large collections. You could get fancy and use a tree structure but the ones I’ve seen don’t perform much better than my sorted array approach.

In a later article I’ll go through a practical application of this, showing how we could use it to speed up Faye’s client disconnection routines.