Canopy produces portable PEG parsers

Most of the open-source software that I write is done at least partly for learning purposes. The majority of it isn’t used by many people, I just wrote it to implement some existing idea, so I could understand how it works.

One such project is Canopy, a PEG parser generator that I originally wrote as part of a very protracted yak shave – while writing a Capybara driver to run in the browser, I found I needed to implement an XPath query API for IE (which lacks document.evaluate()), and rather than write an XPath parser by hand I thought it’d be fun to make a parser generator, since I thought I had a better chance at getting that right than I did a hand-written parser.

It’s been three years and a day since the last release. It’s not something I often find use for, and not something I have ever spent an awful lot of time on, but now and then I get the urge to go back to one of my old projects and see if I can’t improve it using understanding I’ve gained since I last worked on it. Three months ago, I thought it would be fun to see if I could transform Canopy from being a JavaScript-specific parser generator into a compiler for arbitrary languages, and that’s what I’m releasing today.

This is only possible at all because of an important separation of concerns that I learned from Treetop. Many parser generator tools are built for a specific language, and encourage or require you to mix your processing code into the grammar definition. Treetop supports this model, per its examples:

grammar Arithmetic
  rule additive
    multitive a:( '+' multitive )* {
      def value
        a.elements.inject(multitive.value) { |sum, e|
          sum+e.multitive.value
        }
      end
    }
  end

  # other rules below ...
end

After a rule’s expression, you can enclose a set of methods between braces and those method will be added to the generated node. However you can also specify these methods via a named module:

grammar Arithmetic
  rule additive
    multitive ('+' multitive)* <AdditiveNode>
  end

  # other rules below ...
end

This latter example is the only style Canopy has ever supported, and it produces two important benefits: first, it makes it easier to read the grammar itself since it’s not cluttered with processing code, and second, it makes the parser definition language-agnostic. Nothing about the second example is specific to Ruby; it could easily be used to generate JavaScript code and Python code and all sorts of other things.

So I thought, with more and more logic being moved off servers and into browsers and smartphone apps, it would be interesting and useful to have parsing tools that could generate code in any language that would do the same thing. It’s annoying when the server and client parts of an app are written by different teams, and the server team picks some format to represent something, and the client team has to build their own parser because there wasn’t one available for their language. Or, they don’t write an actual parser (because that’s an awful lot of work) and deploy some ad-hoc text processing that ends up being full of edge cases. It would be nice to define a grammar once and generate code for wherever you need to use it.

So Canopy now supports four output langauges: its original JavaScript, plus Java, Ruby and Python. You can write a single grammar:

grammar URL

url       <-  scheme "://" host pathname search hash?
scheme    <-  "http" "s"?
host      <-  hostname port?
hostname  <-  segment ("." segment)*
segment   <-  [a-z0-9-]+
port      <-  ":" [0-9]+
pathname  <-  "/" [^ ?]*
search    <-  ("?" query:[^ #]*)?
hash      <-  "#" [^ ]*

Compile it to any of the supported langauges:

$ canopy url.peg --lang js
$ canopy url.peg --lang java
$ canopy url.peg --lang python
$ canopy url.peg --lang ruby

And then use the grammar in JavaScript:

var url = require('./url');

var tree = url.parse('http://example.com/search?q=hello#page=1');

tree.elements.forEach(function(node) {
  console.log(node.offset, node.text);
});

Or Java:

import url.URL;
import url.TreeNode;
import url.ParseError;

public class Example {
    public static void main(String[] args) throws ParseError {
        TreeNode tree = URL.parse("http://example.com/search?q=hello#page=1");

        for (TreeNode node : tree.elements) {
            System.out.println(node.offset + ", " + node.text);
        }
    }
}

Or Python:

import url

tree = url.parse('http://example.com/search?q=hello#page=1')

for node in tree.elements:
    print node.offset, node.text

Or Ruby:

require './url'

tree = URL.parse('http://example.com/search?q=hello#page=1')

tree.elements.each do |node|
  puts node.offset, node.text
end

The parsers have no runtime dependency on Canopy itself and should have exactly the same semantics. They all produce a tree of nodes that have properties called text, offset and elements. However, a new feature means you can change how nodes are generated. Say you have a grammar that matches lists of integers:

grammar Ints

list    <-  "[" number ("," number)* "]" %make_list
number  <-  [0-9]+ %make_number

The words following the % signs refer to named functions (called actions) that you can supply to the parser to make it generate the values you want. Rather than the default tree, let’s generate an actual list of integers:

var ints = require('./ints');

var actions = {
  make_list: function(input, start, end, elements) {
    var list = [elements[1]];
    elements[2].forEach(function(el) { list.push(el.number) });
    return list;
  },

  make_number: function(input, start, end, elements) {
    return parseInt(input.substring(start, end), 10);
  }
};

ints.parse('[1,2,3]', {actions: actions})
// -> [ 1, 2, 3 ]

Similar functionality exists for the other target languages. There are a few useful differences between this approach and the still-supported <TypeAnnotation> style.

  • Whereas types add new methods to the default tree structure, actions add new functions to the parser itself that allow it to produce any kind of value.
  • Actions run while the parser is executing, whereas methods added by types have to be invoked after the parser has finished. This means processing can be interleaved with parsing, and parsing can be halted with an exception if something syntactically valid but semantically meaningless is encountered.
  • Actions are run instead of the built-in methods for generating nodes, so the parser doesn’t do work to generate nodes you don’t want.
  • For Java, with its static type system, this was more easily realised than dynamically adding new methods to values at runtime. The actions can be grouped into an interface that the caller implements. An implementation of the interface is passed to the parse() method so it can delegate node construction.

Also, notice how the actions are supplied to the parser:

ints.parse('[1,2,3]', {actions: actions})

The same style has been adopted for passing in types. Whereas before they were static settings on the parser class:

var arithmetic = require('./arithmetic');

arithmetic.Parser.AdditiveNode = {
  value: function() { ... }
};

arithmetic.parse('1 + 2');

They are now passed into the parse() method just like actions:

var arithmetic = require('./arithmetic');

var types = {
  AdditiveNode: {
    value: function() { ... }
  }
};

arithmetic.parse('1 + 2', {types: types});

Having these be passed to the parser on every invocation means that you can have multiple different implementations of your actions and types and use a different set depending on context. You can use the same language definition for multiple different purposes just by passing in a different object.

This API also makes supplying types across langauges more consistent; since every langauge has a different set of rules for how constants are loaded and referenced, converting them from global settings into an injected dependency made things simpler.

Finally, I was interested in performance, both in terms of improving the generated code’s footprint and making it run faster. Compared to version 0.2, 0.3 sees some modest improvements. On this JSON grammar:

grammar CanopyJson

document  <-  __ (object / array) __
object    <-  "{" pair ("," pair)* "}" / "{" __ "}"
pair      <-  __ string __ ":" value
array     <-  "[" value ("," value)* "]" / "[" __ "]"
value     <-  __ (object / array / string / number / boolean_ / null_) __
string    <-  "\"" ("\\" . / [^"])* "\""
number    <-  "-"? ("0" / [1-9] [0-9]*) ("." [0-9]+)? (("e" / "E") ("+" / "-" / "") [0-9]+)?
boolean_  <-  "true" / "false"
null_     <-  "null"
__        <-  [\s]*

Canopy 0.2 produces a JavaScript file of 67,459 bytes (5,964 bytes gzipped) while 0.3 produces a file of 50,335 bytes (4,475 bytes gzipped). The 0.3 parser runs slightly faster too, according to this benchmark (on Node 0.12.7 on Ubuntu 15.04):

var json2     = require('./json2'),
    json3     = require('./json3'),
    benchmark = require('benchmark');

var pkg   = require('fs').readFileSync('package.json', 'utf8'),
    suite = new benchmark.Suite();

suite.add('0.2', function() { json2.parse(pkg) });
suite.add('0.3', function() { json3.parse(pkg) });

suite.on('complete', function() {
  this.forEach(function(result) { console.log(result.toString()) });
});

suite.run();

Running this produces:

0.2 x 1,531 ops/sec ±0.45% (100 runs sampled)
0.3 x 1,863 ops/sec ±0.19% (101 runs sampled)

I also compared it to other PEG libraries for JavaScript, Python and Ruby, using a set of grammars I’ve ported to various tools. Compared to PEG.js (with caching turned on, which prevents the same action being run with the same input more than once) (higher is better):

Canopy JSON x 1,896 ops/sec ±0.56% (101 runs sampled)
PEG.js JSON x 1,303 ops/sec ±0.26% (100 runs sampled)

Canopy Lisp x 25,606 ops/sec ±0.14% (100 runs sampled)
PEG.js Lisp x 11,982 ops/sec ±1.68% (91 runs sampled)

Canopy PEG x 465 ops/sec ±1.18% (93 runs sampled)
PEG.js PEG x 390 ops/sec ±3.08% (86 runs sampled)

Compared to Parsimonious (lower is better):

------ benchmark: 6 tests, min 5 rounds (of min 25.00us), 1.00s max time, timer: time.time -------
Name (time in us)                       Min         Max        Mean     StdDev  Rounds  Iterations
--------------------------------------------------------------------------------------------------
test_canopy_parse_json            6659.9846  17949.1043   7536.4686  1579.5468     129           1
test_parsimonious_parse_json     14208.0784  20632.0286  15598.0053  2023.8025      67           1
--------------------------------------------------------------------------------------------------
test_canopy_parse_lisp             367.8799    689.9834    374.6855    16.7133    2088           1
test_parsimonious_parse_lisp       599.8611    993.9671    611.9249    22.9380    1369           1
--------------------------------------------------------------------------------------------------
test_canopy_parse_peg             9866.9529  15202.0454  10825.6472  1577.9357      90           1
test_parsimonious_parse_peg      16062.9749  22645.9503  17493.1793  2203.0655      59           1
--------------------------------------------------------------------------------------------------

And compared to various Ruby libraries (higher is better):

 Canopy JSON    178.743  (± 1.1%) i/s -    901.000 
 Citrus JSON    203.357  (± 2.5%) i/s -      1.020k
Parslet JSON     35.186  (± 5.7%) i/s -    177.000 
Treetop JSON     92.870  (± 4.3%) i/s -    468.000 

 Canopy Lisp      2.862k (± 0.7%) i/s -     14.504k
 Citrus Lisp      3.106k (± 1.6%) i/s -     15.756k
Parslet Lisp    562.075  (± 2.8%) i/s -      2.856k
Treetop Lisp      1.563k (± 3.3%) i/s -      7.854k

  Canopy PEG    113.604  (± 1.8%) i/s -    572.000 
  Citrus PEG      9.444  (± 0.0%) i/s -     48.000 
 Parslet PEG     25.936  (± 3.9%) i/s -    130.000 
 Treetop PEG     63.826  (± 6.3%) i/s -    318.000

Many of these can tweaked either via their grammar facilities or their compiler settings to optimise things; I have tried to keep each port as close to the Canopy grammar as possible for a fair comparison. While it’s not the fastest (and this depends very much on which grammar you use and which version of a runtime you benchmark on), it’s clearly performing comparably to the popular tools in this space.

Although porting to each language did pose some challenges – Python benefits from pre-compiled regexes, Java generates multiple files and needs static types, attributes work differently everywhere – they are still roughly similar. They’re all high-level object-oriented languages with a similar model of computation, common data and control structures, and mutable data. It would be interesting to see what supporting something wildly different like Haskell or Clojure would involve (even though I’ve cheated on Clojure by targetting the JVM already). I don’t expect I’ll get to that any time soon, but I know I’ll have a project to experiment and learn on when I learn those languages.