Most of the open-source software that I write is done at least partly for learning purposes. The majority of it isn’t used by many people, I just wrote it to implement some existing idea, so I could understand how it works.
One such project is Canopy, a PEG parser generator that I originally wrote
as part of a very protracted yak shave – while writing a Capybara driver
to run in the browser, I found I needed to implement an XPath query API for
IE (which lacks document.evaluate()
), and rather than write an XPath parser by
hand I thought it’d be fun to make a parser generator, since I thought I had a
better chance at getting that right than I did a hand-written parser.
It’s been three years and a day since the last release. It’s not something I often find use for, and not something I have ever spent an awful lot of time on, but now and then I get the urge to go back to one of my old projects and see if I can’t improve it using understanding I’ve gained since I last worked on it. Three months ago, I thought it would be fun to see if I could transform Canopy from being a JavaScript-specific parser generator into a compiler for arbitrary languages, and that’s what I’m releasing today.
This is only possible at all because of an important separation of concerns that I learned from Treetop. Many parser generator tools are built for a specific language, and encourage or require you to mix your processing code into the grammar definition. Treetop supports this model, per its examples:
grammar Arithmetic
rule additive
multitive a:( '+' multitive )* {
def value
a.elements.inject(multitive.value) { |sum, e|
sum+e.multitive.value
}
end
}
end
# other rules below ...
end
After a rule’s expression, you can enclose a set of methods between braces and those method will be added to the generated node. However you can also specify these methods via a named module:
grammar Arithmetic
rule additive
multitive ('+' multitive)* <AdditiveNode>
end
# other rules below ...
end
This latter example is the only style Canopy has ever supported, and it produces two important benefits: first, it makes it easier to read the grammar itself since it’s not cluttered with processing code, and second, it makes the parser definition language-agnostic. Nothing about the second example is specific to Ruby; it could easily be used to generate JavaScript code and Python code and all sorts of other things.
So I thought, with more and more logic being moved off servers and into browsers and smartphone apps, it would be interesting and useful to have parsing tools that could generate code in any language that would do the same thing. It’s annoying when the server and client parts of an app are written by different teams, and the server team picks some format to represent something, and the client team has to build their own parser because there wasn’t one available for their language. Or, they don’t write an actual parser (because that’s an awful lot of work) and deploy some ad-hoc text processing that ends up being full of edge cases. It would be nice to define a grammar once and generate code for wherever you need to use it.
So Canopy now supports four output langauges: its original JavaScript, plus Java, Ruby and Python. You can write a single grammar:
grammar URL
url <- scheme "://" host pathname search hash?
scheme <- "http" "s"?
host <- hostname port?
hostname <- segment ("." segment)*
segment <- [a-z0-9-]+
port <- ":" [0-9]+
pathname <- "/" [^ ?]*
search <- ("?" query:[^ #]*)?
hash <- "#" [^ ]*
Compile it to any of the supported langauges:
$ canopy url.peg --lang js
$ canopy url.peg --lang java
$ canopy url.peg --lang python
$ canopy url.peg --lang ruby
And then use the grammar in JavaScript:
var url = require('./url');
var tree = url.parse('http://example.com/search?q=hello#page=1');
tree.elements.forEach(function(node) {
console.log(node.offset, node.text);
});
Or Java:
import url.URL;
import url.TreeNode;
import url.ParseError;
public class Example {
public static void main(String[] args) throws ParseError {
TreeNode tree = URL.parse("http://example.com/search?q=hello#page=1");
for (TreeNode node : tree.elements) {
System.out.println(node.offset + ", " + node.text);
}
}
}
Or Python:
import url
tree = url.parse('http://example.com/search?q=hello#page=1')
for node in tree.elements:
print node.offset, node.text
Or Ruby:
require './url'
tree = URL.parse('http://example.com/search?q=hello#page=1')
tree.elements.each do |node|
puts node.offset, node.text
end
The parsers have no runtime dependency on Canopy itself and should have exactly
the same semantics. They all produce a tree of nodes that have properties called
text
, offset
and elements
. However, a new feature means you can change how
nodes are generated. Say you have a grammar that matches lists of integers:
grammar Ints
list <- "[" number ("," number)* "]" %make_list
number <- [0-9]+ %make_number
The words following the %
signs refer to named functions (called actions)
that you can supply to the parser to make it generate the values you want.
Rather than the default tree, let’s generate an actual list of integers:
var ints = require('./ints');
var actions = {
make_list: function(input, start, end, elements) {
var list = [elements[1]];
elements[2].forEach(function(el) { list.push(el.number) });
return list;
},
make_number: function(input, start, end, elements) {
return parseInt(input.substring(start, end), 10);
}
};
ints.parse('[1,2,3]', {actions: actions})
// -> [ 1, 2, 3 ]
Similar functionality exists for the other target languages. There are a few
useful differences between this approach and the still-supported
<TypeAnnotation>
style.
- Whereas types add new methods to the default tree structure, actions add new functions to the parser itself that allow it to produce any kind of value.
- Actions run while the parser is executing, whereas methods added by types have to be invoked after the parser has finished. This means processing can be interleaved with parsing, and parsing can be halted with an exception if something syntactically valid but semantically meaningless is encountered.
- Actions are run instead of the built-in methods for generating nodes, so the parser doesn’t do work to generate nodes you don’t want.
- For Java, with its static type system, this was more easily realised than
dynamically adding new methods to values at runtime. The actions can be
grouped into an interface that the caller implements. An implementation of the
interface is passed to the
parse()
method so it can delegate node construction.
Also, notice how the actions are supplied to the parser:
ints.parse('[1,2,3]', {actions: actions})
The same style has been adopted for passing in types. Whereas before they were static settings on the parser class:
var arithmetic = require('./arithmetic');
arithmetic.Parser.AdditiveNode = {
value: function() { ... }
};
arithmetic.parse('1 + 2');
They are now passed into the parse()
method just like actions:
var arithmetic = require('./arithmetic');
var types = {
AdditiveNode: {
value: function() { ... }
}
};
arithmetic.parse('1 + 2', {types: types});
Having these be passed to the parser on every invocation means that you can have multiple different implementations of your actions and types and use a different set depending on context. You can use the same language definition for multiple different purposes just by passing in a different object.
This API also makes supplying types across langauges more consistent; since every langauge has a different set of rules for how constants are loaded and referenced, converting them from global settings into an injected dependency made things simpler.
Finally, I was interested in performance, both in terms of improving the generated code’s footprint and making it run faster. Compared to version 0.2, 0.3 sees some modest improvements. On this JSON grammar:
grammar CanopyJson
document <- __ (object / array) __
object <- "{" pair ("," pair)* "}" / "{" __ "}"
pair <- __ string __ ":" value
array <- "[" value ("," value)* "]" / "[" __ "]"
value <- __ (object / array / string / number / boolean_ / null_) __
string <- "\"" ("\\" . / [^"])* "\""
number <- "-"? ("0" / [1-9] [0-9]*) ("." [0-9]+)? (("e" / "E") ("+" / "-" / "") [0-9]+)?
boolean_ <- "true" / "false"
null_ <- "null"
__ <- [\s]*
Canopy 0.2 produces a JavaScript file of 67,459 bytes (5,964 bytes gzipped) while 0.3 produces a file of 50,335 bytes (4,475 bytes gzipped). The 0.3 parser runs slightly faster too, according to this benchmark (on Node 0.12.7 on Ubuntu 15.04):
var json2 = require('./json2'),
json3 = require('./json3'),
benchmark = require('benchmark');
var pkg = require('fs').readFileSync('package.json', 'utf8'),
suite = new benchmark.Suite();
suite.add('0.2', function() { json2.parse(pkg) });
suite.add('0.3', function() { json3.parse(pkg) });
suite.on('complete', function() {
this.forEach(function(result) { console.log(result.toString()) });
});
suite.run();
Running this produces:
0.2 x 1,531 ops/sec ±0.45% (100 runs sampled)
0.3 x 1,863 ops/sec ±0.19% (101 runs sampled)
I also compared it to other PEG libraries for JavaScript, Python and Ruby, using a set of grammars I’ve ported to various tools. Compared to PEG.js (with caching turned on, which prevents the same action being run with the same input more than once) (higher is better):
Canopy JSON x 1,896 ops/sec ±0.56% (101 runs sampled)
PEG.js JSON x 1,303 ops/sec ±0.26% (100 runs sampled)
Canopy Lisp x 25,606 ops/sec ±0.14% (100 runs sampled)
PEG.js Lisp x 11,982 ops/sec ±1.68% (91 runs sampled)
Canopy PEG x 465 ops/sec ±1.18% (93 runs sampled)
PEG.js PEG x 390 ops/sec ±3.08% (86 runs sampled)
Compared to Parsimonious (lower is better):
------ benchmark: 6 tests, min 5 rounds (of min 25.00us), 1.00s max time, timer: time.time -------
Name (time in us) Min Max Mean StdDev Rounds Iterations
--------------------------------------------------------------------------------------------------
test_canopy_parse_json 6659.9846 17949.1043 7536.4686 1579.5468 129 1
test_parsimonious_parse_json 14208.0784 20632.0286 15598.0053 2023.8025 67 1
--------------------------------------------------------------------------------------------------
test_canopy_parse_lisp 367.8799 689.9834 374.6855 16.7133 2088 1
test_parsimonious_parse_lisp 599.8611 993.9671 611.9249 22.9380 1369 1
--------------------------------------------------------------------------------------------------
test_canopy_parse_peg 9866.9529 15202.0454 10825.6472 1577.9357 90 1
test_parsimonious_parse_peg 16062.9749 22645.9503 17493.1793 2203.0655 59 1
--------------------------------------------------------------------------------------------------
And compared to various Ruby libraries (higher is better):
Canopy JSON 178.743 (± 1.1%) i/s - 901.000
Citrus JSON 203.357 (± 2.5%) i/s - 1.020k
Parslet JSON 35.186 (± 5.7%) i/s - 177.000
Treetop JSON 92.870 (± 4.3%) i/s - 468.000
Canopy Lisp 2.862k (± 0.7%) i/s - 14.504k
Citrus Lisp 3.106k (± 1.6%) i/s - 15.756k
Parslet Lisp 562.075 (± 2.8%) i/s - 2.856k
Treetop Lisp 1.563k (± 3.3%) i/s - 7.854k
Canopy PEG 113.604 (± 1.8%) i/s - 572.000
Citrus PEG 9.444 (± 0.0%) i/s - 48.000
Parslet PEG 25.936 (± 3.9%) i/s - 130.000
Treetop PEG 63.826 (± 6.3%) i/s - 318.000
Many of these can tweaked either via their grammar facilities or their compiler settings to optimise things; I have tried to keep each port as close to the Canopy grammar as possible for a fair comparison. While it’s not the fastest (and this depends very much on which grammar you use and which version of a runtime you benchmark on), it’s clearly performing comparably to the popular tools in this space.
Although porting to each language did pose some challenges – Python benefits from pre-compiled regexes, Java generates multiple files and needs static types, attributes work differently everywhere – they are still roughly similar. They’re all high-level object-oriented languages with a similar model of computation, common data and control structures, and mutable data. It would be interesting to see what supporting something wildly different like Haskell or Clojure would involve (even though I’ve cheated on Clojure by targetting the JVM already). I don’t expect I’ll get to that any time soon, but I know I’ll have a project to experiment and learn on when I learn those languages.