Announcing Canopy v0.4

Today I’m publishing version 0.4 of Canopy, my PEG parser compiler for Java, JS, Python and Ruby. This is long overdue and fixes a number of key usability problems, as well as adding a few new features.

The most urgent problem it fixes is that it was depending on behaviour of the mkdirp package that broke when mkdirp v1.0 was released. When this happened, we couldn’t release a patch because Canopy was also depending on a neglected home-grown build system, and it would have been a huge effort to get that working again. The mainline development branch of Canopy had since dropped said build system, but contained a lot of work in progress that wasn’t ready to ship. The result of all this was that we couldn’t ship a patch for v0.3, and v0.4 wasn’t ready yet. The good news is that you can now run Canopy via normal installation from npm, rather than building the project from source:

$ npm install -g canopy

The other major thing this release fixes is that when we first released support for different target languages, those languages didn’t have their own test suites. Instead we heavily unit-tested the systems that parse grammars and talk to the code generator, but only the code generator for JavaScript was actually tested. For the other languages we were relying on a few example programs and trying to keep the code as simple as possible, so that all the code generators were doing was “boring” direct translation of high-level logic into particular language syntax.

As you might guess, this didn’t work, and we ended up with a few language-specific bugs, for example:

  • In Java and Python, if a rule attempted to read past the end of the input, you would get a crash rather than a normal failure of that rule.
  • The Java Actions interface would sometimes not include all required actions, and wouldn’t compile if an action was used in multiple places in a grammar.
  • In Python it was possible for a conditional block to contain no statements, and a pass statement needs to be inserted in these cases.

Canopy now has a complete copy of all its tests written for all target languages, so we can be sure that all its functionality works on all platforms, and make sure any bugs we find get fixed reliably.

We’ve also added some new features and usability improvements. First, sequence expressions can now mute their elements, if you don’t want those elements to produce nodes in the parse tree. For example, given this grammar:

grammar Hash
  object  <-  @"{" string @" => " number:[0-9]+ @"}"
  string  <-  "'" [^']* "'"

The items with a @ prefix will not generate tree nodes, so we can remove items that don’t contribute useful information and are just syntactic noise:

require('./hash').parse("{'foo' => 36}")

   == { text: "{'foo' => 36}",
        offset: 0,
        elements: [
          { text: "'foo'", offset: 1, elements: [...] },
          { text: '36', offset: 10, elements: [...] }
        ],
        string: { text: "'foo'", offset: 1, elements: [...] },
        number: { text: '36', offset: 10, elements: [...] } }

Second, we’ve added numeric repetition to match that found in regular expressions. As well as the * and + operators, we now have:

  • expr{n} matches expr exactly n times
  • expr{n,} matches expr at least n times
  • expr{n,m} matches expr at least n times and at most m times

So for example, [a-z]{3,5} matches 'bad' and 'apple', but not 'no' or 'bananas'.

Finally, we’ve added an --output option to the command line. Normally, the output location is decided by either:

  • For languages that generate a single file, replace the .peg suffix of the grammar file with a language-specific extension.
  • For languages that generate multiple files, create a directory named the same as the grammar with the .peg extension removed.

This automatic decision can now be overridden using the --output option. A language-specific extension will be automatically added to this path if needed. In Java, the package name is set to match the output location. In Ruby, the grammar module is still named after the grammar name in the file, for example if the grammar file begins grammar A.B.C then the ruby module will be named A::B::C regardless of where the output is written to.

That covers all the functional changes. The last thing to mention is that Canopy v0.4 is published under the Mozilla Public License 2.0 rather than the GPLv3. We had a question about the licensing of derivative works, because parsers generated by Canopy contain snippets of its source code. In v0.3 these were really just tiny fragments glued together by the code generator, but it was enough to make some users cautious. This is even more of an issue in v0.4, because we’ve changed the implementation to use template files to generate more of the code, so the generated files do contain substantial chunks of Canopy’s code.

We’ve switched to the MPL to try to strike a balance. I would like Canopy itself, and modified versions of it, to remain free software, and they must be distributed under the MPL. However, if you use Canopy to build a parser and embed that parser in a larger program, it is our intention that you can distribute the result under whatever terms you want, as long as Canopy itself remains free. The MPL’s terms for Larger Works cover this.

That about covers it. Because of the situation the project was in, it had become very difficult to ship for quite a long time, but this is hopefully now resolved and future changes can be made much more quickly. So do install v0.4 and let me know if you find any issues!