Packrat parsing: a top-down performance improvement

In my introduction to parser combinators, we developed the following module for generating common combinators and used them to construct a simple parser for addition expressions. I give all the code in full here, since in the preceding articles it appears in incremental chunks with various implementations of the same method given, and if we’re going to discuss performance we need to be unambiguous about what code we’re using.

module ParserCombinators
  State = Struct.new(:string, :offset) do
    def peek(n)
      string[offset ... offset + n]
    end

    def read(n)
      State.new(string, offset + n)
    end

    def complete?
      offset == string.size
    end
  end

  def parse(string)
    node, state = root.call(State.new(string, 0))

    if state and state.complete?
      node
    end
  end

  def str(string, &action)
    -> state {
      chunk = state.peek(string.size)

      if chunk == string
        node = [:str, chunk]
        node = action.call(node) if action
        [node, state.read(string.size)]
      end
    }
  end

  def chr(pattern, &action)
    -> state {
      chunk = state.peek(1)

      if chunk =~ %r{[#{pattern}]}
        node = [:chr, chunk]
        node = action.call(node) if action
        [node, state.read(1)]
      end
    }
  end

  def seq(*parsers, &action)
    -> state {
      matches = []

      parsers.each do |parser|
        node, state = state && parser.call(state)
        matches << node if state
      end

      if state
        node = [:seq, *matches]
        node = action.call(node) if action
        [node, state]
      end
    }
  end

  def rep(parser, n, &action)
    -> state {
      matches = []
      last_state = nil

      while state
        last_state = state
        node, state = parser.call(state)
        matches << node if state
      end

      if matches.size >= n
        node = [:rep, *matches]
        node = action.call(action) if action
        [node, last_state]
      end
    }
  end

  def alt(*parsers, &action)
    -> state {
      parsers.each do |parser|
        node, new_state = parser.call(state)
        if new_state
          node = action.call(node) if action
          return [node, new_state]
        end
      end

      nil
    }
  end

  def ref(name)
    -> state {
      __send__(name).call(state)
    }
  end
end

Then we went on to look at precedence and associativity and developed the following richer language for composing addition and multiplication. Although our previous implementation included blocks for transforming the original syntax nodes, I am omitting this for brevity and because it does not have a major impact on performance.

class Arithmetic
  include ParserCombinators

  def root
    expression
  end

  def expression
    alt(ref(:addition), ref(:term))
  end

  def addition
    binary(:term, str("+"))
  end

  def term
    alt(ref(:multiplication), ref(:factor))
  end

  def multiplication
    binary(:factor, str("*"))
  end

  def factor
    alt(ref(:number), ref(:paren_expression))
  end

  def number
    alt(str("0"), seq(chr("1-9"), rep(chr("0-9"), 0)))
  end

  def paren_expression
    seq(str("("), ref(:_), ref(:expression), ref(:_), str(")"))
  end

  def binary(arg, op)
    seq(ref(arg), rep(seq(ref(:_), op, ref(:_), ref(arg)), 1))
  end

  def _
    rep(str(" "), 0)
  end
end

Now, this parser may be correct in that it recognises and parses the strings we give to it into the trees we expect. However, to be practical we generally want parsers to be fast, which usually means the time they take to run is proportional to the length of the input. We say such a parser runs in linear time. A naive recursive descent parser typically won’t do that, for several reasons. The most common are that, every time it has to make a choice it tries every alternative, producing a large tree of possible code paths, or, that it repeatedly tries to match the same rule at the same offset, effectively recomputing a result that’s already known. Both these effects contribute to simple implementations running in exponential time on the length of the input.

Mitigating the first kind of problem, that of spending too long finding out which choice to make, can be done using predictive parsing. The parser compares the next character of input to the first element of each alternative, only pursuing those alternatives it knows to begin with that character. Unfortunately, this works best when each alternative expression begins with a terminal – a string literal, rather than a reference to another rule – and this means our grammar must be restructured into a less natural form. It also restricts the set of languages such parsers are able to recognise.

However, the second kind of problem, that of performing redundant work for already-known results can be verily easily fixed with some simple and localised changes to our implementation. All that is required is for the results of each parsing rule to be cached, so they are not re-computed. This technique is called memoisation.

Fortunately, we can make some very simple and localised changes to our implementation to stop such things happening. Each use of a rule in the grammar is implemented by a called to ref(name), and that call gives us an opportunity to insert a cache based on the name of the rule we’re invoking.

In our implementation, calling a rule is actually a two-stage process. In ref, we use __send__(name) to invoke the method for a rule, and this uses the methods in ParserCombinators to construct a parser, which is essentially a set of nested lambdas. Then, we call call(state) on the result of the first call to have the newly minted parser process the input. Our first memoisation improvement comes from recognising that the result of the first call – the construction of the parsing expression – will always be the same, since the rules don’t change between invocations. Therefore we can keep a cache of these parsing expressions and have ref populate the cache the first time it encounters each rule.

class RefCachedArithmetic
  def ref(name)
    @ref_cache ||= {}

    -> state {
      @ref_cache[name] ||= __send__(name)
      @ref_cache[name].call(state)
    }
  end
end

To test this, we’ll try parsing the following randomly generated expression:

399 + 422 * (778 * (851 * 867 * 454 * 599 + 408 * 300 +
988 * 610 * 164 * 315 + 930 * 14) * (608 + ((849))) *
(81) + 102 * 768 + (50) + 967 * 263 + 251) + (744)

I’m using the benchmark-ips library to measure how many times per second each parser can execute:

Benchmark.ips do |bm|
  bm.report "combinators" do
    Arithmetic.new.parse(expression)
  end

  bm.report "ref cache" do
    RefCachedArithmetic.new.parse(expression)
  end
end

The results show that just caching the generated parsers gives roughly a 50% performance improvement over the original implementation, on this input.

 combinators    213.980  (± 3.3%) i/s -      1.080k in   5.052557s
   ref cache    333.491  (± 4.8%) i/s -      1.664k in   5.001283s

This parser-caching trick is only a partial solution though. The second step in ref is to call the generated parser with our input, and this is where much of the redundant work happens in these parsers. Because of the way we need to structure grammars to express precedence and repetition, we end up trying the same rule at the same position multiple times.

Say we’re given just the input 42 to parse. We begin in the expression rule, which tries to match addition / term. We try addition first, which has the form term (_ "+" _ term)+, so the first thing addition tries to match is a term. A term is multiplication / factor so we try multiplication first, which is factor (_ "*" _ factor)+, so now we try to match factor. factor is number / paren_expression so we try number first, and finally we reach a rule with terminal expressions. To reach this point we have the following rules on the stack:

expression        <-  addition / term

addition          <-  term (_ "+" _ term)+

term              <-  multiplication / factor

multiplication    <-  factor (_ "*" _ factor)+

factor            <-  number / paren_expression

number            <-  "0" / [1-9] [0-9]*

number matches the input 42 at offset 0 and moves the offset to 2, the end of the string. Since number succeeded, we pop it off the stack and also factor which called it, leaving us partway through the multiplication expression:

expression        <-  addition / term

addition          <-  term (_ "+" _ term)+

term              <-  multiplication / factor

multiplication    <-  factor (_ "*" _ factor)+
                             ^

multiplication now wants to see zero or more spaces followed by a * character, but we have reached the end of the input and so no further matching is possible. Therefore multiplication fails and we pop back up to the next alternative in term:

expression        <-  addition / term

addition          <-  term (_ "+" _ term)+

term              <-  multiplication / factor
                                       ^

term began by attempting to match multiplication at offset 0, so it now tries to match factor at the same position. But we’ve already done this work when we tried to match multiplication: we tried factor which in turn tried number which yielded a match and moved the offset to the end of the input. So we know factor succeeds here, and so we can pop term off the stack and return to trying to match addition:

expression        <-  addition / term

addition          <-  term (_ "+" _ term)+
                           ^

Now the same thing happens with addition that we saw in multiplication: the rule expects more characters but we’ve reached the end of the input. So, addition fails and we try the other alternative in expression:

expression        <-  addition / term
                                 ^

We’re left trying to match term at position 0, and again this is work we’ve already done: term tries to match multiplication, which tries factor, which tries number, which matches, and so factor succeeds, then we pop back up into multiplication and find it fails on the next character, so we go back to term’s next alternative, which is factor, which tries number, which succeeds, and now we can unwind all the way back to the top and discover that expression matched a number.

All in all, we’ve called number four times in order to match the input 42. Each call to term resulted in factor being called twice: once as the first item in multiplication and once as the second alternative in term. Calling expression similarly resulted in calling term twice. We can visualise the above using the following call tree:

                                    expression
                                     /       \
                                   /           \
                                 /               \
                               /                   \
                       addition                     term
                      /                            /    \
                  term                multiplication    factor
                 /    \                         /          \
    multiplication    factor              factor            number
              /          \                  /
        factor            number      number
          /
    number

Rule structures like this, where one alternative is tried, followed by a fallback with the same beginning item, are very common in programming languages and this call pattern results in the parser taking an exponentially longer amount of time as the grammar grows in structure.

Although this seems like a terrible situation, the retrying of the same rule at the same offset provides an excellent opportunity to cache things. We can augment our ref function with a cache based on the rule name and input offset, such that the named parsing function is only invoked if the cache does not already contain an entry for the current rule and offset. Since each parsing function returns both a parse tree and the new state, the cached values will also make the parser continue from the correct offset when they are used.

class NodeCachedArithmetic
  def ref(name)
    @node_cache ||= {}
    @ref_cache  ||= {}

    -> state {
      key = [name, state.offset]

      unless @node_cache.has_key?(key)
        @ref_cache[name] ||= __send__(name)
        @node_cache[key] = @ref_cache[name].call(state)
      end

      @node_cache[key]
    }
  end
end

Now, what happens when we parse 42 is much simpler. As before expression tries addition, which tries term, which tries multiplication, which tries factor, which calls number, and that succeeds. So we store a cache result for (number, 0). factor has succeeded so we also store the same result in the cache for (factor, 0). Then multiplication fails and term tries factor, but now the result for factor can be plucked out of the cache. This in turn makes term succeed so we put the same result in for (term, 0) and so on up the stack. When addition fails and we fall back on term this is again just a cache lookup away. This eliminates most of the function calls in the original implementation, since no function is ever called twice at the same offset:

                              expression
                               /
                       addition
                      /
                  term
                 /
    multiplication
              /
        factor
          /
    number

This modification makes a substantial improvement to performance on our test input:

 combinators    213.625  (± 4.7%) i/s -      1.080k in   5.066901s
   ref cache    343.193  (± 2.9%) i/s -      1.736k in   5.062538s
  node cache    518.276  (± 3.3%) i/s -      2.600k in   5.022265s

In general, this simple caching modification means inputs can be parsed in linear time with arbitrary amounts of lookahead. The technique is known as packrat parsing, and is explained in a brief paper by Bryan Ford, the inventor of parsing expression grammars, and explored more fully in his master’s thesis.

While this makes a big improvement to how the runtime of the parser scales with the input size, the method of implementation itself can still be changed to yield large constant-factor improvements. For example, we can take our grammar and transcribe it using a parser generator, which will generate code for us. I maintain such a tool, called Canopy, in which the grammar would be expressed as follows:

grammar CanopyArithmetic

expression        <-  addition / term
addition          <-  term (_ "+" _ term)+
term              <-  multiplication / factor
multiplication    <-  factor (_ "*" _ factor)+
factor            <-  number / paren_expression
number            <-  "0" / [1-9] [0-9]*
paren_expression  <-  "(" _ expression _ ")"
_                 <-  " "*

Canopy can convert this grammar into a Ruby class, and we can benchmark that against our previous implementations:

 combinators    211.650  (± 4.3%) i/s -      1.064k in   5.036935s
   ref cache    331.979  (± 4.2%) i/s -      1.683k in   5.078639s
  node cache    509.117  (± 3.7%) i/s -      2.550k in   5.015619s
      canopy      1.546k (± 4.7%) i/s -      7.752k in   5.024497s

Canopy makes roughly a 200% improvement over our previous best attempt. Although in an abstract sense it’s doing the exact same computation as our combinator-based implementation, it avoids a lot of the cost the combinators carry in Ruby and some other languages. To run all those nested lambdas means making a lot of function calls, which can be expensive since a stack frame must be allocated and control must jump to the new function and return again. In contrast, Canopy generates all the code for implementing each parsing expression inline within each parsing function. This results in a lot of verbose impenetrable code, but it runs much faster.

Still, despite these improvements, there is still much room for the parser writer to affect things. For example, rather than implementing an if-statement with optional else as:

if_statement  <-  "if" expression "then" expression "else" expression
               /  "if" expression "then" expression

We might want to avoid re-matching the if and then by putting the whole shared section in its own rule, to take advantage of caching:

if_statement  <-  if_then "else" expression
               /  if_then

if_then       <-  "if" expression "then" expression

Or we could use the ? operator to match the else section if present, without having to re-match the first section at all.

if_statement  <-  "if" expression "then" expression ("else" expression)?

There are usually many ways of expressing the same language, and it’s up to the parser writer to make an acceptable trade-off between the clarity of the grammar, the resulting parse tree, and the runtime performance given their chosen parsing strategy. Indeed sometimes, caching does not actually help on common inputs – the cache lookup itself comes with a cost and that can outweigh the benefit if your grammar does not often retry rules at the same offsets. The parser generator PEG.js has an option to turn off caching in its generated parsers for this reason.

Still, despite the performance problems, parser combinators, PEG generators and other top-down techniques remain a great tool for prototyping since a naive but correct parser can be implemented with relatively little machinery. Expressing the grammar using declarative rules makes it easier to switch to a different tool later without having to rewrite too much code.