In my introduction to parser combinators, we developed the following module for generating common combinators and used them to construct a simple parser for addition expressions. I give all the code in full here, since in the preceding articles it appears in incremental chunks with various implementations of the same method given, and if we’re going to discuss performance we need to be unambiguous about what code we’re using.
module ParserCombinators
State = Struct.new(:string, :offset) do
def peek(n)
string[offset ... offset + n]
end
def read(n)
State.new(string, offset + n)
end
def complete?
offset == string.size
end
end
def parse(string)
node, state = root.call(State.new(string, 0))
if state and state.complete?
node
end
end
def str(string, &action)
-> state {
chunk = state.peek(string.size)
if chunk == string
node = [:str, chunk]
node = action.call(node) if action
[node, state.read(string.size)]
end
}
end
def chr(pattern, &action)
-> state {
chunk = state.peek(1)
if chunk =~ %r{[#{pattern}]}
node = [:chr, chunk]
node = action.call(node) if action
[node, state.read(1)]
end
}
end
def seq(*parsers, &action)
-> state {
matches = []
parsers.each do |parser|
node, state = state && parser.call(state)
matches << node if state
end
if state
node = [:seq, *matches]
node = action.call(node) if action
[node, state]
end
}
end
def rep(parser, n, &action)
-> state {
matches = []
last_state = nil
while state
last_state = state
node, state = parser.call(state)
matches << node if state
end
if matches.size >= n
node = [:rep, *matches]
node = action.call(action) if action
[node, last_state]
end
}
end
def alt(*parsers, &action)
-> state {
parsers.each do |parser|
node, new_state = parser.call(state)
if new_state
node = action.call(node) if action
return [node, new_state]
end
end
nil
}
end
def ref(name)
-> state {
__send__(name).call(state)
}
end
end
Then we went on to look at precedence and associativity and developed the following richer language for composing addition and multiplication. Although our previous implementation included blocks for transforming the original syntax nodes, I am omitting this for brevity and because it does not have a major impact on performance.
class Arithmetic
include ParserCombinators
def root
expression
end
def expression
alt(ref(:addition), ref(:term))
end
def addition
binary(:term, str("+"))
end
def term
alt(ref(:multiplication), ref(:factor))
end
def multiplication
binary(:factor, str("*"))
end
def factor
alt(ref(:number), ref(:paren_expression))
end
def number
alt(str("0"), seq(chr("1-9"), rep(chr("0-9"), 0)))
end
def paren_expression
seq(str("("), ref(:_), ref(:expression), ref(:_), str(")"))
end
def binary(arg, op)
seq(ref(arg), rep(seq(ref(:_), op, ref(:_), ref(arg)), 1))
end
def _
rep(str(" "), 0)
end
end
Now, this parser may be correct in that it recognises and parses the strings we give to it into the trees we expect. However, to be practical we generally want parsers to be fast, which usually means the time they take to run is proportional to the length of the input. We say such a parser runs in linear time. A naive recursive descent parser typically won’t do that, for several reasons. The most common are that, every time it has to make a choice it tries every alternative, producing a large tree of possible code paths, or, that it repeatedly tries to match the same rule at the same offset, effectively recomputing a result that’s already known. Both these effects contribute to simple implementations running in exponential time on the length of the input.
Mitigating the first kind of problem, that of spending too long finding out which choice to make, can be done using predictive parsing. The parser compares the next character of input to the first element of each alternative, only pursuing those alternatives it knows to begin with that character. Unfortunately, this works best when each alternative expression begins with a terminal – a string literal, rather than a reference to another rule – and this means our grammar must be restructured into a less natural form. It also restricts the set of languages such parsers are able to recognise.
However, the second kind of problem, that of performing redundant work for already-known results can be verily easily fixed with some simple and localised changes to our implementation. All that is required is for the results of each parsing rule to be cached, so they are not re-computed. This technique is called memoisation.
Fortunately, we can make some very simple and localised changes to our
implementation to stop such things happening. Each use of a rule in the grammar
is implemented by a called to ref(name)
, and that call gives us an opportunity
to insert a cache based on the name of the rule we’re invoking.
In our implementation, calling a rule is actually a two-stage process. In ref
,
we use __send__(name)
to invoke the method for a rule, and this uses the
methods in ParserCombinators
to construct a parser, which is essentially a set
of nested lambdas. Then, we call call(state)
on the result of the first call
to have the newly minted parser process the input. Our first memoisation
improvement comes from recognising that the result of the first call – the
construction of the parsing expression – will always be the same, since the
rules don’t change between invocations. Therefore we can keep a cache of these
parsing expressions and have ref
populate the cache the first time it
encounters each rule.
class RefCachedArithmetic
def ref(name)
@ref_cache ||= {}
-> state {
@ref_cache[name] ||= __send__(name)
@ref_cache[name].call(state)
}
end
end
To test this, we’ll try parsing the following randomly generated expression:
399 + 422 * (778 * (851 * 867 * 454 * 599 + 408 * 300 +
988 * 610 * 164 * 315 + 930 * 14) * (608 + ((849))) *
(81) + 102 * 768 + (50) + 967 * 263 + 251) + (744)
I’m using the benchmark-ips library to measure how many times per second each parser can execute:
Benchmark.ips do |bm|
bm.report "combinators" do
Arithmetic.new.parse(expression)
end
bm.report "ref cache" do
RefCachedArithmetic.new.parse(expression)
end
end
The results show that just caching the generated parsers gives roughly a 50% performance improvement over the original implementation, on this input.
combinators 213.980 (± 3.3%) i/s - 1.080k in 5.052557s
ref cache 333.491 (± 4.8%) i/s - 1.664k in 5.001283s
This parser-caching trick is only a partial solution though. The second step in
ref
is to call the generated parser with our input, and this is where much of
the redundant work happens in these parsers. Because of the way we need to
structure grammars to express precedence and repetition, we end up trying the
same rule at the same position multiple times.
Say we’re given just the input 42
to parse. We begin in the expression
rule,
which tries to match addition / term
. We try addition
first, which has the
form term (_ "+" _ term)+
, so the first thing addition
tries to match is a
term
. A term
is multiplication / factor
so we try multiplication
first,
which is factor (_ "*" _ factor)+
, so now we try to match factor
. factor
is number / paren_expression
so we try number
first, and finally we reach a
rule with terminal expressions. To reach this point we have the following rules
on the stack:
expression <- addition / term
addition <- term (_ "+" _ term)+
term <- multiplication / factor
multiplication <- factor (_ "*" _ factor)+
factor <- number / paren_expression
number <- "0" / [1-9] [0-9]*
number
matches the input 42
at offset 0
and moves the offset to 2
, the
end of the string. Since number
succeeded, we pop it off the stack and also
factor
which called it, leaving us partway through the multiplication
expression:
expression <- addition / term
addition <- term (_ "+" _ term)+
term <- multiplication / factor
multiplication <- factor (_ "*" _ factor)+
^
multiplication
now wants to see zero or more spaces followed by a *
character, but we have reached the end of the input and so no further matching
is possible. Therefore multiplication
fails and we pop back up to the next
alternative in term
:
expression <- addition / term
addition <- term (_ "+" _ term)+
term <- multiplication / factor
^
term
began by attempting to match multiplication
at offset 0
, so it now
tries to match factor
at the same position. But we’ve already done this work
when we tried to match multiplication
: we tried factor
which in turn tried
number
which yielded a match and moved the offset to the end of the input. So
we know factor
succeeds here, and so we can pop term
off the stack and
return to trying to match addition
:
expression <- addition / term
addition <- term (_ "+" _ term)+
^
Now the same thing happens with addition
that we saw in multiplication
: the
rule expects more characters but we’ve reached the end of the input. So,
addition
fails and we try the other alternative in expression
:
expression <- addition / term
^
We’re left trying to match term
at position 0
, and again this is work we’ve
already done: term
tries to match multiplication
, which tries factor
,
which tries number
, which matches, and so factor
succeeds, then we pop back
up into multiplication
and find it fails on the next character, so we go back
to term
’s next alternative, which is factor
, which tries number
, which
succeeds, and now we can unwind all the way back to the top and discover that
expression
matched a number
.
All in all, we’ve called number
four times in order to match the input 42
.
Each call to term
resulted in factor
being called twice: once as the first
item in multiplication
and once as the second alternative in term
. Calling
expression
similarly resulted in calling term
twice. We can visualise the
above using the following call tree:
expression
/ \
/ \
/ \
/ \
addition term
/ / \
term multiplication factor
/ \ / \
multiplication factor factor number
/ \ /
factor number number
/
number
Rule structures like this, where one alternative is tried, followed by a fallback with the same beginning item, are very common in programming languages and this call pattern results in the parser taking an exponentially longer amount of time as the grammar grows in structure.
Although this seems like a terrible situation, the retrying of the same rule at
the same offset provides an excellent opportunity to cache things. We can
augment our ref
function with a cache based on the rule name and input offset,
such that the named parsing function is only invoked if the cache does not
already contain an entry for the current rule and offset. Since each parsing
function returns both a parse tree and the new state, the cached values will
also make the parser continue from the correct offset when they are used.
class NodeCachedArithmetic
def ref(name)
@node_cache ||= {}
@ref_cache ||= {}
-> state {
key = [name, state.offset]
unless @node_cache.has_key?(key)
@ref_cache[name] ||= __send__(name)
@node_cache[key] = @ref_cache[name].call(state)
end
@node_cache[key]
}
end
end
Now, what happens when we parse 42
is much simpler. As before expression
tries addition
, which tries term
, which tries multiplication
, which tries
factor
, which calls number
, and that succeeds. So we store a cache result
for (number, 0)
. factor
has succeeded so we also store the same result in
the cache for (factor, 0)
. Then multiplication
fails and term
tries
factor
, but now the result for factor
can be plucked out of the cache. This
in turn makes term
succeed so we put the same result in for (term, 0)
and so
on up the stack. When addition
fails and we fall back on term
this is again
just a cache lookup away. This eliminates most of the function calls in the
original implementation, since no function is ever called twice at the same
offset:
expression
/
addition
/
term
/
multiplication
/
factor
/
number
This modification makes a substantial improvement to performance on our test input:
combinators 213.625 (± 4.7%) i/s - 1.080k in 5.066901s
ref cache 343.193 (± 2.9%) i/s - 1.736k in 5.062538s
node cache 518.276 (± 3.3%) i/s - 2.600k in 5.022265s
In general, this simple caching modification means inputs can be parsed in linear time with arbitrary amounts of lookahead. The technique is known as packrat parsing, and is explained in a brief paper by Bryan Ford, the inventor of parsing expression grammars, and explored more fully in his master’s thesis.
While this makes a big improvement to how the runtime of the parser scales with the input size, the method of implementation itself can still be changed to yield large constant-factor improvements. For example, we can take our grammar and transcribe it using a parser generator, which will generate code for us. I maintain such a tool, called Canopy, in which the grammar would be expressed as follows:
grammar CanopyArithmetic
expression <- addition / term
addition <- term (_ "+" _ term)+
term <- multiplication / factor
multiplication <- factor (_ "*" _ factor)+
factor <- number / paren_expression
number <- "0" / [1-9] [0-9]*
paren_expression <- "(" _ expression _ ")"
_ <- " "*
Canopy can convert this grammar into a Ruby class, and we can benchmark that against our previous implementations:
combinators 211.650 (± 4.3%) i/s - 1.064k in 5.036935s
ref cache 331.979 (± 4.2%) i/s - 1.683k in 5.078639s
node cache 509.117 (± 3.7%) i/s - 2.550k in 5.015619s
canopy 1.546k (± 4.7%) i/s - 7.752k in 5.024497s
Canopy makes roughly a 200% improvement over our previous best attempt. Although in an abstract sense it’s doing the exact same computation as our combinator-based implementation, it avoids a lot of the cost the combinators carry in Ruby and some other languages. To run all those nested lambdas means making a lot of function calls, which can be expensive since a stack frame must be allocated and control must jump to the new function and return again. In contrast, Canopy generates all the code for implementing each parsing expression inline within each parsing function. This results in a lot of verbose impenetrable code, but it runs much faster.
Still, despite these improvements, there is still much room for the parser
writer to affect things. For example, rather than implementing an if-statement
with optional else
as:
if_statement <- "if" expression "then" expression "else" expression
/ "if" expression "then" expression
We might want to avoid re-matching the if
and then
by putting the whole
shared section in its own rule, to take advantage of caching:
if_statement <- if_then "else" expression
/ if_then
if_then <- "if" expression "then" expression
Or we could use the ?
operator to match the else
section if present, without
having to re-match the first section at all.
if_statement <- "if" expression "then" expression ("else" expression)?
There are usually many ways of expressing the same language, and it’s up to the parser writer to make an acceptable trade-off between the clarity of the grammar, the resulting parse tree, and the runtime performance given their chosen parsing strategy. Indeed sometimes, caching does not actually help on common inputs – the cache lookup itself comes with a cost and that can outweigh the benefit if your grammar does not often retry rules at the same offsets. The parser generator PEG.js has an option to turn off caching in its generated parsers for this reason.
Still, despite the performance problems, parser combinators, PEG generators and other top-down techniques remain a great tool for prototyping since a naive but correct parser can be implemented with relatively little machinery. Expressing the grammar using declarative rules makes it easier to switch to a different tool later without having to rewrite too much code.