In the previous article we saw the basic elements of parser combinators, a simple modular technique for parsing context-free languages. In this article we’ll extend the grammar we developed to include other operators, and in so doing we’ll discover how to deal with operator precedence and associativity. But first, we should remind ourselves what these terms mean.
Precedence refers to the order in which operations should be performed when
a program does not give an explicit order. In some languages, all arithmetic
expressions must be explicitly parenthesised to indicate their order, for
example in Lisp dialects we’d typically write (+ (* 2 3) 4)
or (* 2 (+ 3
4))
. However, in many languages, operators can be chained without parentheses,
as in 2 * 3 + 4
. A parser must decide whether this string represents the
operation (2 * 3) + 4
or 2 * (3 + 4)
. By convention, we usually give
multiplication higher precedence than addition, meaning the string corresponds
to the interpretation (2 * 3) + 4
, so that the multiplication is performed
before the addition. In terms of the parser output, it means we would prefer the
parse tree to be ["+", ["*", 2, 3], 4]
rather than ["*", 2, ["+", 3, 4]]
.
Associativity determines how operators are grouped together when many
instances of the same operator are used in a row. An operator ~
is
left-associative if a ~ b ~ c
should be parsed as (a ~ b) ~ c
, and it is
right-associative if the result should be a ~ (b ~ c)
. For arithmetic
addition, this doesn’t matter: (a + b) + c
and a + (b + c)
give the same
result. However for subtraction the difference is important: (a - b) - c
does
not in general equal a - (b - c)
. By convention, arithmetic subtraction is
usually assumed to be left-associative, that is we choose the (a - b) - c
interpretation.
Parser combinators belong to a family of parsing techniques called recursive
descent, meaning the grammar is represented by having a function for each
rule, and the parse progresses via these functions calling each other. This
mechanism imposes a constraint on how we express the structure of languages. Say
we wanted to add multiplication to our language. One way to express a grammar
for the resulting language might look like this: an expression is an addition, a
multiplication or a number, an addition is two expressions separated by +
, a
multiplication is the same but with *
in place of +
, and a number is defined
as before.
<expression> ::= <addition> | <multiplication> | <number>
<addition> ::= <expression> "+" <expression>
<multiplication> ::= <expression> "*" <expression>
<number> ::= "0" | [1-9] [0-9]*
This grammar is ambiguous. That means that given an expression like 2 * 3 +
4
, both (2 * 3) + 4
and 2 * (3 + 4)
are valid parses, that is they both
conform to the grammar. For example, take 2 * (3 + 4)
: 2
is a number, which
is an expression, ditto for 3
and 4
. 3 + 4
is therefore two expressions
separated by +
, which makes it an addition, and therefore also an expression.
An so, 2 * (3 + 4)
is two expressions separated by *
: a multiplication. You
can follow a similar argument for the other parse.
Being ambiguous is not a problem per se. Some parsing techniques can accept ambiguous grammars, they just require extra hints about operator precedence. This is a nice way to write grammars since the grammar just describes the logical structure of the language, and hints about precedence can be done as a separate concern. However, recursive descent is not one of these techniques, and precedence must be expressed within the grammar rules themselves.
In recursive descent, rather than saying everything operates on an expression
,
we need to be more specific about what types of things can appear on the left
and right of each operator. To refresh our memories, here’s the Addition
grammar we developed in the last article:
class Addition
include ParserCombinators
def root
expression
end
def expression
alt(ref(:addition), ref(:number))
end
def addition
seq(ref(:number), ref(:_), str("+"), ref(:_), ref(:expression)) { |node|
["+", node[1], node[5]]
}
end
def number
alt(str("0"), seq(chr("1-9"), rep(chr("0-9"), 0))) { |node|
if node.first == :str
0
else
digit_chrs = [node[1]] + node[2].drop(1)
digit_chrs.map(&:last).join("").to_i
end
}
end
def _
rep(str(" "), 0)
end
end
Since this is Ruby, we can extend this language by making a subclass and adding or overriding methods. To begin with, I’m going to add a helper method for building rules for binary expressions, just because they tend to get repetitive. This is one of the great benefits of this approach: the ability to directly refactor the grammar using normal programming techniques.
class Precedence < Addition
def binary(left, op, right)
seq(ref(left), ref(:_), op, ref(:_), ref(right)) { |node|
[node[3][1], node[1], node[5]]
}
end
def addition
binary(:number, str("+"), :expression)
end
end
Now we can begin changing the grammar to accommodate multiplication. We need to
redefine what expression
is, and the way we do this is to split the grammar
into several layers, one for each level of precedence. We being with the
lowest-precedence operator, addition. We define expression
as either an
addition, or whatever addition operates on, which I’ll call a term
.
class Precedence
def expression
alt(ref(:addition), ref(:term))
end
end
Then, we define addition
as an term
combined with anything in addition’s
precedence level, which is to say an expression
. But what should term
be?
Think about what addition can operate on. Given the string 2 * 3 + 4
, we want
to end up deciding the precedence is (2 * 3) + 4
, so addition can operate on
multiplications and numbers – that’s what an term
is.
class Precedence
def addition
binary(:term, str("+"), :expression)
end
def term
alt(ref(:multiplication), ref(:number))
end
end
term
actually forms the next precedence group under expression
. All that’s
left is to define multiplication
as a number
combined with anything in
multiplication’s precedence group.
class Precedence
def multiplication
binary(:number, str("*"), :term)
end
end
This parser will now process expressions with multiplication and addition with
the correct precedence. In all these examples, the *
operator is the innermost
and +
is the outermost:
Precedence.new.parse("2 * 3 + 4")
# => ["+", ["*", 2, 3], 4]
Precedence.new.parse("2 + 3 * 4")
# => ["+", 2, ["*", 3, 4]]
Precedence.new.parse("2 * 3 + 4 * 5 + 6")
# => ["+", ["*", 2, 3], ["+", ["*", 4, 5], 6]]
Sometimes, we’ll want to override the default precedence by explicitly
parenthesising the expression ourselves, as in 2 * (3 + 4)
. Now we see it’s
not sufficient for multiplication to operate merely on numbers – it must be
able to operate on parenthesised expressions too. Therefore, we need to replace
uses of number
in our rules with a new category called factor
, the set of
expressions multiplication can apply to. And, we need a new rule for parsing an
expression in parens and extracting the contents.
class Precedence
def term
alt(ref(:multiplication), ref(:factor))
end
def multiplication
binary(:factor, str("*"), :term)
end
def factor
alt(ref(:number), ref(:paren_expression))
end
def paren_expression
seq(str("("), ref(:_), ref(:expression), ref(:_), str(")")) { |node|
node[3]
}
end
end
Let’s see this parser in action:
Precedence.new.parse("2 * 3 + 4")
# => ["+", ["*", 2, 3], 4]
Precedence.new.parse("2 * (3 + 4)")
# => ["*", 2, ["+", 3, 4]]
This new factor
forms a new precedence level in the grammar. I often organise
my parsers so that the rules for precedence levels are together, and the rules
for what each term actually is follow after.
class Precedence
def expression
alt(ref(:addition), ref(:term))
end
def term
alt(ref(:multiplication), ref(:factor))
end
def factor
alt(ref(:number), ref(:paren_expression))
end
def addition # ...
def multiplication # ...
def number # ...
def paren_expression # ...
end
When organised this way, we see a pattern: each precedence group consists of the
terms in that level, plus a fall-through to the next level. For example, the
term
group consists of multiplication
plus the fall-through factor
.
factor
is the highest-precedence group, consisting of number
and
paren_expression
and no fall-through. Terms in a group typically operate on
the fall-through element of the group to avoid consuming too much of the input.
Because this parsing technique is greedy, the order of terms in an alt
can be
important here. For example, if term
was:
alt(ref(:factor), ref(:multiplication))
then parsing 2 + 3 * 4
would stop after the 3
: the term
rule will try to
parse using factor
, which tries number
, which matches the 3
, and so
factor
and term
in turn both return without trying their other alternatives.
So, it’s important to list alternatives so that when two of them could begin
with the same kind of character, the longer one is tried first.
That’s precedence taken care of, now let’s tackle associativity. You’ll notice that our current parser produces right-associative parses:
Precedence.new.parse("3 + 4 + 5")
# => ["+", 3, ["+", 4, 5]]
That’s fine for addition, but not for subtraction: we’d really prefer 3 - 4 -
5
to parse as (3 - 4) - 5
, or ["-", ["-", 3, 4], 5]
, a left-associative
parse. The current parsing outcome is a direct result of how addition
is
defined:
def addition
binary(:term, str("+"), :expression)
end
This puts term
, which can be a multiplication, number or paren expression, on
the left, and expression
– the part that can be another addition – on the
right. We can’t make both sides expression
as that would be ambiguous, and we
can’t make both term
as that would prevent chained addition entirely. But can
we just swap their positions?
def addition
binary(:expression, str("+"), :term)
end
Unfortunately, that also doesn’t work. Remember how expression
is defined:
def expression
alt(ref(:addition), ref(:term))
end
The first thing expression
does is try to match addition
, and the first
thing the swapped addition
does is try to match expression
, so we get an
infinite recursion leading to a stack overflow. This situation is called left
recursion, and recursive descent and most other top-down parsing
techniques can’t handle it without special modifications.
However, a grammar that contains left-recursion can be rewritten to remove it, by restructuring the rules. In our case, the relevant grammar rules in their current right-recursive form are:
<expression> ::= <addition> | <term>
<addition> ::= <term> "+" <expression>
Rather than using recursion here, we can instead use repetition. Rather than
recursing into expression
, addition
can be written as a repeated chain of +
term
expressions.
<addition> ::= <term> ("+" <term>)+
Or, in Ruby:
def addition
seq(ref(:term), rep(seq(ref(:_), str("+"), ref(:_), ref(:term)), 1))
end
Let’s convert this into a general rule for binary operations, and also
add a block to extract the arguments. What’s different here is how we combine
those arguments: we’re going to use Ruby’s inject
method, which combines
elements from left to right:
[1, 2, 3, 4].inject { |a, b| ["+", a, b] }
# => ["+", ["+", ["+", 1, 2], 3], 4]
Here’s our modified rule for binary operations. I’ve removed the old right
parameter since it’s not longer used, we just repeat the left
parameter over
and over.
class Associativity < Precedence
def binary(arg, op, *)
seq(ref(arg), rep(seq(ref(:_), op, ref(:_), ref(arg)), 1)) { |node|
args = [node[1]] + node[2].drop(1).map { |n| [n[2][1], n[4]] }
args.inject { |a, (op, b)| [op, a, b] }
}
end
end
Notice the pattern here: in the grammar, we replace a right-recursive structure with a flat list, and then in our action code we convert this flat list into a left-recursive tree. We could leave the grammar unchanged and do this conversion after the whole parser had run, but we’d need to repeat some of the work of the parser in recognising chained operators in the right-recursive structure.
This is enough to change the associativity of our existing operators, and precedence is preserved:
Precedence.new.parse("3 + 4 + 5")
# => ["+", 3, ["+", 4, 5]]
Associativity.new.parse("3 + 4 + 5")
# => ["+", ["+", 3, 4], 5]
Associativity.new.parse("3 + 4 * 5")
# => ["+", 3, ["*", 4, 5]]
Now we can add subtraction to our grammar. Subtraction has the same precedence
as addition, so let’s try adding it to the expression
group:
class Associativity
def expression
alt(ref(:addition), ref(:subtraction), ref(:term))
end
def subtraction
binary(:term, str("-"))
end
end
This works for some expressions: we can have chained subtraction, and addition and subtraction each compose with multiplication:
Associativity.new.parse("3 + 4 + 5")
# => ["+", ["+", 3, 4], 5]
Associativity.new.parse("3 + 4 * 5")
# => ["+", 3, ["*", 4, 5]]
Associativity.new.parse("6 - 7 - 8")
# => ["-", ["-", 6, 7], 8]
Associativity.new.parse("6 - 7 * 8")
# => ["-", 6, ["*", 7, 8]]
However, mixing addition and subtraction does not work:
Associativity.new.parse("0 - 1 + 2")
# => nil
This is because subtraction and addition are only able to have an term
– a
multiplication, number or paren expression – for each of their operands, and so
they can’t be chained with themselves. We can easily fix this by reverting our
change to the expression
rule, and replacing the single +
operator in
addition
with a choice between +
and -
.
class Associativity
def expression
alt(ref(:addition), ref(:term))
end
def addition
binary(:term, alt(str("+"), str("-")))
end
end
Now, the parser works for all our operators with the correct precedence and associativity.
Associativity.new.parse("0 - 1 + 2")
# => ["+", ["-", 0, 1], 2]
Associativity.new.parse("9 + 5 - 3")
# => ["-", ["+", 9, 5], 3]
Associativity.new.parse("9 + 5 * 3")
# => ["+", 9, ["*", 5, 3]]
The above techniques will help you deal with a lot of parsing problems you’re likely to run into in practice. Let’s recap:
- Represent precedence as sets of equal-precedence terms with a fall-through
- Put the greediest term first when multiple terms have a common first item
- Make each term only operate on the fall-through for the group it’s in
- You can recurse to operate on the group you’re in, but not on the left
- For left-associative terms, use repetition and roll the operators into one rule
- Use helper functions to generate common parser combinations
And with that, you should be able to go out and start implementing languages and
even inventing your own combinators to make things easier to express. In the
next and final article in series I’ll touch on performance concerns, and some
minor changes we can make to ParserCombinators
to make our parsers much more
efficient.