In the previous article we saw the basic elements of parser combinators, a simple modular technique for parsing context-free languages. In this article we’ll extend the grammar we developed to include other operators, and in so doing we’ll discover how to deal with operator precedence and associativity. But first, we should remind ourselves what these terms mean.

Precedence refers to the order in which operations should be performed when
a program does not give an explicit order. In some languages, all arithmetic
expressions must be explicitly parenthesised to indicate their order, for
example in Lisp dialects we’d typically write `(+ (* 2 3) 4)`

or ```
(* 2 (+ 3
4))
```

. However, in many languages, operators can be chained without parentheses,
as in `2 * 3 + 4`

. A parser must decide whether this string represents the
operation `(2 * 3) + 4`

or `2 * (3 + 4)`

. By convention, we usually give
multiplication *higher precedence* than addition, meaning the string corresponds
to the interpretation `(2 * 3) + 4`

, so that the multiplication is performed
before the addition. In terms of the parser output, it means we would prefer the
parse tree to be `["+", ["*", 2, 3], 4]`

rather than `["*", 2, ["+", 3, 4]]`

.

Associativity determines how operators are grouped together when many
instances of the same operator are used in a row. An operator `~`

is
*left-associative* if `a ~ b ~ c`

should be parsed as `(a ~ b) ~ c`

, and it is
*right-associative* if the result should be `a ~ (b ~ c)`

. For arithmetic
addition, this doesn’t matter: `(a + b) + c`

and `a + (b + c)`

give the same
result. However for subtraction the difference is important: `(a - b) - c`

does
not in general equal `a - (b - c)`

. By convention, arithmetic subtraction is
usually assumed to be left-associative, that is we choose the `(a - b) - c`

interpretation.

Parser combinators belong to a family of parsing techniques called recursive
descent, meaning the grammar is represented by having a function for each
rule, and the parse progresses via these functions calling each other. This
mechanism imposes a constraint on how we express the structure of languages. Say
we wanted to add multiplication to our language. One way to express a grammar
for the resulting language might look like this: an expression is an addition, a
multiplication or a number, an addition is two expressions separated by `+`

, a
multiplication is the same but with `*`

in place of `+`

, and a number is defined
as before.

```
<expression> ::= <addition> | <multiplication> | <number>
<addition> ::= <expression> "+" <expression>
<multiplication> ::= <expression> "*" <expression>
<number> ::= "0" | [1-9] [0-9]*
```

This grammar is *ambiguous*. That means that given an expression like ```
2 * 3 +
4
```

, both `(2 * 3) + 4`

and `2 * (3 + 4)`

are valid parses, that is they both
conform to the grammar. For example, take `2 * (3 + 4)`

: `2`

is a number, which
is an expression, ditto for `3`

and `4`

. `3 + 4`

is therefore two expressions
separated by `+`

, which makes it an addition, and therefore also an expression.
An so, `2 * (3 + 4)`

is two expressions separated by `*`

: a multiplication. You
can follow a similar argument for the other parse.

Being *ambiguous* is not a problem per se. Some parsing techniques can accept
ambiguous grammars, they just require extra hints about operator precedence.
This is a nice way to write grammars since the grammar just describes the
logical structure of the language, and hints about precedence can be done as a
separate concern. However, recursive descent is not one of these techniques, and
precedence must be expressed within the grammar rules themselves.

In recursive descent, rather than saying everything operates on an `expression`

,
we need to be more specific about what types of things can appear on the left
and right of each operator. To refresh our memories, here’s the `Addition`

grammar we developed in the last article:

```
class Addition
include ParserCombinators
def root
expression
end
def expression
alt(ref(:addition), ref(:number))
end
def addition
seq(ref(:number), ref(:_), str("+"), ref(:_), ref(:expression)) { |node|
["+", node[1], node[5]]
}
end
def number
alt(str("0"), seq(chr("1-9"), rep(chr("0-9"), 0))) { |node|
if node.first == :str
0
else
digit_chrs = [node[1]] + node[2].drop(1)
digit_chrs.map(&:last).join("").to_i
end
}
end
def _
rep(str(" "), 0)
end
end
```

Since this is Ruby, we can extend this language by making a subclass and adding or overriding methods. To begin with, I’m going to add a helper method for building rules for binary expressions, just because they tend to get repetitive. This is one of the great benefits of this approach: the ability to directly refactor the grammar using normal programming techniques.

```
class Precedence < Addition
def binary(left, op, right)
seq(ref(left), ref(:_), op, ref(:_), ref(right)) { |node|
[node[3][1], node[1], node[5]]
}
end
def addition
binary(:number, str("+"), :expression)
end
end
```

Now we can begin changing the grammar to accommodate multiplication. We need to
redefine what `expression`

is, and the way we do this is to split the grammar
into several layers, one for each level of precedence. We being with the
lowest-precedence operator, addition. We define `expression`

as either an
addition, or whatever addition operates on, which I’ll call a `term`

.

```
class Precedence
def expression
alt(ref(:addition), ref(:term))
end
end
```

Then, we define `addition`

as an `term`

combined with anything in addition’s
precedence level, which is to say an `expression`

. But what should `term`

be?
Think about what addition can operate on. Given the string `2 * 3 + 4`

, we want
to end up deciding the precedence is `(2 * 3) + 4`

, so addition can operate on
multiplications and numbers – that’s what an `term`

is.

```
class Precedence
def addition
binary(:term, str("+"), :expression)
end
def term
alt(ref(:multiplication), ref(:number))
end
end
```

`term`

actually forms the next precedence group under `expression`

. All that’s
left is to define `multiplication`

as a `number`

combined with anything in
multiplication’s precedence group.

```
class Precedence
def multiplication
binary(:number, str("*"), :term)
end
end
```

This parser will now process expressions with multiplication and addition with
the correct precedence. In all these examples, the `*`

operator is the innermost
and `+`

is the outermost:

```
Precedence.new.parse("2 * 3 + 4")
# => ["+", ["*", 2, 3], 4]
Precedence.new.parse("2 + 3 * 4")
# => ["+", 2, ["*", 3, 4]]
Precedence.new.parse("2 * 3 + 4 * 5 + 6")
# => ["+", ["*", 2, 3], ["+", ["*", 4, 5], 6]]
```

Sometimes, we’ll want to override the default precedence by explicitly
parenthesising the expression ourselves, as in `2 * (3 + 4)`

. Now we see it’s
not sufficient for multiplication to operate merely on numbers – it must be
able to operate on parenthesised expressions too. Therefore, we need to replace
uses of `number`

in our rules with a new category called `factor`

, the set of
expressions multiplication can apply to. And, we need a new rule for parsing an
expression in parens and extracting the contents.

```
class Precedence
def term
alt(ref(:multiplication), ref(:factor))
end
def multiplication
binary(:factor, str("*"), :term)
end
def factor
alt(ref(:number), ref(:paren_expression))
end
def paren_expression
seq(str("("), ref(:_), ref(:expression), ref(:_), str(")")) { |node|
node[3]
}
end
end
```

Let’s see this parser in action:

```
Precedence.new.parse("2 * 3 + 4")
# => ["+", ["*", 2, 3], 4]
Precedence.new.parse("2 * (3 + 4)")
# => ["*", 2, ["+", 3, 4]]
```

This new `factor`

forms a new precedence level in the grammar. I often organise
my parsers so that the rules for precedence levels are together, and the rules
for what each term actually is follow after.

```
class Precedence
def expression
alt(ref(:addition), ref(:term))
end
def term
alt(ref(:multiplication), ref(:factor))
end
def factor
alt(ref(:number), ref(:paren_expression))
end
def addition # ...
def multiplication # ...
def number # ...
def paren_expression # ...
end
```

When organised this way, we see a pattern: each precedence group consists of the
terms in that level, plus a fall-through to the next level. For example, the
`term`

group consists of `multiplication`

plus the fall-through `factor`

.
`factor`

is the highest-precedence group, consisting of `number`

and
`paren_expression`

and no fall-through. Terms in a group typically operate on
the fall-through element of the group to avoid consuming too much of the input.

Because this parsing technique is greedy, the order of terms in an `alt`

can be
important here. For example, if `term`

was:

```
alt(ref(:factor), ref(:multiplication))
```

then parsing `2 + 3 * 4`

would stop after the `3`

: the `term`

rule will try to
parse using `factor`

, which tries `number`

, which matches the `3`

, and so
`factor`

and `term`

in turn both return without trying their other alternatives.
So, it’s important to list alternatives so that when two of them could begin
with the same kind of character, the longer one is tried first.

That’s precedence taken care of, now let’s tackle associativity. You’ll notice that our current parser produces right-associative parses:

```
Precedence.new.parse("3 + 4 + 5")
# => ["+", 3, ["+", 4, 5]]
```

That’s fine for addition, but not for subtraction: we’d really prefer ```
3 - 4 -
5
```

to parse as `(3 - 4) - 5`

, or `["-", ["-", 3, 4], 5]`

, a left-associative
parse. The current parsing outcome is a direct result of how `addition`

is
defined:

```
def addition
binary(:term, str("+"), :expression)
end
```

This puts `term`

, which can be a multiplication, number or paren expression, on
the left, and `expression`

– the part that can be another addition – on the
right. We can’t make both sides `expression`

as that would be ambiguous, and we
can’t make both `term`

as that would prevent chained addition entirely. But can
we just swap their positions?

```
def addition
binary(:expression, str("+"), :term)
end
```

Unfortunately, that also doesn’t work. Remember how `expression`

is defined:

```
def expression
alt(ref(:addition), ref(:term))
end
```

The first thing `expression`

does is try to match `addition`

, and the first
thing the swapped `addition`

does is try to match `expression`

, so we get an
infinite recursion leading to a stack overflow. This situation is called left
recursion, and recursive descent and most other top-down parsing
techniques can’t handle it without special modifications.

However, a grammar that contains left-recursion can be rewritten to remove it, by restructuring the rules. In our case, the relevant grammar rules in their current right-recursive form are:

```
<expression> ::= <addition> | <term>
<addition> ::= <term> "+" <expression>
```

Rather than using recursion here, we can instead use repetition. Rather than
recursing into `expression`

, `addition`

can be written as a repeated chain of ```
+
term
```

expressions.

```
<addition> ::= <term> ("+" <term>)+
```

Or, in Ruby:

```
def addition
seq(ref(:term), rep(seq(ref(:_), str("+"), ref(:_), ref(:term)), 1))
end
```

Let’s convert this into a general rule for binary operations, and also
add a block to extract the arguments. What’s different here is how we combine
those arguments: we’re going to use Ruby’s `inject`

method, which combines
elements from left to right:

```
[1, 2, 3, 4].inject { |a, b| ["+", a, b] }
# => ["+", ["+", ["+", 1, 2], 3], 4]
```

Here’s our modified rule for binary operations. I’ve removed the old `right`

parameter since it’s not longer used, we just repeat the `left`

parameter over
and over.

```
class Associativity < Precedence
def binary(arg, op, *)
seq(ref(arg), rep(seq(ref(:_), op, ref(:_), ref(arg)), 1)) { |node|
args = [node[1]] + node[2].drop(1).map { |n| [n[2][1], n[4]] }
args.inject { |a, (op, b)| [op, a, b] }
}
end
end
```

Notice the pattern here: in the grammar, we replace a right-recursive structure with a flat list, and then in our action code we convert this flat list into a left-recursive tree. We could leave the grammar unchanged and do this conversion after the whole parser had run, but we’d need to repeat some of the work of the parser in recognising chained operators in the right-recursive structure.

This is enough to change the associativity of our existing operators, and precedence is preserved:

```
Precedence.new.parse("3 + 4 + 5")
# => ["+", 3, ["+", 4, 5]]
Associativity.new.parse("3 + 4 + 5")
# => ["+", ["+", 3, 4], 5]
Associativity.new.parse("3 + 4 * 5")
# => ["+", 3, ["*", 4, 5]]
```

Now we can add subtraction to our grammar. Subtraction has the same precedence
as addition, so let’s try adding it to the `expression`

group:

```
class Associativity
def expression
alt(ref(:addition), ref(:subtraction), ref(:term))
end
def subtraction
binary(:term, str("-"))
end
end
```

This works for some expressions: we can have chained subtraction, and addition and subtraction each compose with multiplication:

```
Associativity.new.parse("3 + 4 + 5")
# => ["+", ["+", 3, 4], 5]
Associativity.new.parse("3 + 4 * 5")
# => ["+", 3, ["*", 4, 5]]
Associativity.new.parse("6 - 7 - 8")
# => ["-", ["-", 6, 7], 8]
Associativity.new.parse("6 - 7 * 8")
# => ["-", 6, ["*", 7, 8]]
```

However, mixing addition and subtraction does not work:

```
Associativity.new.parse("0 - 1 + 2")
# => nil
```

This is because subtraction and addition are only able to have an `term`

– a
multiplication, number or paren expression – for each of their operands, and so
they can’t be chained with themselves. We can easily fix this by reverting our
change to the `expression`

rule, and replacing the single `+`

operator in
`addition`

with a choice between `+`

and `-`

.

```
class Associativity
def expression
alt(ref(:addition), ref(:term))
end
def addition
binary(:term, alt(str("+"), str("-")))
end
end
```

Now, the parser works for all our operators with the correct precedence and associativity.

```
Associativity.new.parse("0 - 1 + 2")
# => ["+", ["-", 0, 1], 2]
Associativity.new.parse("9 + 5 - 3")
# => ["-", ["+", 9, 5], 3]
Associativity.new.parse("9 + 5 * 3")
# => ["+", 9, ["*", 5, 3]]
```

The above techniques will help you deal with a lot of parsing problems you’re likely to run into in practice. Let’s recap:

- Represent precedence as sets of equal-precedence terms with a fall-through
- Put the greediest term first when multiple terms have a common first item
- Make each term only operate on the fall-through for the group it’s in
- You can recurse to operate on the group you’re in, but not on the left
- For left-associative terms, use repetition and roll the operators into one rule
- Use helper functions to generate common parser combinations

And with that, you should be able to go out and start implementing languages and
even inventing your own combinators to make things easier to express. In the
next and final article in series I’ll touch on performance concerns, and some
minor changes we can make to `ParserCombinators`

to make our parsers much more
efficient.