# Precedence and associativity in recursive descent

In the previous article we saw the basic elements of parser combinators, a simple modular technique for parsing context-free languages. In this article we’ll extend the grammar we developed to include other operators, and in so doing we’ll discover how to deal with operator precedence and associativity. But first, we should remind ourselves what these terms mean.

Precedence refers to the order in which operations should be performed when a program does not give an explicit order. In some languages, all arithmetic expressions must be explicitly parenthesised to indicate their order, for example in Lisp dialects we’d typically write `(+ (* 2 3) 4)` or ```(* 2 (+ 3 4))```. However, in many languages, operators can be chained without parentheses, as in `2 * 3 + 4`. A parser must decide whether this string represents the operation `(2 * 3) + 4` or `2 * (3 + 4)`. By convention, we usually give multiplication higher precedence than addition, meaning the string corresponds to the interpretation `(2 * 3) + 4`, so that the multiplication is performed before the addition. In terms of the parser output, it means we would prefer the parse tree to be `["+", ["*", 2, 3], 4]` rather than `["*", 2, ["+", 3, 4]]`.

Associativity determines how operators are grouped together when many instances of the same operator are used in a row. An operator `~` is left-associative if `a ~ b ~ c` should be parsed as `(a ~ b) ~ c`, and it is right-associative if the result should be `a ~ (b ~ c)`. For arithmetic addition, this doesn’t matter: `(a + b) + c` and `a + (b + c)` give the same result. However for subtraction the difference is important: `(a - b) - c` does not in general equal `a - (b - c)`. By convention, arithmetic subtraction is usually assumed to be left-associative, that is we choose the `(a - b) - c` interpretation.

Parser combinators belong to a family of parsing techniques called recursive descent, meaning the grammar is represented by having a function for each rule, and the parse progresses via these functions calling each other. This mechanism imposes a constraint on how we express the structure of languages. Say we wanted to add multiplication to our language. One way to express a grammar for the resulting language might look like this: an expression is an addition, a multiplication or a number, an addition is two expressions separated by `+`, a multiplication is the same but with `*` in place of `+`, and a number is defined as before.

``````<expression>     ::= <addition> | <multiplication> | <number>
<addition>       ::= <expression> "+" <expression>
<multiplication> ::= <expression> "*" <expression>
<number>         ::= "0" | [1-9] [0-9]*
``````

This grammar is ambiguous. That means that given an expression like ```2 * 3 + 4```, both `(2 * 3) + 4` and `2 * (3 + 4)` are valid parses, that is they both conform to the grammar. For example, take `2 * (3 + 4)`: `2` is a number, which is an expression, ditto for `3` and `4`. `3 + 4` is therefore two expressions separated by `+`, which makes it an addition, and therefore also an expression. An so, `2 * (3 + 4)` is two expressions separated by `*`: a multiplication. You can follow a similar argument for the other parse.

Being ambiguous is not a problem per se. Some parsing techniques can accept ambiguous grammars, they just require extra hints about operator precedence. This is a nice way to write grammars since the grammar just describes the logical structure of the language, and hints about precedence can be done as a separate concern. However, recursive descent is not one of these techniques, and precedence must be expressed within the grammar rules themselves.

In recursive descent, rather than saying everything operates on an `expression`, we need to be more specific about what types of things can appear on the left and right of each operator. To refresh our memories, here’s the `Addition` grammar we developed in the last article:

``````class Addition
include ParserCombinators

def root
expression
end

def expression
end

seq(ref(:number), ref(:_), str("+"), ref(:_), ref(:expression)) { |node|
["+", node, node]
}
end

def number
alt(str("0"), seq(chr("1-9"), rep(chr("0-9"), 0))) { |node|
if node.first == :str
0
else
digit_chrs = [node] + node.drop(1)
digit_chrs.map(&:last).join("").to_i
end
}
end

def _
rep(str(" "), 0)
end
end
``````

Since this is Ruby, we can extend this language by making a subclass and adding or overriding methods. To begin with, I’m going to add a helper method for building rules for binary expressions, just because they tend to get repetitive. This is one of the great benefits of this approach: the ability to directly refactor the grammar using normal programming techniques.

``````class Precedence < Addition
def binary(left, op, right)
seq(ref(left), ref(:_), op, ref(:_), ref(right)) { |node|
[node, node, node]
}
end

binary(:number, str("+"), :expression)
end
end
``````

Now we can begin changing the grammar to accommodate multiplication. We need to redefine what `expression` is, and the way we do this is to split the grammar into several layers, one for each level of precedence. We being with the lowest-precedence operator, addition. We define `expression` as either an addition, or whatever addition operates on, which I’ll call a `term`.

``````class Precedence
def expression
end
end
``````

Then, we define `addition` as an `term` combined with anything in addition’s precedence level, which is to say an `expression`. But what should `term` be? Think about what addition can operate on. Given the string `2 * 3 + 4`, we want to end up deciding the precedence is `(2 * 3) + 4`, so addition can operate on multiplications and numbers – that’s what an `term` is.

``````class Precedence
binary(:term, str("+"), :expression)
end

def term
alt(ref(:multiplication), ref(:number))
end
end
``````

`term` actually forms the next precedence group under `expression`. All that’s left is to define `multiplication` as a `number` combined with anything in multiplication’s precedence group.

``````class Precedence
def multiplication
binary(:number, str("*"), :term)
end
end
``````

This parser will now process expressions with multiplication and addition with the correct precedence. In all these examples, the `*` operator is the innermost and `+` is the outermost:

``````Precedence.new.parse("2 * 3 + 4")
# => ["+", ["*", 2, 3], 4]

Precedence.new.parse("2 + 3 * 4")
# => ["+", 2, ["*", 3, 4]]

Precedence.new.parse("2 * 3 + 4 * 5 + 6")
# => ["+", ["*", 2, 3], ["+", ["*", 4, 5], 6]]
``````

Sometimes, we’ll want to override the default precedence by explicitly parenthesising the expression ourselves, as in `2 * (3 + 4)`. Now we see it’s not sufficient for multiplication to operate merely on numbers – it must be able to operate on parenthesised expressions too. Therefore, we need to replace uses of `number` in our rules with a new category called `factor`, the set of expressions multiplication can apply to. And, we need a new rule for parsing an expression in parens and extracting the contents.

``````class Precedence
def term
alt(ref(:multiplication), ref(:factor))
end

def multiplication
binary(:factor, str("*"), :term)
end

def factor
alt(ref(:number), ref(:paren_expression))
end

def paren_expression
seq(str("("), ref(:_), ref(:expression), ref(:_), str(")")) { |node|
node
}
end
end
``````

Let’s see this parser in action:

``````Precedence.new.parse("2 * 3 + 4")
# => ["+", ["*", 2, 3], 4]

Precedence.new.parse("2 * (3 + 4)")
# => ["*", 2, ["+", 3, 4]]
``````

This new `factor` forms a new precedence level in the grammar. I often organise my parsers so that the rules for precedence levels are together, and the rules for what each term actually is follow after.

``````class Precedence
def expression
end

def term
alt(ref(:multiplication), ref(:factor))
end

def factor
alt(ref(:number), ref(:paren_expression))
end

def addition # ...

def multiplication # ...

def number # ...

def paren_expression # ...
end
``````

When organised this way, we see a pattern: each precedence group consists of the terms in that level, plus a fall-through to the next level. For example, the `term` group consists of `multiplication` plus the fall-through `factor`. `factor` is the highest-precedence group, consisting of `number` and `paren_expression` and no fall-through. Terms in a group typically operate on the fall-through element of the group to avoid consuming too much of the input.

Because this parsing technique is greedy, the order of terms in an `alt` can be important here. For example, if `term` was:

``````    alt(ref(:factor), ref(:multiplication))
``````

then parsing `2 + 3 * 4` would stop after the `3`: the `term` rule will try to parse using `factor`, which tries `number`, which matches the `3`, and so `factor` and `term` in turn both return without trying their other alternatives. So, it’s important to list alternatives so that when two of them could begin with the same kind of character, the longer one is tried first.

That’s precedence taken care of, now let’s tackle associativity. You’ll notice that our current parser produces right-associative parses:

``````Precedence.new.parse("3 + 4 + 5")
# => ["+", 3, ["+", 4, 5]]
``````

That’s fine for addition, but not for subtraction: we’d really prefer ```3 - 4 - 5``` to parse as `(3 - 4) - 5`, or `["-", ["-", 3, 4], 5]`, a left-associative parse. The current parsing outcome is a direct result of how `addition` is defined:

``````  def addition
binary(:term, str("+"), :expression)
end
``````

This puts `term`, which can be a multiplication, number or paren expression, on the left, and `expression` – the part that can be another addition – on the right. We can’t make both sides `expression` as that would be ambiguous, and we can’t make both `term` as that would prevent chained addition entirely. But can we just swap their positions?

``````  def addition
binary(:expression, str("+"), :term)
end
``````

Unfortunately, that also doesn’t work. Remember how `expression` is defined:

``````  def expression
end
``````

The first thing `expression` does is try to match `addition`, and the first thing the swapped `addition` does is try to match `expression`, so we get an infinite recursion leading to a stack overflow. This situation is called left recursion, and recursive descent and most other top-down parsing techniques can’t handle it without special modifications.

However, a grammar that contains left-recursion can be rewritten to remove it, by restructuring the rules. In our case, the relevant grammar rules in their current right-recursive form are:

``````<expression> ::= <addition> | <term>
<addition>   ::= <term> "+" <expression>
``````

Rather than using recursion here, we can instead use repetition. Rather than recursing into `expression`, `addition` can be written as a repeated chain of ```+ term``` expressions.

``````<addition> ::= <term> ("+" <term>)+
``````

Or, in Ruby:

``````  def addition
seq(ref(:term), rep(seq(ref(:_), str("+"), ref(:_), ref(:term)), 1))
end
``````

Let’s convert this into a general rule for binary operations, and also add a block to extract the arguments. What’s different here is how we combine those arguments: we’re going to use Ruby’s `inject` method, which combines elements from left to right:

``````[1, 2, 3, 4].inject { |a, b| ["+", a, b] }
# => ["+", ["+", ["+", 1, 2], 3], 4]
``````

Here’s our modified rule for binary operations. I’ve removed the old `right` parameter since it’s not longer used, we just repeat the `left` parameter over and over.

``````class Associativity < Precedence
def binary(arg, op, *)
seq(ref(arg), rep(seq(ref(:_), op, ref(:_), ref(arg)), 1)) { |node|
args = [node] + node.drop(1).map { |n| [n, n] }
args.inject { |a, (op, b)| [op, a, b] }
}
end
end
``````

Notice the pattern here: in the grammar, we replace a right-recursive structure with a flat list, and then in our action code we convert this flat list into a left-recursive tree. We could leave the grammar unchanged and do this conversion after the whole parser had run, but we’d need to repeat some of the work of the parser in recognising chained operators in the right-recursive structure.

This is enough to change the associativity of our existing operators, and precedence is preserved:

``````Precedence.new.parse("3 + 4 + 5")
# => ["+", 3, ["+", 4, 5]]

Associativity.new.parse("3 + 4 + 5")
# => ["+", ["+", 3, 4], 5]

Associativity.new.parse("3 + 4 * 5")
# => ["+", 3, ["*", 4, 5]]
``````

Now we can add subtraction to our grammar. Subtraction has the same precedence as addition, so let’s try adding it to the `expression` group:

``````class Associativity
def expression
end

def subtraction
binary(:term, str("-"))
end
end
``````

This works for some expressions: we can have chained subtraction, and addition and subtraction each compose with multiplication:

``````Associativity.new.parse("3 + 4 + 5")
# => ["+", ["+", 3, 4], 5]

Associativity.new.parse("3 + 4 * 5")
# => ["+", 3, ["*", 4, 5]]

Associativity.new.parse("6 - 7 - 8")
# => ["-", ["-", 6, 7], 8]

Associativity.new.parse("6 - 7 * 8")
# => ["-", 6, ["*", 7, 8]]
``````

However, mixing addition and subtraction does not work:

``````Associativity.new.parse("0 - 1 + 2")
# => nil
``````

This is because subtraction and addition are only able to have an `term` – a multiplication, number or paren expression – for each of their operands, and so they can’t be chained with themselves. We can easily fix this by reverting our change to the `expression` rule, and replacing the single `+` operator in `addition` with a choice between `+` and `-`.

``````class Associativity
def expression
end

binary(:term, alt(str("+"), str("-")))
end
end
``````

Now, the parser works for all our operators with the correct precedence and associativity.

``````Associativity.new.parse("0 - 1 + 2")
# => ["+", ["-", 0, 1], 2]

Associativity.new.parse("9 + 5 - 3")
# => ["-", ["+", 9, 5], 3]

Associativity.new.parse("9 + 5 * 3")
# => ["+", 9, ["*", 5, 3]]
``````

The above techniques will help you deal with a lot of parsing problems you’re likely to run into in practice. Let’s recap:

• Represent precedence as sets of equal-precedence terms with a fall-through
• Put the greediest term first when multiple terms have a common first item
• Make each term only operate on the fall-through for the group it’s in
• You can recurse to operate on the group you’re in, but not on the left
• For left-associative terms, use repetition and roll the operators into one rule
• Use helper functions to generate common parser combinations

And with that, you should be able to go out and start implementing languages and even inventing your own combinators to make things easier to express. In the next and final article in series I’ll touch on performance concerns, and some minor changes we can make to `ParserCombinators` to make our parsers much more efficient.

If you’ve enjoyed this article, you might enjoy one of my books, such as Building Git or JavaScript Testing Recipes