Precedence and associativity in recursive descent

In the previous article we saw the basic elements of parser combinators, a simple modular technique for parsing context-free languages. In this article we’ll extend the grammar we developed to include other operators, and in so doing we’ll discover how to deal with operator precedence and associativity. But first, we should remind ourselves what these terms mean.

Precedence refers to the order in which operations should be performed when a program does not give an explicit order. In some languages, all arithmetic expressions must be explicitly parenthesised to indicate their order, for example in Lisp dialects we’d typically write (+ (* 2 3) 4) or (* 2 (+ 3 4)). However, in many languages, operators can be chained without parentheses, as in 2 * 3 + 4. A parser must decide whether this string represents the operation (2 * 3) + 4 or 2 * (3 + 4). By convention, we usually give multiplication higher precedence than addition, meaning the string corresponds to the interpretation (2 * 3) + 4, so that the multiplication is performed before the addition. In terms of the parser output, it means we would prefer the parse tree to be ["+", ["*", 2, 3], 4] rather than ["*", 2, ["+", 3, 4]].

Associativity determines how operators are grouped together when many instances of the same operator are used in a row. An operator ~ is left-associative if a ~ b ~ c should be parsed as (a ~ b) ~ c, and it is right-associative if the result should be a ~ (b ~ c). For arithmetic addition, this doesn’t matter: (a + b) + c and a + (b + c) give the same result. However for subtraction the difference is important: (a - b) - c does not in general equal a - (b - c). By convention, arithmetic subtraction is usually assumed to be left-associative, that is we choose the (a - b) - c interpretation.

Parser combinators belong to a family of parsing techniques called recursive descent, meaning the grammar is represented by having a function for each rule, and the parse progresses via these functions calling each other. This mechanism imposes a constraint on how we express the structure of languages. Say we wanted to add multiplication to our language. One way to express a grammar for the resulting language might look like this: an expression is an addition, a multiplication or a number, an addition is two expressions separated by +, a multiplication is the same but with * in place of +, and a number is defined as before.

<expression>     ::= <addition> | <multiplication> | <number>
<addition>       ::= <expression> "+" <expression>
<multiplication> ::= <expression> "*" <expression>
<number>         ::= "0" | [1-9] [0-9]*

This grammar is ambiguous. That means that given an expression like 2 * 3 + 4, both (2 * 3) + 4 and 2 * (3 + 4) are valid parses, that is they both conform to the grammar. For example, take 2 * (3 + 4): 2 is a number, which is an expression, ditto for 3 and 4. 3 + 4 is therefore two expressions separated by +, which makes it an addition, and therefore also an expression. An so, 2 * (3 + 4) is two expressions separated by *: a multiplication. You can follow a similar argument for the other parse.

Being ambiguous is not a problem per se. Some parsing techniques can accept ambiguous grammars, they just require extra hints about operator precedence. This is a nice way to write grammars since the grammar just describes the logical structure of the language, and hints about precedence can be done as a separate concern. However, recursive descent is not one of these techniques, and precedence must be expressed within the grammar rules themselves.

In recursive descent, rather than saying everything operates on an expression, we need to be more specific about what types of things can appear on the left and right of each operator. To refresh our memories, here’s the Addition grammar we developed in the last article:

class Addition
  include ParserCombinators

  def root
    expression
  end

  def expression
    alt(ref(:addition), ref(:number))
  end

  def addition
    seq(ref(:number), ref(:_), str("+"), ref(:_), ref(:expression)) { |node|
      ["+", node[1], node[5]]
    }
  end

  def number
    alt(str("0"), seq(chr("1-9"), rep(chr("0-9"), 0))) { |node|
      if node.first == :str
        0
      else
        digit_chrs = [node[1]] + node[2].drop(1)
        digit_chrs.map(&:last).join("").to_i
      end
    }
  end

  def _
    rep(str(" "), 0)
  end
end

Since this is Ruby, we can extend this language by making a subclass and adding or overriding methods. To begin with, I’m going to add a helper method for building rules for binary expressions, just because they tend to get repetitive. This is one of the great benefits of this approach: the ability to directly refactor the grammar using normal programming techniques.

class Precedence < Addition
  def binary(left, op, right)
    seq(ref(left), ref(:_), op, ref(:_), ref(right)) { |node|
      [node[3][1], node[1], node[5]]
    }
  end

  def addition
    binary(:number, str("+"), :expression)
  end
end

Now we can begin changing the grammar to accommodate multiplication. We need to redefine what expression is, and the way we do this is to split the grammar into several layers, one for each level of precedence. We being with the lowest-precedence operator, addition. We define expression as either an addition, or whatever addition operates on, which I’ll call a term.

class Precedence
  def expression
    alt(ref(:addition), ref(:term))
  end
end

Then, we define addition as an term combined with anything in addition’s precedence level, which is to say an expression. But what should term be? Think about what addition can operate on. Given the string 2 * 3 + 4, we want to end up deciding the precedence is (2 * 3) + 4, so addition can operate on multiplications and numbers – that’s what an term is.

class Precedence
  def addition
    binary(:term, str("+"), :expression)
  end

  def term
    alt(ref(:multiplication), ref(:number))
  end
end

term actually forms the next precedence group under expression. All that’s left is to define multiplication as a number combined with anything in multiplication’s precedence group.

class Precedence
  def multiplication
    binary(:number, str("*"), :term)
  end
end

This parser will now process expressions with multiplication and addition with the correct precedence. In all these examples, the * operator is the innermost and + is the outermost:

Precedence.new.parse("2 * 3 + 4")
# => ["+", ["*", 2, 3], 4]

Precedence.new.parse("2 + 3 * 4")
# => ["+", 2, ["*", 3, 4]]

Precedence.new.parse("2 * 3 + 4 * 5 + 6")
# => ["+", ["*", 2, 3], ["+", ["*", 4, 5], 6]]

Sometimes, we’ll want to override the default precedence by explicitly parenthesising the expression ourselves, as in 2 * (3 + 4). Now we see it’s not sufficient for multiplication to operate merely on numbers – it must be able to operate on parenthesised expressions too. Therefore, we need to replace uses of number in our rules with a new category called factor, the set of expressions multiplication can apply to. And, we need a new rule for parsing an expression in parens and extracting the contents.

class Precedence
  def term
    alt(ref(:multiplication), ref(:factor))
  end

  def multiplication
    binary(:factor, str("*"), :term)
  end

  def factor
    alt(ref(:number), ref(:paren_expression))
  end

  def paren_expression
    seq(str("("), ref(:_), ref(:expression), ref(:_), str(")")) { |node|
      node[3]
    }
  end
end

Let’s see this parser in action:

Precedence.new.parse("2 * 3 + 4")
# => ["+", ["*", 2, 3], 4]

Precedence.new.parse("2 * (3 + 4)")
# => ["*", 2, ["+", 3, 4]]

This new factor forms a new precedence level in the grammar. I often organise my parsers so that the rules for precedence levels are together, and the rules for what each term actually is follow after.

class Precedence
  def expression
    alt(ref(:addition), ref(:term))
  end

  def term
    alt(ref(:multiplication), ref(:factor))
  end

  def factor
    alt(ref(:number), ref(:paren_expression))
  end

  def addition # ...

  def multiplication # ...

  def number # ...

  def paren_expression # ...
end

When organised this way, we see a pattern: each precedence group consists of the terms in that level, plus a fall-through to the next level. For example, the term group consists of multiplication plus the fall-through factor. factor is the highest-precedence group, consisting of number and paren_expression and no fall-through. Terms in a group typically operate on the fall-through element of the group to avoid consuming too much of the input.

Because this parsing technique is greedy, the order of terms in an alt can be important here. For example, if term was:

    alt(ref(:factor), ref(:multiplication))

then parsing 2 + 3 * 4 would stop after the 3: the term rule will try to parse using factor, which tries number, which matches the 3, and so factor and term in turn both return without trying their other alternatives. So, it’s important to list alternatives so that when two of them could begin with the same kind of character, the longer one is tried first.

That’s precedence taken care of, now let’s tackle associativity. You’ll notice that our current parser produces right-associative parses:

Precedence.new.parse("3 + 4 + 5")
# => ["+", 3, ["+", 4, 5]]

That’s fine for addition, but not for subtraction: we’d really prefer 3 - 4 - 5 to parse as (3 - 4) - 5, or ["-", ["-", 3, 4], 5], a left-associative parse. The current parsing outcome is a direct result of how addition is defined:

  def addition
    binary(:term, str("+"), :expression)
  end

This puts term, which can be a multiplication, number or paren expression, on the left, and expression – the part that can be another addition – on the right. We can’t make both sides expression as that would be ambiguous, and we can’t make both term as that would prevent chained addition entirely. But can we just swap their positions?

  def addition
    binary(:expression, str("+"), :term)
  end

Unfortunately, that also doesn’t work. Remember how expression is defined:

  def expression
    alt(ref(:addition), ref(:term))
  end

The first thing expression does is try to match addition, and the first thing the swapped addition does is try to match expression, so we get an infinite recursion leading to a stack overflow. This situation is called left recursion, and recursive descent and most other top-down parsing techniques can’t handle it without special modifications.

However, a grammar that contains left-recursion can be rewritten to remove it, by restructuring the rules. In our case, the relevant grammar rules in their current right-recursive form are:

<expression> ::= <addition> | <term>
<addition>   ::= <term> "+" <expression>

Rather than using recursion here, we can instead use repetition. Rather than recursing into expression, addition can be written as a repeated chain of + term expressions.

<addition> ::= <term> ("+" <term>)+

Or, in Ruby:

  def addition
    seq(ref(:term), rep(seq(ref(:_), str("+"), ref(:_), ref(:term)), 1))
  end

Let’s convert this into a general rule for binary operations, and also add a block to extract the arguments. What’s different here is how we combine those arguments: we’re going to use Ruby’s inject method, which combines elements from left to right:

[1, 2, 3, 4].inject { |a, b| ["+", a, b] }
# => ["+", ["+", ["+", 1, 2], 3], 4]

Here’s our modified rule for binary operations. I’ve removed the old right parameter since it’s not longer used, we just repeat the left parameter over and over.

class Associativity < Precedence
  def binary(arg, op, *)
    seq(ref(arg), rep(seq(ref(:_), op, ref(:_), ref(arg)), 1)) { |node|
      args = [node[1]] + node[2].drop(1).map { |n| [n[2][1], n[4]] }
      args.inject { |a, (op, b)| [op, a, b] }
    }
  end
end

Notice the pattern here: in the grammar, we replace a right-recursive structure with a flat list, and then in our action code we convert this flat list into a left-recursive tree. We could leave the grammar unchanged and do this conversion after the whole parser had run, but we’d need to repeat some of the work of the parser in recognising chained operators in the right-recursive structure.

This is enough to change the associativity of our existing operators, and precedence is preserved:

Precedence.new.parse("3 + 4 + 5")
# => ["+", 3, ["+", 4, 5]]

Associativity.new.parse("3 + 4 + 5")
# => ["+", ["+", 3, 4], 5]

Associativity.new.parse("3 + 4 * 5")
# => ["+", 3, ["*", 4, 5]]

Now we can add subtraction to our grammar. Subtraction has the same precedence as addition, so let’s try adding it to the expression group:

class Associativity
  def expression
    alt(ref(:addition), ref(:subtraction), ref(:term))
  end

  def subtraction
    binary(:term, str("-"))
  end
end

This works for some expressions: we can have chained subtraction, and addition and subtraction each compose with multiplication:

Associativity.new.parse("3 + 4 + 5")
# => ["+", ["+", 3, 4], 5]

Associativity.new.parse("3 + 4 * 5")
# => ["+", 3, ["*", 4, 5]]

Associativity.new.parse("6 - 7 - 8")
# => ["-", ["-", 6, 7], 8]

Associativity.new.parse("6 - 7 * 8")
# => ["-", 6, ["*", 7, 8]]

However, mixing addition and subtraction does not work:

Associativity.new.parse("0 - 1 + 2")
# => nil

This is because subtraction and addition are only able to have an term – a multiplication, number or paren expression – for each of their operands, and so they can’t be chained with themselves. We can easily fix this by reverting our change to the expression rule, and replacing the single + operator in addition with a choice between + and -.

class Associativity
  def expression
    alt(ref(:addition), ref(:term))
  end

  def addition
    binary(:term, alt(str("+"), str("-")))
  end
end

Now, the parser works for all our operators with the correct precedence and associativity.

Associativity.new.parse("0 - 1 + 2")
# => ["+", ["-", 0, 1], 2]

Associativity.new.parse("9 + 5 - 3")
# => ["-", ["+", 9, 5], 3]

Associativity.new.parse("9 + 5 * 3")
# => ["+", 9, ["*", 5, 3]]

The above techniques will help you deal with a lot of parsing problems you’re likely to run into in practice. Let’s recap:

  • Represent precedence as sets of equal-precedence terms with a fall-through
  • Put the greediest term first when multiple terms have a common first item
  • Make each term only operate on the fall-through for the group it’s in
  • You can recurse to operate on the group you’re in, but not on the left
  • For left-associative terms, use repetition and roll the operators into one rule
  • Use helper functions to generate common parser combinations

And with that, you should be able to go out and start implementing languages and even inventing your own combinators to make things easier to express. In the next and final article in series I’ll touch on performance concerns, and some minor changes we can make to ParserCombinators to make our parsers much more efficient.