Introduction to parser combinators

The following is an article written for a meeting of the London Computation Club, a fortnightly meet-up for people interested in learning about, well, computation. We are currently reading Types and Programming Languages, but interspersing our reading with occasional meetings on other topics.

As we have been working on implementing the type systems in TAPL, we’ve found it convenient to write expressions in the syntax used in the book, rather than writing abstract syntax trees by hand: far easier to type λx:Bool. x true than

Term::Abs.new("x", Type::Bool,
  Term::App.new(Term::Var.new("x"), Term::True))

For many members of the group, this is their first experience with writing parsers, and while parsing is not the subject of TAPL it is nevertheless useful for digesting it. As I have some experience writing languages and maintain a parsing toolkit, I thought it would be fun to do an intro to parsing techniques for the group.

So, what follows is a primer on the simplest parsing technique I know: parser combinators. They’re so simple that we can implement a usable framework with them in about 100 lines of Ruby, requiring far less machinery than many other parsing techniques, and they throw up some interesting problems around precedence and associativity that are worth knowing solutions to.

I’m going to try to keep things as minimal as I can; this will be focussed on practical results rather than theoretical formalism. For example, we not be tokenising our input strings first. Most production parsers split a source text into tokens, meaningful elements of the language like variables, keywords, operators and so on, and then parse the sequence of tokens into a tree. We won’t be doing that, we’ll be working directly with the input text character by character, which is the approach taken by many parsing expression grammar (PEG) tools. The semantics implemented here will also be close to that of PEGs.

The idea behind parser combinators is that each common operation in a parser can be implemented by a function, and then those functions can be combined into more elaborate operations. In general, a combinator is a function that takes an input state, typically the text to be parsed and an offset representing how far into the string you’ve already scanned. If the combinator matches the input at the current offset, it returns a parse tree and a new state: the text with the offset moved to the end of the text matched by the combinator. If it fails, it returns nothing, or the state it was given, depending on what’s most convenient for your implementation.

Parser combinators work particularly well in languages with first-class functions or lambdas, since it means you can generate combinators that do particular things. Let’s look at our first combinator then: one that matches a literal string. To begin with, we’ll need a way to represent the state. I’ll use a struct with three methods: one for looking at the text from the current offset, one for moving the offset forward, and one for checking if we’ve reached the end of the input.

State = Struct.new(:string, :offset) do
  def peek(n)
    string[offset ... offset + n]
  end

  def read(n)
    State.new(string, offset + n)
  end

  def complete?
    offset == string.size
  end
end

Notice read returns a new state rather than mutating the old one. This makes some of the later combinators much easier to implement, since if they fail partway through, they don’t need to rewind or undo any state changes they’ve made during their execution.

Now, here’s our first combinator: one that matches the string hello:

hello = -> state {
  if state.peek(5) == "hello"
    [[:str, "hello"], state.read(5)]
  end
}

We can call this combinator with some input states and see what it does. If we try to match hello world at offset 0, we get a parse tree (just the array containing the matched text, [:str, "hello"]) and a new state. If we match at any other position we get nil:

input = State.new("hello world", 0)

hello.call(input)
# => [[:str, "hello"], #<struct State string="hello world", offset=5>]

hello.call(input.read(1))
# => nil

Rather than having to write this combinator out for every string we ever want to match, it would be nice to generate them. Here’s a function for doing that:

def str(string)
  -> state {
    chunk = state.peek(string.size)

    if chunk == string
      [[:str, chunk], state.read(string.size)]
    end
  }
end

input = State.new("hello world", 0)
hello = str("hello")
world = str("world")

world.call(input.read(6))
# => [[:str, "world"], #<struct State string="hello world", offset=11>]

world.call(input)
# => nil

hello.call(input)
# => [[:str, "hello"], #<struct State string="hello world", offset=5>]

This is our first combinator. It’s strictly a combinator generator since it lets us generate a matcher for any string we like. We’re going to need one more combinator that consumes input directly: one for matching a single character based on a pattern. Whereas the str combinators are useful for recognising fixed things such as operators and keywords, the chr operator is useful for recognising open-ended things with many values, like numbers or variable names. It just reads a single character from the input and matches it against a character class regular expression:

def chr(pattern)
  -> state {
    chunk = state.peek(1)

    if chunk =~ %r{[#{pattern}]}
      [[:chr, chunk], state.read(1)]
    end
  }
end

input = State.new("12 + 34", 0)
digit = chr("0-9")

digit.call(input.read(1))
# => [[:chr, "2"], #<struct State string="12 + 34", offset=2>]

digit.call(input.read(5))
# => [[:chr, "3"], #<struct State string="12 + 34", offset=6>]

digit.call(input.read(2))
# => nil

str and chr are the only combinators we’ll use to directly consume input. You can invent more, but these are enough to do useful work. The other combinators we’ll see are higher order combinators: they take other combinators and glue them together in useful ways. Rather than consuming input directly themselves, they consume it implicitly via the combinators they invoke.

The first one we need is called seq, the sequencing combinator. It lets us say a string must be composed of matches for various different rules, in order. Here’s an example, where we say a string, must match a sequence containing a digit, then the string +, then another digit:

input = State.new("7+8", 0)
addition = seq(chr("0-9"), str("+"), chr("0-9"))

addition.call(input)

# => [ [:seq, [:chr, "7"], [:str, "+"], [:chr, "8"]],
#      #<struct State string="7+8", offset=3> ]

How would such a combinator work? Let’s look at the things it combines. We know that chr("0-9") applied at offset 0 will yield the node [:chr, "7"] and a new state with offset 1.

chr("0-9").call(input.read(0))
# => [[:chr, "7"], #<struct State string="7+8", offset=1>]

Similarly we know that str("+") applied at offset 1 (the offset returned by the previous match) will return:

str("+").call(input.read(1))
# => [[:str, "+"], #<struct State string="7+8", offset=2>]

And finally we apply another chr("0-9") at offset 2 to get another digit:

chr("0-9").call(input.read(2))
# => [[:chr, "8"], #<struct State string="7+8", offset=3>]

So the job of the seq combinator is to feed its input state through a chain of other combinators, recording the nodes produced as it does so. The output state of one combinator becomes the input for the next. If all the combinators are found to match, then we return a sequence node, otherwise the parse fails. Here is a definition of seq; note how if at any point state becomes nil the loop effectively stops doing any work:

def seq(*parsers)
  -> state {
    matches = []

    parsers.each do |parser|
      node, state = state && parser.call(state)
      matches << node if state
    end

    if state
      [[:seq, *matches], state]
    end
  }
end

The next basic operation we need is a way to repeat things. We’ve seen how to match a single digit, but what if we want to match a whole number? That’s where rep comes in.

input = State.new("2017", 0)
number = rep(chr("0-9"), 1)

number.call(input)

# => [ [:rep, [:chr, "2"], [:chr, "0"], [:chr, "1"], [:chr, "7"]],
#      #<struct State string="2017", offset=4>]

The number passed to rep after the combinator it wraps is a minimum repetition count; passing 1 is like using the + operator in regular expression, and 0 is like the * operator.

rep works a little bit like seq in that it threads a state through a series of combinators. However, rather than using a list of different combinators, it just uses the same one over and over until it stops matching. When that happens, it checks it’s seen the minimum required number of matches and returns a node if so. There’s one subtlety here: notice how it stores off last_state before attempting each match. That’s because when the match finally fails, state will become nil, and we need to return the last good state as part of the result so the next combinator in line knows where we finished matching.

def rep(parser, n)
  -> state {
    matches = []
    last_state = nil

    while state
      last_state = state
      node, state = parser.call(state)
      matches << node if state
    end

    if matches.size >= n
      [[:rep, *matches], last_state]
    end
  }
end

The final fundamental operator we need is a way to make choices. When dealing with programming languages with rich expressions, we want to say an expression could be a simple number, or an addition, or a variable, or any combination of these. The alt combinator, like seq, takes a list of parsers as input; these are the alternatives it has to choose from. It works by simply trying each one in turn until one of them works, and returning the result for that. Note how in this combinator, the state is not threaded through the parsers like it is in seq or rep, but each parser works on the original input state. We want each parser to try and match from the same starting point, rather than trying the first, and then trying the second from where the first left off, and so on. If no parser matches, we must explicitly return nil since the return value of parsers.each is parsers.

def alt(*parsers)
  -> state {
    parsers.each do |parser|
      node, new_state = parser.call(state)
      if new_state
        return [node, new_state]
      end
    end

    nil
  }
end

This method of choice always picks the first parser that matches. If the choice it makes causes some later combinator in the chain to fail, we can’t back-track and try something else using this technique. Once we make a choice, we’re stuck with it. This is the semantics found in PEG parsers, but there exist many parsing techniques that do allow back-tracking or what’s called ambiguous choice. This makes some things easier to express but also causes problems around expressing precedence, as in the dangling else problem. What we’ve implemented here is called ordered choice, and it removes ambiguity at the cost of making some things a bit harder to express clearly.

We’re now in a position to do some reasonably interesting things with the tools we’ve amassed. For example, here’s is a little language that recognises numbers or additions of two numbers:

# whitespace is zero or more space characters
_ = rep(str(" "), 0)

# numbers are either a 0, or a digit 1-9 followed by zero or
# more digits 0-9
number = alt(str("0"), seq(chr("1-9"), rep(chr("0-9"), 0)))

# an addition is a number, followed by whitespace, a plus sign,
# more whitespace, and a number
addition = seq(number, _, str("+"), _, number)

# an expression is an addition or a number
expression = alt(addition, number)

Let’s try it out:

expression.call(State.new("12", 0))

# => [ [:seq, [:chr, "1"], [:rep, [:chr, "2"]]],
#      #<struct State string="12", offset=2> ]

expression.call(State.new("34 + 567", 0))

# => [ [:seq, [:seq, [:chr, "3"], [:rep, [:chr, "4"]]],
#             [:rep, [:str, " "]],
#             [:str,"+"],
#             [:rep, [:str, " "]],
#             [:seq, [:chr, "5"], [:rep, [:chr, "6"], [:chr, "7"]]]],
#      #<struct State string="34 + 567", offset=8> ]

These results have the right structure, but they’re not especially useful. We’d really like a data structure telling us clearly that this string represents the addition of 34 and 567, not this maze of characters buried within arrays.

We’ll make an amendment to our existing generator functions that allows them to accept a block, letting us transform the nodes on the fly. For example, let’s augment the number rule with a block for converting its nested nodes into a numeric value. Remember, a number can either be [:str, "0"], or something like [:seq, [:chr, "5"], [:rep, [:chr, "6"], [:chr, "7"]]] as seen above.

number = alt(str("0"), seq(chr("1-9"), rep(chr("0-9"), 0))) { |node|
  if node.first == :str
    0
  else
    digit_chrs = [node[1]] + node[2].drop(1)
    digit_chrs.map(&:last).join("").to_i
  end
}

All that needs to be done to make this work is to add an &action parameter to alt and to make it apply this block to its result:

def alt(*parsers, &action)
  -> state {
    parsers.each do |parser|
      node, new_state = parser.call(state)
      if new_state
        node = action.call(node) if action
        return [node, new_state]
      end
    end

    nil
  }
end

Note that this is why the conditions in these combinators always check for a new state, not a new node. Our own action blocks might decide nil or false is a reasonable node value, and we don’t want the parser to think that indicates failure.

If we make similar changes to all the other combinators, we can also modify the addition rule to produce a compact representation that removes extraneous things like whitespace:

addition = seq(number, _, str("+"), _, number) { |node|
  ["+", node[1], node[5]]
}

And now, parsing an addition expression yields something much more usable:

expression.call(State.new("34 + 567", 0))

# => [ ["+", 34, 567],
#      #<struct State string="34 + 567", offset=8> ]

We’ve now realised the important function of parsers: to take a text format and extract information from it in a concise yet rich format, containing data values we can operate on. However, there is one final piece of machinery we need in order to parse most languages you’d actually want to apply this technique to. So far, everything we’ve done can be accomplished with regular expressions, and not the fancy extended regular expressions we now have. Formally speaking, the above code can only recognise regular languages, so for example, you couldn’t write a parser that matched an expression in any number of balanced parentheses. To do that, you need to be able to write recursive rules.

For example of this, let’s imagine trying to extend our addition grammar to parse an expression like 7 + 8 + 9. This doesn’t parse completely using the above rules:

expression.call(State.new("7 + 8 + 9", 0))
# => [["+", 7, 8], #<struct State string="7 + 8 + 9", offset=5>]

It recognises the first part "7 + 8", but then it stops, leaving " + 9" unparsed. That’s because our addition rule just recognises a number on either side of the +. We resolve this by recognising that 7 + 8 + 9 is addition of a number, 7, and another addition, 8 + 9, i.e. it’s really 7 + (8 + 9). (It could also be (7 + 8) + 9 but we’ll deal with associativity later.) So, we’d like to modify our addition rule so the right-hand side can be either a number or another addition, i.e. it can be an expression. But how can we write that?

addition   = seq(number, _, str("+"), _, expression)
expression = alt(addition, number)

addition requires expression to be defined first, but expression wants addition to be defined first! How can we write mutually recursive rules in this system? The answer is to introduce a special combinator, ref(rule_name), for referring to other rules, that only looks up the implementation of the rule when the combinator is executed. That is, it’s a combinator for deferring execution. To use this, we need to implement each rule as a method, and invent a combinator for looking up another rule by name, and then calling the combinator it yields. Here’s our addition grammar converted to this system, where every reference to another rule is done using this new ref combinator:

class Addition
  def expression
    alt(ref(:addition), ref(:number))
  end

  def addition
    seq(ref(:number), ref(:_), str("+"), ref(:_), ref(:expression)) { |node|
      # construct an addition node
    }
  end

  def number
    alt(str("0"), seq(chr("1-9"), rep(chr("0-9"), 0))) { |node|
      # convert the node to a number
    }
  end

  def _
    rep(str(" "), 0)
  end
end

Since all the combinator generators are now being used inside a class, I’m going to move them into a module called ParserCombinators and include that into the Addition class, rather than leaving them as global functions:

module ParserCombinators
  def str(string, &action) # ...

  def chr(pattern, &action) # ...

  def seq(*parsers, &action) # ...

  def rep(parser, n, &action) # ...

  def alt(*parsers, &action) # ...
end

class Addition
  include ParserCombinators
end

Finally, here’s the implementation of ref: it returns a combinator that calls a method by name to get the combinator it represents, then calls that combinator with whatever state was passed in:

module ParserCombinators
  def ref(name)
    -> state {
      __send__(name).call(state)
    }
  end
end

To wrap everything up, let’s add a convenience method for parsing a complete string. It wraps the string in a new State, then calls root to get the root rule of the grammar, uses that rule to parse the string, and only if the string is fully read after the parse to we actually return a result:

module ParserCombinators
  def parse(string)
    node, state = root.call(State.new(string, 0))

    if state and state.complete?
      node
    end
  end
end

Tying all this together, we have a working parser for addition:

Addition.new.parse("7 + 8 + 9")
# => ["+", 7, ["+", 8, 9]]

So that’s the basics of parser combinators. We’ve seen six simple building blocks that are enough to express most context-free languages: str, chr, seq, rep, alt and ref. We’ve used these to build a parser for a binary operation where we can chain the operation to get a recursive structure. In the next article, we’ll look more at how these combinators are used, including how we handle precedence and associativity.