The following is an article written for a meeting of the London Computation Club, a fortnightly meet-up for people interested in learning about, well, computation. We are currently reading Types and Programming Languages, but interspersing our reading with occasional meetings on other topics.
As we have been working on implementing the type systems in TAPL, we’ve found it
convenient to write expressions in the syntax used in the book, rather than
writing abstract syntax trees by hand: far easier to type λx:Bool. x true
than
Term::Abs.new("x", Type::Bool,
Term::App.new(Term::Var.new("x"), Term::True))
For many members of the group, this is their first experience with writing parsers, and while parsing is not the subject of TAPL it is nevertheless useful for digesting it. As I have some experience writing languages and maintain a parsing toolkit, I thought it would be fun to do an intro to parsing techniques for the group.
So, what follows is a primer on the simplest parsing technique I know: parser combinators. They’re so simple that we can implement a usable framework with them in about 100 lines of Ruby, requiring far less machinery than many other parsing techniques, and they throw up some interesting problems around precedence and associativity that are worth knowing solutions to.
I’m going to try to keep things as minimal as I can; this will be focussed on practical results rather than theoretical formalism. For example, we not be tokenising our input strings first. Most production parsers split a source text into tokens, meaningful elements of the language like variables, keywords, operators and so on, and then parse the sequence of tokens into a tree. We won’t be doing that, we’ll be working directly with the input text character by character, which is the approach taken by many parsing expression grammar (PEG) tools. The semantics implemented here will also be close to that of PEGs.
The idea behind parser combinators is that each common operation in a parser can be implemented by a function, and then those functions can be combined into more elaborate operations. In general, a combinator is a function that takes an input state, typically the text to be parsed and an offset representing how far into the string you’ve already scanned. If the combinator matches the input at the current offset, it returns a parse tree and a new state: the text with the offset moved to the end of the text matched by the combinator. If it fails, it returns nothing, or the state it was given, depending on what’s most convenient for your implementation.
Parser combinators work particularly well in languages with first-class functions or lambdas, since it means you can generate combinators that do particular things. Let’s look at our first combinator then: one that matches a literal string. To begin with, we’ll need a way to represent the state. I’ll use a struct with three methods: one for looking at the text from the current offset, one for moving the offset forward, and one for checking if we’ve reached the end of the input.
State = Struct.new(:string, :offset) do
def peek(n)
string[offset ... offset + n]
end
def read(n)
State.new(string, offset + n)
end
def complete?
offset == string.size
end
end
Notice read
returns a new state rather than mutating the old one. This makes
some of the later combinators much easier to implement, since if they fail
partway through, they don’t need to rewind or undo any state changes they’ve
made during their execution.
Now, here’s our first combinator: one that matches the string hello
:
hello = -> state {
if state.peek(5) == "hello"
[[:str, "hello"], state.read(5)]
end
}
We can call this combinator with some input states and see what it does. If we
try to match hello world
at offset 0
, we get a parse tree (just the array
containing the matched text, [:str, "hello"]
) and a new state. If we match at
any other position we get nil
:
input = State.new("hello world", 0)
hello.call(input)
# => [[:str, "hello"], #<struct State string="hello world", offset=5>]
hello.call(input.read(1))
# => nil
Rather than having to write this combinator out for every string we ever want to match, it would be nice to generate them. Here’s a function for doing that:
def str(string)
-> state {
chunk = state.peek(string.size)
if chunk == string
[[:str, chunk], state.read(string.size)]
end
}
end
input = State.new("hello world", 0)
hello = str("hello")
world = str("world")
world.call(input.read(6))
# => [[:str, "world"], #<struct State string="hello world", offset=11>]
world.call(input)
# => nil
hello.call(input)
# => [[:str, "hello"], #<struct State string="hello world", offset=5>]
This is our first combinator. It’s strictly a combinator generator since it
lets us generate a matcher for any string we like. We’re going to need one more
combinator that consumes input directly: one for matching a single character
based on a pattern. Whereas the str
combinators are useful for recognising
fixed things such as operators and keywords, the chr
operator is useful for
recognising open-ended things with many values, like numbers or variable names.
It just reads a single character from the input and matches it against a
character class regular expression:
def chr(pattern)
-> state {
chunk = state.peek(1)
if chunk =~ %r{[#{pattern}]}
[[:chr, chunk], state.read(1)]
end
}
end
input = State.new("12 + 34", 0)
digit = chr("0-9")
digit.call(input.read(1))
# => [[:chr, "2"], #<struct State string="12 + 34", offset=2>]
digit.call(input.read(5))
# => [[:chr, "3"], #<struct State string="12 + 34", offset=6>]
digit.call(input.read(2))
# => nil
str
and chr
are the only combinators we’ll use to directly consume input.
You can invent more, but these are enough to do useful work. The other
combinators we’ll see are higher order combinators: they take other
combinators and glue them together in useful ways. Rather than consuming input
directly themselves, they consume it implicitly via the combinators they invoke.
The first one we need is called seq
, the sequencing combinator. It lets us say
a string must be composed of matches for various different rules, in order.
Here’s an example, where we say a string, must match a sequence containing a
digit, then the string +
, then another digit:
input = State.new("7+8", 0)
addition = seq(chr("0-9"), str("+"), chr("0-9"))
addition.call(input)
# => [ [:seq, [:chr, "7"], [:str, "+"], [:chr, "8"]],
# #<struct State string="7+8", offset=3> ]
How would such a combinator work? Let’s look at the things it combines. We know
that chr("0-9")
applied at offset 0
will yield the node [:chr, "7"]
and a
new state with offset 1
.
chr("0-9").call(input.read(0))
# => [[:chr, "7"], #<struct State string="7+8", offset=1>]
Similarly we know that str("+")
applied at offset 1
(the offset returned by
the previous match) will return:
str("+").call(input.read(1))
# => [[:str, "+"], #<struct State string="7+8", offset=2>]
And finally we apply another chr("0-9")
at offset 2
to get another digit:
chr("0-9").call(input.read(2))
# => [[:chr, "8"], #<struct State string="7+8", offset=3>]
So the job of the seq
combinator is to feed its input state through a chain of
other combinators, recording the nodes produced as it does so. The output state
of one combinator becomes the input for the next. If all the combinators are
found to match, then we return a sequence node, otherwise the parse fails. Here
is a definition of seq
; note how if at any point state
becomes nil
the
loop effectively stops doing any work:
def seq(*parsers)
-> state {
matches = []
parsers.each do |parser|
node, state = state && parser.call(state)
matches << node if state
end
if state
[[:seq, *matches], state]
end
}
end
The next basic operation we need is a way to repeat things. We’ve seen how to
match a single digit, but what if we want to match a whole number? That’s where
rep
comes in.
input = State.new("2017", 0)
number = rep(chr("0-9"), 1)
number.call(input)
# => [ [:rep, [:chr, "2"], [:chr, "0"], [:chr, "1"], [:chr, "7"]],
# #<struct State string="2017", offset=4>]
The number passed to rep
after the combinator it wraps is a minimum repetition
count; passing 1
is like using the +
operator in regular expression, and 0
is like the *
operator.
rep
works a little bit like seq
in that it threads a state through a series
of combinators. However, rather than using a list of different combinators, it
just uses the same one over and over until it stops matching. When that happens,
it checks it’s seen the minimum required number of matches and returns a node if
so. There’s one subtlety here: notice how it stores off last_state
before
attempting each match. That’s because when the match finally fails, state
will
become nil
, and we need to return the last good state as part of the result so
the next combinator in line knows where we finished matching.
def rep(parser, n)
-> state {
matches = []
last_state = nil
while state
last_state = state
node, state = parser.call(state)
matches << node if state
end
if matches.size >= n
[[:rep, *matches], last_state]
end
}
end
The final fundamental operator we need is a way to make choices. When dealing
with programming languages with rich expressions, we want to say an expression
could be a simple number, or an addition, or a variable, or any combination of
these. The alt
combinator, like seq
, takes a list of parsers as input;
these are the alternatives it has to choose from. It works by simply trying each
one in turn until one of them works, and returning the result for that. Note how
in this combinator, the state is not threaded through the parsers like it is in
seq
or rep
, but each parser works on the original input state. We want each
parser to try and match from the same starting point, rather than trying the
first, and then trying the second from where the first left off, and so on. If
no parser matches, we must explicitly return nil
since the return value of
parsers.each
is parsers
.
def alt(*parsers)
-> state {
parsers.each do |parser|
node, new_state = parser.call(state)
if new_state
return [node, new_state]
end
end
nil
}
end
This method of choice always picks the first parser that matches. If the choice it makes causes some later combinator in the chain to fail, we can’t back-track and try something else using this technique. Once we make a choice, we’re stuck with it. This is the semantics found in PEG parsers, but there exist many parsing techniques that do allow back-tracking or what’s called ambiguous choice. This makes some things easier to express but also causes problems around expressing precedence, as in the dangling else problem. What we’ve implemented here is called ordered choice, and it removes ambiguity at the cost of making some things a bit harder to express clearly.
We’re now in a position to do some reasonably interesting things with the tools we’ve amassed. For example, here’s is a little language that recognises numbers or additions of two numbers:
# whitespace is zero or more space characters
_ = rep(str(" "), 0)
# numbers are either a 0, or a digit 1-9 followed by zero or
# more digits 0-9
number = alt(str("0"), seq(chr("1-9"), rep(chr("0-9"), 0)))
# an addition is a number, followed by whitespace, a plus sign,
# more whitespace, and a number
addition = seq(number, _, str("+"), _, number)
# an expression is an addition or a number
expression = alt(addition, number)
Let’s try it out:
expression.call(State.new("12", 0))
# => [ [:seq, [:chr, "1"], [:rep, [:chr, "2"]]],
# #<struct State string="12", offset=2> ]
expression.call(State.new("34 + 567", 0))
# => [ [:seq, [:seq, [:chr, "3"], [:rep, [:chr, "4"]]],
# [:rep, [:str, " "]],
# [:str,"+"],
# [:rep, [:str, " "]],
# [:seq, [:chr, "5"], [:rep, [:chr, "6"], [:chr, "7"]]]],
# #<struct State string="34 + 567", offset=8> ]
These results have the right structure, but they’re not especially useful. We’d
really like a data structure telling us clearly that this string represents the
addition of 34
and 567
, not this maze of characters buried within arrays.
We’ll make an amendment to our existing generator functions that allows them to
accept a block, letting us transform the nodes on the fly. For example, let’s
augment the number
rule with a block for converting its nested nodes into a
numeric value. Remember, a number can either be [:str, "0"]
, or something like
[:seq, [:chr, "5"], [:rep, [:chr, "6"], [:chr, "7"]]]
as seen above.
number = alt(str("0"), seq(chr("1-9"), rep(chr("0-9"), 0))) { |node|
if node.first == :str
0
else
digit_chrs = [node[1]] + node[2].drop(1)
digit_chrs.map(&:last).join("").to_i
end
}
All that needs to be done to make this work is to add an &action
parameter to
alt
and to make it apply this block to its result:
def alt(*parsers, &action)
-> state {
parsers.each do |parser|
node, new_state = parser.call(state)
if new_state
node = action.call(node) if action
return [node, new_state]
end
end
nil
}
end
Note that this is why the conditions in these combinators always check for a new
state, not a new node. Our own action blocks might decide nil
or false
is a
reasonable node value, and we don’t want the parser to think that indicates
failure.
If we make similar changes to all the other combinators, we can also modify the
addition
rule to produce a compact representation that removes extraneous
things like whitespace:
addition = seq(number, _, str("+"), _, number) { |node|
["+", node[1], node[5]]
}
And now, parsing an addition expression yields something much more usable:
expression.call(State.new("34 + 567", 0))
# => [ ["+", 34, 567],
# #<struct State string="34 + 567", offset=8> ]
We’ve now realised the important function of parsers: to take a text format and extract information from it in a concise yet rich format, containing data values we can operate on. However, there is one final piece of machinery we need in order to parse most languages you’d actually want to apply this technique to. So far, everything we’ve done can be accomplished with regular expressions, and not the fancy extended regular expressions we now have. Formally speaking, the above code can only recognise regular languages, so for example, you couldn’t write a parser that matched an expression in any number of balanced parentheses. To do that, you need to be able to write recursive rules.
For example of this, let’s imagine trying to extend our addition grammar to
parse an expression like 7 + 8 + 9
. This doesn’t parse completely using the
above rules:
expression.call(State.new("7 + 8 + 9", 0))
# => [["+", 7, 8], #<struct State string="7 + 8 + 9", offset=5>]
It recognises the first part "7 + 8"
, but then it stops, leaving " + 9"
unparsed. That’s because our addition
rule just recognises a number
on
either side of the +
. We resolve this by recognising that 7 + 8 + 9
is
addition of a number, 7
, and another addition, 8 + 9
, i.e. it’s really 7 +
(8 + 9)
. (It could also be (7 + 8) + 9
but we’ll deal with associativity
later.) So, we’d like to modify our addition
rule so the right-hand side can
be either a number
or another addition
, i.e. it can be an expression
. But
how can we write that?
addition = seq(number, _, str("+"), _, expression)
expression = alt(addition, number)
addition
requires expression
to be defined first, but expression
wants
addition
to be defined first! How can we write mutually recursive rules in
this system? The answer is to introduce a special combinator, ref(rule_name)
,
for referring to other rules, that only looks up the implementation of the rule
when the combinator is executed. That is, it’s a combinator for deferring
execution. To use this, we need to implement each rule as a method, and invent a
combinator for looking up another rule by name, and then calling the combinator
it yields. Here’s our addition grammar converted to this system, where every
reference to another rule is done using this new ref
combinator:
class Addition
def expression
alt(ref(:addition), ref(:number))
end
def addition
seq(ref(:number), ref(:_), str("+"), ref(:_), ref(:expression)) { |node|
# construct an addition node
}
end
def number
alt(str("0"), seq(chr("1-9"), rep(chr("0-9"), 0))) { |node|
# convert the node to a number
}
end
def _
rep(str(" "), 0)
end
end
Since all the combinator generators are now being used inside a class, I’m going
to move them into a module called ParserCombinators
and include that into the
Addition
class, rather than leaving them as global functions:
module ParserCombinators
def str(string, &action) # ...
def chr(pattern, &action) # ...
def seq(*parsers, &action) # ...
def rep(parser, n, &action) # ...
def alt(*parsers, &action) # ...
end
class Addition
include ParserCombinators
end
Finally, here’s the implementation of ref
: it returns a combinator that calls
a method by name to get the combinator it represents, then calls that combinator
with whatever state was passed in:
module ParserCombinators
def ref(name)
-> state {
__send__(name).call(state)
}
end
end
To wrap everything up, let’s add a convenience method for parsing a complete
string. It wraps the string in a new State
, then calls root
to get the root
rule of the grammar, uses that rule to parse the string, and only if the string
is fully read after the parse to we actually return a result:
module ParserCombinators
def parse(string)
node, state = root.call(State.new(string, 0))
if state and state.complete?
node
end
end
end
Tying all this together, we have a working parser for addition:
Addition.new.parse("7 + 8 + 9")
# => ["+", 7, ["+", 8, 9]]
So that’s the basics of parser combinators. We’ve seen six simple building
blocks that are enough to express most context-free languages: str
, chr
,
seq
, rep
, alt
and ref
. We’ve used these to build a parser for a binary
operation where we can chain the operation to get a recursive structure. In the
next article, we’ll look more at how these combinators are used, including how
we handle precedence and associativity.