Formal Language Processing in Scala, Part 3
August 16th, 2008 | In ScalaThis is the third part in a series of articles on formal language processing in Scala, focusing on parser combinators and language interpretation. In this part I will add a nice operator syntax and support for recursive grammars to the simple parser combinators introduced in part 2 and then move on to the parser combinators from Scala’s core library.
Operator Syntax
After incorporating the changes made in the exercises from part 2 into our parser, we have the following collection of combinators:
def phrase(p: (List[Any] => Option[List[Any]]))(l:List[Any]): Option[List[Any]] = p(l) match {
case x @ Some(Nil) => x
case rest => throw new Exception("Expected end of input but found "+rest)
}
def rep(p: (List[Any] => Option[List[Any]]))(l:List[Any]): Option[List[Any]] = p(l) match {
case None => Some(l)
case Some(rest) => rep(p)(rest)
}
def rep1(p: (List[Any] => Option[List[Any]]))(l:List[Any]): Option[List[Any]] = p(l) match {
case None => None
case Some(rest) => rep(p)(rest)
}
def lift(str:String)(l:List[Any]): Option[List[Any]] = l match {
case s :: rest if(s == str) => Some(rest)
case _ => None
}
def choice(a: (List[Any] => Option[List[Any]]),
b: (List[Any] => Option[List[Any]]))(l:List[Any]): Option[List[Any]] = a(l) match {
case r @ Some(rest) => r
case None => b(l)
}
And here are the parsers from Fun0 which make use of these combinators:
def programUnit = phrase(statements) _
def statements = rep(statement) _
def asteriskOrPlus = choice(asterisk, plus) _
def asterisk = lift("*") _
def plus = lift("+") _
They are semantically very close to EBNF but the syntax is not nearly as nice. I have already shown you a Fun0 parser using Scala’s parser combinator library in part 1, so you know that a better syntax is possible. Our parser combinators are currently implemented as curried functions which take one or two parser arguments, so you always get a prefix notation like “choice(a,b)” (first the combinator, then its operands as function arguments).
Operator syntax in Scala is just a different notation for method calls. “a choice b” gets translated by the compiler to a method call “a.choice(b)”, and similarly, “a | b” gets translated to “a.|(b)”. Both “choice” and “|” are legal method names in Scala and both can be used as operators. Since we want to be able to write a choice as “a | b”, we need to make the choice combinator a method of “a”, which is a parser of the function type “List[Any] => Option[List[Any]]”. We cannot simply add methods to a function type but another feature of Scala’s syntax comes to the rescue. A function type “A => B” is actually a shorthand for the trait “Function1[A,B]” which means that every function is also an object. A function call “a(b)” gets translated by the compiler to “a.apply(b)” where “apply” is a method of the function object. With that in mind it should not come as a surprise that we can define a Parser trait which extends the parser function type:
trait Parser extends (List[Any] => Option[List[Any]])
implicit def lift(str:String) = new Parser {
def apply(l:List[Any]) = l match {
case s :: rest if(s == str) => Some(rest)
case _ => None
}
}
The choice combinator can be rewritten in a similar way and made available through a method of the Parser trait. Note that the method retains only the "b" parameter. The object on which it is called (the "self" alias, because "this" would refer to the inner Parser instance) corresponds to the "a" parameter from the original "choice" method:
trait Parser extends (List[Any] => Option[List[Any]]) {
self =>
def |(b: Parser) = new Parser {
def apply(l:List[Any]) = self(l) match {
case r @ Some(rest) => r
case None => b(l)
}
}
}
We can now define the "asteriskOrPlus" parser with infix syntax:
def asteriskOrPlus = lift("*") | lift("+")
And it gets even better! We marked "lift" as an implicit method, so any method call on a String object or taking a String parameter which is defined for Parser but not for String will automatically lift the String to a Parser. Since "|" is not defined for strings, we can safely rely on this implicit conversion for "asteriskOrPlus":
def asteriskOrPlus = "*" | "+"
Exercise 1: Define the sequence combinator "~" so that you can write the "statement" parser in the following form:
def statement = "print" ~ expression
Postfix operators in Scala work just like infix operators, except that their method counterparts are parameterless (the one operand becomes the receiver of the method), so the new implementation of the "rep" combinator as "*" is straight-forward:
trait Parser extends (List[Any] => Option[List[Any]]) {
self =>
// ...
def * = new Parser {
def apply(l:List[Any]) = self(l) match {
case None => Some(l)
case Some(rest) => apply(rest)
}
}
}
Exercise 2: Rewrite the "rep1" (operator "+") and "phrase" combinators and the "expression" parser in the new style. Hint: You can use the sequence combinator for a very concise implementation of "rep1".
Recursion
Let's introduce some new parser combinators which allow us to define the "expression" parser in an EBNF-like syntax.
def empty = new Parser {
def apply(l:List[Any]) = Some(l)
}
def numericLit = new Parser {
def apply(l:List[Any]): Option[List[Any]] = l match {
case (i:Int) :: rest => Some(rest)
case _ => None
}
}
trait Parser extends (List[Any] => Option[List[Any]]) {
self =>
// ...
def ? = self | empty
}
Here is an iterative definition of "expression" using these combinators:
def expression = numericLit ~ (( asteriskOrPlus ~ numericLit )*)
This works just fine. But what about the right-recursive version?
def expression:Parser =
numericLit ~ (( asteriskOrPlus ~ expression )?)
We have to add a return type annotation to satisfy the compiler because the method is recursive but that alone does not get us very far. The "expression" method still calls itself every time, so you end up with an endless recursion (at least in theory; due to the Java Virtual Machine's lack of proper tail call optimization this call fails very quickly with a stack overflow). But why did it work before we used parser combinators? Have another look at an earlier implementation of "expression":
def expression = new Parser {
def apply(l:List[Any]): Option[List[Any]] = l match {
case (i:Int) :: more => asteriskOrPlus(more) match {
case Some(rest) => apply(rest)
case None => Some(more)
}
case _ => None
}
}
This works because the actual parsing method only calls itself when it needs to, i.e. when there is more input left to parse as an expression. And that input is always finite. The new combinator-based "expression" method on the other hand creates a new parser for each nested sub-expression and since expressions can be nested to an arbitrary depth you have to allow for theoretically infinite nesting. You could use "val" instead of "def" to define a single "expression" parser object but that alone is not sufficient:
lazy val expression:Parser =
numericLit ~ (( asteriskOrPlus ~ expression )?)
This obviously cannot work because the value assigned to "expression" needs that same value in order to be computed. The "lazy" keyword ensures that you can compile this at all (when placed inside a method) and that the "expression" value on the right-hand side is not still null (when placed inside a trait, class or object) but you still need to evaluate "expression" in order to define "expression", so you end up with the same endless recursion. This can be prevented by using Scala's call-by-name syntax in the definition of all combinators:
trait Parser extends (List[Any] => Option[List[Any]]) {
self =>
def |(b: =>Parser) = new Parser {
def apply(l:List[Any]) = self(l) match {
case r @ Some(rest) => r
case None => b(l)
}
}
def ~(b: =>Parser) = new Parser {
def apply(l:List[Any]) = self(l) match {
case Some(rest) => b(rest)
case None => None
}
}
def ? = self | empty
}
All arguments of type Parser have been changed from the default call-by-value to call-by-name, so they do not get evaluated until needed. This happens only when a parser is used, not when it is defined. With these simple changes in place, the right-recursive definition of "expression" (using either "def", "val" or "lazy val") works as it should.
Scala's Parser Combinator Library
We have now defined most of the essential features of a parser combinator library from the bottom up and it is time to move on to the more advanced implementation in Scala's core library which offers important features like:
- Error reporting: As I already mentioned in part 2, our simple parser combinators can only tell if an input is valid or invalid but they cannot produce good error messages to tell the user why an input is invalid. Scala's parser combinators keep track of parse failures deep within the parse tree and return them to the caller so that you always know which part of the input could not be parsed.
- Parse results: When an input is valid, Scala's parser combinators return the complete parse tree for that input which shows you the path through all successfully applied parsers. There are also combinators which allow you to modify individual parts of the parse tree to return your own representation of the input instead.
- Polymorphism: Our simple parsers always used a stream of input tokens of type "List[Any]". Parsers from Scala's parser combinator library are polymorphic over the type of tokens they use and the type of parse result they produce, so you can combine parsers for different types of input and output in a statically type-safe way.
The next part of this series will focus on Parse Trees and Abstract Syntax Trees as the basic data structure on which an interpreter can be built. Until then I can also recommend Debasish Ghosh's and Daniel Spiewak's blog posts about Scala's parser combinators if you have not read them already.