Formal Language Processing in Scala, Part 2

August 2nd, 2008 | In Scala

This is the second part in a series of articles about parser combinators and implementing an interpreter in Scala. In this part I will focus on the development from simple hand-written parsers to parser combinators so that you can see how a combinator-based parser really works and utilize parser combinators to their full potential instead of treating them as merely a DSL for EBNF grammars.

Parsing Strategies

If you have done the exercises from part 1 (or read the solutions) you should have noticed that the hand-written recursive-descent parser cannot handle any left-recursive productions, a limitation shared by all LL parsers. This kind of parser reads the input from left to right and traverses the productions in a top-down fashion. The alternative is an LR parser which uses a bottom-up strategy to build larger non-terminals from smaller ones, starting with the individual tokens. Parser generators frequently use LR parsers (and especially their LALR variety) but they are difficult to write by hand. I will not cover them in any detail in this series but you should be aware that LR parsers do not suffer from the left-recursion problem, so you will frequently encounter left-recursive grammars which were meant to be implemented with LR parsers. Fortunately, it is always possible (and often enough easy) to transform a left-recursive grammar into one which is suitable for an LL parser. The most common case for left-recursion is the implementation of binary operators which you have already seen in their left-recursive, right-recursive and repetitive forms in the exercises from part 1.

From Parsers to Parser Combinators

Let’s go back to the hand-written parser for now. I would like to modify it so that it can represent the following grammar for Fun0 with a separate method for each production:

programUnit = statements
statements = statement*
statement = "print" expression
expression = numericLit ( ( "+" | "*" ) numericLit )*

If you try that in the same way as in the original version, you will see that it gets rather ugly. You cannot easily separate programUnit, statements and statement because the decision if a production should even be tried is made outside its method. Once the method has been called, it can only succeed, or fail by throwing an exception. In the grammar above, the “statements” method would need to know that a single statement begins with the token “print” even though that token is consumed in “statement” and not in “statements”. Alternatively, “statements” could just try to parse a new statement until it reaches the end of the input but that would imply some knowledge about the context in which it is called, which actually belongs into the “programUnit” method.

This dilemma can be solved by moving the check for whether a production is applicable into the method which implements that production. Instead of the remaining tokens of type List[Any], the methods now return a result of type Option[List[Any]]: If the production is not applicable, it returns None, otherwise it returns Some(remainder). Parsing errors are still signaled by throwing an exception.

Here are the productions from the original parser rewritten in this new style:

  def programUnit(l:List[Any]): Option[List[Any]] = l match {
    case Nil => Some(Nil)
    case "print" :: e => expression(e) match {
      case Some(rest) => programUnit(rest)
      case _ => throw new Exception("Expected expression but found "+e)
    }
    case x => throw new Exception("Expected 'print' or end of input but found "+x)
  }

  def expression(l:List[Any]): Option[List[Any]] = l match {
    case (i:Int) :: "*" :: rest => expression(rest)
    case (i:Int) :: "+" :: rest => expression(rest)
    case (i:Int) :: rest => Some(rest)
    case _ => None
  }

Now the task of separating programUnit, statements and statement becomes easy:

  def programUnit(l:List[Any]): Option[List[Any]] = statements(l) match {
    case x @ Some(Nil) => x
    case rest => throw new Exception("Expected end of input but found "+rest)
  }

  def statements(l:List[Any]): Option[List[Any]] = statement(l) match {
    case None => Some(l)
    case Some(rest) => statements(rest)
  }

  def statement(l:List[Any]): Option[List[Any]] = l match {
    case "print" :: e => expression(e) match {
      case None => throw new Exception("Expected expression but found "+e);
      case x => x
    }
    case _ => None
  }

Notice that the programUnit method does not simply call “statements” (as you might expect from the grammar). Because it is the topmost method of the parser, it also needs to ensure that there are no input tokens left after parsing the statements.

Let’s examine the “statements” method more closely. The corresponding production in the grammar is “statements = statement *”, so it is just a repetition of one non-terminal. But what makes it specifically a repetition of statements as opposed to a repetition of some other non-terminal (or terminal)? Since we have properly separated the concerns of the individual productions in the previous step, the only “statement”-specific thing in “statements” is one call to the “statement” method. Imagine we had a syntax for function calls of the form “call = functionName args” with “args” being defined as “args = arg *”. The method to parse “args” would look just like the “statements” method, except that “statement” is replaced by “arg”:

  def args(l:List[Any]): Option[List[Any]] = arg(l) match {
    case None => Some(l)
    case Some(rest) => args(rest)
  }

We can in fact generalize this to a repetition method that can be parameterized with another method for parsing the tokens to repeat:

  def rep(p: (List[Any] => Option[List[Any]]))(
      l:List[Any]): Option[List[Any]] = p(l) match {
    case None => Some(l)
    case Some(rest) => rep(p)(rest)
  }

The method uses a curried arguments list so that we can write parsing methods like “statements” in a very simple way by using partial function application:

def statements = rep(statement) _

We have just defined and used our first parser combinator. A combinator is defined as “a higher-order function which, for defining a result from its arguments, solely uses function application and earlier defined combinators”. This is just what the generalized parsing functions like “rep” do. They return a parsing function (thus they are higher-order functions) which is constructed from their arguments (and possibly from other parser combinators). Now that we have a way of composing parsing functions, the distinction between parsing functions and parsers is no longer needed and we can refer to any parsing function simply as a parser.

Exercise 1: Define the “rep1″ combinator (which corresponds to the “+” operator in EBNF). Hint: Make use of the “rep” combinator.

Let’s move on to the “programUnit” parser where we can make a similar case for generalization. “programUnit” calls the “statements” parser and then ensures that no input is remaining. It does not need to know anything about parsing statements, so we can generalize it into the “phrase” combinator and define “programUnit” in terms of “phrase”:

  def phrase(p: (List[Any] => Option[List[Any]]))(
      l:List[Any]): Option[List[Any]] = p(l) match {
    case x @ Some(Nil) => x
    case rest => throw new Exception("Expected end of input but found "+rest)
  }

  def programUnit = phrase(statements) _

(Note: The error handling would need some improvement for a general parser combinator library. Most parsers should just return None instead of throwing an exception but when you do that you do not get any useful error message to indicate why or where parsing has failed. The parsers from Scala’s parser combinator library return instances of a ParseResult class which can either be Success, Failure (which corresponds to None in our parsers) or Error (reserved for some special cases). Unlike our None value, the Failure and Error results contain an error message and the remaining input.)

Exercise 2: Define some other parser combinators.

2.1 Write the “asteriskOrPlus” parser for the production “asteriskOrPlus = ‘*’ | ‘+’” which is required for this “expression” parser:

  def expression(l:List[Any]): Option[List[Any]] = l match {
    case (i:Int) :: more => asteriskOrPlus(more) match {
      case Some(rest) => expression(rest)
      case None => Some(more)
    }
    case _ => None
  }

2.2 Write two separate parsers “asterisk” and “plus” and rewrite “asteriskOrPlus” in terms of those parsers.

2.3 Generalize “asterisk” and “plus” to the “lift” combinator with the following signature and new definitions for “asterisk” and “plus”:

  def lift(str:String)(l:List[Any]): Option[List[Any]] = ...

  def asterisk = lift("*") _

  def plus = lift("+") _

2.4 Generalize “asteriskOrPlus” to the “choice” combinator with the following signature and new definition for “asteriskOrPlus”:

  def choice(a: (List[Any] => Option[List[Any]]),
      b: (List[Any] => Option[List[Any]]))(l:List[Any]): Option[List[Any]] = ...

  def asteriskOrPlus = choice(asterisk, plus) _

You should now have a pretty clear picture of parser combinators. In part 3 I will introduce a nice operator-based syntax to bring the simple combinator library from this part closer to EBNF and then move on to Scala’s real parser combinators and an interpreter for Fun0.

Leave a Reply


Close
E-mail It