Formal Language Processing in Scala, Part 4

September 20th, 2008 | In Scala

This is the fourth part in a series of articles on formal language processing in Scala, focusing on parser combinators and language interpretation. I wanted to have this part out several weeks ago but setting up my new home file server (running OpenSolaris Indiana) kept me busy. I’ll save the details of that endeavour for another post.

Using Scala’s Parser Combinator Library

At the end of the previous part we had developed some basic parser combinators with the same syntax as Scala’s parser combinator library, so we can now move on to the real thing. Here is a Scala Parser Combinator version of the Fun0 parser with the extension for sub-expressions in parentheses:

import scala.util.parsing.combinator.syntactical.StandardTokenParsers
import scala.util.parsing.input.PagedSeqReader
import scala.collection.immutable.PagedSeq

object Fun0i extends StandardTokenParsers {

  lexical.reserved += "print"
  lexical.delimiters += ("+", "*", "(", ")")

  lazy val programUnit = phrase( statement* )

  lazy val statement = "print" ~ expression

  lazy val expression = sum

  lazy val sum = product ~ ( ("+" ~ product)* )

  lazy val product = simpleExpression ~ ( ("*" ~ simpleExpression)* )

  lazy val simpleExpression:Parser[Any] = numericLit | ("(" ~ expression ~ ")")

  def main(args: Array[String]) {
    val tokens = new lexical.Scanner(
      new PagedSeqReader(PagedSeq fromFile args.first))
    programUnit(tokens) match {
      case Success(tree, _) => println(tree)
      case e: NoSuccess => Console.err.println(e)
    }
  }
}

The individual parsers look just like they did with our own parser combinators. The Parser class and the combinators are contained in scala.util.parsing.combinator.Parsers which all combinator-based parsers extend.

In this case, the Fun0i object extends StandardTokenParsers, a subclass of Parsers which implements a scanner for Scala-like languages. The scanner (also called a lexer or tokenizer) is responsible for separating a stream of characters into individual tokens which are then consumed by the parser. This is not always necessary for a parser, and for simpler languages you might indeed use a parser which consumes single characters directly, but it is a very common setup for parsing a typical programming language syntax. When you are using a parser generator or writing a recursive-descent parser by hand, you often use regular expressions, a simpler subset of a context-free grammar, for the scanner. You can do the same when using parser combinators, or you can just implement the scanner with parser combinators, too (as done in StandardTokenParsers), thus eliminating the need for a different syntax or library.

In our Fun0i parser, the StandardTokenParsers scanner is configured to mark “print” as a reserved word (so that it is recognized as a literal string and not as an identifier) and to add the operators and parentheses used by Fun0 as delimiters (so that they get returned as individual tokens even if they are not separated by whitespace). The main method first creates a stream of tokens from the input file. Note that PagedSeq.fromFile creates a hidden FileReader object which can never be closed properly. This is good enough for our simple example programs but for production-quality code you should rather manage the FileReader on your own and use PagedSet.fromReader.

The programUnit parser is then applied to the tokens, resulting in either a NoSuccess value which is printed as an error message, or a Success value containing the parse result (the tree value in our example) and the remaining input (which we ignore because we know that it must be empty after using the phrase combinator).

Parse Trees and Abstract Syntax Trees

But what exactly is the parse result? The simple combinators we used previously could only return success or failure. The new version also gives us a tree representation of the parsed input which is not yet suitable for writing an interpreter but it’s a good starting point. This tree representation which results “naturally” from parsing is called a parse tree.

Leaving parser combinators aside for a moment, we will first look at a parse tree created by a traditional recursive-descent parser for the same grammar. The input “print 1+(2+3)” leads to the following tree:

Recursive-Descent Parse Tree

We see a blue node for each production and an orange node for each token consumed successfully while parsing. If you do a pre-order traversal of the tree (i.e. start to the left of the root node, move around the entire tree and look at each node from the left side) you find that this reproduces the original tokens “print”, “1″, “+”, “(”, “2″, “+”, “3″ and “)” in the correct order.

The Fun0i parser prints the following parse result for the same input:

List((print~((1~List())~List((+~((((~((2~List())~List((+~(3~List())))))~))~List()))))))

As a diagram it would look like this:

Combintor Parse Tree

This tree is different from the earlier one because the productions themselves are not parsing functions. Instead they use combinators to create parsing functions, and the non-terminal nodes represent those combinators. All terminals simply produce the input token. The sequence combinator “~” produces a “~” node (actually an instance of a class named “~”) with exactly two children, the results of the parsers to the left and right of the “~” operator. The repetition combinators “*” and “+” produce a List of the individual parse results. The choice combinator “|” produces the result of the successfully matched side. All this happens in a statically type-safe way by parameterizing the parsers with the type of their result:

Screenshot

A simpler and more structured tree would be better suited for writing an interpreter. We can use some new combinators to modify the parse tree and create this so-called Abstract Syntax Tree. The first simplification comes from the modified sequence combinators “~>” and “<~”. These work just like the basic “~” combinator except that they only keep the parse result from the side to which the arrow points and discard the other one. We can use them to get rid of unnecessary details like the “print” token (our only statement is “print”, so we don’t need to keep the name) and the parentheses (the grouping of operations is already encoded in the shape of the tree). We also remove the operators for now. This leads to the following parser:

  lazy val programUnit = phrase( statement* )

  lazy val statement = "print" ~> expression

  lazy val expression = sum

  lazy val sum = product ~ ( ("+" ~> product)* )

  lazy val product = simpleExpression ~ ( ("*" ~> simpleExpression)* )

  lazy val simpleExpression:Parser[Any] = numericLit | ("(" ~> expression <~ ")")

And here is the result for our standard input “print 1+(2+3)”:

List(((1~List())~List((((2~List())~List((3~List())))~List()))))

Intermediate AST 1

To further enhance the AST we define some case classes for expressions. We need three different kinds for Fun0: Int literals, additions and multiplications:

  sealed abstract class Expression
  case class IntLiteral(value: Int) extends Expression
  case class Add(exprs: List[Expression]) extends Expression
  case class Mult(exprs: List[Expression]) extends Expression

An IntLiteral simply wraps an Int value. The Add and Mult expressions are n-ary operations for now, so they take a list of sub-expressions to add or multiply. I will cover binary operators in part 5.

In order to create instances of these expression classes in the relevant productions we can use the very generic “^^” combinator which applies a user-defined function to the previous parse result. The numericLit parser (which is provided by StandardTokenParsers) returns a String value. We can transform it into an IntLiteral like this:

numericLit ^^ { s => IntLiteral(s.toInt) }

The “~” objects created by the sequence combinators provide an extractor similar to the list-consing operator “::”, so we can use pattern matching to extract the individual parts of a sequence in the sum and product parsers:

  lazy val sum = product ~ ( ("+" ~> product)* ) ^^ {
    case e ~ l => Add(e :: l)
  }

  lazy val product = simpleExpression ~ ( ("*" ~> simpleExpression)* )  ^^ {
    case e ~ l => Mult(e :: l)
  }

Now that the sum and product parsers also return expressions, we can complete the simpleExpression parser like this:

  lazy val simpleExpression:Parser[Expression] =
    (numericLit ^^ { s => IntLiteral(s.toInt) }) | ("(" ~> expression <~ ")")

Note that it is now typed as Parser[Expression]. If we run the modified parser, we get the following result:

List(Add(List(Mult(List(IntLiteral(1))), Mult(List(Add(List(Mult(List(IntLiteral(2))), Mult(List(IntLiteral(3))))))))))

Intermediate AST 2

This AST still contains some unnecessary Mult nodes because we create Mult (and Add) nodes even if there is only one operand. We only care about returning an Expression anyway, so we can just return the single operand (which is also an Expression) instead of wrapping it unnecessarily:

  lazy val sum = product ~ ( ("+" ~> product)* ) ^^ {
    case e ~ Nil => e
    case e ~ l => Add(e :: l)
  }

  lazy val product = simpleExpression ~ ( ("*" ~> simpleExpression)* )  ^^ {
    case e ~ Nil => e
    case e ~ l => Mult(e :: l)
  }

Now the AST looks like this:

List(Add(List(IntLiteral(1), Add(List(IntLiteral(2), IntLiteral(3))))))

Final AST

Interpretation

With this data structure in place, an interpreter can be written in only a few lines of code. We will use one method for interpreting a program unit (represented by a list of expressions) and another one for computing the value of an expression:

  def interpretProgramUnit(l: List[Expression]) = l foreach {
    e => println(compute(e))
  }

  def compute(e: Expression): Int = e match {
    case IntLiteral(i) => i
    case Add(exprs) => exprs.foldLeft(0) { _ + compute(_) }
    case Mult(exprs) => exprs.foldLeft(1) { _ * compute(_) }
  }

Interpreting a program unit is simply a matter of iterating over all the expressions, computing their value and printing it to the console. The values are computed by matching on the kind of expression. For an IntLiteral the contained value is returned. For Add and Mult expressions the values of the sub-expressions are computed recursively and then folded with the appropriate operator.

Finally, here is the complete interpreter:

import scala.util.parsing.combinator.syntactical.StandardTokenParsers
import scala.util.parsing.input.PagedSeqReader
import scala.collection.immutable.PagedSeq

object Fun0l extends StandardTokenParsers {

  sealed abstract class Expression
  case class IntLiteral(value: Int) extends Expression
  case class Add(exprs: List[Expression]) extends Expression
  case class Mult(exprs: List[Expression]) extends Expression

  lexical.reserved += "print"
  lexical.delimiters += ("+", "*", "(", ")")

  lazy val programUnit = phrase( statement* )

  lazy val statement = "print" ~> expression

  lazy val expression = sum

  lazy val sum = product ~ ( ("+" ~> product)* ) ^^ {
    case e ~ Nil => e
    case e ~ l => Add(e :: l)
  }

  lazy val product = simpleExpression ~ ( ("*" ~> simpleExpression)* )  ^^ {
    case e ~ Nil => e
    case e ~ l => Mult(e :: l)
  }

  lazy val simpleExpression:Parser[Expression] =
    (numericLit ^^ { s => IntLiteral(s.toInt) }) | ("(" ~> expression <~ ")")

  def main(args: Array[String]) {
    val tokens = new lexical.Scanner(
      new PagedSeqReader(PagedSeq fromFile args.first))
    programUnit(tokens) match {
      case Success(tree, _) => interpretProgramUnit(tree)
      case e: NoSuccess => Console.err.println(e)
    }
  }

  def interpretProgramUnit(l: List[Expression]) = l foreach {
    e => println(compute(e))
  }

  def compute(e: Expression): Int = e match {
    case IntLiteral(i) => i
    case Add(exprs) => exprs.foldLeft(0) { _ + compute(_) }
    case Mult(exprs) => exprs.foldLeft(1) { _ * compute(_) }
  }
}

Exercise: Extend the interpreter to support subtraction with the “-” operator. You can either keep the n-ary operators (but note that “+” and “-” have the same precedence) or make them binary.

In the next part I will introduce some new combinators that go beyond EBNF and then move on to the Fun1 language with functions and value assignment.

3 Responses to “Formal Language Processing in Scala, Part 4”

  1. Joakim Ohlrogge says:

    I just wanted to let you know that I am following your series with great interest. I really like how you go to the bottom of explaining the technique behind the fansy DSL and I find that your series drives me to pick up new tricks in scala that is a very new language for me.

    I have toyed with parsers in JavaCC before and it is fascinating how readable the scala DSL becomes. I am really looking forward to see how parser combinators can go beyond EBNF. It also seems very convenient that everything is scala. JavaCC is always a little tedious when it comes to organizing your builds etc. Do you have any input on the performance of the parsers created with Scala? How do they compare to JavaCC ones in that area?

    Keep up the good work!

  2. Stefan Zeiger says:

    Joakim, thanks for the encouragement. If you’re using Eclipse, have a look at the JavaCC plugin which provides an editor with syntax-highlighting and a builder for JavaCC and JJTree sources. I’m using it on a big project with a JavaCC/JJTree parser and it works really well.

    I have to admit that I haven’t done any performance tests with Scala parser combinators or parser generators yet. So far, parsers have always been “fast enough” for my needs. This is probably true for most programming language parsers (where the performance or the interpreter or the quality of the optimizer and code generator will be much more important) but there are certainly cases where parser performance would be very important (e.g. IP header parsing).

  3. Todd Cook says:

    Your articles are excellent and well paced. You’ve taken a difficult topic and explained the many twists and turns admirably.

    JavaCC will probably still have it’s place, but Scala allows for a lightweight approach that probably be good enough for many applications. Thank you for contributing some real world usages of Scala’s advanced libraries.

Leave a Reply


Close
E-mail It