Formal Language Processing in Scala, Part 1
July 27th, 2008 | In ScalaThis is the first part in a series of articles I intend to write about implementing an interpreter for a simple functional programming language in Scala, using Scala’s parser combinator library. I will try to cover the practical aspects of parsing and interpreting step by step but you should already be familiar with the basics of functional programming. Familiarity with Scala is not strictly required but it will be helpful for understanding the code (especially in the later parts of the series).
Series Contents
- Part 1 (this post)
- Solutions to Part 1
- Part 2
- Solutions to Part 2
- Part 3
- Solutions to Part 3
- Part 4
- Solutions to Part 4
- Part 5
- Solutions to Part 5
Language and Syntax
Obviously, before we can think about how to build an interpreter for a language, we first need to decide what that language should be. We will start with a very simple language (let’s call it Fun0, the 0th iteration of our functional language) and gradually add more features to it. Here is an example of a Fun0 program:
print 1+2+3 print 2 + 20 * 2 print 3*4 print 1
In informal terms, the program consists of some “print” statements which take an expression as their only argument. Each expression is either an integer or a calculation on integers, using the operators * and +. Note that there are two “print” statements on the last line. It is always clear where a new statement begins so there is no need for a linefeed character between them. In fact, whitespace can be ignored entirely. There is a nice visualization for the syntax of a formal language called a syntax diagram. Here is a set of such diagrams for Fun0:

This is also known as a railroad diagram because you follow the lines like a train follows its tracks. You start at the left and move along the line in the direction of the arrows. You may go straight forward or take a curve but you cannot take a sharp turn. The rounded boxes represent terminals, the individual symbols the language consists of. For Fun0 they are “print”, “+”, “*” and numbers (called numericLit in the diagram). Each individual diagram represents the production rule for a non-terminal, i.e. a named part of the syntax which is composed of other terminals or non-terminals. A rectangle references such a non-terminal. For example, in order to move through the “expression” rectangle in the first diagram, you have to move through the entire “expression” diagram.
Unlike the informal description of Fun0’s syntax I gave above, the syntax diagram is a precise formal specification of its grammar. While it is easy to read for a human (even without any experience with formal languages), there are more convenient textual representations. The lingua franca of context-free grammars is the Extended Backus-Naur Form, or EBNF. You will be hard-pressed to find a formal specification for a programming language that doesn’t contain an appendix with the language’s grammar in EBNF. There are a few different dialects of the EBNF but they all have the same expressiveness and differ only in details of their syntax. Here is the Fun0 grammar from the syntax diagrams in EBNF. (The diagrams were in fact created from this grammar with a web-based syntax diagram generator):
programUnit = ( "print" expression )* expression = numericLit ( ( "+" | "*" ) numericLit )*
Each line represents a production rule for a non-terminal, with its expansion on the right side of the “=”. Terminals are enclosed in quotes (except for the numericLit terminal which is not a literal string). Multiple terminals and non-terminals can be grouped together with parentheses. An “*” indicates that the preceding terminal, non-terminal or group may be repeated zero or more times (”+” would be one or more times and “?” is zero or one time). Multiple choices can be separated with “|”.
Exercise 1: In order to improve your understanding of grammars and of EBNF, you can paste the Fun0 grammar into the diagram generator, make some modifications to it, and examine the resulting railroad diagrams. I suggest the following modifications. Except for exercise 1.6, they should not change the Fun0 language but simply create different grammars for the same language.
- Integrate a new non-terminal into the grammar:
statement = "print" expression- Write “programUnit” without using the repetition symbols “*” and “+”. Hint: In a program, you can rewrite a loop to a recursive function. The same principle applies here.
- Find a different solution to exercise 1.2.
- Make the grammar reflect the proper precedence of the “+” and “*” operators by integrating this non-terminal:
term = factor ( "*" factor )*- Write the grammar resulting from exercise 1.4 without repetition symbols.
- Extend the grammar resulting from exercise 1.4 to support parentheses to override operator precedence (e.g. “print (20+1)*2″ should print 42)
Parsing
The most basic language processing program is a parser. It reads some source file and checks if it conforms to a specific grammar, possibly extracting on the go a representation of the source file which is better suited to further processing. More advanced language processing tools like type-checkers, interpreters and compilers start with parsing and then use the representation that was built by the parser.
There are several ways of turning a grammar into a parser. If the grammar is simple enough, you can just write a parser for it by hand (which I generally do not recommend in face of the alternative of using parser combinators). This will usually be an LL(1) recursive descent parser which moves through a stream of tokens (which get parsed as terminals) the same way you follow a railroad diagram.
Here is a simple hand-written parser for Fun0 which processes a list of tokens (We will save the process of getting from a character stream to tokens for later):
object Fun0aHandmade {
def main(args : Array[String]) {
val input = List("print", 4, "*", 3, "+", 1, "print", 10)
programUnit(input)
}
def programUnit(l:List[Any]): List[Any] = l match {
case Nil => Nil
case "print" :: e => programUnit(expression(e))
case x => throw new Exception("Expected 'print' or end of input but found "+x)
}
def expression(l:List[Any]): List[Any] = l match {
case Nil => throw new Exception("Expected number but found end of input");
case (i:Int) :: "*" :: rest => expression(rest)
case (i:Int) :: "+" :: rest => expression(rest)
case (i:Int) :: rest => rest
case x => x
}
}
Both production rules have been turned into methods that take a list of tokens from which they process as many as they can and then return the remainder. When an illegal token is encountered, an exception with a proper error message gets thrown.
Exercise 2: Which modification of the grammar from exercise 1 does this parser resemble the most? Can you implement the other versions in a similar way?
The traditional method for implementing more complex parsers is through the use of a parser generator, a tool which takes a grammar in a special (usually EBNF-like) format and produces source code which implements a parser for that grammar. The produced code is not meant to be human-readable. You just program against the parser’s API and when you need to change the grammar, you modifiy its original specification and recreate the parser. Some well-known parser generators which target the Java platform are JavaCC, ANTLR and SableCC.
For example, here is the Fun0 grammar in a format suitable for JavaCC:
PARSER_BEGIN(Fun0Parser)
package com.novocode.fun0;
public class Fun0Parser {}
PARSER_END(Fun0Parser)
SKIP : { " " | "\r" | "\t" | "\n" }
TOKEN : { <PLUS: "+"> | <MULTIPLY: "*"> | <PRINT: "print"> }
TOKEN : { <NUMERIC_LIT: ( <DIGIT> )+ > | <#DIGIT: ["0"-"9"]> }
void programUnit() : {} {
( "print" expression() )*
}
void expression() : {} {
<NUMERIC_LIT> ( ( "+" | "*" ) <NUMERIC_LIT> )*
}
The first paragraph is just boilerplate and the SKIP and TOKEN directives are required for the lexer which turns the character stream into a token stream. The rest is basically EBNF dressed up in a somewhat verbose syntax. Running JavaCC on this simple grammar produces 7 Java source files with a total size of 45 KB.
Parser Combinators
An interesting alternative to hand-written parsers and parser generators are parser combinator libraries. When implemented in a language like Scala that has a rich syntax and is well suited to embedding DSLs, a parser combinator library allows you to write a parser which is embedded directly in the host language but at the same time as close to EBNF as a parser generator’s input. And that is only the beginning. The expressive power of parser combinators reaches well beyond EBNF because you can build your own abstractions with all the features offered by the host language.
Here is a parser for Fun0 implemented with parser combinators:
import scala.util.parsing.combinator.syntactical.StandardTokenParsers
import scala.util.parsing.input.PagedSeqReader
import scala.collection.immutable.PagedSeq
object Fun0a extends StandardTokenParsers {
lexical.reserved += "print"
lexical.delimiters += ("+", "*")
def programUnit = ( "print" ~ expression )*
def expression = numericLit ~ ( ( ("+" | "*") ~ numericLit )* )
def main(args: Array[String]) = {
val tokens = new lexical.Scanner(
new PagedSeqReader(PagedSeq fromFile args.first))
phrase(programUnit)(tokens) match {
case Success(tree, _) => println(tree)
case e: NoSuccess => Console.err.println(e)
}
}
}
Just like in the hand-written parser you can see the two productions “programUnit” and “expression” implemented as methods, except now their definition looks almost like EBNF. The required changes are only due to limitations of Scala’s syntax: There is a sequence combinator “~” which is expressed in EBNF simply by adjunction, and the definition of “expression” needs another set of parentheses because the postfix “*” operator does not take precedence over “~”.
I would like to conclude part 1 at this point with the “magical” combinator-based parser implementation. In part 2 we will look behind the scenes and arrive at our own parser combinators by making some changes to the hand-written parser I introduced earlier. For now, I will simply claim in good tradition: You could have invented parser combinators!
[Edit: Corrected two errors in exercise 1]
July 28th, 2008 at 3:39 pm
Nicely written introduction to parsing, I am looking forward to the later parts of the series to directions you take.
August 17th, 2008 at 10:31 pm
[...] ausführliche Serie, theoretisch [...]
August 19th, 2008 at 4:16 am
[...] post on parsing in Scala and would like more technical details Stephan Zeiger has a very thorough 3-part series. Possibly related posts: (automatically generated)Scala for Java Refugees [...]
January 11th, 2010 at 2:47 pm
thank you for the nice article. It helped me to understand the stuff from the basics.
December 12th, 2010 at 4:02 pm
Thanks for these articles, I’ve been looking for something like this for quite some time now.
You cover two topics that i am very interested in: scala, parsing.
I used to have to read big ugly books about compilers but your examples make things easy
Keep up the good work.