Formal Language Processing in Scala, Part 5
July 5th, 2009 | In ScalaThis is the fifth part in a series of articles on formal language processing in Scala. In this part I will introduce some new parser combinators, provide a specification for the Fun1 language and build an interpreter for it.
Binary Operators
I’ve already shown briefly in the solutions to part 4 how to build a left-recursive AST for left-associative operators. I’d like to go back one step and look at this in more detail. We are always using repetitions (the “+” and “*” operators) instead of left-recursive productions in the grammar. Such a repetition creates a List of the individual results which I have so far drawn in the diagrams like an array, e.g. for the expression 1+2+3:

But the lists are really linked lists consisting of :: (cons) and Nil nodes, so we should draw the diagram like this:

Well, it looks like those repetitions are still right-recursive after all! But we are able to process them in a left-associative way in the interpreter for n-ary operators by starting with a 0 and left-folding the values computed from the list elements with the + operator:
def compute(e: Expression): Int = e match {
...
case Add(exprs) => exprs.foldLeft(0) { _ + compute(_) }
}
Using an initial zero value here was for brevity’s sake only. All n-ary Add operations contain at least two elements, so exprs can never be empty and we do not really need the 0. Instead we can start with the first element’s value and just fold the rest:
case Add(e :: l) => l.foldLeft(compute(e)) { _ + compute(_) }
The same approach allows us to construct a left-recursive AST from a list of Expression nodes directly in the parser. Instead of computing the values of the two sub-expressions and performing the addition, we just create a binary Add expression from them:
lazy val sum = product ~ (("+" ~> product)*) ^^ {
case e ~ l => l.foldLeft(e)(Add)
}
This gives us the AST that we want:

Turning chains of expressions in the form e ~ ((op ~ e)*) into a left-recursive tree is a common operation in many parsers. Even in the tiny Fun0 language we need almost the same code again for the product production:
lazy val product = simpleExpr ~ (("+" ~> simpleExpr)*) ^^ {
case e ~ l => l.foldLeft(e)(Mult)
}
If we have to distinguish between several operators, it gets slightly more difficult, but the basic pattern is still the same:
lazy val sum = product ~ ( ( ("+"|"-") ~ product)* ) ^^ {
case e ~ l => l.foldLeft(e) {
case (z, "+" ~ p) => Add(z, p)
case (z, "-" ~ p) => Subtract(z, p)
}
}
Deducing The chainl1 Combinator
When you’re using a parser generator, you have to implement everything in terms of the built-in abstractions. It’s a bit like procedural programming: You can write all kinds of functions (productions in this case) that work on data but you cannot write new abstractions. With parser combinators and Scala’s flexible syntax and support for functional programming, you can. We’ve already developed some basic elements from EBNF in this series but we don’t have to stop there.
Let’s see how we can abstract the parsers for chains of left-associative operators into a reusable combinator by looking at the differences in our implementations of sum (with one or two operators) and product:
- There is a parser p1 of type
Parser[Expression]which parses the operands. The result of the combinator is also aParser[Expression]. In our examples, p1 is eitherproductorsimpleExpr. Of course, there shouldn’t be any code specific to expressions in the combinator, so we’ll use a type variable A and make p1 aParser[A]. - A combining function f of type
(A,A) => Ais used for folding the results of parsing the operands. This is one of theAdd,SubtractandMultconstructors. - A parser p2 parses the operators (”+”, “-” or “*”). It doesn’t produce a result per se but if we have different operators it determines which combining function to use, so we’ll make the combining function its result. That gives it a type of
Parser[(A,A) => A].
Writing the combinator as a function of p1 and p2 is now straight-forward:
def chain[A](p1: => Parser[A], p2: => Parser[(A,A) => A]): Parser[A] =
p1 ~ ( (p2 ~ p1)* ) ^^ {
case e ~ l => l.foldLeft(e) { case (e1, f ~ e2) => f(e1,e2) }
}
This allows us to construct the sum parser as:
def sum = chain(product, "+" ^^ { _ => Add })
I mentioned above that chain is used frequently so it’s worth abstracting into a combinator. Actually, it is so useful that Scala’s parser combinator library provides it out of the box under the names chainl1 and * (as a binary operator), so we can just write:
def sum = product * ("+" ^^ { _ => Add })
The ^^^ Combinator
The parser ("+" ^^ { _ => Add }) which returns a fixed value Add still looks a bit awkward. Fortunately, there is the ^^^ combinator as a shortcut for such cases which allows us to write:
def sum = product * ("+" ^^^ Add)
A version for multiple operators is equally simple:
def sum = product * ("+" ^^^ Add | "-" ^^^ Subtract)
The Fun1 Language
We have dealt long enough with operators in Fun0 now and it is time to move on from this simple calculator to a real functional programming language with variables, first-class functions and conditional expressions. We’ll call this language Fun1. I will provide only an informal specification of it for now:
A Fun1 program consists of a sequence of expressions separated by semicolons. Expressions can be nested arbitrarily and grouped with parentheses, as in Fun0. The following kinds of expressions are allowed:
- String and Int literals
- Operators +, -, * and / for arithmetic. The + operator also does string concatenation.
- Conditional expressions of the form
if e1 then e2 else e3where e1 yields a boolean value - Variables
- Lambda expressions (anonymous functions) with 0 or more parameters of the form
\p1,p2,p3 -> expr - Variable definitions of the form
let v = expr. When such a definition occurs in a sequence, the variable is visible to all following expressions in that sequence. - Function applications of the form
e1(e2,e3,e4)with 0 or more arguments where e1 yields a function value - Equality checks of the form
e1 = e2which yield a boolean value
The syntactic conventions (e.g. case sensitivity, allowed characters in an identifier, syntax of literals, comments) are those of Scala’s StandardTokenParsers which we already used for Fun0. The supported types are integers, strings, functions and boolean values. A pre-defined println function can be used to print a single value to the console (and also return it to the caller).
Here is a simple program in Fun1:
let add1 = \a,b -> a+b; println(add1(3,4))
Exercise: Create a parser for Fun1 which uses the following AST nodes:
sealed abstract class Expression case class Sequence(l: List[Expression]) extends Expression case class Lambda(params: List[String], expr: Expression) extends Expression case class Let(ident: String, expr: Expression) extends Expression case class Var(ident: String) extends Expression case class Lit(v: Any) extends Expression case class App(expr: Expression, args: List[Expression]) extends Expression case class Add(e1: Expression, e2: Expression) extends Expression case class Subtract(e1: Expression, e2: Expression) extends Expression case class Mult(e1: Expression, e2: Expression) extends Expression case class Div(e1: Expression, e2: Expression) extends Expression case class Equal(e1: Expression, e2: Expression) extends Expression case class If(e1: Expression, e2: Expression, e3: Expression) extends ExpressionWhat information do you need for the parser that is missing or unclear in the informal description? Feel free to ask for it in a comment or come up with a sensible answer on your own. I have omitted some details deliberately (and probably forgotten some others).
Let’s start with an interpreter in the same way as for Fun0: A function which takes an Expression parameter and matches on the kind of Expression to compute and return its value. We don’t need a separate method for evaluating statements anymore because there are none in Fun1 — Everything is an expression. Evaluating literals and arithmetic operators is basically the same as before except we now have to take into account that an expression may return any type of value and not just integers.
def eval(e: Expression): Any = e match {
case Lit(v) => v
case Add(e1, e2) => (eval(e1), eval(e2)) match {
case (v1:String, v2) => v1 + v2.toString
case (v1, v2:String) => v1.toString + v2
case (i1:Int, i2:Int) => i1 + i2
case _ => error("'+' requires two Int values or at least one String")
}
case Subtract(e1, e2) => (eval(e1), eval(e2)) match {
case (i1:Int, i2:Int) => i1 - i2
case _ => error("'-' requires two Int values")
}
case Mult(e1, e2) => (eval(e1), eval(e2)) match {
case (i1:Int, i2:Int) => i1 * i2
case _ => error("'*' requires two Int values")
}
case Div(e1, e2) => (eval(e1), eval(e2)) match {
case (i1:Int, i2:Int) => i1 / i2
case _ => error("'/' requires two Int values")
}
// ...more cases...
}
For the new equality expressions we can reuse Scala’s equality checks and return a plain Bool value:
case Equal(e1, e2) => eval(e1) == eval(e2)
The conditional (if) expressions bring us back to the subject of evaluation strategies. In Fun0 we used call-by-value, i.e. in order to evaluate an operator expression, we first evaluated the operands and then applied the operator to the results. This is the default in most programming languages for most kinds of expressions and we will stick with it, but we have to make at least one exception: An if expression must evaluate either the then clause or the else clause but not both. Even if we did not have any side-effects (which we do, in form of the println function) we couldn’t just compute both clauses and drop the result of one of them because lambdas and function application allow us to create inifinite recursion, so evaluating the wrong side of an if expression might turn a terminating program into a non-terminating one. With that out of the way, the actual implementation is quite simple:
case If(e1, e2, e3) => eval(e1) match {
case b:Boolean => eval(if(b) e2 else e3)
case _ => error("Not a boolean value in IF expression")
}
For function calls, we first of all need a representation of function values. Since functions in Fun1 can have different numbers of parameters, we’ll make it a subtrait of the function type List[Any] => Any:
sealed trait Func extends (List[Any] => Any)
We can now implement the println function as:
case object Println extends Func {
def apply(args: List[Any]) = args match {
case v :: Nil => { println(v); v }
case _ => error("println needs 1 argument")
}
}
For interpreting function applications we just evaluate all sub-expressions and then invoke the function:
case App(expr, args) => eval(expr) match {
case f:Func => f(args.map(a => eval(a)))
case _ => error("Only functions can be applied")
}
Environments
In order to do variable assignments (with let expressions) and lookups, we need an environment which contains a mapping of variable names to their current values. We’ll integrate this into the interpreter in a purely functional style now and save the implementation with mutable state for later. Let’s start with a type alias for the environment:
type Environment = Map[String, Any]
Notice that the Map here is a scala.collection.immutable.Map which can not be modified but allows a new map to be created which is the same as the existing one plus some new (or modified) entries.
The environment can be changed by a program and different invocations of the eval function must be able to see those changes. The only way to accomplish this without mutable state is to pass the current environment into eval and to have it return a possibly new environment along with the computed value:
def eval(e: Expression, env: Environment): (Any, Environment) = e match {
// ...
}
In the initial call to eval with the parser’s result we provide an environment which contains only the println function we defined above:
eval(tree, Map("println" -> Println))
We have to modify the existing code to pass the environment down into recursive calls to eval, drop the possibly modified environments they return, and return the result together with the original one. For example, the case for Equal becomes:
case Equal(e1, e2) => (eval(e1, env)._1 == eval(e2, env)._1, env)
The other cases can be rewritten similarly.
We can now use the environment for variable lookups and definitions. The code for Var just performs the lookup while Let returns a new environment with all previous definitions plus the new binding:
case Var(id) =>
(env getOrElse(id, error("Undefined var "+id)), env)
case Let(id, ex) =>
{ val v = eval(ex, env)._1; (v, env + ((id, v))) }
For a sequence we go through all the sub-expressions, evaluating each in the environment returned by the previous one and returning the last value + environment. Notice the initial value () (”unit”) which gets returned for an empty sequence.
case Sequence(l) =>
l.foldLeft(():Any, env) { (z,n) => eval(n, z._2) }
Lambda expressions are supposed to close over the variables which are visible at their point of creation, so we cannot just return a Lambda expression object directly as its own value. Instead we wrap the parameter list and expression of the Lambda together with the current environment in a Closure object:
case Lambda(params, ex) => (Closure(ex, params, env), env)
Since the environment is immutable, there is no need to create a copy. We can just keep this value around and use it to apply the lambda function at a later time.
We have already defined the Func trait and implemented function application for it, so we make Closure implement it, too. In order to apply the lambda function, we take the saved environment, add the function’s parameters on top of it, and use this new environment for evaluating the saved expression:
case class Closure(ex: Expression, params: List[String],
env: Environment) extends Func {
def apply(args: List[Any]) = {
if(args.length != params.length)
error("Wrong number of arguments for function")
eval(ex, env ++ params.zip(args))._1
}
override def toString =
"<lambda \\" + params.mkString(",") + " -> " + ex + ">"
}
This concludes the interpreter for Fun1 and it also brings me to the end of the content I had originally planned for this series almost one year ago. There are still lots of interesting topics left and I intend to pursue some of them in further installments but I have not yet decided what will be next. For now, I’ll leave you with a short Fun1 program to test your implementation:
// Lambda application -- prints "81"
println((\x -> x*x)(9));
// Shadowing s does not change the closure -- prints "foo!"
let s = "foo";
let ls = \x -> s + x;
let s = "bar";
println(ls("!"));
// N-ary and curried functions -- both print "7"
let add1 = \a,b -> a+b;
println(add1(3,4));
let add2 = \a -> \b -> a+b;
println(add2(3)(4));
// The Y combinator allows us to define recursive functions
let Y = \f ->
(\x -> f(\y -> x(x)(y)))
(\x -> f(\y -> x(x)(y)));
// Define a faculty function using Y -- prints "6! = 720"
let fact = Y(\fact -> \n -> if n = 0 then 1 else n*fact(n-1));
println("6! = "+fact(6));
// Print a nullary function "<lambda...>" and its value "42"
let nullary = \-> 42;
println(nullary);
println(nullary())
I’ll make the source code for my reference implementation available in the solutions to this part, coming up in a few days.
July 6th, 2009 at 7:59 pm
Defined using my GLL combinators library (http://github.com/djspiewak/gll-combinators):
import edu.uwm.cs.gll.RegexParsers
object Fun1Parser extends RegexParsers {
// %%
lazy val program = seq
lazy val seq = (expr + “;”) ^^ Sequence
lazy val expr: Parser[Expression] = (
expr ~ “+” ~ term ^^ { (e1, _, e2) => Add(e1, e2) }
| expr ~ “-” ~ term ^^ { (e1, _, e2) => Subtract(e1, e2) }
)
lazy val term: Parser[Expression] = (
term ~ “*” ~ factor ^^ { (e1, _, e2) => Mult(e1, e2) }
| term ~ “/” ~ factor ^^ { (e1, _, e2) => Div(e1, e2) }
)
lazy val factor: Parser[Expression] =
factor ~ “=” ~ exprBase ^^ { (e1, _, e2) => Equal(e1, e2) }
lazy val exprBase = (
“\\” ~> (id + “,”) ~ “->” ~ expr ^^ { (p, _, e) => Lambda(p, e) }
| “let” ~> id ~ “=” ~ expr ^^ { (id, _, e) => Let(id, e) }
| id ~ (”(” ~> (expr + “,”) expr ~ “then” ~ expr ~ “else” ~ expr ^^ { (e1, _, e2, _, e3) => If(e1, e2, e3) }
| id ^^ Var
| strLit ^^ Lit
| intLit ^^ Lit
)
val id = “”"[a-zA-Z_][a-ZA-Z0-9_]*”"”r
val strLit = “”"”([^\\"]|\\.)*”"”"r
val intLit = “”"\d+”"”.r ^^ { _.toInt }
// %%
}
For the record, I had to make-up/assume a lot of things in order to create this. The definition of an identifier, operator precedence, string semantics, etc. A rough E/BNF grammar would help.
July 6th, 2009 at 8:01 pm
Markup fail (correct if you can please). In the meantime, here’s a paste which is properly formatted and colored: http://paste.pocoo.org/show/126941/
July 6th, 2009 at 8:36 pm
Not sure how I messed that one up, but my grammar was very wrong. Correction: http://paste.pocoo.org/show/126951/