Archive for December, 2009

A Zipper for scala.xml

Sunday, December 27th, 2009

In recent discussions on the scala-internals and scala-xml mailing lists, there were calls for a mutable, DOM-like XML model, where nodes hold references to their parent element, so that all standard XPath axes can be supported. In this post I’d like to present a different representation which is based on the immutable and persistent scala.xml model, is completely immutable itself, yet allows navigation along all axes and “mutation”. It is based on the Zipper technique which was first described by Gérard Huet in this paper [PDF].

The Problem

Take this piece of XML (an Eclipse .project file):

<projectDescription>
  <name>scala-query</name>
  <comment></comment>
  <projects></projects>
  <buildSpec>
    <buildCommand>
      <name>ch.epfl.lamp.sdt.core.scalabuilder</name>
      <arguments></arguments>
    </buildCommand>
  </buildSpec>
  <natures>
    <nature>ch.epfl.lamp.sdt.core.scalanature</nature>
    <nature>org.eclipse.jdt.core.javanature</nature>
  </natures>
</projectDescription>

Its scala.xml model looks like this:

XML model

(In reality, a pretty-printed document contains a bunch of additional text nodes with all the whitespace between the elements, but we can safely ignore them and work with a simplified model.)

Imagine you have a reference to the first “nature” element which we will call the context. The only other node reachable from this context is the text node below it. You cannot navigate to its parent or sibling without starting at the root element and searching for the right spot.

An obvious solution to this problem is to add a reference from each node to its parent but that means giving up persistence (the sharing of data between different versions of the tree).

Let’s look at the situation without the parent reference first and assume you wanted to add a third “nature” element before the two existing ones. Since you cannot modify the “natures” element in this immutable data structure, you have to make a copy of it which contains the two original children plus the new one. Now that you have a new “natures” element, you make a copy of the “projectDescription” root element above it which contains all the original children except for the “natures” element which is replaced with the new one. This gives you two separate trees (the original and the modified one) which share most of their nodes:

XML models with shared sub-trees

(For the “natures” element we can even reuse the list of its children because we only prepend an element at the beginning. We need to copy the list of the root element’s children though because we replace its last element.)

Now try the same on a model with parent references. You can create the new “nature” element and a copy of the “natures” element above it. But what about the other two “nature” elements? You cannot reuse them because they contain a reference to the original “natures” element, so you have to create copies of them which point up to their new parent. You cannot even reuse the text nodes below them because they also reference their original parents. And now that you have copied the entire “natures” sub-tree, you face the same problem with the root element. All its children point up to the old root element, so you have to copy them recursively, too. In the end you have copied the tree entirely!

The Zipper

We can do better with the Zipper. The basic idea is to take an immutable data structure for the actual data (in this case XML documents in scala.xml models) and add a representation for a context that allows you to navigate through the data in all directions and modify parts of it (in the same way that you modified a document in the previous section — by creating a new document that shares most of the data).

We start with a type for the context:

sealed case class NodeLoc(node: Node, path: NodePath)

It consists of the node (or sub-tree) which it wraps plus a path to the root element. A path can either be the top (root) of the tree or a hole, i.e. the location of a node without the node itself:

sealed trait NodePath
final case object Top extends NodePath
final case class Hole(left: List[Node],
  parent: NodeLoc, right: List[Node]) extends NodePath

For example, here is a NodeLoc for the first “nature” element:

XML model with Zipper

You can see that the original tree (in grey and yellow) is still intact. There are two NodeLocs with holes which point up to a parent NodeLoc and contain lists of their preceding and following siblings. Note that the list of preceding siblings is in reverse order (compared to the corresponding list of children on the parent node) so that the first node in the list is the next element to the left of the hole.

We can now define the basic navigation methods on NodeLoc. In order to move to the left sibling, we create a new NodeLoc which contains the head of the “left” list as its node and a copy of the current hole with the remaining nodes on the “left” list and the current node appended to the “right” list:

  protected def create(node: Node, path: Hole) = NodeLoc(node, path)

  final def leftOpt = path match {
    case Hole(tl :: l, p, r) =>
      Some(create(tl, Hole(l, p, node :: r)))
    case _ => None
  }

Moving to the right is symmetrical to that:

  final def rightOpt = path match {
    case Hole(l, p, tr :: r) =>
      Some(create(tr, Hole(node :: l, p, r)))
    case _ => None
  }

Methods left and right which return a bare NodeLoc or throw an exception, and isFirst and isLast to ensure that you don’t run into such an exception, can be defined in the same way. I’m using Option here because it will be useful later on for creating the XPath axes.

Moving down to a child with the specified index position is accomplished by splitting the list of the current node’s children at the correct position, removing the child node and creating a Hole for the other children which points up to this NodeLoc:

  final def downOpt(idx: Int) = {
    val ch = node.child
    if(ch.isEmpty) None
    else Some(create(ch.head,
      Hole(ch.tail.take(idx).reverse.toList,
           this, ch.drop(idx+1).toList)))
  }

Mutable Yet Immutable

Replacing the node at the current location is extremely simple. We just create a NodeLoc with the new Node and the current path:

  final def set(n: Node) = NodeLoc(n, path)

There is no need to propagate the change up to the root element at this point. We could, for example, move through all siblings, replacing each one with a new node, all in constant time (per sibling). Parent elements are not copied until we move up and “unzip” the path into a new tree. This is done by constructing a new list of children from the current node and its left and right siblings, and creating a copy of the old parent with the new children:

  def upOpt = path match {
    case h : Hole =>
      val l = h.left.reverse ++ (node :: h.right)
      Some(NodeLoc(h.parent.node match {
        case e: Elem => e.copy(child = l)
        case _: Group => new Group(l)
      }, h.parent.path))
    case _ => None
  }

This is slightly messy with scala.xml because we have to match on the type of node and copy Elem and Group nodes (the only ones which can contain children) in different ways.

Always recreating the parent when moving up is a waste of time when no changes have been made to the current sub-tree. We can optimize for this case with a special CachedParentNodeLoc which returns the original parent when moving up:

final class CachedParentNodeLoc(node: Node, path: Hole)
extends NodeLoc(node, path) {
  override def upOpt = Some(path.parent)
  override protected def create(node: Node, path: Hole) =
    new CachedParentNodeLoc(node, path)
}

final class CachedTopNodeLoc(node: Node) extends NodeLoc(node, Top) {
  override def upOpt = None
  override protected def create(node: Node, path: Hole) =
    new CachedParentNodeLoc(node, path)
}

object NodeLoc {
  def apply(node: Node): NodeLoc = new CachedTopNodeLoc(node)
}

Note that the leftOpt, rightOpt and downOpt methods use create(...) instead of NodeLoc(...), so that navigating left, right or down from a cached location returns another cached location.

The XPath Axes

The XPath specification defines the following axes:

  • the child axis contains the children of the context node
  • the descendant axis contains the descendants of the context node; a descendant is a child or a child of a child and so on; thus the descendant axis never contains attribute or namespace
    nodes
  • the parent axis contains the parent of the context node, if there is one
  • the ancestor axis contains the ancestors of the context node; the ancestors of the context node consist of the parent of context node and the parent’s parent and so on; thus, the ancestor axis will always include the root node, unless the context node is the root node
  • the following-sibling axis contains all the following siblings of the context node; if the context node is an attribute node or namespace node, the following-sibling axis is empty
  • the preceding-sibling axis contains all the preceding siblings of the context node; if the context node is an attribute node or namespace node, the preceding-sibling
    axis is empty
  • the following axis contains all nodes in the same document as the context node that are after the context node in document order, excluding any descendants and excluding attribute nodes and namespace nodes
  • the preceding axis contains all nodes in the same document as the context node that are before the context node in document order, excluding any ancestors and excluding attribute nodes and namespace nodes
  • the attribute axis contains the attributes of the context node; the axis will be empty unless the context node is an element
  • the namespace axis contains the namespace nodes of the context node; the axis will be empty unless the context node is an element
  • the self axis contains just the context node itself
  • the descendant-or-self axis contains the context node and the descendants of the context node
  • the ancestor-or-self axis contains the context node and the ancestors of the context node; thus, the ancestor axis will always include the root node

The attribute and namespace axes do not fit well into our model because scala.xml does not represent attributes and namespace declarations as Node values. All other axes can be defined as methods on NodeLoc.

We will represent an axis as an Iterator[NodeLoc]. The following helper method creates an iterator by starting with a given value and continually applying a function to it until it returns None. It is similar to Iterator.iterate in Scala 2.8 except the latter can only create infinite iterators.

  private final def iterate[T](start: Option[T])(f: T => Option[T]): Iterator[T] = new Iterator[T] {
    private[this] var acc = start
    def hasNext = acc.isDefined
    def next() = acc match {
      case None => throw new NoSuchElementException("next on empty iterator");
      case Some(x) => val res = x ; acc = f(x) ; res
    }
  }

We can use it like this to create the child axis: Start with the first child, then go right node by node until you reach the last child:

  final def childAxis = iterate(downOpt(0))(_.rightOpt)

The descendant-or-self axis consists of the current node plus the descendant-or-self axes of all children:

  final def descendantOrSelfAxis: Iterator[NodeLoc] =
    Iterator.single(this) ++ childAxis.flatMap(_.descendantOrSelfAxis)

And the descendant axis consists only of the descendant-or-self axes of all children, without the current node:

  final def descendantAxis = childAxis.flatMap(_.descendantOrSelfAxis)

The parent, ancestor-or-self and ancestor axes work similarly:

  final def parentAxis = upOpt.iterator

  final def ancestorOrSelfAxis: Iterator[NodeLoc] =
    Iterator.single(this) ++ upOpt.iterator.flatMap(_.ancestorOrSelfAxis)

  final def ancestorAxis = upOpt.iterator.flatMap(_.ancestorOrSelfAxis)

The following-sibling and preceding-sibling axes are straight-forward:

  final def followingSiblingAxis = iterate(rightOpt)(_.rightOpt)

  final def precedingSiblingAxis = iterate(leftOpt)(_.leftOpt)

For the following and preceding axes we need to define some helper methods for moving to the following or preceding node in document order:

  final def followingAxis = iterate(followingOpt)(_.followingOpt)

  final def followingOpt: Option[NodeLoc] =
    downOpt(0) orElse rightOutOpt

  private final def rightOutOpt: Option[NodeLoc] =
    rightOpt orElse upOpt.flatMap(_.rightOutOpt)

  final def precedingAxis = iterate(precedingOpt)(_.precedingOpt)

  final def precedingOpt: Option[NodeLoc] =
    leftOpt.map(n => n.downLastTransitiveOpt.getOrElse(n)) orElse upOpt

  private final def downLastTransitiveOpt: Option[NodeLoc] =
    downLastOpt.flatMap(_.downLastTransitiveOpt)

  final def downLastOpt = {
    val ch = node.child
    if(ch.isEmpty) None
    else Some(NodeLoc(ch.head, Hole(ch.reverse.toList.tail, this, Nil)))
  }

And finally the trivial self axis:

  final def selfAxis = Iterator.single(this)

Methods \ and \\ for finding all child or descendant elements with a given name can be defined by filtering the child and descendant axes:

  final def \ (Name: String) =
    for(n @ NodeLoc(Elem(_, Name, _, _, _*), _) <- childAxis) yield n

  final def \\ (Name: String) =
    for(n @ NodeLoc(Elem(_, Name, _, _, _*), _) <- descendantAxis) yield n

You can find the complete source code, including more convenience methods and a short demo here.

New ScalaQuery Features

Monday, December 21st, 2009

It’s been almost 3 months since my last summary of new features in ScalaQuery. I implemented some important changes in the following weeks but I only had little time to work on ScalaQuery after my extended one-month vacation ended in October.

Build system

Builds against newer versions of Scala 2.8 have been going well thanks to sbt 0.6 which separates the Scala version used by the build system from the version used by the project being built. You need to use at least version 0.6.7 of the new xsbt launcher to build ScalaQuery. Other dependencies are downloaded automatically by sbt. I wrote an implementation of sbt’s test interface for JUnit which you can find here (binaries are here on scala-tools.org). It allows you to run ScalaQuery’s JUnit test cases from sbt with the test command.

All published artifacts now contain the Scala version in the artifact ID. You can find the latest ScalaQuery snapshot in scala-tools.org’s snapshots repository as:

groupId:
com.novocode
artifactId:
scala-query_2.8.0.Beta1-RC4
version:
1.0.0-SNAPSHOT

JDBC escape syntax

ScalaQuery now uses JDBC escape syntax and scalar functions where possible for better portability between DB engines:

  • String column concatenation (with the ++ operator) and the startsWith and endsWith methods on String columns have been moved up to BasicProfile.
  • Escape characters for LIKE expressions are added to the queries.
  • You can add other functions to be called with the {fn ...} syntax through SimpleScalarFunction() and SimpleScalarFunction.nullary().
  • Various scalar functions are supported: CONVERT, USER, DATABASE, CURDATE, CURTIME, PI, MOD, ABS, CEILING, FLOOR, SIGN, DEGREES, RADIANS, IFNULL, UCASE, LCASE, LTRIM, RTRIM.
  • Literals for Date, Time and Timestamp values use the JDBC escape syntax.
  • Explicit inner, left outer, right outer and full outer joins are supported. For example:
  val q = for {
    Join(c,p) <- Categories innerJoin Posts on (_.id is _.category)
    _ <- Query orderBy p.id
  } yield p.id ~ c.id ~ c.name ~ p.title

Complex inserts

Queries and columns can now be inserted into tables (using SQL’s INSERT...SELECT syntax). This allows you to use scalar functions when inserting individual records and to copy data produced by a query. For example:

  val q2 = for(s <- Src1 if s.id <= 2) yield s
  Dst2.insert(q2)

Various changes

  • String columns are created as VARCHAR(254) by default (except in H2 where no size is needed).
  • The operators === and != can be used instead of is and isNot. They have the proper precedence when used with other operators like && and ||.

Close
E-mail It