A Zipper for scala.xml
Sunday, December 27th, 2009In recent discussions on the scala-internals and scala-xml mailing lists, there were calls for a mutable, DOM-like XML model, where nodes hold references to their parent element, so that all standard XPath axes can be supported. In this post I’d like to present a different representation which is based on the immutable and persistent scala.xml model, is completely immutable itself, yet allows navigation along all axes and “mutation”. It is based on the Zipper technique which was first described by Gérard Huet in this paper [PDF].
The Problem
Take this piece of XML (an Eclipse .project file):
<projectDescription>
<name>scala-query</name>
<comment></comment>
<projects></projects>
<buildSpec>
<buildCommand>
<name>ch.epfl.lamp.sdt.core.scalabuilder</name>
<arguments></arguments>
</buildCommand>
</buildSpec>
<natures>
<nature>ch.epfl.lamp.sdt.core.scalanature</nature>
<nature>org.eclipse.jdt.core.javanature</nature>
</natures>
</projectDescription>
Its scala.xml model looks like this:
(In reality, a pretty-printed document contains a bunch of additional text nodes with all the whitespace between the elements, but we can safely ignore them and work with a simplified model.)
Imagine you have a reference to the first “nature” element which we will call the context. The only other node reachable from this context is the text node below it. You cannot navigate to its parent or sibling without starting at the root element and searching for the right spot.
An obvious solution to this problem is to add a reference from each node to its parent but that means giving up persistence (the sharing of data between different versions of the tree).
Let’s look at the situation without the parent reference first and assume you wanted to add a third “nature” element before the two existing ones. Since you cannot modify the “natures” element in this immutable data structure, you have to make a copy of it which contains the two original children plus the new one. Now that you have a new “natures” element, you make a copy of the “projectDescription” root element above it which contains all the original children except for the “natures” element which is replaced with the new one. This gives you two separate trees (the original and the modified one) which share most of their nodes:
(For the “natures” element we can even reuse the list of its children because we only prepend an element at the beginning. We need to copy the list of the root element’s children though because we replace its last element.)
Now try the same on a model with parent references. You can create the new “nature” element and a copy of the “natures” element above it. But what about the other two “nature” elements? You cannot reuse them because they contain a reference to the original “natures” element, so you have to create copies of them which point up to their new parent. You cannot even reuse the text nodes below them because they also reference their original parents. And now that you have copied the entire “natures” sub-tree, you face the same problem with the root element. All its children point up to the old root element, so you have to copy them recursively, too. In the end you have copied the tree entirely!
The Zipper
We can do better with the Zipper. The basic idea is to take an immutable data structure for the actual data (in this case XML documents in scala.xml models) and add a representation for a context that allows you to navigate through the data in all directions and modify parts of it (in the same way that you modified a document in the previous section — by creating a new document that shares most of the data).
We start with a type for the context:
sealed case class NodeLoc(node: Node, path: NodePath)
It consists of the node (or sub-tree) which it wraps plus a path to the root element. A path can either be the top (root) of the tree or a hole, i.e. the location of a node without the node itself:
sealed trait NodePath final case object Top extends NodePath final case class Hole(left: List[Node], parent: NodeLoc, right: List[Node]) extends NodePath
For example, here is a NodeLoc for the first “nature” element:
You can see that the original tree (in grey and yellow) is still intact. There are two NodeLocs with holes which point up to a parent NodeLoc and contain lists of their preceding and following siblings. Note that the list of preceding siblings is in reverse order (compared to the corresponding list of children on the parent node) so that the first node in the list is the next element to the left of the hole.
We can now define the basic navigation methods on NodeLoc. In order to move to the left sibling, we create a new NodeLoc which contains the head of the “left” list as its node and a copy of the current hole with the remaining nodes on the “left” list and the current node appended to the “right” list:
protected def create(node: Node, path: Hole) = NodeLoc(node, path)
final def leftOpt = path match {
case Hole(tl :: l, p, r) =>
Some(create(tl, Hole(l, p, node :: r)))
case _ => None
}
Moving to the right is symmetrical to that:
final def rightOpt = path match {
case Hole(l, p, tr :: r) =>
Some(create(tr, Hole(node :: l, p, r)))
case _ => None
}
Methods left and right which return a bare NodeLoc or throw an exception, and isFirst and isLast to ensure that you don’t run into such an exception, can be defined in the same way. I’m using Option here because it will be useful later on for creating the XPath axes.
Moving down to a child with the specified index position is accomplished by splitting the list of the current node’s children at the correct position, removing the child node and creating a Hole for the other children which points up to this NodeLoc:
final def downOpt(idx: Int) = {
val ch = node.child
if(ch.isEmpty) None
else Some(create(ch.head,
Hole(ch.tail.take(idx).reverse.toList,
this, ch.drop(idx+1).toList)))
}
Mutable Yet Immutable
Replacing the node at the current location is extremely simple. We just create a NodeLoc with the new Node and the current path:
final def set(n: Node) = NodeLoc(n, path)
There is no need to propagate the change up to the root element at this point. We could, for example, move through all siblings, replacing each one with a new node, all in constant time (per sibling). Parent elements are not copied until we move up and “unzip” the path into a new tree. This is done by constructing a new list of children from the current node and its left and right siblings, and creating a copy of the old parent with the new children:
def upOpt = path match {
case h : Hole =>
val l = h.left.reverse ++ (node :: h.right)
Some(NodeLoc(h.parent.node match {
case e: Elem => e.copy(child = l)
case _: Group => new Group(l)
}, h.parent.path))
case _ => None
}
This is slightly messy with scala.xml because we have to match on the type of node and copy Elem and Group nodes (the only ones which can contain children) in different ways.
Always recreating the parent when moving up is a waste of time when no changes have been made to the current sub-tree. We can optimize for this case with a special CachedParentNodeLoc which returns the original parent when moving up:
final class CachedParentNodeLoc(node: Node, path: Hole)
extends NodeLoc(node, path) {
override def upOpt = Some(path.parent)
override protected def create(node: Node, path: Hole) =
new CachedParentNodeLoc(node, path)
}
final class CachedTopNodeLoc(node: Node) extends NodeLoc(node, Top) {
override def upOpt = None
override protected def create(node: Node, path: Hole) =
new CachedParentNodeLoc(node, path)
}
object NodeLoc {
def apply(node: Node): NodeLoc = new CachedTopNodeLoc(node)
}
Note that the leftOpt, rightOpt and downOpt methods use create(...) instead of NodeLoc(...), so that navigating left, right or down from a cached location returns another cached location.
The XPath Axes
The XPath specification defines the following axes:
- the
childaxis contains the children of the context node- the
descendantaxis contains the descendants of the context node; a descendant is a child or a child of a child and so on; thus the descendant axis never contains attribute or namespace
nodes- the
parentaxis contains the parent of the context node, if there is one- the
ancestoraxis contains the ancestors of the context node; the ancestors of the context node consist of the parent of context node and the parent’s parent and so on; thus, the ancestor axis will always include the root node, unless the context node is the root node- the
following-siblingaxis contains all the following siblings of the context node; if the context node is an attribute node or namespace node, thefollowing-siblingaxis is empty- the
preceding-siblingaxis contains all the preceding siblings of the context node; if the context node is an attribute node or namespace node, thepreceding-sibling
axis is empty- the
followingaxis contains all nodes in the same document as the context node that are after the context node in document order, excluding any descendants and excluding attribute nodes and namespace nodes- the
precedingaxis contains all nodes in the same document as the context node that are before the context node in document order, excluding any ancestors and excluding attribute nodes and namespace nodes- the
attributeaxis contains the attributes of the context node; the axis will be empty unless the context node is an element- the
namespaceaxis contains the namespace nodes of the context node; the axis will be empty unless the context node is an element- the
selfaxis contains just the context node itself- the
descendant-or-selfaxis contains the context node and the descendants of the context node- the
ancestor-or-selfaxis contains the context node and the ancestors of the context node; thus, the ancestor axis will always include the root node
The attribute and namespace axes do not fit well into our model because scala.xml does not represent attributes and namespace declarations as Node values. All other axes can be defined as methods on NodeLoc.
We will represent an axis as an Iterator[NodeLoc]. The following helper method creates an iterator by starting with a given value and continually applying a function to it until it returns None. It is similar to Iterator.iterate in Scala 2.8 except the latter can only create infinite iterators.
private final def iterate[T](start: Option[T])(f: T => Option[T]): Iterator[T] = new Iterator[T] {
private[this] var acc = start
def hasNext = acc.isDefined
def next() = acc match {
case None => throw new NoSuchElementException("next on empty iterator");
case Some(x) => val res = x ; acc = f(x) ; res
}
}
We can use it like this to create the child axis: Start with the first child, then go right node by node until you reach the last child:
final def childAxis = iterate(downOpt(0))(_.rightOpt)
The descendant-or-self axis consists of the current node plus the descendant-or-self axes of all children:
final def descendantOrSelfAxis: Iterator[NodeLoc] =
Iterator.single(this) ++ childAxis.flatMap(_.descendantOrSelfAxis)
And the descendant axis consists only of the descendant-or-self axes of all children, without the current node:
final def descendantAxis = childAxis.flatMap(_.descendantOrSelfAxis)
The parent, ancestor-or-self and ancestor axes work similarly:
final def parentAxis = upOpt.iterator
final def ancestorOrSelfAxis: Iterator[NodeLoc] =
Iterator.single(this) ++ upOpt.iterator.flatMap(_.ancestorOrSelfAxis)
final def ancestorAxis = upOpt.iterator.flatMap(_.ancestorOrSelfAxis)
The following-sibling and preceding-sibling axes are straight-forward:
final def followingSiblingAxis = iterate(rightOpt)(_.rightOpt) final def precedingSiblingAxis = iterate(leftOpt)(_.leftOpt)
For the following and preceding axes we need to define some helper methods for moving to the following or preceding node in document order:
final def followingAxis = iterate(followingOpt)(_.followingOpt)
final def followingOpt: Option[NodeLoc] =
downOpt(0) orElse rightOutOpt
private final def rightOutOpt: Option[NodeLoc] =
rightOpt orElse upOpt.flatMap(_.rightOutOpt)
final def precedingAxis = iterate(precedingOpt)(_.precedingOpt)
final def precedingOpt: Option[NodeLoc] =
leftOpt.map(n => n.downLastTransitiveOpt.getOrElse(n)) orElse upOpt
private final def downLastTransitiveOpt: Option[NodeLoc] =
downLastOpt.flatMap(_.downLastTransitiveOpt)
final def downLastOpt = {
val ch = node.child
if(ch.isEmpty) None
else Some(NodeLoc(ch.head, Hole(ch.reverse.toList.tail, this, Nil)))
}
And finally the trivial self axis:
final def selfAxis = Iterator.single(this)
Methods \ and \\ for finding all child or descendant elements with a given name can be defined by filtering the child and descendant axes:
final def \ (Name: String) =
for(n @ NodeLoc(Elem(_, Name, _, _, _*), _) <- childAxis) yield n
final def \\ (Name: String) =
for(n @ NodeLoc(Elem(_, Name, _, _, _*), _) <- descendantAxis) yield n
You can find the complete source code, including more convenience methods and a short demo here.


