|MySQL Conference and Expo April 14-17, 2008, Santa Clara, CA|
So, how to transform a plain string into DOM nodes? It sounds like a complicated task, but due to the strict and simple syntax of XML-based languages, it's not. Just remember two of the basic rules:
The XML syntax differentiates between two types of elements, empty and non-empty. A non-empty element is an element that can have some content, i.e., it can contain a set of child nodes; thus a non-empty element always has a start and an end tag.
The second type is the empty element. It is indicated by a single start tag and can't have child nodes other than attribute nodes. Empty elements are easily recognized by the slash,
The syntactical difference between the two types is very important for automated processing of XML-formatted data; more on that later.
As tags are always indicated by a trailing and leading angle bracket, it's easy to find them in a given string. Just scan for an opening angle bracket, then a closing one, and access everything in between:
This simple algorithm finds any tag regardless of its type. However, for building your own DOM tree, you only need to consider the start tags. An end tag is nothing more than an indicator marking the end of an element. This means that the content following the closing tag of a given element can only be part of this element's parent node or its descendants.
Building the tree
Let's have a closer look at the process of building a tree. Every tree has a single root node, which forms the base of the tree.
As we want to create a method to write into an element, we have this root point already. What we actually need to do is build a smaller sub-tree inside the document's complete tree and consider the element we're gonna write into as the root node of this tree.
Here's an example of a visualized DOM tree:
The first step for an algorithm is to search for all tags (as described above) and textual data. In terms of computer science, this process is called scanning.
This scanning happens in a linear fashion, from left to right. If we consider the string as a two-dimensional object, our task is to add another dimension in height, thus expressing the different levels of the tree.
How would that look on a very simple example -- let's say adding some text under a paragraph element?
An algorithm must take care to add characters to text nodes only. So, when we start with a
If we had some (non-empty) element following the text, we'd have to climb up the tree again and append the element to the parent node. Then we climb down the tree again into the appended node and expect further text or markup. Everything before the closing tag of this element is a descendant node of the actual element.
Empty elements, on the other hand, must only be appended to the actual pointer node (which holds our position in the tree), since they can't have child nodes.
As you can see, building a tree is rather simple after becoming familiar with the structure of the DOM. Now we must be able to create elements from an extracted substring. An algorithm with this behavior is implemented in