oreilly.comSafari Books Online.Conferences.


Processing XML with Xerces and the DOM
Pages: 1, 2, 3, 4

Xerces uses its own UTF-16 character type, XMLCh, instead of plain char or std::string. The function transcode() converts between char* and XMLCh* strings. It relies on the caller to free the memory it allocates for strings, hence the call to XMLString::release().

main() itself is very small. The XMLConfigData class encapsulates most of the Xerces calls. The accessors and mutators roughly match the contents of the XML file:

class XMLConfigData {

  void load() throw( std::runtime_error ) ;
  void commit() throw( std::runtime_error ) ;

  std::ostream& print( std::ostream& s ) const ;

  const std::string& getLastUpdate() const throw() ;

  void setLastUpdate( const std::string& in ) ;
  const std::string& getLoginUser() const throw() ;

  void setLoginUser( const std::string& in ) ;
  void setLoginPassword( const std::string& in ) ;

  int getReportCount() const throw() ;
  void addReport( const std::string& report ) ;

} ;

In addition to the configuration properties, the XMLConfigData constructor initializes two Xerces-related objects. The first is a XercesDOMParser named parser_. As its name implies, XercesDOMParser parses an XML document into a tree of DOMNode structures.

The second such object, tags_, is of the custom TagNames class. This convenience class holds XMLChar versions of the element and attribute names, such that code can uniformly address them without repeated calls to XMLString's transcode() and release(). It may be tempting to make these values static constants; but because C++ offers no guarantees on the order of static member initialization, it's impossible to make sure nothing uses the Xerces classes before the call to XMLPlatformUtils::Initialize().

All the magic happens in XMLConfigData::load(). It first configures the parser object to disable validation:

  xercesc::XercesDOMParser::Val_Never ) ;

parser_.setDoSchema( false ) ;
parser_.setLoadExternalDTD( false ) ;

(I'll revisit validation in my next article.)

The call to parser_.parse() parses the XML document into a DOMDocument object called xmlDoc. step1 passes parse() a filename, so behind the scenes Xerces creates a LocalFileInputSource to read data from a local file. parse() is overloaded to accept other input, such as a buffer of memory (MemBufInputSource), standard input (StdInInputSource), and data loaded via URL (URLInputSource).

step1 calls DOMDocument::getDocumentElement to fetch the top-level element (here, <config>) as a DOMElement object:

DOMElement* elementConfig = xmlDoc->getDocumentElement() ;

It then calls DOMElement::getAttribute() to fetch some attribute values:

// "tags_.ATTR_LASTUPDATE" is the
// XMLCh* version of "lastupdate"

const XMLCh* lastUpdateXMLCh =
  elementConfig->getAttribute( tags_.ATTR_LASTUPDATE ) ;

Given the document's tree structure, many nodes have child nodes. DOMNode::getChildNodes() returns a DOMNodeList, which is useful to iterate through those immediate children:

xercesc::DOMNodeList* children =
  elementConfig->getChildNodes() ;

const XMLSize_t nodeCount = children->getLength() ;
for( XMLSize_t ix = 0 ; ix < nodeCount ; ++ix ){
  xercesc::DOMNode* currentNode = children->item( ix ) ;
  // ... do something with currentNode ...

Use getChildNodes() to walk the document, one level at a time. Call DOMNode's getTagName() function to see the name of the current element:

if( XMLString::equals(
  tags_.TAG_LOGIN ,
) ){
  // ... it's a <login> tag ...

(Of course, getTagName() returns meaningful values only for element-type nodes.)

Remember, nodes can be more than elements. Be careful to avoid blindly downcasting a DOMNode to a subclass thereof. Compare a DOMNode constant with the result of a node's getNodeType() member to determine its type:

if( DOMNode::ELEMENT_NODE == currentNode->getNodeType() ){
  DOMElement* currentElement =
        dynamic_cast< DOMElement* >( currentNode )

In theory, you could just dynamic_cast<> the node to an element and check the return value for NULL. Explicitly checking the tag type is more in tune with DOM style. (Remember, it's a standard meant to work with several different languages.) It's also a little more verbose, which serves as a maintenance hint.

Whereas getChildNodes() returns all child nodes, you can pass a filter to the parent DOMDocument's createTreeWalker() or createNodeIterator. A DOMTreeWalker lets you navigate the document hierarchy by sibling or parent/child relationship. DOMDocument::createNodeIterator, by comparison, is more like a database result set or cursor: you can scroll forward or backward over the list of returned nodes.

Walking node children is a brute-force means to find elements of interest. You can also call DOMElement::getElementsByTagName() to fetch a list of descendant elements of a certain name. step1 uses this to find the <report> child elements of the <reports> element:

// tags_.TAG_REPORT is the XMLCh* version of "report"

xercesc::DOMNodeList* reportNodes =
  element->getElementsByTagName( tags_.TAG_REPORT ) ;

  // ... iterate through the node list ...

Though it returns a DOMNodeList, Xerces guarantees that getElementsByTagName() will return only element nodes. As such, it's safe to call blindly dynamic_cast<>() on the node list's elements.

Pages: 1, 2, 3, 4

Next Pagearrow

Sponsored by: