oreilly.comSafari Books Online.Conferences.


Processing XML with Xerces and SAX
Pages: 1, 2, 3

From XML to Objects

Programs don't usually work with raw XML. Instead, they use XML as a storage and transport medium and parse it into objects. That is to say, an XML document typically reflects some data structure(s).

The sample program step2 is a more practical example, and it shows how to use SAX to turn XML into objects. As a bonus, it demonstrates SAX's power to extract small pieces of information from a large document.

The XML Package Metadata project defines an XML format that describes all the RPMs in a yum repository. step2's job is to extract the name, license, and vendor of each package defined in the file.

Instead of unmarshaling the entire document--roughly 5MB--you can use SAX to extract only the pieces of interest. Think of the package metadata file as a database and the handler as the query.

To quote a colleague, a SAX handler is just an event-driven state machine. Parsing a document into usable objects thus requires objects that you can assemble in stages, and a plan to match states--SAX events--to calls on those objects.

It's easier to start with the objects. Ideally, they can be simple data-holder classes with accessors (getXyz) and mutators (setXyz) to get and set properties, respectively.

step2 uses the RPMInfo object:

class RPMInfo {

  const std::string& getName() const throw() ;
  void setName( const std::string& in ) ;

  const std::string& getVersion() const throw() ;
  void setVersion( const std::string& in ){

  const std::string& getLicense() const throw() ;
  void setLicense( const std::string& in ) ;

  const std::string& getVendor() const throw() ;
  void setVendor( const std::string& in ) ;

} ;

Next, the plan of attack. At a high level, the plan to parse the package metadata file is as follows:

namespace   element    action

{default}   <package>  rpm = new RPMInfo()
{default}   <name>     rpm->setName()
rpm         <license>  rpm->setLicense()
rpm         <vendor>   rpm->setVendor()

This isn't a completely accurate depiction of what happens, though. step2 looks at the body content of the <name>, <license>, and <vendor> elements. Unlike DOM parsing, where you get the entire element at once--body content and all--SAX requires a little more creativity. Going after body content requires you use start and end tags to mark state, then collect what you need in the characters() handler callback.

In other words, a handler must keep track of the current tag so that it can decide what to do. There are many ways to do this. I use a function to match element names to symbolic constants, and push those constants onto a std::stack of int. In turn, each handler callback is a switch() block:

switch( ... top element on stack ... ){

    ... handle case for
      <license> element ...

    ... handle case for
      <vendor> element ...

  ... and so on ...

The following table matches SAX events to handler actions:

startElement() Note the element name, and put its corresponding numeric constant on the stack. If it's a <package> start tag, create a new RPMInfo object and assign it to a member variable.
characters() If the current element (based on the stack) is the target, stash the buffer content in a std::ostringstream.
endElement() Based on the current element, assign the contents of the std::ostringstream to one of the current RPMInfo objects' mutator methods. Decrement the tag stack and clear the buffer used by characters().

Generally speaking, start tag events are a good place to create a new object. End tag events are a place to wrap up: assign the character buffer to some property, decrement the tag stack, and so on.

Finally, note that step2 requires you to cheat; you must uncompress the package metadata file before you can parse it. This data is usually compressed, but writing code to decompress the file before feeding it to Xerces is beyond the scope of this article. I've included a small sample file for demonstrative purposes.

Pages: 1, 2, 3

Next Pagearrow

Sponsored by: