ONLamp.com
oreilly.comSafari Books Online.Conferences.

advertisement


Processing XML with Xerces and SAX

by Q Ethan McCallum
11/10/2005

In my previous article, I introduced the Xerces-C++ XML toolkit and explained how to use Xerces for DOM parsing. This time, I'll explain Xerces SAX parsing, plus error handling and validation.

SAX and DOM offer very different approaches to reading XML. Many people say the difference between the two is just about memory efficiency--but it's also a matter of control. With SAX you pluck out exactly what you want from the XML document, instead of wandering the DOM tree.

Xerces provides customizable error handling for both types of parsing. This means you can (in limited fashion) tell the parser how to react to certain types of problems.

Finally, there's validation using DTD and XML schema. (Come to think of it, validation is the reason for a lot of error handling.) Letting the parser handle the validation means your code can (safely) assume a certain document structure. That keeps your code clean, because it can focus on the business logic behind what's in the XML instead of watching out for missing elements.

I compiled the sample code under Fedora Core 3/x86 using Xerces-C++ 2.6.0 and GCC 3.4.3. The code uses the helper classes described in the previous article, but you don't need to understand them to follow this article.

SAX at a High Level

Related Reading

C++ Cookbook
By Ryan Stephens, Christopher Diggins, Jonathan Turkanis, Jeff Cogswell

Whereas DOM gives you an object graph of the entire document at once, SAX parsing is a streaming model. It reads a document sequentially, and hands your code chunks of data to process. You have to catch items of interest when they appear, because you can't rewind the stream to get them later. It's not unlike scrolling movie credits in that the information is gone once it rolls off the screen. SAX is reminiscent of parsing with lex and yacc, though it's much stronger: the rules of XML define much of the grammar for you.

Because you don't load the entire document into memory, SAX can be quite resource-efficient. On the other hand, it requires more elbow grease than DOM: it's up to your code to churn the stream of data into usable objects.

This makes SAX feel lower-level than DOM, similar to how people compare memory management in C and Java. It's true that SAX puts you closer to the raw parsing, but that's one reason to choose SAX over DOM. What if you have a large document and you're interested in only a few particular elements? DOM's overhead is wasteful in that case. SAX lets you sift through the data and extract just what you need.

SAX Mechanics

At the code level, a SAX parser interprets an XML document as a series of events. The most common events are element start tags and end tags and the body content between them.

Each event is tied to a callback function on a handler object--this is the piece you write--that you register with the parser. When a parser encounters an element's start tag, for example, it passes the element and any attributes to the start tag callback. When it's parsing the body text between the start and end tags, it passes chunks of content to the body content callback.

In fact, the handler does all the work. The main() function of the sample program step1 is very short. There's just enough code to create a parser, assign a handler, and run:

// ... skipping basic Xerces setup explained
// in the previous article ...

xercesc::SAX2XMLReader* p =
  xercesc::XMLReaderFactory::createXMLReader();

xercesc::ContentHandler* h =
  new SimpleContentHandler() ;

p->setContentHandler( h ) ;

// ... set some options on the parser ...

p->parse( xmlFile ) ;

SAX handler classes implement the ContentHandler class. The abbreviated interface is:

class ContentHandler {

  void startDocument() ;
  void endDocument() ;
  
  void startElement(
    const XMLCh* const uri,
    const XMLCh* const localname,
    const XMLCh* const qname,
    const Attributes&  attrs 
  ) ;

  void endElement(
    const XMLCh* const uri,
    const XMLCh* const localname,
    const XMLCh* const qname
  ) ;

  void characters(
    const XMLCh* const chars,
    const unsigned int length
  ) ;

  void startPrefixMapping(
    const XMLCh* const prefix,
    const XMLCh* const uri
  ) ;

  void endPrefixMapping(
    const XMLCh* const prefix
  ) ;

  void processingInstruction (
    const XMLCh* const target,
    const XMLCh* const data
  ) ;

  // ... a couple of other member functions ...
} ;

Most of the methods are self-explanatory. When the parser encounters body content between a start and end tag, it calls characters(). There may be several calls for the same element's body content, because the parser makes no guarantee that it will be able to hand you all that content at once. The parser's character buffer may be several kilobytes in size, so typically you might be able to assume that you'll have only one call element--but that's not a safe assumption. Do yourself a favor and stash characters() data in a buffer, such as std::ostringstream.

ContentHandler is a pure virtual class, or interface, and it can be tedious to implement all its methods if you're interested only in certain events. (Most developers care only about startElement(), endElement(), and characters().) Xerces includes a convenience class called DefaultHandler that provides do-nothing implementations of ContentHandler's methods. Your handler can simply inherit from DefaultHandler and override the methods of interest.

The stub program step1 demonstrates this. It uses a simple "talking" handler, the callback methods of which announce when they are called. Here's a sample XML file and the corresponding output from step1.

[begin: parse] [setDocumentLocator()] [startDocument()]
<airports> [startElement( "airports" )] [characters( XMLCh[11] ) ]
<airport name="CDG"> [startElement( "airport" )] "name" => "CDG" (type: CDATA) [characters( XMLCh[14] ) ]
<aliases> [startElement( "aliases" )] [characters( XMLCh[13] ) ]
<alias>Charles de Gaulle airport</alias> [startElement( "alias" )] [characters( XMLCh[25] ) ] [endElement( "alias" )] [characters( XMLCh[13] ) ]
<alias>Roissy airport</alias> [startElement( "alias" )] [characters( XMLCh[14] ) ] [endElement( "alias" )] [characters( XMLCh[10] ) ]
</aliases> [endElement( "aliases" )] [characters( XMLCh[14] ) ]
<location>Paris, France</location> [startElement( "location" )] [characters( XMLCh[13] ) ] [endElement( "location" )] [characters( XMLCh[14] ) ]
<comment> Terminal 3 has a very 1970s sci-fi decor </comment> [startElement( "comment" )] [characters( XMLCh[61] ) ] [endElement( "comment" )] [characters( XMLCh[11] ) ]
</airport> [endElement( "airport" )] [characters( XMLCh[11] ) ]
<!-- ... other airport defs ... -->
</airports> [characters( XMLCh[8] ) ] [endElement( "airports" )]
[endDocument()] [end: parse] [done]

This code is rather dull, even for a learning device. What it demonstrates, though, is that startElement() and endElement() announce the name of the current element. You can use this to create a selective handler, one that reports only a certain subset of SAX events.

Suppose that the sample file included definitions for several airports, and you wanted to see the names and aliases for each one. Your handler's startElement() method would entail:

void startElement( ... element name , attributes , etc .. ){

  if( ... element name is "airport" ... ){

    ... print the element's "name" attribute ...

  } else if( ... element name is "alias" ... ){

    ... print the alias ...

  } else {

    ... ignore it ...

  }

}

Such a handler would show:

CDG
        Roissy airport
        Charles de Gaulle airport
...

(I'm cheating here, because I don't show how to capture the elements' body content. I'll explain that shortly.)

Another point made evident by the previous example is that SAX, unlike DOM, doesn't make XML comments available to your code. In the sample XML file, the phrase ... other airport defs ... is invisible to SAX handlers.

Also notice that the phrase Terminal 3 has a very 1970s sci-fi decor has only 42 characters, yet the characters() callback reports 61. The remaining characters constitute white space between the phrase and its surrounding tags. If I were to do anything useful with the <comment> element in the example, I would have to trim the white space from either end.

Pages: 1, 2, 3

Next Pagearrow





Sponsored by: