ONJava.com -- The Independent Source for Enterprise Java
oreilly.comSafari Books Online.Conferences.

advertisement

AddThis Social Bookmark Button

XSLT Processing with Java
Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9

The approach

It turns out that writing a SAX parser is quite easy (our examples use SAX 2). All a SAX parser does is read an XML file top to bottom and fire event notifications as various elements are encountered. In our custom parser, we will read the CSV file top to bottom, firing SAX events as we read the file. A program listening to those SAX events will not realize that the data file is CSV rather than XML; it sees only the events. Figure 5-4 illustrates the conceptual model.

Diagram.
Figure 5-4. Custom SAX parser

In this model, the XSLT processor interprets the SAX events as XML data and uses a normal stylesheet to perform the transformation. The interesting aspect of this model is that we can easily write custom SAX parsers for other file formats, making XSLT a useful transformation language for just about any legacy application data.

In SAX, org.xml.sax.XMLReader is a standard interface that parsers must implement. It works in conjunction with org.xml.sax.ContentHandler, which is the interface that listens to SAX events. For this model to work, your XSLT processor must implement the ContentHandler interface so it can listen to the SAX events that the XMLReader generates. In the case of JAXP, javax.xml.transform.sax.TransformerHandler is used for this purpose.

Obtaining an instance of TransformerHandler requires a few extra programming steps. First, create a TransformerFactory as usual:

TransformerFactory transFact = TransformerFactory.newInstance( );

As before, the TransformerFactory is the JAXP abstraction to some underlying XSLT processor. This underlying processor may not support SAX features, so you have to query it to determine if you can proceed:

if (transFact.getFeature(SAXTransformerFactory.FEATURE)) {

If this returns false, you are out of luck. Otherwise, you can safely downcast to a SAXTransformerFactory and construct the TransformerHandler instance:

SAXTransformerFactory saxTransFact =
     (SAXTransformerFactory) transFact;
 // create a ContentHandler, don't specify a
 // stylesheet. Without a stylesheet, raw
 // XML is sent to the output.
 TransformerHandler transHand = saxTransFact.newTransformerHandler( );

In the code shown here, a stylesheet was not specified. JAXP defaults to the identity transformation stylesheet, which means that the SAX events will be "transformed" into raw XML output. To specify a stylesheet that performs an actual transformation, pass a Source to the method as follows:

Source xsltSource = new StreamSource(myXsltSystemId);
TransformerHandler transHand = saxTransFact.newTransformerHandler(xsltSource);

Detailed CSV to SAX design

Before delving into the complete example program, let's step back and look at a more detailed design diagram. The conceptual model is straightforward, but quite a few classes and interfaces come into play. Figure 5-5 shows the pieces necessary for SAX-based transformations.

Diagram.
Figure 5-5. SAX and XSLT transformations

This diagram certainly appears to be more complex than previous approaches, but is similar in many ways. In previous approaches, we used the TransformerFactory to create instances of Transformer; in the SAX approach, we start with a subclass of TransformerFactory. Before any work can be done, you must verify that your particular implementation supports SAX-based transformations. The reference implementation of JAXP does support this, although other implementations are not required to do so. In the following code fragment, the getFeature method of TransformerFactory will return true if you can safely downcast to a SAXTransformerFactory instance:

TransformerFactory transFact = TransformerFactory.newInstance( );
if (transFact.getFeature(SAXTransformerFactory.FEATURE)) {
  // downcast is allowed
  SAXTransformerFactory saxTransFact = (SAXTransformerFactory) transFact;

If getFeature returns false, your only option is to look for an implementation that does support SAX-based transformations. Otherwise, you can proceed to create an instance of TransformerHandler:

TransformerHandler transHand = saxTransFact.newTransformerHandler(myXsltSource);

This object now represents your XSLT stylesheet. As Figure 5-5 shows, TransformerHandler extends org.xml.sax.ContentHandler, so it knows how to listen to events from a SAX parser. The series of SAX events will provide the "fake XML" data, so the only remaining piece of the puzzle is to set the Result and tell the SAX parser to begin parsing. The TransformerHandler also provides a reference to a Transformer, which allows you to set output properties such as the character encoding, whether to indent the output or any other attributes of <xsl:output>.

Writing the custom parser

Writing the actual SAX parser sounds harder than it really is. The process basically involves implementing the org.xml.sax.XMLReader interface, which provides numerous methods you can safely ignore for most applications. For example, when parsing a CSV file, it is probably not necessary to deal with namespaces or validation. The code for AbstractXMLReader.java is shown in Example 5-5. This is an abstract class that provides basic implementations of every method in the XMLReader interface except for the parse( ) method. This means that all you need to do to write a parser is create a subclass and override this single method.


Example 5-5: AbstractXMLReader.java

package com.oreilly.javaxslt.util;
 
import java.io.IOException;
import java.util.*;
import org.xml.sax.*;
 
 
/**
* An abstract class that implements the SAX2
* XMLReader interface. The intent of this class
* is to make it easy for subclasses to act as
* SAX2 XMLReader implementations. This makes it
* possible, for example, for them to emit SAX2
* events that can be fed into an XSLT processor
* for transformation.
*/
public abstract class AbstractXMLReader implements org.xml.sax.XMLReader {
 private Map featureMap = new HashMap( );
 private Map propertyMap = new HashMap( );
 private EntityResolver entityResolver;
 private DTDHandler dtdHandler;
 private ContentHandler contentHandler;
 private ErrorHandler errorHandler;
 
 /**
  * The only abstract method in this class. Derived classes can parse
  * any source of data and emit SAX2 events to the ContentHandler.
  */
 public abstract void parse(InputSource input) throws IOException,
   SAXException;
 
 public boolean getFeature(String name)
   throws SAXNotRecognizedException, SAXNotSupportedException {
  Boolean featureValue = (Boolean) this.featureMap.get(name);
  return (featureValue == null) ? false
    : featureValue.booleanValue( );
 }
 
 public void setFeature(String name, boolean value)
   throws SAXNotRecognizedException, SAXNotSupportedException {
  this.featureMap.put(name, new Boolean(value));
 }
 
 public Object getProperty(String name)
   throws SAXNotRecognizedException, SAXNotSupportedException {
  return this.propertyMap.get(name);
 }
 
 public void setProperty(String name, Object value)
   throws SAXNotRecognizedException, SAXNotSupportedException {
  this.propertyMap.put(name, value);
 }
 
 public void setEntityResolver(EntityResolver entityResolver) {
  this.entityResolver = entityResolver;
 }
 
 public EntityResolver getEntityResolver( ) {
  return this.entityResolver;
 }
 
 public void setDTDHandler(DTDHandler dtdHandler) {
  this.dtdHandler = dtdHandler;
 }
 
 public DTDHandler getDTDHandler( ) {
  return this.dtdHandler;
 }
 
 public void setContentHandler(ContentHandler contentHandler) {
  this.contentHandler = contentHandler;
 }
 
 public ContentHandler getContentHandler( ) {
  return this.contentHandler;
 }
 
 public void setErrorHandler(ErrorHandler errorHandler) {
  this.errorHandler = errorHandler;
 }
 
 public ErrorHandler getErrorHandler( ) {
  return this.errorHandler;
 }
 
 public void parse(String systemId) throws IOException, SAXException {
  parse(new InputSource(systemId));
 }
}


Creating the subclass, CSVXMLReader, involves overriding the parse( ) method and actually scanning through the CSV file, emitting SAX events as elements in the file are encountered. While the SAX portion is very easy, parsing the CSV file is a little more challenging. To make this class as flexible as possible, it was designed to parse through any CSV file that a spreadsheet such as Microsoft Excel can export. For simple data, your CSV file might look like this:

Burke,Eric,M
Burke,Jennifer,L
Burke,Aidan,G

The XML representation of this file is shown in Example 5-6. The only real drawback here is that CSV files are strictly positional, meaning that names are not assigned to each column of data. This means that the XML output merely contains a sequence of three <value> elements for each line, so your stylesheet will have to select items based on position.


Example 5-6: Example XML output from CSV parser

<?xml version="1.0" encoding="UTF-8"?>
<csvFile>
  <line>
    <value>Burke</value>
    <value>Eric</value>
    <value>M</value>
  </line>
  <line>
    <value>Burke</value>
    <value>Jennifer</value>
    <value>L</value>
  </line>
  <line>
    <value>Burke</value>
    <value>Aidan</value>
    <value>G</value>
  </line>
</csvFile>

One enhancement would be to design the CSV parser so it could accept a list of meaningful column names as parameters, and these could be used in the XML that is generated. Another option would be to write an XSLT stylesheet that transformed this initial output into another form of XML that used meaningful column names. To keep the code example relatively manageable, these features were omitted from this implementation. But there are some complexities to the CSV file format that have to be considered. For example, fields that contain commas must be surrounded with quotes:

"Consultant,Author,Teacher",Burke,Eric,M
Teacher,Burke,Jennifer,L
None,Burke,Aidan,G

To further complicate matters, fields may also contain quotes ("). In this case, they are doubled up, much in the same way you use double backslash characters (\\) in Java to represent a single backslash. In the following example, the first column contains a single quote, so the entire field is quoted, and the single quote is doubled up:

"test""quote",Teacher,Burke,Jennifer,L

This would be interpreted as:

test"quote,Teacher,Burke,Jennifer,L

Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9

Next Pagearrow