XSLT Processing with Java
Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9
The code in Example 5-7 shows the complete implementation of the CSV parser.
Example 5-7: CSVXMLReader.java
package com.oreilly.javaxslt.util;
import java.io.*;
import java.net.URL;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
/**
* A utility class that parses a Comma
* Separated Values (CSV) file and outputs its
* contents using SAX2 events. The format of CSV
* that this class reads is identical to the export
* format for Microsoft Excel. For simple values, the
* CSV file may look like this:
* <pre>
* a,b,c
* d,e,f
* </pre>
* Quotes are used as delimiters when the values
* contain commas:
* <pre>
* a,"b,c",d
* e,"f,g","h,i"
* </pre>
* And double quotes are used when the values
* contain quotes. This parser is smart enough
* to trim spaces around commas, as well.
*
* @author Eric M. Burke
*/
public class CSVXMLReader extends AbstractXMLReader {
// an empty attribute for use with SAX
private static final Attributes EMPTY_ATTR = new AttributesImpl( );
/**
* Parse a CSV file. SAX events are
* delivered to the ContentHandler
* that was registered via
* <code>setContentHandler</code>.
*
* @param input the comma separated
* values file to parse.
*/ public void parse(InputSource input) throws IOException,
SAXException {
// if no handler is registered to receive events, don't bother
// to parse the CSV file
ContentHandler ch = getContentHandler( );
if (ch == null) {
return;
}
// convert the InputSource into a BufferedReader
BufferedReader br = null;
if (input.getCharacterStream( ) != null) {
br = new BufferedReader(input.getCharacterStream( ));
} else if (input.getByteStream( ) != null) {
br = new BufferedReader(new InputStreamReader(
input.getByteStream( )));
} else if (input.getSystemId( ) != null) {
java.net.URL url = new URL(input.getSystemId( ));
br = new BufferedReader(new InputStreamReader(url.openStream( )));
} else {
throw new SAXException("Invalid InputSource object");
}
ch.startDocument( );
// emit <csvFile>
ch.startElement("","","csvFile",EMPTY_ATTR);
// read each line of the file until EOF is reached
String curLine = null;
while ((curLine = br.readLine( )) != null) {
curLine = curLine.trim( );
if (curLine.length( ) > 0) {
// create the <line> element
ch.startElement("","","line",EMPTY_ATTR);
// output data from this line
parseLine(curLine, ch);
// close the </line> element
ch.endElement("","","line");
/code>
}
// emit </csvFile>
ch.endElement("","","csvFile");
ch.endDocument( );
}
// Break an individual line into tokens.
// This is a recursive function
// that extracts the first token, then
// recursively parses the
// remainder of the line.
private void parseLine(String curLine, ContentHandler ch)
throws IOException, SAXException {
String firstToken = null;
String remainderOfLine = null;
int commaIndex = locateFirstDelimiter(curLine);
if (commaIndex > -1) {
firstToken = curLine.substring(0, commaIndex).trim( );
remainderOfLine = curLine.substring(commaIndex+1).trim( );
} else {
// no commas, so the entire line is the token
firstToken = curLine;
}
// remove redundant quotes
firstToken = cleanupQuotes(firstToken);
// emit the <value> element
ch.startElement("","","value",EMPTY_ATTR);
ch.characters(firstToken.toCharArray(), 0, firstToken.length( ));
ch.endElement("","","value");
// recursively process the remainder of the line
if (remainderOfLine != null) {
parseLine(remainderOfLine, ch);
}
}
// locate the position of the comma,
// taking into account that
// a quoted token may contain ignorable commas.
private int locateFirstDelimiter(String curLine) {
if (curLine.startsWith("\"")) {
boolean inQuote = true;
int numChars = curLine.length( );
for (int i=1; i<numChars; i++) {
char curChar = curLine.charAt(i);
if (curChar == '"') {
inQuote = !inQuote;
} else if (curChar == ',' && !inQuote) {
return i;
}
}
return -1;
} else {
return curLine.indexOf(',');
}
}
// remove quotes around a token, as well as pairs of quotes
// within a token.
private String cleanupQuotes(String token) {
StringBuffer buf = new StringBuffer( );
int length = token.length( );
int curIndex = 0;
if (token.startsWith("\"") && token.endsWith("\"")) {
curIndex = 1;
length--;
}
boolean oneQuoteFound = false;
boolean twoQuotesFound = false;
while (curIndex < length) {
char curChar = token.charAt(curIndex);
if (curChar == '"') {
twoQuotesFound = (oneQuoteFound) ? true : false;
oneQuoteFound = true;
} else {
oneQuoteFound = false;
twoQuotesFound = false;
}
if (twoQuotesFound) {
twoQuotesFound = false;
oneQuoteFound = false;
curIndex++;
continue;
}
buf.append(curChar);
curIndex++;
}
return buf.toString( );
}
}
CSVXMLReader is a subclass of AbstractXMLReader, so it must provide an implementation of the abstract parse method:
public void parse(InputSource input) throws IOException,
SAXException {
// if no handler is registered to receive
// events, don't bother
// to parse the CSV file
ContentHandler ch = getContentHandler( );
if (ch == null) {
return;
}
The first thing this method does is check for the existence of a
SAX ContentHandler. The base class, AbstractXMLReader, provides access to this object, which
is responsible for listening to the SAX events. In our example, an instance of
JAXP's TransformerHandler is used as the SAX ContentHandler implementation. If this handler is not
registered, our parse method simply returns because
nobody is registered to listen to the events. In a real SAX parser, the XML
would be parsed anyway, which provides an opportunity to check for errors in
the XML data. Choosing to return immediately was merely a performance
optimization selected for this class.
The SAX InputSource parameter allows
our custom parser to locate the CSV file. Since an InputSource has many options for reading its data,
parsers must check each potential source in the order shown here:
// convert the InputSource into a BufferedReader
BufferedReader br = null;
if (input.getCharacterStream( ) != null) {
br = new BufferedReader(input.getCharacterStream( ));
} else if (input.getByteStream( ) != null) {
br = new BufferedReader(new InputStreamReader(
input.getByteStream( )));
} else if (input.getSystemId( ) != null) {
java.net.URL url = new URL(input.getSystemId( ));
br = new BufferedReader(new InputStreamReader(url.openStream( )));
} else {
throw new SAXException("Invalid InputSource object");
}
Assuming that our InputSource was
valid, we can now begin parsing the CSV file and emitting SAX events. The
first step is to notify the ContentHandler that a
new document has begun:
ch.startDocument( );
// emit <csvFile>
ch.startElement("","","csvFile",EMPTY_ATTR);
The XSLT processor interprets this to mean the following:
<?xml version="1.0" encoding="UTF-8"?>
<csvFile>
Our parser simply ignores many SAX 2 features, particularly XML
namespaces. This is why many values passed as parameters to the various ContentHandler methods simply contain empty strings. The
EMPTY_ATTR constant indicates that this XML element
does not have any attributes.
The CSV file itself is very straightforward, so we merely loop
over every line in the file, emitting SAX events as we read each line. The
parseLine method is a private helper method that
does the actual CSV parsing:
// read each line of the file until EOF is reached
String curLine = null;
while ((curLine = br.readLine( )) != null) {
curLine = curLine.trim( );
if (curLine.length( ) > 0) {
// create the <line> element
ch.startElement("","","line",EMPTY_ATTR);
parseLine(curLine, ch);
ch.endElement("","","line");
}
}
And finally, we must indicate that the parsing is complete:
// emit </csvFile>
ch.endElement("","","csvFile");
ch.endDocument( );
The remaining methods in CSVXMLReader
are not discussed in detail here because they are really just responsible for
breaking down each line in the CSV file and checking for commas, quotes, and
other mundane parsing tasks. One thing worth noting is the code that emits
text, such as the following:
<value>Some Text Here</value>
SAX parsers use the characters method
on ContentHandler to represent text, which has this
signature:
public void characters(char[] ch, int start, int length)
Although this method could have been designed to take a String, using an array allows SAX parsers to preallocate
a large character array and then reuse that buffer repeatedly. This is why an
implementation of ContentHandler cannot simply
assume that the entire ch array contains meaningful
data. Instead, it must read only the specified number of characters beginning
at the start position.
Our parser uses a relatively straightforward approach, simply
converting a String to a character array and
passing that as a parameter to the characters
method:
// emit the <value>text</value> element
ch.startElement("","","value",EMPTY_ATTR);
ch.characters(firstToken.toCharArray(), 0, firstToken.length( ));
ch.endElement("","","value");