ONJava.com -- The Independent Source for Enterprise Java
oreilly.comSafari Books Online.Conferences.

advertisement

AddThis Social Bookmark Button

Opening Microsoft File Formats to Java

by Avik Sengupta and Andrew Oliver
01/22/2003

So you're a developer who creates cool but complicated server-side applications in Java, or you're an architect convinced about J2EE. Do your users care until they see an Excel spreadsheet on their desktops?

Until recently, your options were limited; pay a lot or move your application to a Windows box (see A Short History of POI below). Now, with the Jakarta POI project, you have a third option. POI (Poor Obfuscation Implementation, the name that seemed to describe the format best) is a high-quality application that can read and write Excel and other MS-format files right from inside of your Java application.

In this three-part series, we will see how the Jakarta POI project provides pure Java APIs to let you achieve this dream of interoperability.

Basics

Starting with Office 95, all MS Office applications store their documents in an archive called the OLE2 Compound Document Format (OLE2CDF). It's a bit like the old FAT filesystem: it promotes fragmentation, doesn't support compression, and isn't linear (which would make streaming easier). Other than that, it's great. What is it good for? The Microsoft Foundation Classes allow applications to serialize to this format (often in Property Sets, which we'll cover later), so if you need interoperability with legacy Windows proprietary file formats or Office documents, you have to deal with OLE2CDF.

Each OLE2CDF file on disk contains an entire filesystem, laid out using nested Directory Entries, which contain Entries. We are interested in Entry elements of the Document Entries type. A Document Entry contains application-specific (e.g. Excel) data structures.

Related Reading

Java Enterprise Best Practices
By The O'Reilly Java Authors

POI

POI consists of various parts that fit together to deliver the data in a MS file format to the Java application. At the lowest level is the POIFS (POI FileSystem) API that contains the basic logic to process any OLE2CDF file. Above that sit the various components to process the application data. HSSF (Horrible SpreadSheet Format) understands the Excel structures, while HDF (Horrible Document Format) understands the Microsoft Word structures. In this article, we will look at how to use POIFS to read or write a OLE2CDF file. In future articles of this series, we shall see how to use HSSF and HDF, as well as HPSF (Horrible Property Sheet Format, used to read -- and eventually write -- document property information available through File->Property) and using the HSSF Cocoon Serializer to serialize XML to an Excel file.

POIFS

Appendix 1 contains an example program that shows you how to use the POIFS API to read and write OLE2CDF files. Let's look at the important bits.

Reading a file is pretty easy:

POIFSFileSystem fs; // org.apache.poi.poifs
...
s = new POIFSFileSystem(inputStream);

The root (/) Directory Entry can be accessed by:

DirectoryEntry dir = fs.getRoot()

From there you can traverse the file format, which is structured like a filesystem -- directories, subdirectories and all. We can now iterate through the filesystem and access its Directory Entries:

for (Iterator iter = dir.getEntries(); iter.hasNext(); ) {
    Entry entry = (Entry)iter.next();
    System.out.println("found entry: " + entry.getName());
    if (entry instanceof DirectoryEntry) {
        // .. recurse into this directory
    } else if (entry instanceof DocumentEntry) {
        // entry is a document, which you can read
    } else {
        // We're not interested in any other type
    }
}

For the moment, however, we are not interested in recursing through every entry. In an XLS file, for instance, most of the data is available in an entry called Workbook, in the root directory. We access it like this:

DocumentEntry document = (DocumentEntry)directory.getEntry("Workbook");
DocumentInputStream dstream = new DocumentInputStream(document);

From this point on, you can read the document the way you would any other InputStream in Java. Simple, isn't it?

Writing is equally easy. Watch!

// stream == InputStream of some type
// outfile = output file name
POIFSFileSystem fs = new POIFSFileSystem();
DirectoryEntry createdDir = fs.getRoot().createDocument("Workbook", stream);
OutputStream out = new FileOutputStream(outfile);
fs.write(out);  // don't forget to close your output stream...

With those few lines of code, we can read the innards of an Excel file, and write it out again in a way that Excel can read it. (Note, however, that we are writing out only the Workbook stream, thereby losing information in other streams. Macros and embedded objects are stored in a separate stream.)

There are other ways to do this more efficiently. The project is slowly migrating to a event-based API (or the "Reactor Pattern" of sorts) for reading. However, this is probably best saved for those experienced with the API; the online POIFS documentation covers this in more detail. The benefits of the reactor-based APIs are a smaller memory footprint and speed; the cost is a lower level of abstraction and bending your mind inside-out. Instead of grabbing a document, you have to do something when a document is pushed unto you.

Summary

Microsoft's OLE 2 Compound Document format once prevented Java programmers from reading and writing Office and MFC-generated file formats from pure Java. Java programmers often had to resort to native bridges which limited them to Microsoft Operating Systems. The Jakarta POI Project opens up new worlds to Java developers by allowing them to write to OLE2CDF-based file formats with pure Java -- even on UNIX. This article explained how to work with the underlying OLE 2 Compound Document Format. In the next article, we'll explain how to read, write, and Modify Excel files with HSSF. The final article will cover the HSSFSerializer for Cocoon, as well as HPSF and HDF.

References

Avik Sengupta is the CTO at Itellix Software Solutions and a contributor to the Jakarta POI project.

Andrew Oliver is a professional cat herder who moonlights as a software developer and runs a small consultancy with an ugly retro looking website that one day will be updated when he and his compadres aren't too busy working with clients.

Appendix 1

Example 1

import org.apache.poi.poifs.filesystem.POIFSFileSystem;
import org.apache.poi.poifs.filesystem.Entry;
import org.apache.poi.poifs.filesystem.DirectoryEntry;
import org.apache.poi.poifs.filesystem.DocumentEntry;
import org.apache.poi.poifs.filesystem.DocumentInputStream;

import java.io.*;  // don't ever do this...done to save space

/**
 *Purpose: reads in an Excel file and outputs just the Workbook stream
 * Usage: POIFSExample1 excelfile.xls workbook.dat
 */

public class POIFSExample1 {
   public POIFSExample1() {};
   public void run(String filename, String outfile) throws IOException {
      InputStream istream = new FileInputStream(filename);

      POIFSFileSystem fs = new POIFSFileSystem(new BufferedInputStream(istream));
      DirectoryEntry directory = fs.getRoot();
      DocumentEntry document = (DocumentEntry)directory.getEntry("Workbook");
      DocumentInputStream dstream = new DocumentInputStream(document); 
      FileOutputStream ostream = new FileOutputStream(outfile);
      byte[] buffer = new byte[1024];
      while (dstream.read(buffer) > 0) {
         ostream.write(buffer);
      }
      ostream.close();
      dstream.close();
      istream.close();
   }

   public static void main(String[] args) {
      try {
         POIFSExample1 example = new POIFSExample1();
         example.run(args[0], args[1]);
      } catch (Exception e) {
         e.printStackTrace();
      }
   }
}

Example 2

import org.apache.poi.poifs.filesystem.POIFSFileSystem;
import org.apache.poi.poifs.filesystem.Entry;
import org.apache.poi.poifs.filesystem.DirectoryEntry;
import org.apache.poi.poifs.filesystem.DocumentEntry;
import org.apache.poi.poifs.filesystem.DocumentInputStream;

import java.io.*;  // don't ever do this...done to save space

/**
 * Purpose:  Read in the Workbook stream from Example 1 
 * and construct a new and valid Excel file
 * using POIFS.
 * Usage: POIFSExample2 workbook.dat excelfile.xls
 */

public class POIFSExample2 {
   public POIFSExample2() {};
   public void run(String filename, String outfile) throws IOException {
      InputStream istream = new FileInputStream(filename);
      POIFSFileSystem fs = new POIFSFileSystem();
      DirectoryEntry directory = fs.getRoot();
      directory.createDocument("Workbook", new BufferedInputStream(istream));
      FileOutputStream ostream = new FileOutputStream(outfile);
      fs.writeFilesystem(ostream);
      ostream.close();
      istream.close();
   }

   public static void main(String[] args) {
      try {
         POIFSExample2 example = new POIFSExample2();
         example.run(args[0], args[1]);
      } catch (Exception e) {
         e.printStackTrace();
      }
   }
}

Appendix 2

A Short History of POI (by Andrew C. Oliver)

About a year and a half ago, I was fortunate enough to land a short-term contract doing a small Web-based system that would output a few basic types of reports. Having done the project a few times, I knew what was coming over the horizon: customer mapping (ATT = AT&T = A T&T = AT&T Southeastern Division but AT&T Southeastern Division != ATT Northeastern Division), and ultimately, the big issue -- output to Microsoft Excel. That would be a bit easier on Windows, but on a Solaris server ...?

Fortunately, I knew of an inexpensive product from a small company that could write these files affordably and easily. It would be painful, because it was bound to AWT, which was bound to X libraries, but I was positive we could get around this issue.

Unfortunately, I discovered that the little company had been bought by a larger company and they'd raised the price into the low-five-figure range per processor. I had to sell the idea, so I did some competitive analysis and guessed that it would take six months and two developers to create a basic Excel API for Java. Alternatively, they could upgrade everyone to the latest version of Excel and keep their spreadsheets small enough (so as not crash Excel, as tends to happen with larger spreadsheets) and generate the pseudo-HTML XML format that Excel supported at the time. As you might guess, the client bought the package.

Well, I got to thinking, "once this project is over, what if I did this Excel project?" Even better, I was enamored with an Apache project called Cocoon. What if this was a serializer for Cocoon?

I soon discovered that it was not as simple as writing out Excel file format. I would also need to learn about a base file format called OLE 2 Compound Document Format.

I was fortunate enough to have the ever-so-talented Marc Johnson as a member of my local Triangle Java Users Group. He responded to an inquiry of mine, and six months later we had four developers, basic OLE 2 Compound Document Format, and could read and write basic Excel files. Shortly thereafter, we were the newest addition to the Apache Jakarta project. Since then, we've been joined by a number of talented developers. The rest, as they say, is history!


Return to ONJava.com.