ONJava.com    
 Published on ONJava.com (http://www.onjava.com/)
 See this if you're having trouble printing code examples


Reading and Writing Excel Files with POI

by Avik Sengupta, Andrew Oliver and Rainer Klute
04/16/2003

In our previous article, we discussed what the POI project is all about, showed how to read and write OLE 2 Compound Document files, and gave a brief history of the POI project. Probably half of the folks who read that article are scratching their heads now, thinking "How do I write out a spreadsheet?" Good news! In this article we'll show you how to read and write Excel files and how to read any Microsoft file's document properties.

Conventions

The POI project is nearing a 2.0 release and is in a stage of rapid development, with new features and changes integrating nightly. In order to keep this article relevant, we'll refer to the recent 1.9-development release. While compiling the sources with later releases should work, there may be minor changes between now and the 2.0 release.

Excel Basics

The Microsoft Excel 97 file format is also known as "BIFF8." Recent versions of Excel have changed very little about this file format, and writing out the new intricacies would serve no purpose other than to make everyone upgrade. So when we say Excel 97 format, we mean Excel 97-to-XP format.

Related Reading

Head First Java
Your Brain on Java - A Learner's Guide
By Bert Bates, Kathy Sierra

HSSF

Our implementation of the Excel 97 file format is called "HSSF," which stands for, you guessed it, Horrible SpreadSheet Format. (We admire their method of making simple things complicated and oversimplifying things that should have been done with more flexibility.) HSSF may have a comical name, but is a very serious API. HSSF lets you read, write, and modify Excel files using nothing but Java.

What does HSSF have to do with POIFS, which was covered in the previous article? Like all of the other POI APIs, it is built on top of POIFS. So there is code contained within HSSF that is very similar to the examples included with the previous article. Generally, however, you don't need to know about POIFS APIs when writing to HSSF APIs.

HSSF APIs

HSSF has two APIs for reading: usermodel and eventusermodel. The former is most familiar, and the latter is more cryptic but far more efficient. The usermodel consists primarily of the classes in the org.apache.poi.hssf.usermodel package, as well as org.apache.poi.hssf.eventusermodel. (In earlier versions of HSSF, this was in the eventmodel package.) The usermodel package maps the file into familiar structures like Workbook, Sheet, Row, and Cell. It stores the entire structure in memory as a set of objects. The eventusermodel package requires you to become more familiar with the actual low-level structures of the file format. It operates in a manner similar to XML's SAX APIs or the AWT event model (the origin of the name)--and can be trickier to use. It is also read-only, so you cannot modify files using the eventusermodel.

Reading Using usermodel

Reading files using the HSSF usermodel is simple. Create a new inputstream and construct an instance of HSSFWorkbook.

InputStream myxls = new FileInputStream("workbook.xls"));
HSSFWorkbook wb     = new HSSFWorkbook(myxls);

With the HSSFWorkbook instance, you can now retrieve a sheet, its rows, and its cells:

HSSFSheet sheet = wb.getSheetAt(0);       // first sheet
HSSFRow row     = sheet.getRow(2);        // third row
HSSFCell cell   = row.getCell((short)3);  // fourth cell

This fetches the fourth cell of the third row from the first sheet in the workbook. You can retrieve a value from the cell object. Be sure to note the cell type before retrieving its value.

if (cell.getCellType() == HSSFCell.CELL_TYPE_STRING) {
	("The Cell was a String with value " + cell.getStringCellValue());
} else if (cell.getCellType() == HSSFCell.CELL_TYPE_NUMERIC) {
	("The cell was a number " + cell.getCellValue());
} else () {
	("The cell was nothing we're interested in");
}

Note that you may get an exception if you try and ask for an unrelated datatype.

Handle dates carefully while using HSSF. Excel stores all dates as numbers, internally. The only way to distinguish a date is by the formatting of the cell. (If you have ever formatted a cell containing a date in Excel, you will know what I mean.)

Therefore, for a cell containing a date, cell.getCellType() will return HSSFCell.CELL_TYPE_NUMERIC. However, you can use a utility function, HSSFDateUtil.isCellDateFormatted(cell), to check if the cell can be a date. This function checks the format against a few internal formats to decide the issue, but by its very nature it is prone to false negatives.

Appendix 1 contains a complete example of using HSSF to create and return a workbook from a servlet.

Example 1 reads in an Excel spreadsheet and converts it to comma-separated values.

Writing Files Using usermodel

Writing XLS files is even simpler. Create a new instance of HSSFWorkbook. At some point, you'll need to create an outputstream to write out the file to disk; however, this can be done at the end.

HSSFWorkbook wb          = new HSSFWorkbook();
FileOutputStream fileOut = new FileOutputStream("workbook.xls");
wb.write(fileOut);
fileOut.close();

To create sheets and rows, you must do so from the parent object. For instance:

HSSFSheet sheet = wb.createSheet();
HSSFRow row     = sheet.createRow((short)0); 
HSSFCell cell   = row.createCell((short)0); 
cell.setCellValue(1); 
row.createCell((short)1).setCellValue(1.2); 
row.createCell((short)2).setCellValue("This is a string");
row.createCell((short)3).setCellValue(true);

To style a cell, create a style and assign it to the cell. Assign that style to as many cells as should be styled in that fashion. This confuses a number of new users of HSSF, as it is currently possible to create an invalidly large number of styles. Ideally, you want to create one style for the same rules that you assign it. If you have a summary row and want to make its cells bold and underlined, create a summaryRowStyle and assign it to any cell that is on a summary row.

It is important to realize that the CellFormat and CellStyle objects are members of the workbook that are referenced by the cell.

...

HSSFCellStyle style = workbook.createCellStyle();
style.setDataFormat(HSSFDataFormat.getBuiltinFormat("($#,##0_);[Red]($#,##0)"));
style.setFillBackgroundColor(HSSFColor.AQUA.index);
style.setFillPattern(HSSFCellStyle.BIG_SPOTS);

...

someCell.setCellStyle(style);
someOtherCell.setCellStyle(style);

Newer editions of HSSF allow you use a limited set of formulas. Note that this is a beta-quality feature. You should do appropriate testing before using it.

A formula assignment looks like this:

someCell.setCellFormula(SUM(A1:A2:);

At present, you can use any built-in function or operator in the formula, except logical operators and functions (such as the IF function). This feature is being worked on even as we write this.

Modifying Files Using usermodel

Files are modified by reading and rewriting. Example 3 takes in an Excel file and changes the values based on rules specified in a .property file.

Reading Files with the eventusermodel

Reading files via the eventusermodel is much harder but is much more memory-efficient, since it expects the application to process the data as it is read. Reading in a file in this way is analogous to reading an XML document using SAX, as opposed to DOM. You have to register your interest in the structures you want, and the library will call back when such structures are encountered. Before using this model, however, you must familiarize yourself with some basic structures in an Excel workbook.

In HSSF, the low-level binary structures are called records. Records come in different types, each of which is modelled as a Java class in the org.apache.poi.hssf.record package. For example, the BOFRecord denotes the start of the Workbook or Sheet section. The RowRecord denotes that a row exists and stores its style information. Any record exposing the CellValueRecordInterface is a cell value. These include NumericRecord, LabelSSTRecord, and FormulaRecord. (There are a few more that are obsolete or for optimization, but generally speaking, HSSF converts them.)

See The Structure of an HSSF Spreadsheet for more details.

private EventRecordFactory factory = new EventRecordFactory();
factory.registerListener(new ERFListener() {
    public boolean processRecord(Record rec) {
        (got BOF Record);              
        return true;              
    }
}, new short[] {BOFRecord.sid});
factory.processRecords(someInputStream);

The Structure of an HSSF Spreadsheet

HSSF, as we mentioned, is built on top of POIFS. More to the point, an Excel 97+ file is an OLE 2 Compound Document. The underlying OLE 2 Compound Document stores a stream, or File. This "stream" is always named Workbook (except in Excel 95, which HSSF does not handle). Currently, HSSF deals strictly with this stream and delegates all functionality to POIFS. In recent builds, HSSF is aware enough to preserve these other nodes. Macros and Images are stored in separate streams and sometimes even separate directories within the OLE 2 CDF file. Macros should be preserved; however, we have no API as of yet to handle them.

Within each stream is a set of records. A record is just an array of bytes, with a header and a body. The header contains the record type (AKA the "id") and the length of the following data. The body is broken up into fields. Fields contain numeric data (including references to other records), character data, or flags.

The following is the top-level structure of an Excel Workbook:

Bla.xls {
    OLE2CDF headers
    "Workbook" stream {
        Workbook {
            Static String Table Record..
            Sheet names... and pointers
        } 
        Sheet {
            ROW
            ROW
            ...
            NUMBER RECORD (cell)
            LABELSST Record (cell)
            ...
        }
        Sheet
    }
}
... images, macros, etc.
Document Summary
Summary

Reading Document Properties with HPSF

Users of Microsoft Word, Excel, or PowerPoint can attach additional information to their documents using the File->Properties... menu. Properties include a document's title, a summary, a category, and keywords. The application itself adds further information: the last author, the date and time of the last modification, the date and time the document was printed, and so on.

Properties are stored separately from the document itself. As you know, an OLE 2 CDF file is a container consisting of directories and files internally, and POIFS gives you access to these files. The files are also called streams. Document properties are stored in streams of their own within a POIFS filesystem. Consider a Word document: while you see a simple file named myfile.doc on your hard disk, internally, it consists of a file called WordDocument and two files called SummaryInformation and DocumentSummaryInformation. There are usually other files, but we won't care about them here.

Can you guess what these streams contain? Yes, WordDocument contains the text you edited with Word. The document properties are stored in the SummaryInformation and DocumentSummaryInformation streams. It would be too simple to have all of the properties in a single stream, so Microsoft decided to use two of them. To make things even more complicated, these streams' names start with the octal \005 character. This is a non-printable character and is omitted above.

The nice thing with the standard Microsoft-defined properties is that they don't care whether the main document is a Word document, an Excel file, a PowerPoint presentation, or some other document. Once you know how to read the properties of an Excel file, you can retrieve the properties of all other documents, too.

Luckily, you don't have to figure out too much, because your Java program can extract these information using POI's HPSF package. HPSF stands for Horrible Property Set Format and is POI's implementation for reading properties. Writing is not yet supported.

HPSF has a simple API for reading Microsoft's standard properties and a more complicated but general API for reading any sort of property sets. We'll focus on the simple API here because it should suffice for most applications.

A Sample Application

Let's have a look at a sample Java program that reads the "title" property of an OLE 2 CDF document:

import java.io.*;
import org.apache.poi.hpsf.*;
import org.apache.poi.poifs.eventfilesystem.*;

/**
 * <p>Sample application showing how to read a OLE 2 document's
 * title. Call it with the document's file name as command line
 * parameter.</p>
 *
 * @author Rainer Klute (klute@rainer-klute.de)
 */

public class ReadTitle
{
    public static void main(String[] args) throws IOException
    {
        final String filename = args[0];
        POIFSReader r         = new POIFSReader();
        r.registerListener(new MyPOIFSReaderListener(),
            "\005SummaryInformation");
        r.read(new FileInputStream(filename));
    }

    static class MyPOIFSReaderListener implements POIFSReaderListener
    {
        public void processPOIFSReaderEvent(POIFSReaderEvent event)
        {
            SummaryInformation si = null;
            try
            {
                si = (SummaryInformation)
                PropertySetFactory.create(event.getStream());
            }
            catch (Exception ex)
            {
                throw new RuntimeException
                    ("Property set stream \"" + event.getPath() +
                        event.getName() + "\": " + ex);
            }

            final String title = si.getTitle();

            if (title != null)
                System.out.println("Title: \"" + title + "\"");
            else
                System.out.println("Document has no title.");
            }
        }
}

The main() method uses POIFS' eventing filesystem to read a stream named \005SummaryInformation from the OLE 2 document named on the command line. When the POIFSReader encounters this stream, it transfers control to the processPOIFSReaderEvent() method of MyPOIFSReaderListener. You've already learned the details of this mechanism from the first article of this series.

What does processPOIFSReaderEvent() do? As a parameter, it essentially gets an input stream containing some of the document's properties, including the title. To access the properties, the application must create a PropertySet instance from the contents of the input stream, as in the following statement:

si = (SummaryInformation) PropertySetFactory.create(event.getStream());

The statement consists of three steps:

Since several things can go wrong, the statement is encapsulated in a try block. The sample application simply catches all exceptions. A production program will likely differentiate between several possible causes for exceptions. Besides any I/O exceptions, an HPSF-specific exception could occur. For example, the NoPropertySetStreamException is thrown if the input stream does not contain a property set or if the property set is faulty.

A quite unlikely but still possible error is that the \005SummaryInformation stream contains a valid property set, but not the summary information property set. If this should ever happen, the cast to SummaryInformation is not possible and a ClassCastException is thrown.

Once the SummaryInformation instance is established, the rest is simple. The application just calls getTitle() and prints the result.

Besides getTitle(), SummaryInformation contains other convenience methods, including getApplicationName(), getAuthor(), getCharCount(), and getCreateDateTime(). The HPSF Javadoc documentation describes them in more detail.

The Document Summary Information

Unfortunately, not all properties reside in the summary information property set. Many, but not all, OLE 2 files have an additional property set. It is called the document summary information and resides in a stream named \005DocumentSummaryInformation. This property set holds properties like the document's category, the number of multimedia clips in a PowerPoint presentation, and more.

To access this property set, your application should proceed as shown above, except that you should register for the \005DocumentSummaryInformation stream. You probably want to register for both summary information and document summary information. As before, you should also hand over the input stream containing the document summary information to PropertySetFactory.create(). However, the factory method will return a DocumentSummaryInformation object, not a SummaryInformation instance. If you registered for both types of property sets, you should check the type of the returned object. Either use Java's instanceof operator or the query methods isSummaryInformation() and isDocumentSummaryInformation(). Remember that create() always returns a PropertySet, so you can call these methods on it. The PropertySet class provides two methods, because the property set could be custom-defined .

More Property Sets

If you want to deal with custom-defined property sets, or if you want to read user-defined properties from the standard property sets, you will have to use the general property set API. This is more complicated than the API outlined above and is not covered by this article. To learn about it, you should read the HPSF HOW-TO in the POI documentation and consult the Javadocs.

Conclusion

This article covered HSSF and how to output to Excel. We also covered HPSF and how to read property sets and document summary information. Next time, we'll show how to transform from XML to Excel format using the HSSF Serializer and Cocoon, as well as the current HDF and Word format.

References

Avik Sengupta is a committer on the Jakarta POI project. He is currently Director of Technology at Itellix Software Solutions.

Andrew Oliver is a professional cat herder who moonlights as a software developer, and runs a small consultancy with an ugly retro-looking web site that one day will be updated when he and his compadres aren't too busy working with clients.

Rainer Klute is a committer on the Jakarta POI project.

Appendix 1

Example 1. To create and return a workbook from a servlet

package org.apache.poi.hssf.usermodel.examples;

import java.io.*;
import java.net.*;
import javax.servlet.*;
import javax.servlet.http.*;
import org.apache.poi.hssf.usermodel.*;

public class HSSFCreate extends HttpServlet {
    public void init(ServletConfig config) throws ServletException {
        super.init(config);  
    }

    public void destroy() {
    }

    /** Processes requests for both HTTP GET and POST methods.
     * @param request servlet request
     * @param response servlet response
     */

    protected void processRequest(HttpServletRequest request,
        HttpServletResponse response) throws ServletException, IOException {

        response.setContentType("application/vnd.ms-excel");
        HSSFWorkbook wb = new HSSFWorkbook();
        HSSFSheet sheet = wb.createSheet("new sheet");

        // Create a row and put some cells in it. Rows are 0 based.
        HSSFRow row     = sheet.createRow((short)0);

        // Create a cell and put a value in it.
        HSSFCell cell   = row.createCell((short)0);

        cell.setCellValue(1);

        // Or do it on one line.
        row.createCell((short)1).setCellValue(1.2);
        row.createCell((short)2).setCellValue("This is a string");
        row.createCell((short)3).setCellValue(true);
        // Write the output 
        OutputStream out = response.getOutputStream();
        wb.write(out);
        out.close();
    }

    /** Handles the HTTP <code>GET</code> method.
     * @param request servlet request
     * @param response servlet response
     */

    protected void doGet(HttpServletRequest request,
        HttpServletResponse response) throws ServletException, IOException {
        processRequest(request, response);
    }

    /** Handles the HTTP POST method.
     * @param request servlet request
     * @param response servlet response
     */

    protected void doPost(HttpServletRequest request,
        HttpServletResponse response) throws ServletException, IOException {
        processRequest(request, response);
    }

    /** Returns a short description of the servlet.
     */

    public String getServletInfo() {
       return "Example to create a workbook in a servlet using HSSF";
    }
}

Return to ONJava.com.

Copyright © 2009 O'Reilly Media, Inc.