ONJava.com -- The Independent Source for Enterprise Java
oreilly.comSafari Books Online.Conferences.


AddThis Social Bookmark Button

Advanced Text Indexing with Lucene

by Otis Gospodnetic

Lucene Index Structure

Lucene is a free text-indexing and -searching API written in Java. To appreciate indexing techniques described later in this article, you need a basic understanding of Lucene's index structure. As I mentioned in the previous article in this series, a typical Lucene index is stored in a single directory in the filesystem on a hard disk.

The core elements of such an index are segments, documents, fields, and terms. Every index consists of one or more segments. Each segment contains one or more documents. Each document has one or more fields, and each field contains one or more terms. Each term is a pair of Strings representing a field name and a value. A segment consists of a series of files. The exact number of files that constitute each segment varies from index to index, and depends on the number of fields that the index contains. All files belonging to the same segment share a common prefix and differ in the suffix. You can think of a segment as a sub-index, although each segment is not a fully-independent index.

-rw-rw-r--    1 otis     otis            4   Nov 22 22:43 deletable
-rw-rw-r--    1 otis     otis      1000000   Nov 22 22:43 _lfyc.f1
-rw-rw-r--    1 otis     otis      1000000   Nov 22 22:43 _lfyc.f2
-rw-rw-r--    1 otis     otis     31030502   Nov 22 22:28 _lfyc.fdt
-rw-rw-r--    1 otis     otis      8000000   Nov 22 22:28 _lfyc.fdx
-rw-rw-r--    1 otis     otis           16   Nov 22 22:28 _lfyc.fnm
-rw-rw-r--    1 otis     otis   1253701335   Nov 22 22:43 _lfyc.frq
-rw-rw-r--    1 otis     otis   1871279328   Nov 22 22:43 _lfyc.prx
-rw-rw-r--    1 otis     otis        14122   Nov 22 22:43 _lfyc.tii
-rw-rw-r--    1 otis     otis      1082950   Nov 22 22:43 _lfyc.tis
-rw-rw-r--    1 otis     otis           18   Nov 22 22:43 segments

Example 1: An index consisting of a single segment.

Note that all files that belong to this segment start with a common prefix: _lfyc. Because this index contains two fields, you will notice two files with the fN suffix, where N is a number. If this index had three fields, a file named _lfyc.f3 would also be present in the index directory.

The number of segments in an index is fixed once the index is fully built, but it varies while indexing is in progress. Lucene adds segments as new documents are added to the index, and merges segments every so often. In the next section we will learn how to control creation and merging of segments in order to improve indexing speed.

For more information about the files that make up a Lucene index, please see the File Formats document on Lucene's web site. You can find the URL in the Reference section at the end of this article.

Related Reading

Java Enterprise Best Practices
By The O'Reilly Java Authors

Indexing Speed Factors

The previous article demonstrated how to index text using the LuceneIndexExample class. Because the example was so basic, there was no need to think about speed. If you are using Lucene in a non-trivial application, you will want to ensure optimal indexing performance. The bottleneck of a typical text-indexing application is the process of writing index files onto a disk. Therefore, we need to instruct Lucene to be smart about adding and merging segments while indexing documents.

When new documents are added to a Lucene index, they are initially stored in memory instead of being immediately written to the disk. This is done for performance reasons. The simplest way to improve Lucene's indexing performance is to adjust the value of IndexWriter's mergeFactor instance variable. This value tells Lucene how many documents to store in memory before writing them to the disk, as well as how often to merge multiple segments together. With the default value of 10, Lucene will store 10 documents in memory before writing them to a single segment on the disk. The mergeFactor value of 10 also means that once the number of segments on the disk has reached the power of 10, Lucene will merge these segments into a single segment. (There is a small exception to this rule, which I shall explain shortly.)

For instance, if we set mergeFactor to 10, a new segment will be created on the disk for every 10 documents added to the index. When the 10th segment of size 10 is added, all 10 will be merged into a single segment of size 100. When 10 such segments of size 100 have been added, they will be merged into a single segment containing 1000 documents, and so on. Therefore, at any time, there will be no more than 9 segments in each power of 10 index size.

The exception noted earlier has to do with another IndexWriter instance variable: maxMergeDocs. While merging segments, Lucene will ensure that no segment with more than maxMergeDocs is created. For instance, if we set maxMergeDocs to 1000, when we add the 10,000th document, instead of merging multiple segments into a single segment of size 10,000, Lucene will create a 10th segment of size 1000, and keep adding segments of size 1000 for every 1000 documents added.

The default value of maxMergeDocs is Integer#MAX_VALUE. In my experience, one rarely needs to change this value.

Now that I have explained how mergeFactor and maxMergeDocs work, you can see that using a higher value for mergeFactor will cause Lucene to use more RAM, but will let Lucene write data to disk less frequently, which will speed up the indexing process. A smaller mergeFactor will use less memory and will cause the index to be updated more frequently, which will make it more up-to-date, but will also slow down the indexing process. Similarly, a larger maxMergeDocs is better suited for batch indexing, and a smaller maxMergeDocs is better for more interactive indexing.

To get a better feel for how different values of mergeFactor and maxMergeDocs affect indexing speed, take a look at the IndexTuningDemo class below. This class takes three arguments on the command line: the total number of documents to add to the index, the value to use for mergeFactor, and the value to use for maxMergeDocs. All three arguments must be specified, must be integers, and must be in this order. In order to keep the code short and clean, there are no checks for improper usage.

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

 * Creates an index called 'index' in a temporary directory.
 * The number of documents to add to this index, the mergeFactor and
 * the maxMergeDocs must be specified on the command line
 * in that order - this class expects to be called correctly.
 * Note: before running this for the first time, manually create the
 * directory called 'index' in your temporary directory.
public class IndexTuningDemo
    public static void main(String[] args) throws Exception
        int docsInIndex  = Integer.parseInt(args[0]);

        // create an index called 'index' in a temporary directory
        String indexDir =
            System.getProperty("java.io.tmpdir", "tmp") +
            System.getProperty("file.separator") + "index";

        Analyzer    analyzer = new StopAnalyzer();
        IndexWriter writer   = new IndexWriter(indexDir, analyzer, true);

        // set variables that affect speed of indexing
        writer.mergeFactor   = Integer.parseInt(args[1]);
        writer.maxMergeDocs  = Integer.parseInt(args[2]);

        long startTime = System.currentTimeMillis();
        for (int i = 0; i < docsInIndex; i++)
            Document doc = new Document();
            doc.add(Field.Text("fieldname", "Bibamus, moriendum est"));
        long stopTime = System.currentTimeMillis();
        System.out.println("Total time: " + (stopTime - startTime) + " ms");

Here are some results:

prompt> time java IndexTuningDemo 100000 10 1000000

Total time: 410092 ms

real    6m51.801s
user    5m30.000s
sys     0m45.280s

prompt> time java IndexTuningDemo 100000 1000 100000

Total time: 249791 ms

real    4m11.470s
user    3m46.330s
sys     0m3.660s

As you can see, both invocations created an index with 100,000 documents, but the first one took much longer to complete. That is because it used the default mergeFactor of 10, which caused Lucene to write documents to the disk more often than the mergeFactor of 1000 used in the second invocation.

Note that while these two variables can help improve indexing performance, they also affect the number of file descriptors that Lucene uses, and can therefore cause the "Too many open files" exception. If you get this error, you should first see if you can optimize the index, as will be described shortly. Optimization may help indexes that contain more than one segment. If optimizing the index does not solve the problem, you could try increasing the maximum number of open files allowed on your computer. This is usually done at the operating-system level and varies from OS to OS. If you are using Lucene on a computer that uses a flavor of the UNIX OS, you can see the maximum number of open files allowed from the command line.

Under bash, you can see the current settings with the built-in ulimit command:

prompt> ulimit -n

Under tcsh, the equivalent is:

prompt> limit descriptors

To change the value under bash, use this:

prompt> ulimit -n <max number of open files here>

Under tcsh, use the following:

prompt> limit descriptors <max number of open files here>

To estimate a setting for the maximum number of open files allowed while indexing, keep in mind that the maximum number of files Lucene will open is (1 + mergeFactor) * FilesPerSegment.

For instance, with a default mergeFactor of 10 and an index of 1 million documents, Lucene will require 110 open files on an unoptimized index. When IndexWrite's optimize() method is called, all segments are merged into a single segment, which minimizes the number of open files that Lucene needs.

Pages: 1, 2

Next Pagearrow