ONJava.com -- The Independent Source for Enterprise Java
oreilly.comSafari Books Online.Conferences.

advertisement

AddThis Social Bookmark Button

Introduction to Text Indexing with Apache Jakarta Lucene

by Otis Gospodnetic
01/15/2003

What Lucene Is

Lucene is a Java library that adds text indexing and searching capabilities to an application. It is not a complete application that one can just download, install, and run. It offers a simple, yet powerful core API. To start using it, one needs to know only a few Lucene classes and methods.

Lucene offers two main services: text indexing and text searching. These two activities are relatively independent of each other, although indexing naturally affects searching. In this article I will focus on text indexing, and we will look at some of the core Lucene classes that provide text indexing capabilities.

Lucene Background

Lucene was originally written by Doug Cutting and was available for download from SourceForge. It joined the Apache Software Foundation's Jakarta family of open source server-side Java products in September of 2001. With each release since then, the project has enjoyed more visibility, attracting more users and developers. As of November 2002, Lucene version 1.2 has been released, with version 1.3 in the works. In addition to those organizations mentioned on the "Powered by Lucene" page, I have heard of FedEx, Overture, Mayo Clinic, Hewlett Packard, New Scientist magazine, Epiphany, and others using, or at least evaluating, Lucene.

Related Reading

Java Enterprise Best Practices
By The O'Reilly Java Authors

Installing Lucene

Like most other Jakarta projects, Lucene is distributed as pre-compiled binaries or in source form. You can download the latest official release from Lucene's release page. There are also nightly builds, if you'd like to use the newest features. To demonstrate Lucene usage, I will assume that you will use the pre-compiled distribution. Simply download the Lucene .jar file and add its path to your CLASSPATH environment variable. If you choose to get the source distribution and build it yourself, you will need Jakarta Ant and JavaCC, which is available as a free download. Although the company that created JavaCC no longer exists, you can still get JavaCC from the URL listed in the References section of this article.

Indexing with Lucene

Before we jump into code, let's look at some of the fundamental Lucene classes for indexing text. They are IndexWriter, Analyzer, Document, and Field.

IndexWriter is used to create a new index and to add Documents to an existing index.

Before text is indexed, it is passed through an Analyzer. Analyzers are in charge of extracting indexable tokens out of text to be indexed, and eliminating the rest. Lucene comes with a few different Analyzer implementations. Some of them deal with skipping stop words (frequently-used words that don't help distinguish one document from the other, such as "a," "an," "the," "in," "on," etc.), some deal with converting all tokens to lowercase letters, so that searches are not case-sensitive, and so on.

An index consists of a set of Documents, and each Document consists of one or more Fields. Each Field has a name and a value. Think of a Document as a row in a RDBMS, and Fields as columns in that row.

Now, let's consider the simplest scenario, where you have a piece of text to index, stored in an instance of String. Here is how you could do it, using the classes described above:

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

/**
 * LuceneIndexExample class provides a simple
 * example of indexing with Lucene.  It creates a fresh
 * index called "index-1" in a temporary directory every
 * time it is invoked and adds a single document with a
 * single field to it.
 */
public class LuceneIndexExample
{
    public static void main(String args[]) throws Exception
    {
        String text = "This is the text to index with Lucene";

        String indexDir =
            System.getProperty("java.io.tmpdir", "tmp") +
            System.getProperty("file.separator") + "index-1";
        Analyzer analyzer = new StandardAnalyzer();
        boolean createFlag = true;

        IndexWriter writer =
            new IndexWriter(indexDir, analyzer, createFlag);
        Document document  = new Document();
        document.add(Field.Text("fieldname", text));
        writer.addDocument(document);
        writer.close();
    }
}

Let's step through the code. Lucene stores its indices in directories on the file system. Each index is contained within a single directory, and multiple indices should not share a directory. The first parameter in IndexWriter's constructor specifies the directory where the index should be stored. The second parameter provides the implementation of Analyzer that should be used for pre-processing the text before it is indexed. This particular implementation of Analyzer eliminates stop words, converts tokens to lower case, and performs a few other small input modifications, such as eliminating periods from acronyms. The last parameter is a boolean flag that, when true, tells IndexWriter to create a new index in the specified directory, or overwrite an index in that directory, if it already exists. A value of false instructs IndexWriter to instead add Documents to an existing index. We then create a blank Document, and add a Field called fieldname to it, with a value of the String that we want to index. Once the Document is populated, we add it to the index via the instance of IndexWriter. Finally, we close the index. This is important, as it ensures that all index changes are flushed to the disk.

Analyzers

As I already mentioned, Analyzers are components that pre-process input text. They are also used when searching. Because the search string has to be processed the same way that the indexed text was processed, it is crucial to use the same Analyzer for both indexing and searching. Not using the same Analyzer will result in invalid search results.

The Analyzer class is an abstract class, but Lucene comes with a few concrete Analyzers that pre-process their input in different ways. Should you need to pre-process input text and queries in a way that is not provided by any of Lucene's Analyzers, you will need to implement a custom Analyzer. If you are indexing text with non-Latin characters, for instance, you will most definitely need to do this.

Pages: 1, 2

Next Pagearrow