ONJava.com -- The Independent Source for Enterprise Java
oreilly.comSafari Books Online.Conferences.


AddThis Social Bookmark Button

Using Lucene to Search Java Source Code
Pages: 1, 2, 3, 4, 5

Lucene has four different types of fields, which can be specified for optimal index creation: Keyword, UnIndexed, UnStored, and Text.

  • Keyword fields are those that are not parsed by the analyzer, but are indexed and stored in the index. JavaSourceCodeIndexer uses this field to store import declarations.
  • UnIndexed fields are neither analyzed nor indexed, but their values are stored in the index, word for word. The Java file name is indexed with this field, as we would want to store the location of the file but would rarely search for keywords in the file name.
  • UnStored fields are the opposite of UnIndexed fields. Fields of this type are analyzed and indexed, but are not stored in the index. The source code of the method is indexed as an UnStored code field, as storing every line of code would require a large amount of space. The source code of a method can be directly retrieved from the original Java file, resulting in an optimal index size.
  • Text fields are analyzed, indexed, and stored in the index. The class name is stored as a text field. The summary of the Fields used by JavaSourceCodeIndexer is shown in the following table:
Field Type
Class Name Text
Import Declarations Keyword
Method Name Text
Method Block (Code) UnStored
File Name UnIndexed
Method Parameter Type Text
Return Type Text
Comments UnStored
Extends Class Text
Implements Text

The indexes created by Lucene can be viewed and modified using Luke, a useful open source tool for understanding indexes. Luke's snapshot of the indexes creates by JavaSourceCodeIndexer is shown in Figure 1.

Figure 1
Figure 1. Snapshot of indexes in Luke

As you can see, the import declarations are stored as is, without tokenizing or analyzing. The class names and method names are converted to lower case and stored.

Pages: 1, 2, 3, 4, 5

Next Pagearrow