ONJava.com -- The Independent Source for Enterprise Java
oreilly.comSafari Books Online.Conferences.

advertisement

AddThis Social Bookmark Button

Regular Expressions in J2SE

by Hetal C. Shah
11/26/2003

In Java applications that do text searching and manipulation, the StringTokenizer and String classes are used heavily. This can often result in complex code and lead to a maintenance nightmare.

Often such Java applications are looking for an occurrence of a particular character or token in a String, and then trying to find a string surrounding it, validating the extracted String. A simple example is validation of a web site URL or an email address. To validate an email address, we could check for an occurrence of '@', followed by one or more '.'. This logic might be implemented in Java as shown below.

JDK 1.4 supports regular expressions in the java.util.regex package. Use of this package and supporting classes makes string search and manipulation very easy. It helps reduce the development effort, and at the same time significantly improves the maintenance of code. Since classes in this package are a standard part of core Java, they don't have to be distributed separately, and can be assumed to be present. We will see at the end of article how regular expressions simplify the implementation of email validation.

String str="administrator@admin.com";
int indexOfAtChar=str.indexOf("@");

if(indexOfAtChar > 0)
{
    int indexOfDotChar =
        str.indexOf(".",indexOfAtChar);
    if(indexOfDotChar > 0)
    {
      System.out.println ("Valid Email Address.");
    }
    else
    {
      System.out.println
      ("Invalid Email Address- " +
       "Missing character '.' after '@'.");
    }
}
else{
    System.out.println("Invalid Email Address- " +
                       "Missing character'@' .");
}

This produces the output:

Valid Email Address.

Interest in regular expressions has been around for a number of years in the software industry. It has been heavily used in:

  • Parsing
  • Data validation
  • String manipulation
  • Data extraction and report generation

Many programming languages and operating systems tools support regular expressions, such as:

  • Unix tools - grep, awk, sed
  • Programming editors - vi, textpad
  • Scripting languages - Perl, Python, JavaScript
  • ColdFusion Studio

This article explains the benefits of writing regular expressions using the java.util.regex package, and how to use its key components.

What Is a Regular Expression?

First of all, let's define a regular expression in a simple approach: A regular expression is a pattern, a template, to be matched against a string.

Users of a command-line operating system like DOS or Unix often use a directory listing command to find a list of files in a directory. On DOS, this would be:

dir *.txt

And on Unix, it would be:

ls *.txt

Here "*.txt" is a command parameter to display the list of files with file extension 'txt', irrespective of file name.

Now, say we want to see list of files where the filename begins with 'a'; then the DOS command will be

dir a*.*

and the Unix command will be

ls a*.*

Related Reading

Regular Expression Pocket Reference
By Tony Stubblebine

Here "a*.*", means a filename starting with 'a' followed by any number of characters, followed by a character '.', followed by any file extension.

These examples are straightforward uses of regular expressions.

Regular Expression Grammar Rules

Before we jump into how to write regular expression code using the java.util.regex package, let's first have a brief look at regular expression syntax in general.

In its simplest form, a regular expression is just a word or phrase for which to search. For example, the regular expression 'John' would match any string with the string 'John' in it. Strings like 'John', 'Ajohn', and ' Decker John' all would match.

In regular expressions some characters are used for more special purposes. These are called Quantifiers. For instance, '*' matches any sequence of characters, and the '.' matches any single character except a new line. Hence, the regular expression '.ine' matches any four character strings that ends with 'ine', including 'line', and 'nine'.

But what if you want to search for a string containing a period and, say, references to pi. The following regular expression would not work:

3.141592

This would indeed match "3.141592", but it will also match "3x141592",and "38141592". To get around this, we can use a metacharacter, the backslash (\). The backslash can be used to indicate that the character immediately to its right is to be taken literally. Thus, to search for the string "3.141592", we would use:

3\.141592

Regular Expressions in JDK 1.4

The entire regular expression support is contained in the package java.util.regex and is made up of the following two main classes:

  • java.util.regex.Pattern
    An instance of this class is a compiled representation of a regular expression in a string form.
  • java.util.regex.Matcher
    A searcher that performs text match operations by interpreting Pattern on a String of readable characters.

A typical implementation of text searching and/or manipulation using the java.util.regex package is divided into three steps.

  1. Compile the regular expression into an instance of Pattern
  2. Use the Pattern object to create a Matcher object.
  3. Use the Matcher object to search and/or manipulate the character sequence

A typical invocation sequence might be like the example to follow, which uses a regular expression to match 'cats', followed by any number of characters, followed by 'dogs':

Pattern pat=Pattern.compile("cats.*dogs");
Matcher matcher=pat.matcher("cats and dogs");
boolean flag=matcher.matches();

We will look at each of the above methods in detail in next few sections.

Creating Patterns

The Pattern class provides an overloaded static factory method compile() to create Pattern instances.

  • static Pattern compile(String regex)
    Compiles the given regular expression regex into a Pattern.
  • static Pattern compile(String regex, int flags)
    Compiles the given regular expression regex into a pattern with the given flags, where flags is a bit mask of behavior flags, as described below.

Flags

In the java.util.regex package, text matching defaults to case sensitivity and treats each character as ASCII rather than Unicode. To modify this default behavior, you can provide flags to the compile() method. All flags are static int members of Pattern. To combine behaviors, you can mathematically OR flags together with the "|" operator.

Flag Purpose
CANON_EQ Enables canonical equivalence in the search.
CASE_INSENSITIVE Enables case-insensitive matching.
COMMENTS Permits white space and comments in pattern. If this flag is set then white spaces, and embedded comments starting with # are ignored.
DOTALL By default the metaCharacter '.' does not match line terminator, but using this flag it matches any character, including a line terminator.
MULTILINE Enables multiline searches. In multiline input character sequence '^' and '$' MetaCharacters match, respectively, after or before a line terminator or at the end of input sequence.
UNICODE_CASE This flag specified along with the CASE_INSENSITIVE flag makes case-insensitive matching in a manner consistent with the Unicode Standards.
UNIX_LINES Unix lines mode.

Creating Matchers

Once we have a compiled Pattern, we call matcher(charsequence) on it to create a Matcher.

  • Matcher matcher (CharSequence input)
    Creates and returns a Matcher that will match the given input against this pattern.

java.lang.CharSequence is an interface to represent a readable sequence of characters. The String, StringBuffer, and CharBuffer classes implement this interface. Typically, we pass Strings to the matcher method:

Pattern pat=Pattern.compile("cats.*dogs");
Matcher matcher=pat.matcher("cats and dogs");

Pages: 1, 2

Next Pagearrow