ONLamp.com
oreilly.comSafari Books Online.Conferences.

advertisement


Regular Expression Pocket Reference

Five Habits for Successful Regular Expressions

by Tony Stubblebine, author of Regular Expression Pocket Reference
08/21/2003

Regular expressions are hard to write, hard to read, and hard to maintain. Plus, they are often wrong, matching unexpected text and missing valid text. The problem stems from the power and expressiveness of regular expressions. Each metacharacter packs power and nuance, making code impossible to decipher without resorting to mental gymnastics.

Most implementations include features that make reading and writing regular expressions easier. Unfortunately, they are hardly ever used. For many programmers, writing regular expressions is a black art. They stick to the features they know and hope for the best. If you adopt the five habits discussed in this article, you will take most of the trial and error out of your regular expression development.

This article uses Perl, PHP, and Python in the code examples, but the advice here is applicable to nearly any regex implementation.

1. Use Whitespace and Comments

Most programmers have no problem adding whitespace and indentation to the code surrounding a regular expression. They would be laughed at or yelled at if they didn't (hopefully, yelled at). Nearly everyone knows that code is harder to read, write, and maintain if it is crammed into one line. Why would that be any different with regular expressions?

The extended whitespace feature of most regex implementations allows programmers to extend their regular expressions over several lines, with comments at the end of each. Why do so few programmers use this feature? Perl 6 regular expressions, for example, will be in extended whitespace mode by default. Until your language makes extended whitespace the default, turn it on yourself.

Related Reading

Regular Expression Pocket Reference
By Tony Stubblebine

The only trick to remember with extended whitespace is that the regex engine ignores whitespace. So if you are hoping to match whitespace, you have to say so explicitly, often with \s.

In Perl, add an x to the end of the regex, so m/foo|bar/ becomes:

m/
  foo
  |
  bar
 /x

In PHP, add an x to the end of the regex, so "/foo|bar/" becomes:

"/
  foo
  |
  bar
 /x"

In Python, pass the mode modifier, re.VERBOSE, to the compile function:

pattern = r'''
 foo
 |
 bar
'''

regex = re.compile(pattern, re.VERBOSE)

The value of whitespace and comments becomes more important when working with more complex regular expressions. Consider the following regular expression to match a U.S. phone number:

\(?\d{3}\)? ?\d{3}[-.]\d{4}

This regex matches phone numbers like "(314)555-4000". Ask yourself if the regex would match "314-555-4000" or "555-4000". The answer is no in both cases. Writing this pattern on one line conceals both flaws and design decisions. The area code is required and the regex fails to account for a separator between the area code and prefix.

Spreading the pattern out over several lines makes the flaws more visible and the necessary modifications easier.

In Perl this would look like:

/  
    \(?     # optional parentheses
      \d{3} # area code required
    \)?     # optional parentheses
    [-\s.]? # separator is either a dash, a space, or a period.
      \d{3} # 3-digit prefix
    [-.]    # another separator
      \d{4} # 4-digit line number
/x

The rewritten regex now has an optional separator after the area code so that it matches "314-555-4000." The area code is still required. However, a new programmer who wants to make the area code optional can quickly see that it is not optional now, and that a small change will fix that.

2. Write Tests

There are three levels of testing, each adding a higher level of reliability to your code. First, you need to think hard about what you want to match and whether you can deal with false matches. Second, you need to test the regex on example data. Third, you need to formalize the tests into a test suite.

Deciding what to match is a trade-off between making false matches and missing valid matches. If your regex is too strict, it will miss valid matches. If it is too loose, it will generate false matches. Once the regex is released into live code, you probably will not notice either way. Consider the phone regex example above; it would match the text "800-555-4000 = -5355". False matches are hard to catch, so it's important to plan ahead and test.

Sticking with the phone number example, if you are validating a phone number on a web form, you may settle for ten digits in any format. However, if you are trying to extract phone numbers from a large amount of text, you might want to be more exact to avoid a unacceptable numbers of false matches.

When thinking about what you want to match, write down example cases. Then write some code that tests your regular expression against the example cases. Any complicated regular expression is best written in a small test program, as the examples below demonstrate:

In Perl:

#!/usr/bin/perl

my @tests = ( "314-555-4000",
              "800-555-4400",
	      "(314)555-4000",
              "314.555.4000",
              "555-4000",
              "aasdklfjklas",
              "1234-123-12345"          
            );

foreach my $test (@tests) {
    if ( $test =~ m/
                   \(?     # optional parentheses
                     \d{3} # area code required
                   \)?     # optional parentheses
                   [-\s.]? # separator is either a dash, a space, or a period.
                     \d{3} # 3-digit prefix
                   [-\s.]  # another separator
                     \d{4} # 4-digit line number
                   /x ) {
        print "Matched on $test\n";
     }
     else {
        print "Failed match on $test\n";
     }
}

In PHP:

<?php
$tests = array( "314-555-4000",
           "800-555-4400",
           "(314)555-4000",
           "314.555.4000",
           "555-4000",
           "aasdklfjklas",
           "1234-123-12345"
          );

$regex = "/
            \(?     # optional parentheses
              \d{3} # area code
            \)?     # optional parentheses
            [-\s.]? # separator is either a dash, a space, or a period.
              \d{3} # 3-digit prefix
            [-\s.]  # another separator
              \d{4} # 4-digit line number
           /x";

foreach ($tests as $test) {
    if (preg_match($regex, $test)) { 
        echo "Matched on $test<br />";
    }
    else {
        echo "Failed match on $test<br />";
     }
}
?>

In Python:

import re

tests = ["314-555-4000",
         "800-555-4400",
         "(314)555-4000",
         "314.555.4000",
         "555-4000",
         "aasdklfjklas",
         "1234-123-12345"        
        ]

pattern = r'''
\(?     # optional parentheses
              \d{3} # area code
            \)?     # optional parentheses
            [-\s.]? # separator is either a dash, a space, or a period.
              \d{3} # 3-digit prefix
            [-\s.]  # another separator
              \d{4} # 4-digit line number
           '''

regex = re.compile( pattern, re.VERBOSE )

for test in tests:
    if regex.match(test):
        print "Matched on", test, "\n"
    else:
        print "Failed match on", test, "\n"

Running the test script exposes yet another problem in the phone number regex: it matched "1234-123-12345". Include tests that you expect to fail as well as those you expect to match.

Ideally, you would incorporate these tests into the test suite for your entire program. Even if you do not have a test suite already, your regular expression tests are a good foundation for a suite, and now is the perfect opportunity to start on one. Even if now is not the right time (really, it is!), you should make a habit to run your regex tests after every modification. A little extra time here could save you many headaches.

Pages: 1, 2

Next Pagearrow





Sponsored by: