ONLamp.com
oreilly.comSafari Books Online.Conferences.

advertisement


Five Habits for Successful Regular Expressions
Pages: 1, 2

3. Group the Alternation Operator

The alternation operator (|) has a low precedence. This means that it often alternates over more than the programmer intended. For example, a regex to extract email addresses out of a mail file might look like:
^CC:|To:(.*)



The above attempt is incorrect, but the bugs often go unnoticed. The intent of the above regex is to find lines starting with "CC:" or "To:" and then capture any email addresses on the rest of the line.

Unfortunately, the regex doesn't actually capture anything from lines starting with "CC:" and may capture random text if "To:" appears in the middle of a line. In plain English, the regular expression matches lines beginning with "CC:" and captures nothing, or matches any line containing the text "To:" and then captures the rest of the line. Usually, it will capture plenty of addresses and nobody will notice the failings.

If that were the real intent, you should add parentheses to say it explicitly, like this:

(^CC:)|(To:(.*))

However, the real intent of the regex is to match lines starting with "CC:" or "To:" and then capture the rest of the line. The following regex does that:

^(CC:|To:)(.*)

This is a common and hard-to-catch bug. If you develop the habit of wrapping your alternations in parentheses (or non-capturing parentheses -- (?:…)) you can avoid this error.

4. Use Lazy Quantifiers

Most people avoid using the lazy quantifiers *?, +?, and ??, even though they are easy to understand and make many regular expressions easier to write.

Lazy quantifiers match as little text as possible while still aiding the success of the overall match. If you write foo(.*?)bar, the quantifier will stop matching the first time it sees "bar", not the last time. This may be important if you are trying to capture "###" in the text "foo###bar+++bar". A regular quantifier would have captured "###bar+++".

Let's say you want to capture all of the phone numbers from an HTML file. You could use the phone number regular expression example we discussed earlier in this article. However, if you know that the file contains all of the phone numbers in the first column of a table, you can write a much simpler regex using lazy quantifiers:

<tr><td>(.+?)<td>

Many beginning regular expression programmers avoid lazy quantifiers with negated character classes. They write the above code as:

<tr><td>([^<]+)</td>

That works in this case, but leads to trouble if the text you are trying to capture contains common characters from your delimiter (in this case, </td>). If you use lazy quantifiers, you will spend less time kludging character classes and produce clearer regular expressions.

Lazy quantifiers are most valuable when you know the structure surrounding the text you want to capture.

5. Use Available Delimiters

Perl and PHP often use the forward slash to mark the start and end of a regular expression. Python uses a variety of quotes to mark the start and end of a string, which may then be used as a regular expression. If you stick with the slash delimiter in Perl and PHP, you will have to escape any slashes in your regex. If you use regular quotes in Python, you will have to escape all of your backslashes. Choosing different delimiters or quotes allows to avoid escaping half of your regex. This makes the regex easier to read and reduces the potential for bugs when you forget to escape something.

Perl and PHP allow you to use any non-alphanumeric or whitespace character as a delimiter. If you switch to a new delimiter, you can avoid having to escape the forward slashes when you are trying to match URLs or HTML tags such as "http://" or "<br />".

For example:

/http:\/\/(\S)*/

could be rewritten as:

#http://(\S)*#

Common delimiters are #, !, |. If you use square brackets, angle brackets, or curly braces, the opening and closing brackets must match. Here are some common uses of delimiters:

#…# !…! {…}
s|…|…| (Perl only) s[…][…] (Perl only) s<…>/…/ (Perl only)

In Python, regular expressions are treated as strings first. If you use quotes -- the regular string delimiter -- you will have to escape all of your backslashes. However, you can use raw strings, r'', to avoid this. If you use raw triple-quoted strings with the re.VERBOSE option, it allows you to include newlines.

For example:

regex = "(\\w+)(\\d+)"

could be rewritten as:

regex = r'''
           (\w+)
           (\d+)
         '''

Conclusion

The advice in this article focuses on making regular expressions readable. In developing habits to achieve this, you will be forced to think more clearly about the design and structure of your regular expressions. This will reduce bugs and ease the life of the code maintainer. You will be especially happy if that code maintainer is you.

I would like to thank Sarah Burcham for advice on this article. Also, thanks to Jeffrey E.F. Friedl for Mastering Regular Expressions. His book serves as the foundation for everything I do with regular expressions.

Tony Stubblebine is an Internet consultant and author of Regular Expression Pocket Reference.


O'Reilly & Associates will soon release (August 2003) Regular Expression Pocket Reference.


Return to ONLamp.com.



Sponsored by: