ONLamp.com
oreilly.comSafari Books Online.Conferences.

advertisement


Regular Expressions in C++ with Boost.Regex

by Ryan Stephens
04/06/2006

Searching and parsing text is messy business. What, at first, sounds like a simple matter of tokenizing a string and interpreting its structure quickly degenerates into a fog of loops, if/then statements, bugs, and ultimately partial or total insanity. Something as easy as grabbing the host name from a URL quickly becomes unwieldy. Now imagine more complicated tasks, such as parsing a log file or validating a date string.

Regular expressions rescued programmers from this mess long ago. Small, cryptic regular expressions can match nearly any format you like, saving you from writing the nasty parsing code yourself. Not only are regular expressions powerful, they are ubiquitous, thanks in large part to Perl. Contemporary programming languages have had standardized support for regular expressions for a while: of course Perl, but also Java, C#, Ruby, Python, and so on. Sadly, C++ is not in this list; the only standardized support for regular expressions is in a handful of POSIX C API functions. Don't worry, though, there is still hope for C++.

The Boost.Regex library, written by John Maddock, solved this problem by creating an ad hoc standard. Even better, the next product of the C++ standardization committee (known as Technical Report 1 or TR1) has accepted the library to appear as part of the C++ standard library in the next standardized version of C++. For now, if you use regular expressions, and you want to employ them in C++, this library is for you.

Boost.Regex allows you to do everything from simple matching (validating a phone number, for example) to search and replace. This article explains the basics of using Boost.Regex:

  • Matching
  • Simple parsing
  • Enumeration

When you are done, you should have a good idea of what kinds of things you can do with Boost.Regex. Before I can get into the particulars, you may need a short crash course in regular expressions.

Background and Definitions

Regular expression syntax is a language all its own, and it would take much more than a few paragraphs to give you a comprehensive understanding of it. I will explain the basics, though, and provide a couple of links to good pages where you can explore regular expressions in as much detail as you like.

A regular expression is a string of characters that a regular expression engine (which, essentially, is what Boost.Regex is) interprets and applies to a target string. The engine interprets the expression against the string and determines if the expression is part of the target string, matches the entire string, or neither. If the regular expression is somewhere in the string, you will get the results either as a Boolean or the text that actually satisfies the expression (or part of it).

For the purposes of this article, the terms match and search have special meaning. When I say that a given string matches a regular expression, I mean that the entire string satisfies the expression. As an example, consider the regular expression:

a+|b+

This matches any string that contains one or more as or one or more bs. The plus sign means "match the previous character one or more times," and the pipe operator (|), is a logical OR (see Table 1 for more info). Some strings that match that regular expression are:

a
aaaaa
b
bbbb

These won't match:

ab
aac
x

This is because the expression means, in pseudocode, "return true if the string consists only of one or more as, or if it consists only of one or more bs."

On the other hand, the results will be different if I am searching a string for a substring that satisfies a regular expression. The same expression as before will match the strings:

mxaaabfc
aabc
bbbbbxxxx

This is because there is a substring in the target string that satisfies the regular expression. This subtle distinction is more important when you match or search a regular expression with Boost.Regex, because you must use different classes and functions depending on what you are testing.

Table 1 shows a summary of some of the most common regular expression tokens. I give a few examples after the table and throughout the rest of this article, but to really appreciate the expressive power of regular expressions, copy the code from Example 1 and experiment a bit. Once you recognize how easy it is to search and parse text with regular expressions, you won't want to stop.

Symbol Meaning
c Match the literal character c once, unless it is one of the special characters.
^ Match the beginning of a line.
. Match any character that isn't a newline.
$ Match the end of a line.
| Logical OR between expressions.
() Group subexpressions.
[] Define a character class.
* Match the preceding expression zero or more times.
+ Match the preceding expression one ore more times.
? Match the preceding expression zero or one time.
{n} Match the preceding expression n times.
{n,} Match the preceding expression at least n times.
{n, m} Match the preceding expression at least n times and at most m times.
\d Match a digit.
\D Match a character that is not a digit.
\w Match an alpha character, including the underscore.
\W Match a character that is not an alpha character.
\s Match a whitespace character (any of \t, \n, \r, or \f).
\S Match a non-whitespace character.
\t Tab.
\n Newline.
\r Carriage return.
\f Form feed.
\m Escape m, where m is one of the metacharacters described above: ^, ., $, |, (), [], *, +, ?, \, or /.

I am using Perl-style regular expressions, which is the de facto standard. POSIX has defined both a basic and extended standardized syntax, both of which Boost.Regex supports, but I don't describe them here. There are more tokens than I list in Table 1, but those listed are enough to get you started. You can find a comprehensive reference to Perl-style regular expressions in the Perl regex documentation, or read about the Boost.Regex-specific options on the Boost documentation pages. You can also do a web search for "regular expression syntax" and find lots of choices.

That's nice. What can you do with them? Consider a social security number, which is nine digits, usually separated by hyphens after the third and fifth digits, looking something like XXX-XX-XXXX, where X is a number. You can validate it with a simple expression:

\d{3}\-\d{2}\-\d{4}

The \d matches any digit. The following {3} means that the digit expression must match three in a row, as if you had written \d\d\d instead. The \- matches the hyphen, which includes a leading backslash because the hyphen is a regular expression metacharacter. Everything that follows is a repeat of what I just explained: two digits, then a hyphen, then four digits.

Maybe you want to search for someone's name that is either Johnson or Johnston:

Johnst{0,1}on
Johnson|Johnston
Johns(on|ton)

In the first example, the {0,1} applies to the character that precedes it. It means that the t can occur between zero and one times. The second and third examples use the logical OR operator with and without parentheses to allow the regular expression engine to match alternative names.

Pages: 1, 2, 3, 4

Next Pagearrow





Sponsored by: