ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Regular Expressions in C++ with Boost.Regex

by Ryan Stephens
04/06/2006

Searching and parsing text is messy business. What, at first, sounds like a simple matter of tokenizing a string and interpreting its structure quickly degenerates into a fog of loops, if/then statements, bugs, and ultimately partial or total insanity. Something as easy as grabbing the host name from a URL quickly becomes unwieldy. Now imagine more complicated tasks, such as parsing a log file or validating a date string.

Regular expressions rescued programmers from this mess long ago. Small, cryptic regular expressions can match nearly any format you like, saving you from writing the nasty parsing code yourself. Not only are regular expressions powerful, they are ubiquitous, thanks in large part to Perl. Contemporary programming languages have had standardized support for regular expressions for a while: of course Perl, but also Java, C#, Ruby, Python, and so on. Sadly, C++ is not in this list; the only standardized support for regular expressions is in a handful of POSIX C API functions. Don't worry, though, there is still hope for C++.

The Boost.Regex library, written by John Maddock, solved this problem by creating an ad hoc standard. Even better, the next product of the C++ standardization committee (known as Technical Report 1 or TR1) has accepted the library to appear as part of the C++ standard library in the next standardized version of C++. For now, if you use regular expressions, and you want to employ them in C++, this library is for you.

Boost.Regex allows you to do everything from simple matching (validating a phone number, for example) to search and replace. This article explains the basics of using Boost.Regex:

When you are done, you should have a good idea of what kinds of things you can do with Boost.Regex. Before I can get into the particulars, you may need a short crash course in regular expressions.

Background and Definitions

Regular expression syntax is a language all its own, and it would take much more than a few paragraphs to give you a comprehensive understanding of it. I will explain the basics, though, and provide a couple of links to good pages where you can explore regular expressions in as much detail as you like.

A regular expression is a string of characters that a regular expression engine (which, essentially, is what Boost.Regex is) interprets and applies to a target string. The engine interprets the expression against the string and determines if the expression is part of the target string, matches the entire string, or neither. If the regular expression is somewhere in the string, you will get the results either as a Boolean or the text that actually satisfies the expression (or part of it).

For the purposes of this article, the terms match and search have special meaning. When I say that a given string matches a regular expression, I mean that the entire string satisfies the expression. As an example, consider the regular expression:

a+|b+

This matches any string that contains one or more as or one or more bs. The plus sign means "match the previous character one or more times," and the pipe operator (|), is a logical OR (see Table 1 for more info). Some strings that match that regular expression are:

a
aaaaa
b
bbbb

These won't match:

ab
aac
x

This is because the expression means, in pseudocode, "return true if the string consists only of one or more as, or if it consists only of one or more bs."

On the other hand, the results will be different if I am searching a string for a substring that satisfies a regular expression. The same expression as before will match the strings:

mxaaabfc
aabc
bbbbbxxxx

This is because there is a substring in the target string that satisfies the regular expression. This subtle distinction is more important when you match or search a regular expression with Boost.Regex, because you must use different classes and functions depending on what you are testing.

Table 1 shows a summary of some of the most common regular expression tokens. I give a few examples after the table and throughout the rest of this article, but to really appreciate the expressive power of regular expressions, copy the code from Example 1 and experiment a bit. Once you recognize how easy it is to search and parse text with regular expressions, you won't want to stop.

Symbol Meaning
c Match the literal character c once, unless it is one of the special characters.
^ Match the beginning of a line.
. Match any character that isn't a newline.
$ Match the end of a line.
| Logical OR between expressions.
() Group subexpressions.
[] Define a character class.
* Match the preceding expression zero or more times.
+ Match the preceding expression one ore more times.
? Match the preceding expression zero or one time.
{n} Match the preceding expression n times.
{n,} Match the preceding expression at least n times.
{n, m} Match the preceding expression at least n times and at most m times.
\d Match a digit.
\D Match a character that is not a digit.
\w Match an alpha character, including the underscore.
\W Match a character that is not an alpha character.
\s Match a whitespace character (any of \t, \n, \r, or \f).
\S Match a non-whitespace character.
\t Tab.
\n Newline.
\r Carriage return.
\f Form feed.
\m Escape m, where m is one of the metacharacters described above: ^, ., $, |, (), [], *, +, ?, \, or /.

I am using Perl-style regular expressions, which is the de facto standard. POSIX has defined both a basic and extended standardized syntax, both of which Boost.Regex supports, but I don't describe them here. There are more tokens than I list in Table 1, but those listed are enough to get you started. You can find a comprehensive reference to Perl-style regular expressions in the Perl regex documentation, or read about the Boost.Regex-specific options on the Boost documentation pages. You can also do a web search for "regular expression syntax" and find lots of choices.

That's nice. What can you do with them? Consider a social security number, which is nine digits, usually separated by hyphens after the third and fifth digits, looking something like XXX-XX-XXXX, where X is a number. You can validate it with a simple expression:

\d{3}\-\d{2}\-\d{4}

The \d matches any digit. The following {3} means that the digit expression must match three in a row, as if you had written \d\d\d instead. The \- matches the hyphen, which includes a leading backslash because the hyphen is a regular expression metacharacter. Everything that follows is a repeat of what I just explained: two digits, then a hyphen, then four digits.

Maybe you want to search for someone's name that is either Johnson or Johnston:

Johnst{0,1}on
Johnson|Johnston
Johns(on|ton)

In the first example, the {0,1} applies to the character that precedes it. It means that the t can occur between zero and one times. The second and third examples use the logical OR operator with and without parentheses to allow the regular expression engine to match alternative names.

Regular Expression Pocket Reference

Related Reading

Regular Expression Pocket Reference
By Tony Stubblebine

Matching

As I said earlier, a string matches a regular expression if the entire string satisfies the expression. Example 1 is a trivial program that accepts a regular expression and a string and tells you whether the string satisfies the expression. Compile and run it to see matching in action and to get a feel for regular expression syntax, if it is new to you.

#include <iostream>
#include <string>
#include <boost/regex.hpp>  // Boost.Regex lib

using namespace std;

int main( ) {

   std::string s, sre;
   boost::regex re;

   while(true)
   {
      cout << "Expression: ";
      cin >> sre;
      if (sre == "quit")
      {
         break;
      }
      cout << "String:     ";
      cin >> s;

      try
      {
         // Set up the regular expression for case-insensitivity
         re.assign(sre, boost::regex_constants::icase);
      }
      catch (boost::regex_error& e)
      {
         cout << sre << " is not a valid regular expression: \""
              << e.what() << "\"" << endl;
         continue;
      }
      if (boost::regex_match(s, re))
      {
         cout << re << " matches " << s << endl;
      }
   }
}

Example 1. A trivial regular expression tester

The first, and most important, part of this example is the inclusion of boost/regex.hpp. This header includes everything you need for the Boost.Regex library. (Unzipping and building the Boost libraries is quick and painless, and it is described on the Boost getting started page.)

The next critical part to this example is the boost::regex class. This class, as you may have guessed, contains the regular expression itself. It has the same semantics as the standard string class, except that, as Maddock says, it is, "a string plus the actual state-machine required by the regular expression algorithms." It is also, like the standard string class, a class template that supports narrow- and wide-character strings with typedefs (regex = basic_regex<char> and wregex = basic_regex<wchar_t>).

Nothing interesting happens with the regex class until you assign it a string that contains a regular expression. Its assign member functions, assignment operator, and constructor will interpret and compile the regular expression, throwing a regex_error exception if it is not formed correctly. I used assign so I could send in the case-insensitivity flag boost::regex_constants::icase.

Finally, regex_match does the work. Its parameters (in the version I used in Example 1) are the target string and the regex object. Not surprisingly, it returns true if the string satisfies the expression, false otherwise.

That's the anatomy of the simplest use of Boost.Regex for matching a string to a regular expression. This mode of matching lends itself nicely to two common use cases: validation and parsing.

Validation

In general, anything a human enters into a computer is suspicious and needs validation. Sure, you can do this with a simple character-by-character comparison and a string of if/then statements, but why? Stop wasting your time writing ugly character-by-character parsing code and use regular expressions to make your code concise and your intent clear.

A URL is a good example of something users routinely enter incorrectly. Consider an application where you read a URL from a config file or some other user-entered location. In the first place, you want to make sure it's valid before you pass it off to your networking library, or perhaps you're writing the network library and you must validate the URL's format before you parse it. A URL has at least three parts: a protocol, a host name, and a path (including the file name itself). Here's an example:

ftp://downloads.foo.com/apps/linux/patch.gz

You need to accept FTP, HTTP, and HTTPS protocols. Here's a regular expression to validate the URL:

(ftp|http|https):\/\/(\w+\.)*(\w*)\/([\w\d]+\/{0,1})+

The first expression in parenthesis validates the protocol, (ftp|http|https). This expression matches one of the three protocols it contains. Next comes the necessary colon and following forward slashes (which are special characters that require backslash escaping). After that, the expression (\w+\.)*(\w*) matches a host name that is of the form foo.com or foo, allowing alpha and numeric characters in the host name. Then comes another forward slash, then ([\w\d]+\/{0,1})+. This matches a pathname of the form /downloads/w32/utils.

From Example 1, you can see that this is easy to code:

try
{
   boost::regex re("(ftp|http|https):\/\/(\w+\.)*(\w*)\/([\w\d]+\/{0,1})+");
   if (!boost::regex_match(url, re))
   {
      throw "Your URL is not formatted correctly!";
   }
}
catch (boost::regex_error& e)
{
   cerr << "The regexp " << re << " is invalid!" << endl;
   throw(e);
}

regex_match has several overloadings to handle strings, char*s, and iterator ranges, a l´ the standard library algorithms.

The previous snippet does a syntactic validation. If the string satisfies the regular expression, then you are happy. This is not the end of this road, though, because often you will need to do more than simply validate the structure of a string--you may need to extract part of the string to do something with it. The hostname, for instance, is useful from the URL example.

Parsing

Not only does regex_match confirm or deny whether a string satisfies some expression, it also lets you parse your string into pieces. It does this by storing the results in a match_results object, which is a sequence (in the sense of a standard library sequence) over which you can iterate to examine the results.

Example 2 is a modified version of Example 1. This new version includes a cmatch object, which is simply a typedef for match_results<const char*>. Boost.Regex, like standard library strings, supports both narrow- and wide-character strings).

#include <iostream>
#include <string>
#include <boost/regex.hpp>

using namespace std;

int main( ) {

   std::string s, sre;
   boost::regex re;
   boost::cmatch matches;

   while(true)
   {
      cout << "Expression: ";
      cin >> sre;
      if (sre == "quit")
      {
         break;
      }

      cout << "String:     ";
      cin >> s;

      try
      {
         // Assignment and construction initialize the FSM used
         // for regexp parsing
         re = sre;
      }
      catch (boost::regex_error& e)
      {
         cout << sre << " is not a valid regular expression: \""
              << e.what() << "\"" << endl;
         continue;
      }
      // if (boost::regex_match(s.begin(), s.end(), re))
      if (boost::regex_match(s.c_str(), matches, re))
      {
         // matches[0] contains the original string.  matches[n]
         // contains a sub_match object for each matching
         // subexpression
         for (int i = 1; i < matches.size(); i++)
         {
            // sub_match::first and sub_match::second are iterators that
            // refer to the first and one past the last chars of the
            // matching subexpression
            string match(matches[i].first, matches[i].second);
            cout << "\tmatches[" << i << "] = " << match << endl;
         }
      }
      else
      {
         cout << "The regexp \"" << re << "\" does not match \"" << s << "\"" << endl;
      }
   }
}

Example 2. Parsing a string using subexpressions

In Example 2, matches is a sequence of sub_match objects. The sub_match class has the members first and second, which are iterators which refer to the first and one-past-the-last elements in the original string. matches[0] contains the entire original string, and the sub_match objects at indexes matches[1...n] each refer to the substrings n that match the corresponding subexpression in the original expression.

A subexpression is a part of the original regular expression that is contained within parentheses. For example, this regular expression has three subexpressions:

(\d{1,2})\/(\d{1,2})\/(\d{2}|\d{4})

This particular expression will match a date of the form MM/DD/YY or MM/DD/YYYY (of course, it doesn't validate the semantics of the values, so the month can be greater than 12). How do you grab each of the parts? Figure 1 should give you an idea, it shows the what a match_results object will look like if you use the expression above and give it the string 11/5/2005.

the results of a regex_match
Figure 1. The results of a regex_match

After parsing this date, there are four elements in matches. The element at index zero refers to the entire string, and each of the elements in matches refers to the elements in the original string that satisfy the corresponding subexpression (this can vary, though). The entire string successfully matches the regular expression, so each of the subexpressions is available via indexes 1-3, respectively, in the match_results sequence.

Depending on the type of subexpressions you are using, the contents of match_results may surprise you. Consider the URL example above. This regular expression has four emboldened subexpressions:

(ftp|http|https):\/\/(\w+\.)*(\w*)\/([\w\d]+\/{0,1})+

Using repeating subexpressions (for example, (\w+\.)*) means that the subexpression can match any number of times. This, in turn, means that match_results can contain a different number of values based on the string you try to match. Here's what you will see with a sample run of Example 2 using the URL regular expression I just gave:

Expression: (ftp|http|https):\/\/(\w+\.)*(\w*)\/([\w\d]+\/{0,1})+
String:     http://www.foo.com/bar
        matches[0] = http://www.foo.com/bar
        matches[1] = http
        matches[2] = foo.
        matches[3] = com
        matches[4] = bar

You probably noticed right away that the "www." is missing from the results. This is because the repeating subexpression only stores the last subexpression matched. If you want to, for example, grab the full host name out of this URL, you have to add another subexpression, which I have indicated with new bold parentheses below:

(ftp|http|https):\/\/((\w+\.)*(\w*))\/([\w\d]+\/{0,1})+

This will put the entire host name into one of the subexpressions. The order of the corresponding sub_match objects in the match_results sequence is as though the tree of nested subexpressions were traversed depth-first, left to right. Here's the output with this modified regular expression:

Expression: (ftp|http|https):\/\/((\w+\.)*(\w*))\/([\w\d]+\/{0,1})+
String:     http://www.foo.com/bar
        matches[0] = http://www.foo.com/bar
        matches[1] = http
        matches[2] = www.foo.com
        matches[3] = foo.
        matches[4] = com
        matches[5] = bar

The results are the same as before, except this time you also get the match for the host name subexpression in matches[2].

By using these techniques, and perhaps after some practice experimenting with regular expression syntax, you can use Boost.Regex to validate and parse a wide variety of strings. But these examples only provide a glimpse into the expressive power of regular expressions. If you aren't already familiar with regular expressions, experiment some more--you may be surprised how often they do just what you need.

Searching

Matching and parsing a single string in its entirety does not address the important and ubiquitous use case of searching a string that contains a substring you want, but possibly a lot of other characters you don't.

Like matching, Boost.Regex lets you search a string for a regular expression in two ways. In the simplest case, you may just want to know if a given string contains a match for your regular expression. Example 3 is a trivial implementation of the grep program that reads in each line from a file and prints it out if it contains a string that satisfies the regular expression pattern.

#include <iostream>
#include <string>
#include <boost/regex.hpp>
#include <fstream>

using namespace std;
const int BUFSIZE = 10000;

int main(int argc, char** argv) {

   // Safety checks omitted...
   boost::regex re(argv[1]);
   string file(argv[2]);
   char buf[BUFSIZE];

   ifstream in(file.c_str());
   while (!in.eof())
   {
      in.getline(buf, BUFSIZE-1);
      if (boost::regex_search(buf, re))
      {
         cout << buf << endl;
      }
   }
}

Example 3. Trivial grep

You can see that you use regex_search in the same way as regex_match.

This comes in handy sometimes, but has limited appeal. More often, you will enumerate over all substrings that match a given pattern. For example, maybe you are writing a web crawler and want to iterate over all anchor tags in a page. Craft a regular expression to grab anchor tags:

<a\s+href="([\-:\w\d\.\/]+)">

You don't want the whole line returned, though, as in the grep example above; you want the target URL. To do this, use the second subexpression in match_results. Example 4, a slightly modified version of Example 3, will do just that.

#include <iostream>
#include <string>
#include <boost/regex.hpp>
#include <fstream>

using namespace std;
const int BUFSIZE = 10000;

int main(int argc, char** argv) {

   // Safety checks omitted...
   boost::regex re("<a\\s+href=\"([\\-:\\w\\d\\.\\/]+)\">");
   string file(argv[1]);
   char buf[BUFSIZE];
   boost::cmatch matches;
   string sbuf;
   string::const_iterator begin;
   ifstream in(file.c_str());

   while (!in.eof())
   {
      in.getline(buf, BUFSIZE-1);
      sbuf = buf;
      begin = sbuf.begin();

      while (boost::regex_search(begin, sbuf.end(), matches, re))
      {
         string url(matches[1].first, matches[1].second);
         cout << "URL: " << url << endl;
         // Update the beginning of the range to the character
         // following the match
         begin = matches[1].second;
      }
   }
}

Example 4. Enumerating anchor tags

The hard-coded regular expression in Example 4 contains lots of backslashes. This is necessary because I am escaping certain characters twice: once for the compiler, and once for the regular expression engine.

Example 4 uses a different overload of regex_search than Example 3; this version takes two bidirectional iterator arguments that refer to the beginning and end of a range of characters to be searched. To access every matching substring, all I have to do is update begin to point to the character following the last match, which is in matches[1].second.

This is not the only way to iterate over all occurrences of a pattern. If you prefer (or require) iterator semantics, use a regex_token_iterator, which is an iterator interface to the results from a regular expression search. In Example 4, you could just as easily have iterated over the results of the URL search:

   // Read the HTML file into the string s...
   boost::sregex_token_iterator p(s.begin(), s.end(), re, 0);
   boost::sregex_token_iterator end;

   for (;p != end;count++, ++p)
   {
      string m(p->first, p->second);
      cout << m << endl;
   }

That's not all, though. The first token iterator here passes a zero as the last argument to its constructor. This tells it to iterate over the strings that satisfy the regular expression. Change it to -1 and you get the opposite: iteration over substrings that do not satisfy the expression. In other words, it tokenizes the string, where each token is something that satisfies the regular expression. This is a cool feature, because it lets you tokenize a string of characters based on complex delimiters. To use the example of parsing a web page, you could, for example, break the document into sections by its headers, using header tags such as <h1>...</h1>, <h3>...</h3>, etc.

Stuff to Check Out

There is, of course, more to Boost.Regex than I've presented here, but this should give you a good idea of what you can do with regular expressions in C++. The documentation on the Boost.Regex page is comprehensive, and there are plenty of examples you can copy and experiment with. In addition to searching strings as I did above, you can:

Above all, you should experiment with regular expression syntax. There are different ways to do the same thing, and it's fun to see how concise you can make an expression that does what you want. Once you're a pro at regular expressions, you will be surprised at how often you can use them to validate, search, or parse a string.

Conclusion

Boost.Regex is the library in the Boost project that implements a regular expression engine in C++. You can use it to match, search, or search and replace with regular expressions against a target string, instead of writing ugly and cumbersome string-parsing code. Boost.Regex has been accepted as part of the next C++ standard library, and you will see it appearing in implementations of TR1 (in the tr1 namespace) from standard library vendors very soon. Check out Boost.Regex to get a feel for how useful it is, and while you're at it, take a look at many of the other libraries in Boost--there's a lot of good stuff there.

Ryan Stephens is a software engineer, writer, and student living in Tempe, Arizona. He enjoys programming in virtually any language, especially C++.


Return to ONLamp.com.

Copyright © 2009 O'Reilly Media, Inc.