ONLamp.com
oreilly.comSafari Books Online.Conferences.

advertisement


Regular Expressions in C++ with Boost.Regex
Pages: 1, 2, 3, 4

Parsing

Not only does regex_match confirm or deny whether a string satisfies some expression, it also lets you parse your string into pieces. It does this by storing the results in a match_results object, which is a sequence (in the sense of a standard library sequence) over which you can iterate to examine the results.



Example 2 is a modified version of Example 1. This new version includes a cmatch object, which is simply a typedef for match_results<const char*>. Boost.Regex, like standard library strings, supports both narrow- and wide-character strings).

#include <iostream>
#include <string>
#include <boost/regex.hpp>

using namespace std;

int main( ) {

   std::string s, sre;
   boost::regex re;
   boost::cmatch matches;

   while(true)
   {
      cout << "Expression: ";
      cin >> sre;
      if (sre == "quit")
      {
         break;
      }

      cout << "String:     ";
      cin >> s;

      try
      {
         // Assignment and construction initialize the FSM used
         // for regexp parsing
         re = sre;
      }
      catch (boost::regex_error& e)
      {
         cout << sre << " is not a valid regular expression: \""
              << e.what() << "\"" << endl;
         continue;
      }
      // if (boost::regex_match(s.begin(), s.end(), re))
      if (boost::regex_match(s.c_str(), matches, re))
      {
         // matches[0] contains the original string.  matches[n]
         // contains a sub_match object for each matching
         // subexpression
         for (int i = 1; i < matches.size(); i++)
         {
            // sub_match::first and sub_match::second are iterators that
            // refer to the first and one past the last chars of the
            // matching subexpression
            string match(matches[i].first, matches[i].second);
            cout << "\tmatches[" << i << "] = " << match << endl;
         }
      }
      else
      {
         cout << "The regexp \"" << re << "\" does not match \"" << s << "\"" << endl;
      }
   }
}

Example 2. Parsing a string using subexpressions

In Example 2, matches is a sequence of sub_match objects. The sub_match class has the members first and second, which are iterators which refer to the first and one-past-the-last elements in the original string. matches[0] contains the entire original string, and the sub_match objects at indexes matches[1...n] each refer to the substrings n that match the corresponding subexpression in the original expression.

A subexpression is a part of the original regular expression that is contained within parentheses. For example, this regular expression has three subexpressions:

(\d{1,2})\/(\d{1,2})\/(\d{2}|\d{4})

This particular expression will match a date of the form MM/DD/YY or MM/DD/YYYY (of course, it doesn't validate the semantics of the values, so the month can be greater than 12). How do you grab each of the parts? Figure 1 should give you an idea, it shows the what a match_results object will look like if you use the expression above and give it the string 11/5/2005.

the results of a regex_match
Figure 1. The results of a regex_match

After parsing this date, there are four elements in matches. The element at index zero refers to the entire string, and each of the elements in matches refers to the elements in the original string that satisfy the corresponding subexpression (this can vary, though). The entire string successfully matches the regular expression, so each of the subexpressions is available via indexes 1-3, respectively, in the match_results sequence.

Depending on the type of subexpressions you are using, the contents of match_results may surprise you. Consider the URL example above. This regular expression has four emboldened subexpressions:

(ftp|http|https):\/\/(\w+\.)*(\w*)\/([\w\d]+\/{0,1})+

Using repeating subexpressions (for example, (\w+\.)*) means that the subexpression can match any number of times. This, in turn, means that match_results can contain a different number of values based on the string you try to match. Here's what you will see with a sample run of Example 2 using the URL regular expression I just gave:

Expression: (ftp|http|https):\/\/(\w+\.)*(\w*)\/([\w\d]+\/{0,1})+
String:     http://www.foo.com/bar
        matches[0] = http://www.foo.com/bar
        matches[1] = http
        matches[2] = foo.
        matches[3] = com
        matches[4] = bar

You probably noticed right away that the "www." is missing from the results. This is because the repeating subexpression only stores the last subexpression matched. If you want to, for example, grab the full host name out of this URL, you have to add another subexpression, which I have indicated with new bold parentheses below:

(ftp|http|https):\/\/((\w+\.)*(\w*))\/([\w\d]+\/{0,1})+

This will put the entire host name into one of the subexpressions. The order of the corresponding sub_match objects in the match_results sequence is as though the tree of nested subexpressions were traversed depth-first, left to right. Here's the output with this modified regular expression:

Expression: (ftp|http|https):\/\/((\w+\.)*(\w*))\/([\w\d]+\/{0,1})+
String:     http://www.foo.com/bar
        matches[0] = http://www.foo.com/bar
        matches[1] = http
        matches[2] = www.foo.com
        matches[3] = foo.
        matches[4] = com
        matches[5] = bar

The results are the same as before, except this time you also get the match for the host name subexpression in matches[2].

By using these techniques, and perhaps after some practice experimenting with regular expression syntax, you can use Boost.Regex to validate and parse a wide variety of strings. But these examples only provide a glimpse into the expressive power of regular expressions. If you aren't already familiar with regular expressions, experiment some more--you may be surprised how often they do just what you need.

Pages: 1, 2, 3, 4

Next Pagearrow





Sponsored by: