ONLamp.com
oreilly.comSafari Books Online.Conferences.

advertisement


Regular Expressions in C++ with Boost.Regex
Pages: 1, 2, 3, 4

Searching

Matching and parsing a single string in its entirety does not address the important and ubiquitous use case of searching a string that contains a substring you want, but possibly a lot of other characters you don't.



Like matching, Boost.Regex lets you search a string for a regular expression in two ways. In the simplest case, you may just want to know if a given string contains a match for your regular expression. Example 3 is a trivial implementation of the grep program that reads in each line from a file and prints it out if it contains a string that satisfies the regular expression pattern.

#include <iostream>
#include <string>
#include <boost/regex.hpp>
#include <fstream>

using namespace std;
const int BUFSIZE = 10000;

int main(int argc, char** argv) {

   // Safety checks omitted...
   boost::regex re(argv[1]);
   string file(argv[2]);
   char buf[BUFSIZE];

   ifstream in(file.c_str());
   while (!in.eof())
   {
      in.getline(buf, BUFSIZE-1);
      if (boost::regex_search(buf, re))
      {
         cout << buf << endl;
      }
   }
}

Example 3. Trivial grep

You can see that you use regex_search in the same way as regex_match.

This comes in handy sometimes, but has limited appeal. More often, you will enumerate over all substrings that match a given pattern. For example, maybe you are writing a web crawler and want to iterate over all anchor tags in a page. Craft a regular expression to grab anchor tags:

<a\s+href="([\-:\w\d\.\/]+)">

You don't want the whole line returned, though, as in the grep example above; you want the target URL. To do this, use the second subexpression in match_results. Example 4, a slightly modified version of Example 3, will do just that.

#include <iostream>
#include <string>
#include <boost/regex.hpp>
#include <fstream>

using namespace std;
const int BUFSIZE = 10000;

int main(int argc, char** argv) {

   // Safety checks omitted...
   boost::regex re("<a\\s+href=\"([\\-:\\w\\d\\.\\/]+)\">");
   string file(argv[1]);
   char buf[BUFSIZE];
   boost::cmatch matches;
   string sbuf;
   string::const_iterator begin;
   ifstream in(file.c_str());

   while (!in.eof())
   {
      in.getline(buf, BUFSIZE-1);
      sbuf = buf;
      begin = sbuf.begin();

      while (boost::regex_search(begin, sbuf.end(), matches, re))
      {
         string url(matches[1].first, matches[1].second);
         cout << "URL: " << url << endl;
         // Update the beginning of the range to the character
         // following the match
         begin = matches[1].second;
      }
   }
}

Example 4. Enumerating anchor tags

The hard-coded regular expression in Example 4 contains lots of backslashes. This is necessary because I am escaping certain characters twice: once for the compiler, and once for the regular expression engine.

Example 4 uses a different overload of regex_search than Example 3; this version takes two bidirectional iterator arguments that refer to the beginning and end of a range of characters to be searched. To access every matching substring, all I have to do is update begin to point to the character following the last match, which is in matches[1].second.

This is not the only way to iterate over all occurrences of a pattern. If you prefer (or require) iterator semantics, use a regex_token_iterator, which is an iterator interface to the results from a regular expression search. In Example 4, you could just as easily have iterated over the results of the URL search:

   // Read the HTML file into the string s...
   boost::sregex_token_iterator p(s.begin(), s.end(), re, 0);
   boost::sregex_token_iterator end;

   for (;p != end;count++, ++p)
   {
      string m(p->first, p->second);
      cout << m << endl;
   }

That's not all, though. The first token iterator here passes a zero as the last argument to its constructor. This tells it to iterate over the strings that satisfy the regular expression. Change it to -1 and you get the opposite: iteration over substrings that do not satisfy the expression. In other words, it tokenizes the string, where each token is something that satisfies the regular expression. This is a cool feature, because it lets you tokenize a string of characters based on complex delimiters. To use the example of parsing a web page, you could, for example, break the document into sections by its headers, using header tags such as <h1>...</h1>, <h3>...</h3>, etc.

Stuff to Check Out

There is, of course, more to Boost.Regex than I've presented here, but this should give you a good idea of what you can do with regular expressions in C++. The documentation on the Boost.Regex page is comprehensive, and there are plenty of examples you can copy and experiment with. In addition to searching strings as I did above, you can:

  • Search and replace using different Perl and Sed-style formatting conventions.
  • Use POSIX basic and extended regular expression format.
  • Use Unicode strings and other non-standard string formats.

Above all, you should experiment with regular expression syntax. There are different ways to do the same thing, and it's fun to see how concise you can make an expression that does what you want. Once you're a pro at regular expressions, you will be surprised at how often you can use them to validate, search, or parse a string.

Conclusion

Boost.Regex is the library in the Boost project that implements a regular expression engine in C++. You can use it to match, search, or search and replace with regular expressions against a target string, instead of writing ugly and cumbersome string-parsing code. Boost.Regex has been accepted as part of the next C++ standard library, and you will see it appearing in implementations of TR1 (in the tr1 namespace) from standard library vendors very soon. Check out Boost.Regex to get a feel for how useful it is, and while you're at it, take a look at many of the other libraries in Boost--there's a lot of good stuff there.

Ryan Stephens is a software engineer, writer, and student living in Tempe, Arizona. He enjoys programming in virtually any language, especially C++.


Return to ONLamp.com.



Sponsored by: