ONLamp.com
oreilly.comSafari Books Online.Conferences.

advertisement


Regular Expressions in C++ with Boost.Regex
Pages: 1, 2, 3, 4

Matching

As I said earlier, a string matches a regular expression if the entire string satisfies the expression. Example 1 is a trivial program that accepts a regular expression and a string and tells you whether the string satisfies the expression. Compile and run it to see matching in action and to get a feel for regular expression syntax, if it is new to you.



#include <iostream>
#include <string>
#include <boost/regex.hpp>  // Boost.Regex lib

using namespace std;

int main( ) {

   std::string s, sre;
   boost::regex re;

   while(true)
   {
      cout << "Expression: ";
      cin >> sre;
      if (sre == "quit")
      {
         break;
      }
      cout << "String:     ";
      cin >> s;

      try
      {
         // Set up the regular expression for case-insensitivity
         re.assign(sre, boost::regex_constants::icase);
      }
      catch (boost::regex_error& e)
      {
         cout << sre << " is not a valid regular expression: \""
              << e.what() << "\"" << endl;
         continue;
      }
      if (boost::regex_match(s, re))
      {
         cout << re << " matches " << s << endl;
      }
   }
}

Example 1. A trivial regular expression tester

The first, and most important, part of this example is the inclusion of boost/regex.hpp. This header includes everything you need for the Boost.Regex library. (Unzipping and building the Boost libraries is quick and painless, and it is described on the Boost getting started page.)

The next critical part to this example is the boost::regex class. This class, as you may have guessed, contains the regular expression itself. It has the same semantics as the standard string class, except that, as Maddock says, it is, "a string plus the actual state-machine required by the regular expression algorithms." It is also, like the standard string class, a class template that supports narrow- and wide-character strings with typedefs (regex = basic_regex<char> and wregex = basic_regex<wchar_t>).

Nothing interesting happens with the regex class until you assign it a string that contains a regular expression. Its assign member functions, assignment operator, and constructor will interpret and compile the regular expression, throwing a regex_error exception if it is not formed correctly. I used assign so I could send in the case-insensitivity flag boost::regex_constants::icase.

Finally, regex_match does the work. Its parameters (in the version I used in Example 1) are the target string and the regex object. Not surprisingly, it returns true if the string satisfies the expression, false otherwise.

That's the anatomy of the simplest use of Boost.Regex for matching a string to a regular expression. This mode of matching lends itself nicely to two common use cases: validation and parsing.

Validation

In general, anything a human enters into a computer is suspicious and needs validation. Sure, you can do this with a simple character-by-character comparison and a string of if/then statements, but why? Stop wasting your time writing ugly character-by-character parsing code and use regular expressions to make your code concise and your intent clear.

A URL is a good example of something users routinely enter incorrectly. Consider an application where you read a URL from a config file or some other user-entered location. In the first place, you want to make sure it's valid before you pass it off to your networking library, or perhaps you're writing the network library and you must validate the URL's format before you parse it. A URL has at least three parts: a protocol, a host name, and a path (including the file name itself). Here's an example:

ftp://downloads.foo.com/apps/linux/patch.gz

You need to accept FTP, HTTP, and HTTPS protocols. Here's a regular expression to validate the URL:

(ftp|http|https):\/\/(\w+\.)*(\w*)\/([\w\d]+\/{0,1})+

The first expression in parenthesis validates the protocol, (ftp|http|https). This expression matches one of the three protocols it contains. Next comes the necessary colon and following forward slashes (which are special characters that require backslash escaping). After that, the expression (\w+\.)*(\w*) matches a host name that is of the form foo.com or foo, allowing alpha and numeric characters in the host name. Then comes another forward slash, then ([\w\d]+\/{0,1})+. This matches a pathname of the form /downloads/w32/utils.

From Example 1, you can see that this is easy to code:

try
{
   boost::regex re("(ftp|http|https):\/\/(\w+\.)*(\w*)\/([\w\d]+\/{0,1})+");
   if (!boost::regex_match(url, re))
   {
      throw "Your URL is not formatted correctly!";
   }
}
catch (boost::regex_error& e)
{
   cerr << "The regexp " << re << " is invalid!" << endl;
   throw(e);
}

regex_match has several overloadings to handle strings, char*s, and iterator ranges, a l´ the standard library algorithms.

The previous snippet does a syntactic validation. If the string satisfies the regular expression, then you are happy. This is not the end of this road, though, because often you will need to do more than simply validate the structure of a string--you may need to extract part of the string to do something with it. The hostname, for instance, is useful from the URL example.

Pages: 1, 2, 3, 4

Next Pagearrow





Sponsored by: