Python DevCenter
oreilly.comSafari Books Online.Conferences.


Python News

Structured grep and Python


When text files are structured, like HTML, XML, or even news or mail files, you can take advantage of that structure in your search. You can search for words that appear within certain tags, like in the title element of an HTML document, or within the From field of a mail file. All you need is a tool that understands the structure of your text.

Jani Jaakkola and Pekka Kilpeläinen's structured text search and index tool, sgrep, handles all structured text in a generic way. Sgrep's expression language allows you to provide details about the structure to sgrep so it can find exactly what you want. It also has a printf like format for printing the output. It's generic because you supply information about the structure with each search. Since typing out all those details can be tedious, sgrep can use a preprocessor like m4 to read macros. For example, in the expression

sgrep -o"%f:%r\n" '"Stephen" IN HTML_TITLE' ~/public_html/*.html

HTML_TITLE is a macro for

(( ( "<TITLE>" or ( ("<TITLE " or "<TITLE\t" or \ "<TITLE\n") .. ">")) .. ( "</TITLE>" ) ))

This sgrep expression looks for the name "Stephen" in the contents of title tags in the html documents in your public_html directory. sgrep prints the file name, a colon, and then the text of the matching region, as specified by the -o option.

OK, it isn't pretty, and it isn't a quick thing to learn either. To define those macros you have to use M4, and sgrep's expressions are only slightly more understandable than the equivalent regular expressions would be. Your reward for learning it is a powerful command-line tool for use in searching your documents, and set of macros that makes searching a breeze. You also have one tool that can work on any kind of structured text.

sgrep isn't new. The last version was released in 1998. What's new is Dave Kuhlman's PySgrep, a python extension module to call and control sgrep. PySgrep lends the power of Python to sgrep, but it doesn't free you from understanding sgrep's language (although you could write your own Python macros and avoid m4.) PySgrep uses a call-back object to handle the results of a query. Whenever there is a carriage return in the result stream, PySgrep calls the call-back object's write method.

Layering the power of Python over the fast searching power of sgrep, you could use the callback to further refine the search, perhaps using a regular expression you couldn't have used with sgrep alone. You could put the information into a list instead of pumping it to standard out. You could use it as the basis for a search engine for your site. There are many possibilities.

One of PySgrep's weak points, however, is how you access the files to be queried. PySgrep will take information from stdin, or it will open files named in a file. I would like to pass a list of files to the query call, or even pass an appropriate data structure like a list, but it doesn't currently support that.

Darrel Gallion's swigSgrep was an earlier attempt to match sgrep with Python. swigSgrep had better support for handling files but did not support sgrep's M4 macro facility. Its interface is klunkier as well, being a straight SWIG of the sgrep program. Kuhlman's effort seems clearer to me. If you work with a variety of structured text files, you should take a look at PySgrep.

Stephen Figgins administrates Linux servers for Sunflower Broadband, a cable company.

Read more Python News columns.

Return to the PHP DevCenter.

Sponsored by: