Python DevCenter
oreilly.comSafari Books Online.Conferences.

advertisement


Building Recursive Descent Parsers with Python
Pages: 1, 2, 3, 4, 5

Listing 3

from pyparsing import *
import urllib

# define basic text pattern for NTP server 
integer = Word("0123456789")
ipAddress = integer + "." + integer + "." + integer + "." + integer
tdStart = Literal("<td>")
tdEnd = Literal("</td>")
timeServer =  tdStart + ipAddress + tdEnd + tdStart + SkipTo(tdEnd) + tdEnd

# get list of time servers
nistTimeServerURL = "http://tf.nist.gov/service/time-servers.html"
serverListPage = urllib.urlopen( nistTimeServerURL )
serverListHTML = serverListPage.read()
serverListPage.close()

for srvrtokens,startloc,endloc in timeServer.scanString( serverListHTML ):
    print srvrtokens

Running the program in Listing 3 gives the token data:

[' <td>', '129', '.', '6', '.', '15', '.', '28', '</td> ', ' <td>', 'NIST, Gaithersburg, Maryland', '</td> ']
[' <td>', '129', '.', '6', '.', '15', '.', '29', '</td> ', ' <td>', 'NIST, Gaithersburg, Maryland', '</td> ']
[' <td>', '132', '.', '163', '.', '4', '.', '101', '</td> ', ' <td>', 'NIST, Boulder, Colorado', '</td> ']
[' <td>', '132', '.', '163', '.', '4', '.', '102', '</td> ', ' <td>', 'NIST, Boulder, Colorado', '</td> ']
:

Looking at these results, a couple of things immediately jump out. One is that the parser records each IP address as a series of separate tokens, one for each subfield and delimiting period. It would be nice if pyparsing were to do a bit of work during the parsing process to combine these fields into a single-string token. Pyparsing's Combine class will do just this. Modify the ipAddress definition to read:

ipAddress = Combine( integer + "." + integer + "." + integer + "." + integer )

to get a single-string token returned for the IP address.

The second observation is that the results include the opening and closing HTML tags that mark the table columns. While the presence of these tags is important during the parsing process, the tags themselves are not interesting in the extracted data. To have them suppressed from the returned token data, construct the tag literals with the suppress method.

tdStart = Literal("<td>").suppress()
tdEnd   = Literal("</td>").suppress()

Listing 4

from pyparsing import *
import urllib

# define basic text pattern for NTP server 
integer = Word("0123456789")
ipAddress = Combine( integer + "." + integer + "." + integer + "." + integer )
tdStart = Literal("<td>").suppress()
tdEnd = Literal("</td>").suppress()
timeServer =  tdStart + ipAddress + tdEnd + tdStart + SkipTo(tdEnd) + tdEnd

# get list of time servers
nistTimeServerURL = "http://tf.nist.gov/service/time-servers.html"
serverListPage = urllib.urlopen( nistTimeServerURL )
serverListHTML = serverListPage.read()
serverListPage.close()

for srvrtokens,startloc,endloc in timeServer.scanString( serverListHTML ):
    print srvrtokens

Now run the program in Listing 4. Your returned token data has substantially improved:

['129.6.15.28', 'NIST, Gaithersburg, Maryland']
['129.6.15.29', 'NIST, Gaithersburg, Maryland']
['132.163.4.101', 'NIST, Boulder, Colorado']
['132.163.4.102', 'NIST, Boulder, Colorado']

Finally, add result names to these tokens, so that you can access them by attribute name. The easiest way to do this is in the definition of timeServer:

timeServer = tdStart + ipAddress.setResultsName("ipAddress") + tdEnd 
        + tdStart + SkipTo(tdEnd).setResultsName("locn") + tdEnd

Now you can neaten up the body of the for loop and access these tokens just like members in a dictionary:

servers = {}

for srvrtokens,startloc,endloc in timeServer.scanString( serverListHTML ):
    print "%(ipAddress)-15s : %(locn)s" % srvrtokens
    servers[srvrtokens.ipAddress] = srvrtokens.locn

Listing 5 contains the finished running program.

Listing 5

from pyparsing import *
import urllib

# define basic text pattern for NTP server 
integer = Word("0123456789")
ipAddress = Combine( integer + "." + integer + "." + integer + "." + integer )
tdStart = Literal("<td>").suppress()
tdEnd = Literal("</td>").suppress()
timeServer = tdStart + ipAddress.setResultsName("ipAddress") + tdEnd + \
             tdStart + SkipTo(tdEnd).setResultsName("locn") + tdEnd

# get list of time servers
nistTimeServerURL = "http://tf.nist.gov/service/time-servers.html"
serverListPage = urllib.urlopen( nistTimeServerURL )
serverListHTML = serverListPage.read()
serverListPage.close()

servers = {}
for srvrtokens,startloc,endloc in timeServer.scanString( serverListHTML ):
    print "%(ipAddress)-15s : %(locn)s" % srvrtokens
    servers[srvrtokens.ipAddress] = srvrtokens.locn

print servers

At this point, you've successfully extracted the NTP servers and their IP addresses and populated a program variable so that your NTP-client application can make use of the parsed results.

In Conclusion

Pyparsing provides a basic framework for creating recursive-descent parsers, taking care of the overhead functions of scanning the input string, handling expression mismatches, selecting the longest of matching alternatives, invoking callback functions, and returning the parsed results. This leaves developers free to focus on their grammar design and the design and implementation of corresponding token processing. Pyparsing's nature as a combinator allows developers to scale their applications from simple tokenizers up to complex grammar processors. It is a great way to get started with your next parsing project!

Download pyparsing from SourceForge.

Paul McGuire is a senior manufacturing systems consultant at Alan Weber & Associates. In his spare time, he administers the pyparsing project on SourceForge.


Return to the Python DevCenter.



Sponsored by: