Pyparsing Is a "Combinator"
With the pyparsing module, you first define the basic pieces of your grammar. Then you combine them into more complex parse expressions for the various branches of the overall grammar syntax. Combine them by defining relationships such as:
- Which expressions should follow each other in the grammar, such as "the keyword
ifis followed by a Boolean expression enclosed in parentheses"
- Which expressions are valid alternatives at a certain point in the grammar, such as "a SQL command may start with
- Which expressions are optional, as in "a phone number may optionally be preceded by an area code, enclosed in parentheses"
- Which expressions are repeatable, such as "an opening XML tag may contain zero or more attributes"
Although some complex grammars can involve dozens or even hundreds of grammar combinations, many parsing tasks are easily performed with only a handful of definitions. Capturing the grammar in BNF helps organize your thoughts and parser design. It also helps you keep track of your progress in implementing the grammar with pyparsing's functions and classes.
Defining a Simple Grammar
The smallest building blocks of most grammars are typically exact strings of characters. For example, here is a simple BNF for parsing a phone number:
number :: '0'.. '9'* phoneNumber :: [ '(' number ')' ] number '-' number
Because this looks for dashes and parentheses within a phone number string, you can also define simple literal tokens for these punctuation marks:
dash = Literal( "-" ) lparen = Literal( "(" ) rparen = Literal( ")" )
To define the groups of numbers in the phone number, you need to handle groups of characters of varying lengths. For this, use the Word token:
digits = "0123456789" number = Word( digits )
The number token will match contiguous sequences made up of characters listed in the string digits; that is, it is a "word" composed of digits (as opposed to a traditional word, which is composed of letters of the alphabet). Now you have enough individual pieces of the phone number, so you can string them together using the
phoneNumber = And( [ lparen, number, rparen, number, dash, number ] )
This is fairly ugly, and unnatural to read. Fortunately, the pyparsing module defines operator methods to combine separate parse elements more easily. A more legible definition uses
phoneNumber = lparen + number + rparen + number + dash + number
For an even cleaner version, the
+ operator will join strings to parse elements, implicitly converting the strings to Literals. This gives the very easy-to-read:
phoneNumber = "(" + number + ")" + number + "-" + number
Finally, to designate that the area code at the beginning of the phone number is optional, use pyparsing's
phoneNumber = Optional( "(" + number + ")" ) + number + "-" + number
Using the Grammar
Once you've defined your grammar, the next step is to apply it to the source text. Pyparsing expressions support three methods for processing input text with a given grammar:
parseStringmethod uses a grammar that completely specifies the contents of an input string, parses the string, and returns a collection of strings and substrings for each grammar construct.
scanStringmethod uses a grammar that may match only parts of an input string, scans the string looking for matches, and returns a tuple that contains the matched tokens and their starting and ending locations within the input string.
transformStringmethod is a variation on
scanString. It applies any changes to the matched tokens and returns a single string representing the original input text, as modified by the individual matches.
Hello, World! parser calls
parseString and returns straightforward token results:
Hello, World! -> ['Hello', ',', 'World', '!']
Although this looks like a simple list of token strings, pyparsing returns data using a
ParseResults object. In the example above, the results variable behaves like a simple Python list. In fact, you can index into the results just like a list:
print results print results[-2]
ParseResults also lets you define names for individual syntax elements, making it easier to retrieve bits and pieces of the parsed text. This is especially helpful when a grammar includes optional elements, which can change the length and offsets of the returned token list. By modifying the definitions of
salute = Word( alphas+"'" ).setResultsName("salute") greetee = Word( alphas ).setResultsName("greetee")
you can reference the corresponding tokens as if they were attributes of the returned results object:
print hello, "->", results print results.salute print results.greetee
Now the program will print:
G'day, Mate! -> ["G'day", ',', 'Mate', '!'] G'day Mate
Results names can greatly help to improve the readability and maintainability of your parsing programs.
In the case of the phone number grammar, you can parse an input string containing a list of phone numbers, one after the next, as:
phoneNumberList = OneOrMore( phoneNumber ) data = phoneNumberList.parseString( inputString )
This will return data as a pyparsing ParseResults object, containing a list of all of the input phone numbers.
Pyparsing includes some helper expressions, such as
delimitedList, so that if your input were a comma-separated list of phone numbers, you could simply change
phoneNumberList = delimitedList( phoneNumber )
This will return the same list of phone numbers that you had before. (
delimitedList supports any custom string or expression as a delimiter, but comma delimiters are the most common, and so they are the default.)
If, instead of having a string containing only phone numbers, you had a complete mailing list of names, addresses, zip codes, and phone numbers, you could extract the phone numbers using
scanString is a Python generator function, so you must use it in a for loop, list comprehension, or generator expression.
for data,dataStart,dataEnd in phoneNumber.scanString( mailingListText ): . . # do something with the phone number tokens, # returned in the 'data' variable . .
Lastly, if you had the same mailing list but wished to hide the numbers from, say, a potential telemarketer, you could transform the string by attaching a parse action that just changes all phone numbers to the string
(000)000-0000. Replacing the input tokens with a fixed string is a common parse action, so pyparsing provides a built-in function
replaceWith to make this very simple:
phoneNumber.setParseAction( replaceWith("(000)000-0000") ) sanitizedList = phoneNumber.transformString( originalMailingListText )