oreilly.comSafari Books Online.Conferences.


Building a Parrot Compiler
Pages: 1, 2, 3

Optimizing the Grammar

Recursive descent parsers do a left-most, depth-first match of the grammar against the input source. While this is a powerful technique, it's also potentially very slow — there's a lot more backtracking with a recursive descent parser than with a LALR parser. Because of this, it's a good idea to optimize your grammars, otherwise you'll find your parser too slow, regardless of the language it's written in, to be useful.

Optimizing recursive descent grammars deserves an article (or more) all by itself, but there are a few easy tricks you can use to get a respectable speedup in parse times. The easiest way to make a recursive descent parser faster is to make it do less work. Since rule failures (that is, rules that don't match) make up a significant amount of the work the parser does, make sure the parser fails as rarely as possible, or, rather, succeeds as often as possible.

For example, if you have a grammar with the following keywords:

abort: abort

else: else

if: if

print: print

read: read

then: then

It's useful to have a grammar element that encompasses them all. As a first try, you'd probably throw something together like:

keyword: abort | else | if | print | read | then

and that will work. It'll also be slower than it has to be.

Why? Well, that's due to the way that recursive descent parsers work. When the parser is looking to see if the keyword rule matches, it tries each sub-rule in turn. In this case the rules are listed alphabetically, so it tries abort first, then else, then if, and so on. Programming languages are no different than any other language. Some words are more common than others. In this case, it's likely that abort is the least common of the words in your program, so you're guaranteed that the keyword rule will fail the first test most of the time. The rule will still work, just not as fast as if you had the subrules ordered by frequency of occurrence. Performing a simple frequency analysis on a representative sample of what you're parsing will give you the information you need to order the subrules more efficiently.

Straight frequency sorting works when each subrule costs about the same amount to evaluate. When your subrules are each constant strings they each take about the same amount of time to evaluate. If you have some subrules that are significantly more expensive — because you've got a complex regular expression, or have to execute code to check if a subrule matched — you may find it better to reorder the subrules based both on frequency and expense. Often this is a matter of trial and error, but proper ordering of subrules can cut parsing time by half or more, depending on how many subrules you have and the frequency distribution of your keywords.

Generating the Code

Once you've transformed your source program into a parsed tree, you need to then turn that tree into executable code. In our case, we generate PIR, which is Parrot's Intermediate Representation, a sort of high-level assembly language. While we could have generated the assembly directly, PIR has some features that are nice for compiler writers, including automatic register allocation, subroutine call handling, and subroutine preamble and postamble handling. If you've never had to deal with any of these things count yourself lucky. If you have, you'll be happy you won't have to do so now.

Our example language is pretty simple. We can declare variables, use them in expressions, store numbers and strings in them, and print them out. There are no loops, decisions, blocks, subroutines, or flow control words. While that's not sufficient for a real language, it's enough for example purposes.

Sample Program

Since we need a source program to go with our grammar, let's use the following program:

declare foo
declare bar

foo = 15
bar = (foo + 8) * 32 7

print bar
print \n
print foo % 3
print \n

Once again, nothing at all fancy. The output should be 729 and 5, each on separate lines. Each line is a single statement, so we don't have to worry about statements spanning lines, multiple statements on a line, or other unpleasant complications. This is a good thing, at least for example purposes.

Source Code

Download the example file,

The basic parser and compiler is in Listing 1. It integrates the grammar we detailed earlier with the code necessary to handle the output of the parser and generate PIR code to feed into Parrot. When this program runs, it parses our source and, if there are no errors, it feeds the results into the compilation code, which produces a file of PIR code. From there you can feed the PIR to Parrot and see it run. This compiler fully parses the source before handing the results off for code generation, but you can easily intersperse parsing and code generation if you want.

Basic Parser Behavior

The way we have Parse::RecDescent set up, it generates a tree for each statement. This is a series of arrays of arrays (of arrays of arrays, and so forth) with each matching rule producing an array whose elements are the rule name and the matching sub-rule elements. If what matched for a rule is another rule the element will be an array reference, while if the rule matches one or more strings and regexes, the elements will be string constants.

For example, consider the statement

declare foo

That matches the declare rule, which has two elements, the string constant declare and a field expression. The field expression itself is just a regex that matches a word, at least as Perl defines it. That means the result is:

['declare', 'declare', ['field', 'foo'] ]

The inner rule for the field produces an array of two elements, the rule name (field) and the string that matched the regex (foo). The outer rule produces a result with three elements, the rule name (declare), the string declare, and the result of the field subrule.

The parser will, when it's done, hand the code generation routine a tree for each statement that it parses. Since each node and subnode is tagged with the rule that it matches, the easiest way to process this tree is with a table of functions for the rules. For our grammar, the table would look like:

my (%handlers) = (
    addval         => \&handle_generic_val,
    assign         => \&handle_assign,
    cmpval         => \&handle_generic_val,
    constant       => \&delegate,
    declare        => \&handle_declare,
    expr           => \&delegate,
    field          => \&handle_field,
    float          => \&handle_float,
    logval         => \&handle_generic_val,
    modval         => \&handle_generic_val,
    mulval         => \&handle_generic_val,
    parenexpr      => \&handle_paren_expr,
    print          => \&handle_print,
    simplevalue    => \&delegate,
    statement      => \&delegate,
    stringconstant => \&handle_stringconstant,

Using a table like this makes the compiler easily extendable — if you add in a new node you just add in a new entry in the table. You'll also note that some of the rules share functions. This is fine, and we'll get into why in a little bit.

Emitting Header Information

Before we start processing the output of the parser, we need to put out some boilerplate header information. This is a common thing for compilers to do, and you'll find yourself with a small catalog of boilerplate code. In this case, the boilerplate looks like:

.sub __MAIN prototyped, @MAIN
.param pmc argv

This declares that we're in a subroutine named __MAIN, which is prototyped, and takes a single PMC parameter argv. Being in a subroutine is mandatory; Parrot requires that all code executed must be in a subroutine, function, or method. For us, with our simple sub-less language, the name is irrelevant, but in languages with real subs and methods the name is often meaningful. The @MAIN tag on the subroutine marks it as the starting subroutine for our program. When parrot executes our program, this is where it starts.

At the moment, Parrot assumes that the first subroutine in the first file that's executed is the driving sub, though we may change that later on.

The parameter, argv, goes unused in our example language, but it's an array PMC that holds the command-line parameters passed into parrot when our program started.

Emitting PIR Code

After the boilerplate is emitted, we can start processing the trees of nodes the parser has generated. If we mandate that each node-handling function returns the PIR code needed to process that node, the code driving code generation can look like:

foreach my $node (@nodes) {
    my (@lines) = process_node(@$node);
    print join("", @lines);

In this case, we assume that the @nodes array holds all the statement nodes from our parsed program. We flatten and process each node in turn, printing out all the code for the node. The function to process a node looks like:

sub process_node {
    my (@elems) = @_;

    if(exists($handlers{$elems[0]})) {
        return $handlers{$elems[0]}->(@elems);
    } else {
        return "***", $elems[0], "***\n";

This function (the version in the listing has some extra error checking) just looks to see if it can find a function in our global handler hash to handle the node that's been passed in. If it can, it calls that function, passing in the flattened node data. If it can't, it emits an easy-to-find string that notes the node that it couldn't find a handler for. This comes in handy when debugging the compiler later on, as it makes it easy to find where you've found a token you don't know how to deal with.

Pages: 1, 2, 3

Next Pagearrow

Sponsored by: