ONLamp.com
oreilly.comSafari Books Online.Conferences.

advertisement


Building a Parrot Compiler
Pages: 1, 2, 3

With the driver code and the node processing code, it's just a matter of writing the code for the individual nodes. This code can be complex, so it's not necessarily as simple as it might seem, but its not that tough. It is important to remember, when writing a compiler, that you need to evaluate things from the inside out. When evaluating the node for print, for example, you have to evaluate the expression to be printed before doing the actual print. This implies that all nodes that get evaluated return a value that can be used, as well as the code to be executed to produce this value. Since we're already having the nodes return the code needed to produce that value, we need an alternate way to get the value produced by a node.



The following code in the compiler does this for us:

sub last_expr_val {
    return $::last_expr;
}

sub set_last_expr_val {
    $::last_expr = $_[0];
}

We use a global variable, which is a perfectly fine thing to do in this case, but we hide it behind a pair of accessing functions. This is less a stylistic convention (as you'll see, we'll introduce more globals later — compilers seem to accumulate them) as one of convenience. Using accessor functions means we can easily wedge more code in later if necessary.

There are three types of nodes that we'll need to deal with in our compiler. The first type is the terminal node. These are nodes that resolve to a single concrete thing. The float, field, and stringconstant rules produce terminal nodes, as what they generate is a single concrete thing. (In this case a floating point constant, field name, or string constant) The handlers for nodes like these are generally simple. For example, the handler for a floating point constant is just:

sub handle_float {
    my ($nodetype, $floatval) = @_;
    set_last_expr_val($floatval);
    return;
}

These nodes don't even produce any executable code, since theres no code that actually needs to be executed for them. It's sometimes useful to put validation code in a terminal node, if there's something you want to check at code generation time. For example, the field node handler may look like:

sub handle_field {
    my ($nodetype, $fieldname) = @_;

    if (!exists $global_vars{$fieldname}) {
        die "undeclared field $fieldname used";
    }

    set_last_expr_val($fieldname);
    return;
}

if you want the compiler to emit a fatal error if a variable has been accessed before it has been declared. The field access code could also fetch the fields out of the global namespace if you were using it. (Our compiler uses the named local variable feature of Parrot's PIR compiler, so we don't do this.)

The next type of node you'll deal with is the non-terminal node. These are nodes generated by rules that have multiple subrules, pieces of text, or regular expressions that match. The declare rule is a good example of this, as it has both the constant text declare that matches as well as the field rule to note the name of the field being declared. The declare handler looks like:

sub handle_declare {
    my ($nodetype, undef, $var) = @_;
    my @lines;
    my $varname = $var->[1];

    # Does it exist?
    if (defined $global_vars{$varname}) {
        die "Multiple declaration of $varname";
    }

    $global_vars{$varname}++;
    push @lines, " .local pmc $varname\n";
    push @lines, " new $varname, .PerlInt\n";
    return @lines;
}

This handler illustrates an important feature of compilers — they cheat, sometimes extraordinarily. In this case all we're doing is peeking at into the insides of what the field name subrule produced, rather than just delegating to the field handler. Why? Well, if we delegated to the field handler it would complain that the field hasn't yet been declared. While that's true, it's not a useful error as we're in the field declaration rule. So we cheat and peek ahead a bit.

This particular cheat isn't strictly necessary, so if you object to it you can get rid of it. There are several ways to do that, whether making a separate pass over the tree to note the declarations before making the compilation pass, tweaking the parser to make entries into the global variable list as it parses, using a different rule for field names in the declaration rule, or adding in a context marker that the field handler can note. Consider them exercises for the reader if you're so inclined.

The third type of node you'll encounter is a node that has no purpose other than to delegate to another node. The constant and simplevalue rules are examples of this. While each rule has several alternative subrules, each may match exactly one subrule. That means that the node that's generated for the rule will consist of just a single subrule node, in which case we can just punt the processing to the subrule handler and be done with it. This is done with the delegate handler that's attached to these rules. It looks like:

sub delegate {
    my ($nodetype, $nodeval) = @_;
    return process_node(@$nodeval);
}

All it does is call the subnode handler and return whatever it returns.

Now, you may wonder why we even bother with nodes like this. The simple reason is partly sheer laziness, as we could make a pass over the tree first and splice out nodes like this. Making that pass over the tree does take some time, though, so it makes the compiler a bit faster if we don't bother. Leaving the tree alone also leaves us open to some flexibility in the future. We may change the grammar a bit and make these nodes not just passthrough nodes.

Most of the terminal and nonterminal nodes are simple enough to process, but the one place you'll probably run into the most trouble is in processing expressions. The grammar does proper precedence-based and parenthesized grouping of expression terms, but processing the expression can still be somewhat tricky. You've probably noticed that the five different expression term rules are essentially identical. That's because they are. The only reason they're split into separate rule is that's what needs to be done with top-down parsers to properly handle precedence. (As we've said, bottom-up parsers, like ones that yacc generates, don't have this issue)

Complex Handlers

All our expression rules can be handled by a single handler, and they are. The handler is a bit complex, so we'll look at only a part of it:

sub handle_generic_val {
    my (undef, $terms) = @_;
    my (@terms)        = @$terms;
    my $lhs            = shift @terms;
    my @tokens;

    push @tokens, process_node(@$lhs);

    my ($op, $rhs);

    while (@terms) {
        $op      = shift @terms;
        $rhs     = shift @terms;
        my $val  = last_expr_val();
        my $oper = $op->[1];

        push @tokens, process_node(@$rhs);

        my $other_val = last_expr_val();
        my $dest      = $temps{P}++;

        foreach ($oper) {
            # Simple stuff -- addition, subtraction, multiplication,
            # division, and modulus. Just a quick imcc transform
            /(\+|\-|\*|\/|%)/ && do {
                push @tokens, "new \$P$dest, .PerlInt\n";
                push @tokens, "\$P$dest = $val $oper $other_val\n";
                set_last_expr_val("\$P$dest");
                last;
            };

            /=/ && do {
                my $label = "eqcheck$tempcount";
                $tempcount++;
                push @tokens, "new \$P$dest, .Integer\n";
                push @tokens, "\$P$dest = 1\n";
                push @tokens, "eq $val, $other_val, $label\n";
                push @tokens, "\$P$dest = 0\n";
                push @tokens, "$label:\n";
                set_last_expr_val("\$P$dest");
                last;
            };
        }
    }

    return @tokens;
}

This code only handles basic math and equality testing, but that's sufficient for illustration. The basic math operations each first create a new temporary variable, perform the requested operation, and then set the last expression value to be the temp we just created.

This takes advantage of the PIR compiler's infinite temporary register feature. While Parrot only has 32 integer, string, floating point, and PMC registers, PIR code can use as many as it needs, with the PIR compiler mapping these temp registers to real registers as it needs to. Temp registers are noted with a leading dollar sign, so while P3 refers to the fourth PMC register, $P3 is the fourth temporary PMC register. The PIR compiler handles calculating temporary lifetimes, spilling excess temps to backing store, and moving them around as needed for expressions. It's a very useful feature, to be sure, and worth using as it takes care of the single most annoying characteristic of a register-based system.

Translating Complex Expressions to PIR

Since Parrot's opcodes only do a single thing, such as add two things and return a result, complex expressions have to be broken into simple pieces and have the result stitched together. In its simplest form the expression a = b + c + d turns into:

$P0 = b + c
$P1 = $P0 + d
a = $P1

though Parrot does need you to explicitly create the temporary PMCs that are in $P0 and $P1, since math operations assign their result into existing PMCs, rather than creating new PMCs to assign the result into. Our compiler could be clever and see that $P1 is unnecessary, as it only gets used in the assignment that immediately follows, but doing so requires extra work analyzing the results of expressions. You could, if you wanted, add an optimizing pass to detect this and emit the more efficient code.

The equality testing rule is a bit trickier, since it needs to do some flow control. While Parrot can do all sorts of comparison testing, all the comparisons are test-and-branch operations, unlike the math operations that just produce a value. That means we need to do a bit of bouncing around to figure out what the results should be. In pseudo-code, it looks like:

temp = true
if equal goto templabel
temp = false

templabel:

What we do is create a temporary variable and set its value to true, which for our compiler means 1. Then we test the result of the equality expression, and if it is true, then we bounce to the temporary label that marks the end of the expression. If our expression is false we don't branch, but instead fall through to the next statement, which sets our temporary variable to false.

Emitting Postamble Code

Finishing off our compiler is just a matter of filling in the code for the rest of the nodes, and emitting a bit of postamble code. (The example compiler provides that code, so feel free to use it) For us, the postamble is:

end
.end

the end tells parrot to exit, and the .end marks the end of the subroutine. When writing parrot code you must explicitly exit you can't just fall off the end of the world, as Parrot will generally crash. (Parrot assumes that compilers emit correct code, which includes properly exiting.)

Running the Program

The provided compiler for our example language is simple enough — it takes the names of the source and destination files as parameters on the command-line. Assuming the source we're compiling is in a file named example.simple, the command would be:

perl compiler.pl example.simple example.imc

This produces a file named example.imc. The .imc extension is standard for parrot PIR code files, though you could use a .pir extension instead if you'd prefer.

Once you have the PIR code from your source, run it to see what happens. Doing this is straightforward:

parrot example.imc

Parrot will recognize that the input is PIR code by the extension and will compile and run your program. If you want parrot to generate bytecode rather than execute your program, use the -o switch:

parrot -o example.pbc example.imc

The generated bytecode file should have a .pbc extension. To execute the bytecode, just hand it to parrot as you did the PIR code:

parrot example.pbc

Compiling to bytecode isn't necessary for Parrot to run your program, but Parrot will start your program a bit faster if you have the bytecode rather than the source.

Conclusion

That's it for our simple compiler. The example language lacks some features of many useful languages, such as control flow, but the compiler we built is fully functional, and you should now have the tools you need to add in control flow capabilities if you want them.

Dan Sugalski is the chief architect for Parrot, the interpreter engine for Perl 6. He's been a Perl 5 core developer for years, writing more than a dozen modules in the process.


Return to ONLamp.com.



Sponsored by: