Building a Parrot Compiler
Pages: 1, 2, 3
With the driver code and the node processing code, it's just a matter of
writing the code for the individual nodes. This code can be complex, so it's
not necessarily as simple as it might seem, but its not that tough. It is
important to remember, when writing a compiler, that you need to evaluate
things from the inside out. When evaluating the node for print,
for example, you have to evaluate the expression to be printed before doing the
actual print. This implies that all nodes that get evaluated return a value
that can be used, as well as the code to be executed to produce this value.
Since we're already having the nodes return the code needed to produce that
value, we need an alternate way to get the value produced by a node.
The following code in the compiler does this for us:
sub last_expr_val {
return $::last_expr;
}
sub set_last_expr_val {
$::last_expr = $_[0];
}
We use a global variable, which is a perfectly fine thing to do in this case, but we hide it behind a pair of accessing functions. This is less a stylistic convention (as you'll see, we'll introduce more globals later — compilers seem to accumulate them) as one of convenience. Using accessor functions means we can easily wedge more code in later if necessary.
There are three types of nodes that we'll need to deal with in our compiler.
The first type is the terminal node. These are nodes that resolve to a single
concrete thing. The float, field, and
stringconstant rules produce terminal nodes, as what they generate
is a single concrete thing. (In this case a floating point constant, field
name, or string constant) The handlers for nodes like these are generally
simple. For example, the handler for a floating point constant is just:
sub handle_float {
my ($nodetype, $floatval) = @_;
set_last_expr_val($floatval);
return;
}
These nodes don't even produce any executable code, since theres no code that actually needs to be executed for them. It's sometimes useful to put validation code in a terminal node, if there's something you want to check at code generation time. For example, the field node handler may look like:
sub handle_field {
my ($nodetype, $fieldname) = @_;
if (!exists $global_vars{$fieldname}) {
die "undeclared field $fieldname used";
}
set_last_expr_val($fieldname);
return;
}
if you want the compiler to emit a fatal error if a variable has been accessed before it has been declared. The field access code could also fetch the fields out of the global namespace if you were using it. (Our compiler uses the named local variable feature of Parrot's PIR compiler, so we don't do this.)
The next type of node you'll deal with is the non-terminal node. These are
nodes generated by rules that have multiple subrules, pieces of text, or
regular expressions that match. The declare rule is a good example
of this, as it has both the constant text declare that matches as
well as the field rule to note the name of the field being declared. The
declare handler looks like:
sub handle_declare {
my ($nodetype, undef, $var) = @_;
my @lines;
my $varname = $var->[1];
# Does it exist?
if (defined $global_vars{$varname}) {
die "Multiple declaration of $varname";
}
$global_vars{$varname}++;
push @lines, " .local pmc $varname\n";
push @lines, " new $varname, .PerlInt\n";
return @lines;
}
This handler illustrates an important feature of compilers — they cheat, sometimes extraordinarily. In this case all we're doing is peeking at into the insides of what the field name subrule produced, rather than just delegating to the field handler. Why? Well, if we delegated to the field handler it would complain that the field hasn't yet been declared. While that's true, it's not a useful error as we're in the field declaration rule. So we cheat and peek ahead a bit.
This particular cheat isn't strictly necessary, so if you object to it you can get rid of it. There are several ways to do that, whether making a separate pass over the tree to note the declarations before making the compilation pass, tweaking the parser to make entries into the global variable list as it parses, using a different rule for field names in the declaration rule, or adding in a context marker that the field handler can note. Consider them exercises for the reader if you're so inclined.
The third type of node you'll encounter is a node that has no purpose other
than to delegate to another node. The constant and
simplevalue rules are examples of this. While each rule has
several alternative subrules, each may match exactly one subrule. That means
that the node that's generated for the rule will consist of just a single
subrule node, in which case we can just punt the processing to the subrule
handler and be done with it. This is done with the delegate handler that's
attached to these rules. It looks like:
sub delegate {
my ($nodetype, $nodeval) = @_;
return process_node(@$nodeval);
}
All it does is call the subnode handler and return whatever it returns.
Now, you may wonder why we even bother with nodes like this. The simple reason is partly sheer laziness, as we could make a pass over the tree first and splice out nodes like this. Making that pass over the tree does take some time, though, so it makes the compiler a bit faster if we don't bother. Leaving the tree alone also leaves us open to some flexibility in the future. We may change the grammar a bit and make these nodes not just passthrough nodes.
Most of the terminal and nonterminal nodes are simple enough to process, but the one place you'll probably run into the most trouble is in processing expressions. The grammar does proper precedence-based and parenthesized grouping of expression terms, but processing the expression can still be somewhat tricky. You've probably noticed that the five different expression term rules are essentially identical. That's because they are. The only reason they're split into separate rule is that's what needs to be done with top-down parsers to properly handle precedence. (As we've said, bottom-up parsers, like ones that yacc generates, don't have this issue)
Complex Handlers
All our expression rules can be handled by a single handler, and they are. The handler is a bit complex, so we'll look at only a part of it:
sub handle_generic_val {
my (undef, $terms) = @_;
my (@terms) = @$terms;
my $lhs = shift @terms;
my @tokens;
push @tokens, process_node(@$lhs);
my ($op, $rhs);
while (@terms) {
$op = shift @terms;
$rhs = shift @terms;
my $val = last_expr_val();
my $oper = $op->[1];
push @tokens, process_node(@$rhs);
my $other_val = last_expr_val();
my $dest = $temps{P}++;
foreach ($oper) {
# Simple stuff -- addition, subtraction, multiplication,
# division, and modulus. Just a quick imcc transform
/(\+|\-|\*|\/|%)/ && do {
push @tokens, "new \$P$dest, .PerlInt\n";
push @tokens, "\$P$dest = $val $oper $other_val\n";
set_last_expr_val("\$P$dest");
last;
};
/=/ && do {
my $label = "eqcheck$tempcount";
$tempcount++;
push @tokens, "new \$P$dest, .Integer\n";
push @tokens, "\$P$dest = 1\n";
push @tokens, "eq $val, $other_val, $label\n";
push @tokens, "\$P$dest = 0\n";
push @tokens, "$label:\n";
set_last_expr_val("\$P$dest");
last;
};
}
}
return @tokens;
}
This code only handles basic math and equality testing, but that's sufficient for illustration. The basic math operations each first create a new temporary variable, perform the requested operation, and then set the last expression value to be the temp we just created.
This takes advantage of the PIR compiler's infinite temporary register
feature. While Parrot only has 32 integer, string, floating point, and PMC
registers, PIR code can use as many as it needs, with the PIR compiler mapping
these temp registers to real registers as it needs to. Temp registers are noted
with a leading dollar sign, so while P3 refers to the fourth PMC
register, $P3 is the fourth temporary PMC register. The PIR
compiler handles calculating temporary lifetimes, spilling excess temps to
backing store, and moving them around as needed for expressions. It's a very
useful feature, to be sure, and worth using as it takes care of the single most
annoying characteristic of a register-based system.
Translating Complex Expressions to PIR
Since Parrot's opcodes only do a single thing, such as add two things and
return a result, complex expressions have to be broken into simple pieces and
have the result stitched together. In its simplest form the expression a
= b + c + d turns into:
$P0 = b + c
$P1 = $P0 + d
a = $P1
though Parrot does need you to explicitly create the temporary PMCs that are
in $P0 and $P1, since math operations assign their
result into existing PMCs, rather than creating new PMCs to assign the result
into. Our compiler could be clever and see that $P1 is
unnecessary, as it only gets used in the assignment that immediately follows,
but doing so requires extra work analyzing the results of expressions. You
could, if you wanted, add an optimizing pass to detect this and emit the more
efficient code.
The equality testing rule is a bit trickier, since it needs to do some flow control. While Parrot can do all sorts of comparison testing, all the comparisons are test-and-branch operations, unlike the math operations that just produce a value. That means we need to do a bit of bouncing around to figure out what the results should be. In pseudo-code, it looks like:
temp = true
if equal goto templabel
temp = false
templabel:
What we do is create a temporary variable and set its value to true, which for our compiler means 1. Then we test the result of the equality expression, and if it is true, then we bounce to the temporary label that marks the end of the expression. If our expression is false we don't branch, but instead fall through to the next statement, which sets our temporary variable to false.
Emitting Postamble Code
Finishing off our compiler is just a matter of filling in the code for the rest of the nodes, and emitting a bit of postamble code. (The example compiler provides that code, so feel free to use it) For us, the postamble is:
end
.end
the end tells parrot to exit, and the .end marks
the end of the subroutine. When writing parrot code you must
explicitly exit you can't just fall off the end of the world, as Parrot will
generally crash. (Parrot assumes that compilers emit correct code, which
includes properly exiting.)
Running the Program
The provided compiler for our example language is simple enough — it
takes the names of the source and destination files as parameters on the
command-line. Assuming the source we're compiling is in a file named
example.simple, the command would be:
perl compiler.pl example.simple example.imc
This produces a file named example.imc. The .imc
extension is standard for parrot PIR code files, though you could use a
.pir extension instead if you'd prefer.
Once you have the PIR code from your source, run it to see what happens. Doing this is straightforward:
parrot example.imc
Parrot will recognize that the input is PIR code by the extension and will
compile and run your program. If you want parrot to generate
bytecode rather than execute your program, use the -o switch:
parrot -o example.pbc example.imc
The generated bytecode file should have a .pbc extension. To
execute the bytecode, just hand it to parrot as you did the PIR
code:
parrot example.pbc
Compiling to bytecode isn't necessary for Parrot to run your program, but Parrot will start your program a bit faster if you have the bytecode rather than the source.
Conclusion
That's it for our simple compiler. The example language lacks some features of many useful languages, such as control flow, but the compiler we built is fully functional, and you should now have the tools you need to add in control flow capabilities if you want them.
Dan Sugalski is the chief architect for Parrot, the interpreter engine for Perl 6. He's been a Perl 5 core developer for years, writing more than a dozen modules in the process.
Return to ONLamp.com.
-
DecisionPlus
2005-01-18 13:30:00 osprey [View]
-
try to fix runtime
2004-07-01 00:21:50 SlavaKudinov [View]
- Trackback from http://bundeskanzlerkandidatin.de/weblog//item/265
Parrot Compiler Entwicklung
2004-05-23 07:14:35 [View]
-
Small correction to sample program
2004-04-20 07:51:51 jsacco [View]
-
DecisionPlus
2004-04-19 09:26:22 Dan Sugalski |
[View]
- Trackback from http://www.classy.dk/log/archive/000873.html
Parrot as a production platform!
2004-04-18 07:54:18 [View]
-
DecisionPlus?
2004-04-18 04:14:00 Offer Kaye [View]
- Trackback from http://weblogs.asp.net/astopford/archive/0001/01/01/114518.aspx
Items of interest pt9
2004-04-16 05:39:18 [View]
- Trackback from http://www.sidhe.org/~dan/blog/archives/000325.html
Mmmm, article goodness
2004-04-15 19:58:25 [View]