oreilly.comSafari Books Online.Conferences.


Perl 6 Essentials

Building a Parrot Compiler

by Dan Sugalski, coauthor of Perl 6 Essentials

The Background

The company I work (Yarde) for has a large system written in an '80s vintage 4GL called DecisionPlus. This is one of those languages with integrated database, screen handling, and report generation facilities. The current code base is in excess of 460K lines of (often very nasty) code across nearly 1,300 files. Yarde has a full set of source for the system, with new code and bug fixes for the system rolling out into production on a daily basis.

The problem Yarde faces, like many other companies with custom or semi-custom software, is that limits in the system are starting to affect day-to-day operations. In our case, we're running into problems with the integrated database library, which limits some of our critical data files to around 2 million records. That might sound like a lot, but we're putting nearly 5,000 records a day into those files, and that daily rate is constantly increasing. On the one hand, that's good for the company as it means our business is growing, but on the other hand, we can only keep about 400 business days of data online at once. This hasn't affected the business yet, but sooner or later it will.

This is the equivalent of having the full C source to an application, but not having the source to the C compiler or runtime library, and finding that there's a large and unpleasant bug in one of the essential support routines. There was no flaw in the application—for which we had the source—that was causing the problem; rather, the problem was in the runtime, for which we didn't have source.

The common solution to a problem this big, especially when the original system is written in a language that's fallen out of favor, is to rewrite the system in a new language, or to migrate to another software package. These choices are both risky and expensive, and the experiences of other companies show that projects like this fail more often than succeed. We have more than a decade of customized business logic tied up in the current system, which we'd have to tease out and recreate; hundreds of employees we'd need to retrain; and lots of custom reports to recreate. Given the amount of customization needed, it was doubtful that even if we used another off-the-shelf product, we could finish the project for a reasonable cost (or even an unreasonable cost) before the database issues stopped being annoyances and started being problems.

Related Reading

Perl 6 Essentials
By Allison Randal, Dan Sugalski, Leopold Tötsch

Neither rewriting the system nor ditching it for someone else's software was an appealing option, especially given that the reason to do so was essentially an internal issue and mostly invisible to the users. We decided instead to do the sensible thing—we rewrote DecisionPlus. Or, rather, we wrote a compiler that understood DecisionPlus and, rather than targeting the DecisionPlus virtual machine and runtime that was causing us problems, we targeted a newer and less restrictive system.

That may sound mad, but it really was the simplest thing to do, by far. DecisionPlus the language is, well, let's call it charmingly simplistic. The screen handling code was all terminal based and very straightforward to map to ncurses with its form handling library. The database system required a bit of thought, as the original was a fielded ISAM database; we originally looked at the Berkeley DB, but realized we could use an SQL database with some database trickery and runtime library shims, and settled on PostgreSQL. (We needed writable views and triggers, so MySQL wasn't an option.)

The final project design ultimately has four pieces: a compiler, a database runtime library, a screen runtime library, and some general utility code.

We started writing the compiler in Perl 5, to emit Perl 5 code. Writing the compiler in Perl 5 meant we could use a Parse::RecDescent grammar, which takes care of the tokenizing and some of the parsing. This was a huge win. With Parse::RecDescent we got a preliminary compiler written in less than two months. Unfortunately, because of this it became obvious that Perl 5 wasn't a good fit as a target language. DecisionPlus, because of its age, didn't have the concept of subroutines or blocks. Instead, it made do with line labels, a goto, and a gosub statement. It also has typed variables and strings with predeclared (and enforced) lengths.

Perl could do this. The compiler turned labels into subroutines, gotos into goto &subroutines, and gosubs into sub calls. Typed data was implemented with scalars tied into packages that enforced the typed behavior. The result was messy; even before we wrote any runtime library or support code it was pretty clear that the resulting code would be unwieldy and potentially slow.

Writing a compiler still was the right answer. The speed with which I built it made that clear. Perl 5 was just the wrong back end to target. We just needed a different back end. Conveniently, we had Parrot handy, and the results have been entirely satisfactory.

Our compiler started out as a set of prototype single-function programs. There's a simple C-style preprocessor, a parser, and the actual compiler. Splitting out the functions into separate programs made development and testing easier, and helped keep the different functional parts of the compiler from bleeding into each other. This wasn't really necessary. At some point, we may merge the different pieces into a single program.

The first step in the compilation sequence is preprocessing the source. DecisionPlus has no macro or include functionality, but at some point in the past someone figured out that you could use C's preprocessor if you didn't mind ignoring all the warnings and complaints. Since only a small subset of the preprocessor functionality is needed — constant substitution and includes— we wrote our own C-style preprocessor in Perl so we could cut down on the complaints.

Defining the Grammar

After preprocessing, the resulting source is fed into the parser. The parser is again written in Perl, using Damian Conway's Parse::RecDescent module to do the parsing. We'd considered using lex and yacc or some of the other parser generators that are available, but using a perl-based parser made our prototyping much easier and faster. Recursive descent parsers are notoriously slow, as unoptimized grammars trigger a lot of backtracking. There are some tricks you can use to speed up the parsing, and we'll get to those in a bit.

O'Reilly Open Source Convention.

Parse::RecDescent grammars are pretty standard as grammars go, and the examples you'll find in most any compiler or parsing book will work just fine. (Though you may have to do a bit of translation for LALR grammars, like the ones fed to yacc) Each rule in a grammar has a set of one or more strings, regular expressions, or sub-rules that must be matched in the input stream for the rule to successfully match.

Parse::RecDescent has many handy features you can, for example, have pieces of code that fire off whenever a rule matches, allowing your parser to alter Parse::RecDescent's default tree. By default the tree that's built is based on hashes. I found that this wasn't as easy to process as a more traditional tree, but P::RDs documentation notes that if the $::RD_AUTOACTION variable is set to q{[@items]} you get back an array of arrays instead. I found this easier to process in the compiler, though that is a matter of taste.

Since having an example is useful, we'll use the grammar in Example 1 throughout the article. This is a grammar specifically for a recursive descent parser (and will work fine with Parse::RecDescent). If you've never written a grammar before, or are only familiar with LALR grammars such as the ones that yacc uses, it may seem a little odd.

Example 1. the example mathematical grammar

field: /\b\w+\b/

stringconstant: /'[^']*'/ | /"[^"]*"/

float: /[+-]?(?=\d|\.\d)\d*(\.\d*)?([Ee]([+-]?\d+))?/

constant: float | stringconstant

addop: '+' | '-'

mulop: '*' | '/'

modop: '%'

cmpop: '<>' | '>='| '<=' | '<' | '>' | '='

logop: 'and' | 'or'

parenexpr: '(' expr ')'

simplevalue: parenexpr | constant | field

modval: <leftop: simplevalue modop simplevalue>

mulval: <leftop: modval mulop modval>

addval: <leftop: mulval addop mulval>

cmpval: <leftop: addval cmpop addval>

logval: <leftop: cmpval logop cmpval>

expr: logval

declare: declare field

assign: field = expr

print: print expr

statement: assign | print | declare

In this case the grammar is for a mathematical or logical expression, with standard precedence. (That is, modulus comes first, then multiplication and division, then addition and subtraction, then comparisons, then logical operations.) We also have a variable declaration rule, a value assignment rule, and a print rule. There's a separate rule for each level of precedenceadding in more levels is just a matter of adding in more rules. We've also added a variable declaration rule, an assignment rule, and a printing rule so we can hold temporary values and see what we've produced.

If you've used yacc before, you might wonder why we go to all this trouble. This is because an LL, or recursive descent, grammar can't be left-recursive. The leftmost term of a rule can't be the rule itself, as this would lead to infinite recursion when evaluating the grammar. LR and LALR grammars, such as those that yacc uses, don't have this limitation, making precedence rules much simpler.

When this grammar is run over an expression, it will build up a tree that represents the expression. We can feed that tree to our compiler to produce code. In our case we're going to produce Parrot code, but you could easily use this to generate C, Perl, or Python code, or even native machine code if you were so inclined.

Pages: 1, 2, 3

Next Pagearrow

Sponsored by: