INDIVIDUAL ENTRY MAIN
Of Bison, Spirits and Antlrs

I've recently had cause to dip into the wonderful world of full on text parsing again after a couple of years away. Anyone who's tried to do any serious text parsing over the years will probably have either written their own recusive decent parser (probably at great cost to their sanity) or encountered the wonder of LEX and YACC (arguably at an even greater cost to their sanity). For those of you who don't know LEX and YACC are a pair of tools that generate a C parser and lexer for you from some definition files. LEX is based heavily on regular expressions that it uses to match various tokens, whereas YACC is based on the parsing the stream of tokens extracted by LEX. The nice guys at the the Free Software Foundation have been maintaining their own versions of these tools for a good number of years in the form of FLEX and BISON (wonderous pun names there).

Now, seeing as I was looking to write a moderately complicated little language for myself, an automated solution like LEX and YACC seemed like a far saner choice than writing my own parser. There was only one big problem, even the GNU version of these tools still only has very minimal support for C++, and are still rather stuck in the bad old days of C. With a C++ solution firmly in mind, I went off in search of some more up-to-date alternatives.

Anyway, after a little bit of a search I stumbled upon SPIRIT, which is now part of the BOOST library. Spirit takes a really interesting approach to the whole parser generation problem, which is to do away with any sort of external def files to define your parser, and it instead uses templates to pervert C++ in such a way as to allow you to write your BNF parser definition along with your related actions straight in the same .cpp file. It all gets magically transformed into a working parser by the compiler when it evaluates the templates. So, for example using SPIRIT, the grammar for a simple calculator might look something like:

    struct calculator : public grammar
    {
        template 
        struct definition
        {
            definition(calculator const& self)
            {
                expression
                    =   term
                        >> *(   ('+' >> term)[&do_add]
                            |   ('-' >> term)[&do_subt]
                            )
                    ;

term =
factor
>> *( ('*' >> factor)[&do_mult]
| ('/' >> factor)[&do_div]
)
;

factor
= lexeme_d[(+digit_p)[&do_int]]
| '(' >> expression >> ')'
| ('-' >> factor)[&do_neg]
| ('+' >> factor)
;
}

rule expression, term, factor;

rule const&
start() const { return expression; }
};
};

Yes, that is real C++, honest! The cool thing about SPIRIT is that being template based, you can get very heavily inlined and optimised versions of your parser. It's also very modular and allows you to define little grammar classes, that can parse a particular sub-grammar (nice for re-use). However, because its template based, you also sacrifice a few things; decent and sensible error reporting is one of those things, you put yourself rather at the mercy of the compiler you're using to determine what you've done wrong when you have an error in your grammar definition. Also, again because its template based, you can end up with the compiler generating a HUGE amount of code, along with some very very very long symbol names that all combine to give you rather terrible compile and link times. This was unfortunately what caused me to give up with SPIRIT, I think at one point I was generating 30mb debug object files for my parser file, and getting to make serveral cups of tea whilst waiting for the whole thing to link; not really very practical. Having to give up on SPIRIT was a bit disheartening as I'd spent a whole week learning its intricacies, and written a decent amount of code with it. So I was a bit sad to have to throw it all away, but it really wasn't looking like it was actually practical in real everyday use (like so many other wonderful template based insanity libraries).

So I ended up searching around again, and came across a real gem in the form of ANTLR. ANTLR is an external tool very much in the spirit of LEX and YACC, except that it is actually quite modern and fully supports C++ and Java. Its actually written in Java, and was pretty easy to install ,once I'd figured out that J2SE was what I needed to install on my computer to run it (thanks for the obvious naming and non (sic.) confusing web site there, Sun). ANTLR actually generates a LL parser ( SPIRIT too for that matter) rather than a LALR parser like YACC and BISON, so you have to be careful about left recursion in your grammar. Unlike LEX and YACC, ANTLR does everything all in one file, so you can define both your lexer and your parser in one file. It even has support for "Abstract Syntax Tree" generation and parsing, so you can define a tree parser to act on the ASTs that your parser has generated instead of having to deal with this kind of thing youself like in LEX and YACC. A simple ANTLR grammar to implement a simple calculator using ASTs looks like this:


options {
language="Cpp";
}

class CalcParser extends Parser;
options {
genHashLines = true; // include line number information
buildAST = true; // uses CommonAST by default
}

expr
: mexpr (PLUS^ mexpr)* SEMI!
;

mexpr
: atom (STAR^ atom)*
;

atom: INT
;

class CalcLexer extends Lexer;

WS_ : (' '
| ' '
| ' '
| ' ')
{ _ttype = ANTLR_USE_NAMESPACE(antlr)Token::SKIP; }
;

LPAREN: '('
;

RPAREN: ')'
;

STAR: '*'
;

PLUS: '+'
;

SEMI: ';'
;

protected
DIGIT
: '0'..'9'
;

INT : (DIGIT)+
;

class CalcTreeWalker extends TreeParser;

expr returns [float r]
{
float a,b;
r=0;
}
: #(PLUS a=expr b=expr) {r = a+b;}
| #(STAR a=expr b=expr) {r = a*b;}
| i:INT {r = atof(i->getText().c_str());}
;

As you can see the syntax is quite simple and clean, and best of all it generates 3 nice C++ classes for you, instead of lots of horrible C like LEX and YACC. It didn't take me long to get to grips with either, I had my grammar from SPIRIT converted and working in a day or so. Its not quite as flexible as SPIRIT in terms of grammar reuse, but is actually pretty good all in all. I certainly wont be going back to LEX and YACC after discovering ANTLR, finally, a nice clean simple parser generator that produces C++. Yokatta!

Posted by kallisti at 05:05 PM on Thu, 11 September 2003