Parser Generator Tradeoffs

Discussion moving here from AvoidExceptionsWheneverPossible.

BTW, I find I went from recursive descent years ago, to yacc/lex, to AntlrTranslatorGenerator, back to hand-coded recursive descent. --JimPerry
How come? I mean, I find recursive descent seductive - but I have a very strong feeling it's nowhere near as maintainable/efficient as a good parser generator coupled with a good grammar. OTOH, having done both within the last few years, I have no idea why I feel that. So ... how come? --PeterMerel

There's a benefit to having an explicit grammar, especially if you're implementing something with a standard one, but the fact is that parser generators require grammars twiddled for their own preferences, for instance yacc favors left-recursion for repetitive constructs. Also, they require you to intermix the grammar with the implementation language in a way that is neither quite natural to a BackusNaurForm-reader nor a C++-or-whatever reader. The context in which the target-language fragments will occur in the generated parser is not apparent.

Yacc is notoriously hard to get to produce decent diagnostics, and existing versions tend to have limitations like using fixed symbol names (this can be worked around but is a pain). Also, since it builds trees bottom up, it can be unfriendly to C++ object management.

So I looked into AntlrTranslatorGenerator which is more modern than yacc, and generates readable recursive-descent code. It looked promising, but I found that to me the grammar interspersed with C++ fragments was not really a whole lot more readable or maintainable than the emitted parser, and certainly no more so than an equivalent hand-coded parser. (Also, the ANTLR folks were/are putting all their efforts into a Java version). So I eventually dropped that path and opted for the hand-coded parser route.

This has seemed to pay off; it's much easier to make changes, there are fewer grammar hacks and repetitions and so on. Perhaps it's what you're used to. It's certainly more convenient to have the parser right there in C++ like everything else rather than having to run it through a preprocessor (and in our environment we found we needed to pick one vendor's yacc, do the preprocessing on that particular platform, then ship around the preprocessed code).

Efficient? I don't know. My recollection is that it's the lexical analyzer (token scanner) that dominates in parsing, but in any case I'm a strong believer in LazyOptimization and parser efficiency has even never crossed the radar screen in monitoring performance of our current systems. YMMV. --JimPerry

As a followup on this, the desire for an explicit grammar does come up occasionally, so I recently wrote a utility to read parser code and generate a BackusNaurForm grammar for the recognized language. It produces about as readable a BNF as could be written by hand, more so IMO than most yacc source (even if trimmed of action code). This approach has a certain satisfying quality related to TheSourceCodeIsTheDesign.

The particular program I wrote, written in the IconLanguage and primarily a vehicle for learning Icon, is specifically tailored to C++ and in a few minor ways to the particular context, but I think the general approach is pretty broadly applicable. --JimPerry