Parse Using Grammars

Context:

Problem: Forces: Solution: Related Patterns:


Discussion:

Yes, you should always build a grammar before you write a parser. But you don't always have to use a parser generator. If your grammar is simple then it is easy to write a recursive descent parser. To me, it depends on how complicated the grammar is. And how easy to use your parser generator is. And how well you know how to write recursive-descent parsers. -- RalphJohnson

Personally, in the tradition of YouArentGonnaNeedIt I prefer to implement that initial version of the Parser on regular expressions. I am unaware of a modern language that doesn't support them, and your grammar can become surprisingly complex before maintaining them becomes a problem. So for your simple name:value/list/etc file the code is very simple and trivially extensible. Of course when this ceases to be the case, ReFactor and use ParseUsingGrammars.

Of course if the alternative is to hand write your parser by hand? Well I would disagree with RalphJohnson. File formats do change, and I have never seen a hand written parser that isn't ridiculously brittle in the face of change compared to a grammar. -- AndraeMuys

I always start off by writing a parser by hand. And the result is always ugly and hard to maintain, so I always end up refactoring by reimplementing it with lex/yacc or a regular expression library. When will I ever learn? -- KrisJohnson


Use XML. It comes with a free grammar... and lots of parsers.


Another option is to use a language such as LuaLanguage, or any language that can read a representation of data structures at run time and reconstruct them in memory (PerlLanguage and PythonLanguage also qualify, as does LispLanguage). The added advantage of using a language in this role is you can not only incorporate a description of the data, but processing on that data as well.

Let's say you wanted to create a graphics editor and store your graphics as a list of data structures - a circle here, a line there - with attributes on these. No problem, but you can also incorporate code. Instead of filling an object with the color red, you might provide a function that does a procedural fill. Or you might have a shape best described not as drawing primitives, but as a function.

-- JohnPassaniti


I have to disagree with this. Using language specific formats makes it really hard to manipulate the data in another language, worse they can become a living nightmare to support as the language matures and various bugs/features change or disappear. This is especially the case if you are embedding executable code in your data formats.

Worse still, this can quickly lead to serious security problems as any Outlook user will attest to. Really this comes down to the PowerOfPlainText. -- AndraeMuys

Eh... you use PowerOfPlainText and you'll get all the disadvantages of a language specific format. Indeed, I can't think of something much more language-specific than a specific-language - one designed for the particular application. If you require the ability to describe processing, you'll need to obtain it whether you're using a DSL or a generic extension language like LUA.


Save yourself the heartache and megabytes of code involved with a parser generator, and use ReversePolishNotation. You don't need to embed all of ForthLanguage -- just order the data and some simple constructor words. For example, given the following bit of XML:

 <book>
  <title>Come Home</title>
  <author>Margeret McAuther?</author>
  <isbn>0-123-45678-9</isbn>
  <price>15.99</price>
 </book>

you can express this quite effortlessly in RPN:

 "Come Home" title
 "Margeret McAuther?" author
 0 123 45678 9 isbn
 "15.99" price
 book

The lexer for this need only identify strings, numbers, and pre-existing constructors. The constructors build stuff in memory. The result of executing such a "program" is a data structure in memory, suitable for more convenient processing.

I used this technique while working for Hifn, Inc. in a project designed to automate the testing of their HIPP series of chips. It worked REALLY well, despite everyone's abject horror of having to learn a little RPN. I was able to deliver the product in days, not weeks or months, it was fully unit tested (how do you UnitTest the output of yacc?), and I've received zero bug reports on it for the remainder of my time at Hifn. -- SamuelFalvo?


See LexAndYacc, ProcessingMarkupLanguages, AlternativesToRegularExpressions


EditText of this page (last edited July 19, 2009) or FindPage with title or text search