Unit tests for parsers

I’m still tinkering with my ANTLR-based Delphi parser, but I’ve also started playing around with Racc. Today, I found out that there are going to be some fundamental differences in how I write unit tests for Racc.

ANTLR generates a recursive descent parser, which means that it can’t handle any sort of ambiguity in the grammar. (Is “at” a keyword or an identifier? “It depends” is not generally an answer ANTLR can accept.) But there’s an upside, too: the code it generates is very easy to understand (and debug, for that matter). If I have a parser rule “expression ::= atom operator atom“, then ANTLR generates a function called expression(), which consists of calls to atom(), operator(), and atom(). There’s some other decorations along the way, but it’s basically that simple.

This makes it really easy to write unit tests for the expression rule in isolation: I just stuff some tokens into the lexer, call parser.expression(), and then make sure all the lexer’s input has been consumed. It’s not really a unit test of expression(), since I’m also testing everything expression() calls, but it’s been working reasonably well so far.

Racc, on the other hand, makes an LALR parser. It should, like yacc, be able to handle all sorts of ambiguity, as long as there’s some way to eventually resolve it. It generates a table-driven parser, basically a big state machine (so I hope I never have to debug it).

And it doesn’t have entry points for individual parser rules. This presents a problem for unit testing.

I see several options here:

  • Don’t write unit tests for the parser. (Fat chance.)
  • Force the state machine into the right state, then throw some tokens at it, and check its final state. The problem with this is, I’m pretty sure there’s more than one starting state for “expression”, since it will show up multiple places in the grammar, each with its own exit conditions. (Which state do you move to if you’re in an expression and you hit a semicolon? Depends entirely on which state you were in when you started…)
  • Feed in a complete source file’s worth of syntax for every test. If I’m testing the expression rule, then each test would prepend “unit Foo; initialization implementation begin” and append “end.” to the expression I’m testing. This might be doable, though it would force me to implement the grammar starting from the unit rule and working down (I had been going the other direction with the ANTLR parser), which would necessitate some really good record-keeping to know how close I was to being done.
  • Whip up some magic tokens that will never be generated by the real tokenizer, but can be injected by tests. Then make the top-level rule able to parse either projects (“program Foo; …” or “library Foo; …”), units (“unit Foo; …”), packages (“package Foo; …”), expressions (“<secret token for expression> 1 + 2”), uses clauses (“<secret token for uses clause> uses Foo, Bar;”), etc. This also would be doable, if I could automate the repetitive parts of the syntax; and it would let me build the grammar from the bottom up, which has its advantages.

Has anyone out there written an LALR grammar test-first? Does anyone have any suggestions, or see anything I hadn’t thought about?

Leave a Reply

Your email address will not be published. Required fields are marked *