Meta-ANTLR

My first go-round, I got frustrated with the amount of duplicate code I had to write in my ANTLR grammar just to build real parse trees. ANTLR is great if you want to parse and evaluate an expression, but it kind of sucks if you want to parse and understand source code.

So I wrote a meta-parser and had another go of it. I’ve now got a Ruby script that reads in my metagrammar, does some chugging, and spits out an ANTLR .g syntax file, as well as some autogenerated .cs files. (All automated as part of every build, of course.)

Here’s the cool stuff my metaparser can do.

  • Node classes. Just about every parser rule now has a corresponding class, which is automatically codegenned by the metagrammar script. That class has properties for each of the syntax elements, so for example, when the Parameter class is automatically generated, it is given these properties:
    • public TokenNode Semicolon { get … }
    • public TokenNode Style { get … } // var, const, out, or nothing
    • public List<IdentItem> Names { get … }
    • public TokenNode Colon { get … }
    • public INode Type { get … }
    • public TokenNode Equal { get … }
    • public INode Default { get … }
  • Readability. In the metagrammar file, token names are in angle brackets, such as <Ident>. Parser names, like MethodCall, aren’t in brackets. Everything is Pascal-cased. I find this far easier to read than the capitalization-based TOKEN and parserule convention used by ANTLR.
  • ANTLR bug workaround. Since ANTLR sometimes forgets that EOF is a valid end-of-rule token, my metagrammar script automatically generates the appropriate workaround for every rule in the metagrammar. Suddenly I can write unit tests for everything again, without even having to think about it! Hooray!
  • Remove duplication. To get keywords and semikeywords to work, I had to list them all twice: once in the tokens{} section, and again in the keyword and semikeyword parser rules. (Predictably, when I went to remove the duplication, I found that I had made some mistakes: the two lists weren’t exactly the same.) The metagrammar just lists them once, and it automatically generates the necessary duplication to make ANTLR do the right thing.

Honestly, though, I think that the class-for-each-node-type thing is going to be the coolest. Especially when I want to start hanging some hand-coded properties (like a List<string> of all the used units, so I don’t always have to walk the Ident nodes) off of some node types. (Partial classes are going to be so sweet for this.)

The only downside is that I more or less had to start over. I’ve still got parsing for over half the rules, but I threw away my old tree construction and started over, so I’m only finished with 42% of all the rules. But the parse trees are going to be way cooler now that I’m doing them myself and not worrying about ANTLR’s AST.

(And no, I still can’t release the code. Working on that.)

Leave a Reply

Your email address will not be published. Required fields are marked *