(Someone else’s) open-source Delphi parser

Alert reader David Champion told me that there’s another open-source Delphi parser out there. It’s called castaliadelphiparser, and is, as I understand it, the actual parser that’s used in the Castalia add-in for the Delphi IDE. (We’ve got the Borland Edition of their add-in installed on a few of our computers at work. It’s pretty nice, and I need to get around to sending them some feature requests that would make the full version worth buying.)

Anyway, if anyone plays around with their parser, I’d love to hear how well it works, so please post comments here and let me know. If it can parse the entire Delphi language, including conditional compilation, typed constants, records with methods, nested types, class variables, strict private/protected, etc., then it’s way ahead of my parser. (If it can’t do all that yet, then you can take the italics off the “way”.)

If anyone’s wondering about the status of my parser… I’ve been making progress, but I probably won’t make any more progress between now and the end of November, since I’ll be spending the whole month of November writing a novel. And I can’t open-source what I’ve got because I’m still waiting on corporate red tape. (My employer owns my brain unless otherwise specified, but I’m tantalizingly close to getting them to otherwise specify.)

Unit tests for parsers

I’m still tinkering with my ANTLR-based Delphi parser, but I’ve also started playing around with Racc. Today, I found out that there are going to be some fundamental differences in how I write unit tests for Racc.

ANTLR generates a recursive descent parser, which means that it can’t handle any sort of ambiguity in the grammar. (Is “at” a keyword or an identifier? “It depends” is not generally an answer ANTLR can accept.) But there’s an upside, too: the code it generates is very easy to understand (and debug, for that matter). If I have a parser rule “expression ::= atom operator atom“, then ANTLR generates a function called expression(), which consists of calls to atom(), operator(), and atom(). There’s some other decorations along the way, but it’s basically that simple.

This makes it really easy to write unit tests for the expression rule in isolation: I just stuff some tokens into the lexer, call parser.expression(), and then make sure all the lexer’s input has been consumed. It’s not really a unit test of expression(), since I’m also testing everything expression() calls, but it’s been working reasonably well so far.

Racc, on the other hand, makes an LALR parser. It should, like yacc, be able to handle all sorts of ambiguity, as long as there’s some way to eventually resolve it. It generates a table-driven parser, basically a big state machine (so I hope I never have to debug it).

And it doesn’t have entry points for individual parser rules. This presents a problem for unit testing.

I see several options here:

  • Don’t write unit tests for the parser. (Fat chance.)
  • Force the state machine into the right state, then throw some tokens at it, and check its final state. The problem with this is, I’m pretty sure there’s more than one starting state for “expression”, since it will show up multiple places in the grammar, each with its own exit conditions. (Which state do you move to if you’re in an expression and you hit a semicolon? Depends entirely on which state you were in when you started…)
  • Feed in a complete source file’s worth of syntax for every test. If I’m testing the expression rule, then each test would prepend “unit Foo; initialization implementation begin” and append “end.” to the expression I’m testing. This might be doable, though it would force me to implement the grammar starting from the unit rule and working down (I had been going the other direction with the ANTLR parser), which would necessitate some really good record-keeping to know how close I was to being done.
  • Whip up some magic tokens that will never be generated by the real tokenizer, but can be injected by tests. Then make the top-level rule able to parse either projects (“program Foo; …” or “library Foo; …”), units (“unit Foo; …”), packages (“package Foo; …”), expressions (“<secret token for expression> 1 + 2”), uses clauses (“<secret token for uses clause> uses Foo, Bar;”), etc. This also would be doable, if I could automate the repetitive parts of the syntax; and it would let me build the grammar from the bottom up, which has its advantages.

Has anyone out there written an LALR grammar test-first? Does anyone have any suggestions, or see anything I hadn’t thought about?

Meta-ANTLR

My first go-round, I got frustrated with the amount of duplicate code I had to write in my ANTLR grammar just to build real parse trees. ANTLR is great if you want to parse and evaluate an expression, but it kind of sucks if you want to parse and understand source code.

So I wrote a meta-parser and had another go of it. I’ve now got a Ruby script that reads in my metagrammar, does some chugging, and spits out an ANTLR .g syntax file, as well as some autogenerated .cs files. (All automated as part of every build, of course.)

Here’s the cool stuff my metaparser can do.

  • Node classes. Just about every parser rule now has a corresponding class, which is automatically codegenned by the metagrammar script. That class has properties for each of the syntax elements, so for example, when the Parameter class is automatically generated, it is given these properties:
    • public TokenNode Semicolon { get … }
    • public TokenNode Style { get … } // var, const, out, or nothing
    • public List<IdentItem> Names { get … }
    • public TokenNode Colon { get … }
    • public INode Type { get … }
    • public TokenNode Equal { get … }
    • public INode Default { get … }
  • Readability. In the metagrammar file, token names are in angle brackets, such as <Ident>. Parser names, like MethodCall, aren’t in brackets. Everything is Pascal-cased. I find this far easier to read than the capitalization-based TOKEN and parserule convention used by ANTLR.
  • ANTLR bug workaround. Since ANTLR sometimes forgets that EOF is a valid end-of-rule token, my metagrammar script automatically generates the appropriate workaround for every rule in the metagrammar. Suddenly I can write unit tests for everything again, without even having to think about it! Hooray!
  • Remove duplication. To get keywords and semikeywords to work, I had to list them all twice: once in the tokens{} section, and again in the keyword and semikeyword parser rules. (Predictably, when I went to remove the duplication, I found that I had made some mistakes: the two lists weren’t exactly the same.) The metagrammar just lists them once, and it automatically generates the necessary duplication to make ANTLR do the right thing.

Honestly, though, I think that the class-for-each-node-type thing is going to be the coolest. Especially when I want to start hanging some hand-coded properties (like a List<string> of all the used units, so I don’t always have to walk the Ident nodes) off of some node types. (Partial classes are going to be so sweet for this.)

The only downside is that I more or less had to start over. I’ve still got parsing for over half the rules, but I threw away my old tree construction and started over, so I’m only finished with 42% of all the rules. But the parse trees are going to be way cooler now that I’m doing them myself and not worrying about ANTLR’s AST.

(And no, I still can’t release the code. Working on that.)

ANTLR and SmaCC

Don Roberts is consulting for us this week, and I got to talking to him about parsers. I mentioned my problems building logical parse trees with ANTLR.

Turns out I was talking to the right person, because he’s one of the maintainers of SmaCC. And SmaCC has built-in support for building parse trees just the way you want them. You can specify that a particular rule should generate a parse-tree node, or not. You can specify the name of the class you want it to generate for that particular rule. You can even name the individual pieces of the parse rule (e.g., leftside, rightside), and those names become the names of methods on the class, which return other nodes.

SmaCC even generates a base visitor class that you can use to walk the tree. He showed me a simple C-like grammar, and explained how he used multiple visitors: one pass to build the symbol tables, another to do codegen.

I know very little Smalltalk, which is a big downside in this, since SmaCC (as its name implies) generates Smalltalk code. But that’s going to be changing: by the end of this summer, they’re going to have it able to create parsers in C#. Sweeeet.

So I’m about to go download VisualWorks so I can play with the current version of SmaCC. I’m not sure what this means for my current Delphi parser. I may go ahead and finish it in ANTLR, since it’ll be a few months before SmaCC does C#. Then again, if SmaCC is cool enough, I just might teach myself Smalltalk…

What to do when ANTLR breaks your inner rules

Three or four times now, I’ve written one parser rule, test-first, and gotten everything working; and then used that rule from another rule, and watched the original tests all fail. For example:

parameter:
  parameter_style
  ident_list
  (COLON parameter_type)?
  (EQUAL expression)?;

Worked great. It would match A: Integer, it would match const Foo, it would match A, B: array of const. Passed all the tests I could throw at it.

Then I added a rule that used it:

formal_parameters:
  OPEN_PAREN
  (p:parameter)?
  CLOSE_PAREN;

(This new rule isn’t finished yet. Yes, it will accept multiple parameters by the time it’s done.)

As soon as I defined the formal_parameters rule in my grammar, my parameter tests started failing with errors like this:

antlr.NoViableAltException: unexpected token: ["",<1>,line=1,col=17]

I did some digging and some diffing, and found the problem. To take one particular example, the test was trying to parse var Foo: Integer. var matched parameter_style; Foo matched ident_list; and the colon and Integer matched COLON parameter_type. Then it tried to look for EQUAL expression, and it broke.

It was looking at the next token in the input and deciding what to do. If the next token was EQUAL, then it would take the optional EQUAL expression branch. If the next token was something that could validly come after the parameter rule (which it figured out by looking at every other rule in the grammar that could possibly use parameter), then it would skip the optional branch and return. And if neither of those cases held — if the next token was neither EQUAL nor a token that could possibly come after a parameter — then it would throw a NoViableAltException.

The problem was, somehow ANTLR got confused, and forgot that EOF was one of those somethings that could validly come after the parameter rule. After all, you might call the parameter rule all on its own, and it’s perfectly legit for it to match everything in the input stream.

ANTLR gets confused this way, every now and then. Most likely some bug in its “look at every other rule in the grammar that could possibly use parameter” logic. Fortunately, I don’t expect to use the parameter rule on its own, except in tests; it’ll always be part of some larger construct that does still work. So I used to just shrug helplessly, and move my parameter unit tests onto formal_parameters. Made the tests messier, because now they were testing two levels deep, but I got adequate coverage.

This latest time, though, a light bulb went off: I could easily force ANTLR to recognize EOF as something that can come after the parameter rule. I wrote another rule, like so, and with no other changes, magically my tests started passing:

parameter_with_eof:
  parameter EOF;

I don’t even need to use this new rule, in my tests or anywhere else. But it gives ANTLR a clue-by-four that yes, EOF can come after parameter. So everywhere I use parameter, it recognizes EOF as a valid signal to skip the optional bits. And everything starts working again.

It’d be nice if ANTLR worked right in the first place, of course. But it is at least encouraging to know that I’ve been able to work around every one of its bugs that I’ve found so far.

What to do when ANTLR moves your variables out of scope

Tidbit I thought was worth passing on…

I was just adding the SimpleType rule to my grammar, and got this error:

Delphi.g:534: warning:nondeterminism between alts 1 and 2 of block upon
Delphi.g:534:     k==1:OPEN_PAREN

Okay, I’ve run into this before. The problem is between SubrangeType and EnumeratedType, because both can start with something in parentheses (SubrangeType could start with a parenthesized subexpression like (WM_USER + 1), and EnumeratedType is something like (alNone, alClient, alCustom)). ANTLR doesn’t like that kind of ambiguity; it likes to know that an open parenthesis means it can move right to this rule.

No problem, says I: I’ll just break out my handy-dandy semantic predicate. If lookahead shows an expression followed by ‘..’, then it’s a SubrangeType. So I made the changes, and promptly got forty-some compiler errors. Things like:

error CS0103: The name 'label' does not exist in the current context

What happened?! I looked at the DelphiParser.cs that ANTLR had generated, and saw things like this:

if (0==inputState.guessing) {
    string label = "";
}
// ...
if (0==inputState.guessing) {
    label = f1_AST.getText(); // <-- compiler error
}

Sure ’nuff! ANTLR had moved the variable declaration inside the if block, so by the time it got time to use it, the variable wasn’t in scope anymore. I beat my head against this for a few minutes before pulling out the ANTLR docs.

The fix was simple: I hadn’t been using quite the right ANTLR syntax. I had been declaring my variables inside the rule, along these lines:

expression:
    { string label = ""; }
    // ...

(The stuff in curly braces is a custom C# code block I provide, which gets inserted into the generated code.)

This had worked fine, until I started using those rules inside a predicate. It turns out that any code blocks after the colon can be if’ed away, if ANTLR decides that it will need to be able to execute that rule in guess mode (i.e., if that rule, or one that references it, appears in the parenthesized part of a semantic predicate). And since most every rule contained several code blocks (one that declared the variable, followed by one or more that used it), and each code block got moved into its own scope, I had a problem.

The right way is to put the variable declarations before the colon. Then they won’t be if’ed out:

expression
{ string label = ""; } :
    // ...

Worked like a charm.

I gave the ANTLR docs a brief skim before I started, and then dove right in to constructing a lexer and parser. I’ve learned an awful lot this way, but every now and then, it’s good to go back to the docs.

DGrok Delphi grammar

I’ve just posted a copy of Borland’s Delphi grammar with my annotations, corrections, and tweaks. It’s not in EBNF form, but it’s close, and it’s a lot more accurate than Borland’s original grammar document.

This is a file that’s generated from a little Ruby script I wrote this morning. It’s not just a static document; it includes the current status of the Delphi parser as I’m working on it. It shows which language constructs I’ve fully implemented, which I’ve thrown away in favor of something else, and which I haven’t worked on yet. (As of now, out of 121 language constructs, I’ve got 21 finished.)

The colors and such aren’t all self-explanatory yet, but you can probably figure most of them out. I’ll be adding more explanatory text later. If you can’t figure something out, post a question here.

I expect the project to further accelerate as I go, partly because of the “Likely Targets” section I put into this status page. It probably took twenty minutes to add that to the report, but it’ll save me a lot of head-scratching and wondering what I can work on next without having to stop and finish something else first.

I’ll keep the project status page updated as I go (whenever my Net connection is actually working, that is). Go check it out.

Parsing Delphi expressions

It’s commit #111 to my local Subversion repository. I have 234 unit tests so far. And my parser can now handle expressions. Whether it’s 2+2 or Application.Forms[0].ClientWidth, it can build a parse tree for it.

Still far from done. It can’t handle statements yet, so no A := B, no if, no for..in. Nowhere near being able to parse an entire method or a type declaration, much less an entire unit. And it doesn’t have any sort of symbol tables yet, so it doesn’t know Application.Forms from a hole in the wall. But expressions were a big hairy part of the grammar. They took a bit of creative bootstrapping. And they’re working. I’m a happy geek.

No, I can’t post the code yet. Still waiting to hear back from my employer’s legal department on the extent to which they own my soul, and what that means for me in terms of open source.

But I’m psyched about being able to parse expressions. And Jennie is deeply concerned about my mental health.

Making ANTLR able to fail unit tests

I’m writing my ANTLR grammar test-first. I’ll write an NUnit test method that parses a particular string and checks the resulting parse tree; then I’ll run the test and watch it fail first; and then I’ll start writing the lexer and parser rules to make that test pass.

This is an awesome way to write a grammar. It’s already saved me several times (it was my unit tests that caught the hiccup with ANTLR confusing “.” with “..“). But the thing I learned early on is that it doesn’t work out of the box.

ANTLR’s base lexer and parser classes throw exceptions when they encounter errors. This is great; I want my tests to fail if there are parse errors (except for the tests that expect parse errors, naturally). But by default, all of the code ANTLR generates will include try..catch blocks, which catch those exceptions, output messages to the console, and then backtrack in an attempt to continue parsing past the error. So as far as unit tests (which don’t even look at the console) are concerned, exceptions basically get swallowed, and you can’t tell whether the parse succeeded or not.

Turns out this is easy to change. If you’re writing your grammar test-first, you’ll want to disable the parser’s defaultErrorHandler option. Then ANTLR won’t generate its try..except blocks all over the place, so any parse errors will propagate back up the call stack, and cause your test to fail like you’d expect.

class DelphiParser extends Parser;
options {
  defaultErrorHandler = false;
}

ANTLR: when a token is not a token

Lexers are good at figuring out where tokens start and end. They read one character at a time, but still, they can tell < from <= with ease. They thrive on it.

ANTLR gets stuck occasionally.

If I define a token named DOT that matches a single period, and another token named ELLIPSIS that matches two periods, it can handle them just fine. Most of the time. But it can’t parse 1..2, because it thinks that first dot is a decimal point, not an ellipsis.

When I tell ANTLR that I expect a dot, I want it to check that the next token is a dot. But instead, it checks that the next character is a dot. Which means, if the next two characters in the input stream are “..“, and I call the ANTLR-generated method mDOT(), it will read the first dot out of those two. I want it to fail, because the next thing isn’t a dot, it’s an ellipsis. But it doesn’t work that way.

The workaround is to use something called “symantic predicates”. (If you understand positive lookahead in regular expressions, it’s the same thing.) Symantic predicates look like this:

(...) => ...

You put this on one leg of an alternation. It means “if x, then match y.” x is the stuff inside the parentheses; y is the stuff after =>. I can use this to tell it “if you look ahead and see “..“, then do something other than matching the first ‘.‘ alone.”

This is really flexible. The biggest downside is, you have to provide this context at every call site. In practice, it hasn’t been too bad yet, but boy am I glad I’ve got my unit tests!

My complete NUMBER lexer rule (excluding unit tests) looks like this. If I hit “..“, I just give an empty alternate, and stop parsing the token:

NUMBER:
  ('0'..'9')+
  (
    // if we find "..", we're done parsing the number
    (ELLIPSIS) => |
    // otherwise, look for a decimal point
    (DOT ('0'..'9')+)?
    (('E' | 'e') ('+' | '-')? ('0'..'9')+)?
  )
  ;