ANTLR and SmaCC

Don Roberts is consulting for us this week, and I got to talking to him about parsers. I mentioned my problems building logical parse trees with ANTLR.

Turns out I was talking to the right person, because he’s one of the maintainers of SmaCC. And SmaCC has built-in support for building parse trees just the way you want them. You can specify that a particular rule should generate a parse-tree node, or not. You can specify the name of the class you want it to generate for that particular rule. You can even name the individual pieces of the parse rule (e.g., leftside, rightside), and those names become the names of methods on the class, which return other nodes.

SmaCC even generates a base visitor class that you can use to walk the tree. He showed me a simple C-like grammar, and explained how he used multiple visitors: one pass to build the symbol tables, another to do codegen.

I know very little Smalltalk, which is a big downside in this, since SmaCC (as its name implies) generates Smalltalk code. But that’s going to be changing: by the end of this summer, they’re going to have it able to create parsers in C#. Sweeeet.

So I’m about to go download VisualWorks so I can play with the current version of SmaCC. I’m not sure what this means for my current Delphi parser. I may go ahead and finish it in ANTLR, since it’ll be a few months before SmaCC does C#. Then again, if SmaCC is cool enough, I just might teach myself Smalltalk…

Dragging Boone

I spent the weekend with about 70 teenagers. It was a terrific weekend. Very long, very draining, but terrific just the same.

It was the big UU youth con in Boone, IA, and it really made me realize how much I love helping out with our youth group. Youth group in general, cons in particular, are just a place where the kids are allowed to be who they are. And they shine.

To take just a few examples of what really stood out, at this con in particular:

  • One of the graduating seniors said that she was planning to go into theater in college — and the main reason was cons. She had performed at the coffeehouse (talent show) at several cons, and it was the other youths’ reactions and support that made her realize this was what she wanted to do.
  • Never once, at the entire con, did I see a loner. Everyone got drawn into a larger group. Everyone mixed, and began to hang out with others they hadn’t known before. As a serious introvert, I still marvel at the amount of community they’re able to build in less than forty-eight hours.
  • There were several youth there who were openly gay and/or bi. Some of them wouldn’t dare let that be known in their hometowns, but at the con, it was a non-issue. In some small ways, it was even celebrated.
  • Wink is always a blast. Jennie said that Deathball was terrific fun, too.
  • Both the adults and the youth at this con made a real point to ask the adults to get involved: to participate in touch groups, attend workshops, and even just talk to the youth. (I wasn’t about to wait for an invitation; I joined in with a touch group two cons ago, and loved it. And wasn’t able to join in at the next con, because they always had the adults scheduled to do our own thing, and I really missed it.)
  • Where, but at a Unitarian youth conference, can you have a teenager leading a workshop called “Root Beer and Smut”, and have it attended by about ten youth and two adults? (That was a lot of fun — he brought root beer and some trashy romance novels, and people took turns doing dramatic readings of the sex scenes.)
  • They have some traditions that some people might find in questionable taste, but that I, having been there and seen how they pull the community just that much closer, would definitely put my full support behind. Things like the obscene Yogi Bear song at closing, or the foofing of the con virgins, or getting all of the first-year Boone guys dressed up in drag on Saturday night.

And yes, for the record… this was my first Boone. So I put on a dress, eyeshadow, and lipstick, and strutted across the complex with the rest of them. (And, later, past a group of Camp Fire girls, on the long trek to my cabin to move my stuff. One of my colleagues said it was worth the walk just to see the expressions on their faces.) I’m still a little surprised to say it, but it was actually a lot of fun. I got all theatrical about it, blowing kisses and getting my Angelina pout on. One of the girls told me I was a natural at doing The Walk, and several guys told me I was the hottest one there (yes, beard and all). Jennie just laughed and laughed.

It’s amazing. The more youth cons I go to, the more I love them, just like the youth do. I still wish I’d known about Unitarian cons when I was a teenager, but even being an old (30) guy like I am, I’m starting to get drawn into the community myself. The first con I went to, several months ago, I made it to one touch-group meeting and I think one workshop, and otherwise kept to myself, and didn’t join in on the hug line at the end (because I still wasn’t comfortable with the idea, and wasn’t sure it would even be considered appropriate). The next con, I didn’t get to go to touch groups, but had fun at the dance (for as long as I’m constitutionally capable of having fun at a dance), loved Unrequited Love, and did take tentative part in the hug line. At Boone, my third con, I was giving everyone bear hugs at the end.

To think, I wasn’t really sure I wanted to come to Boone. I don’t think that’s going to happen again. I think, next time, that it’ll take a lot more than being worn out from the last con to keep me from the next one.

Delphi parser: Half done

My Delphi parser is now half done! (Well, okay, halfway towards being able to parse Delphi 6 grammar. Still, it’s a big milestone.)

A few current statistics:

  • Rules completed: 63 out of 125
  • Number of unit tests: 330
  • Last time I fixed a serious error in the Borland-provided grammar definition: 1 hour ago
  • Last rule added: ProcedureType
  • Next likely targets: MethodHeading, PropertyParameterList, PropertySpecifiers
  • Next Really Big Thing that needs to be cleaned up in the grammar: ClassType, which has the visibility stuff completely wrong
  • Latest interesting fix to the official grammar: procedural types (type TFoo = procedure of object;) were in the grammar, but wouldn’t work because they expected a procedure name (e.g., the official grammar would only parse it if it was type TFoo = procedure Foo of object;, which is not in fact something that the compiler will accept)

What to do when ANTLR breaks your inner rules

Three or four times now, I’ve written one parser rule, test-first, and gotten everything working; and then used that rule from another rule, and watched the original tests all fail. For example:

parameter:
  parameter_style
  ident_list
  (COLON parameter_type)?
  (EQUAL expression)?;

Worked great. It would match A: Integer, it would match const Foo, it would match A, B: array of const. Passed all the tests I could throw at it.

Then I added a rule that used it:

formal_parameters:
  OPEN_PAREN
  (p:parameter)?
  CLOSE_PAREN;

(This new rule isn’t finished yet. Yes, it will accept multiple parameters by the time it’s done.)

As soon as I defined the formal_parameters rule in my grammar, my parameter tests started failing with errors like this:

antlr.NoViableAltException: unexpected token: ["",<1>,line=1,col=17]

I did some digging and some diffing, and found the problem. To take one particular example, the test was trying to parse var Foo: Integer. var matched parameter_style; Foo matched ident_list; and the colon and Integer matched COLON parameter_type. Then it tried to look for EQUAL expression, and it broke.

It was looking at the next token in the input and deciding what to do. If the next token was EQUAL, then it would take the optional EQUAL expression branch. If the next token was something that could validly come after the parameter rule (which it figured out by looking at every other rule in the grammar that could possibly use parameter), then it would skip the optional branch and return. And if neither of those cases held — if the next token was neither EQUAL nor a token that could possibly come after a parameter — then it would throw a NoViableAltException.

The problem was, somehow ANTLR got confused, and forgot that EOF was one of those somethings that could validly come after the parameter rule. After all, you might call the parameter rule all on its own, and it’s perfectly legit for it to match everything in the input stream.

ANTLR gets confused this way, every now and then. Most likely some bug in its “look at every other rule in the grammar that could possibly use parameter” logic. Fortunately, I don’t expect to use the parameter rule on its own, except in tests; it’ll always be part of some larger construct that does still work. So I used to just shrug helplessly, and move my parameter unit tests onto formal_parameters. Made the tests messier, because now they were testing two levels deep, but I got adequate coverage.

This latest time, though, a light bulb went off: I could easily force ANTLR to recognize EOF as something that can come after the parameter rule. I wrote another rule, like so, and with no other changes, magically my tests started passing:

parameter_with_eof:
  parameter EOF;

I don’t even need to use this new rule, in my tests or anywhere else. But it gives ANTLR a clue-by-four that yes, EOF can come after parameter. So everywhere I use parameter, it recognizes EOF as a valid signal to skip the optional bits. And everything starts working again.

It’d be nice if ANTLR worked right in the first place, of course. But it is at least encouraging to know that I’ve been able to work around every one of its bugs that I’ve found so far.

What to do when ANTLR moves your variables out of scope

Tidbit I thought was worth passing on…

I was just adding the SimpleType rule to my grammar, and got this error:

Delphi.g:534: warning:nondeterminism between alts 1 and 2 of block upon
Delphi.g:534:     k==1:OPEN_PAREN

Okay, I’ve run into this before. The problem is between SubrangeType and EnumeratedType, because both can start with something in parentheses (SubrangeType could start with a parenthesized subexpression like (WM_USER + 1), and EnumeratedType is something like (alNone, alClient, alCustom)). ANTLR doesn’t like that kind of ambiguity; it likes to know that an open parenthesis means it can move right to this rule.

No problem, says I: I’ll just break out my handy-dandy semantic predicate. If lookahead shows an expression followed by ‘..’, then it’s a SubrangeType. So I made the changes, and promptly got forty-some compiler errors. Things like:

error CS0103: The name 'label' does not exist in the current context

What happened?! I looked at the DelphiParser.cs that ANTLR had generated, and saw things like this:

if (0==inputState.guessing) {
    string label = "";
}
// ...
if (0==inputState.guessing) {
    label = f1_AST.getText(); // <-- compiler error
}

Sure ’nuff! ANTLR had moved the variable declaration inside the if block, so by the time it got time to use it, the variable wasn’t in scope anymore. I beat my head against this for a few minutes before pulling out the ANTLR docs.

The fix was simple: I hadn’t been using quite the right ANTLR syntax. I had been declaring my variables inside the rule, along these lines:

expression:
    { string label = ""; }
    // ...

(The stuff in curly braces is a custom C# code block I provide, which gets inserted into the generated code.)

This had worked fine, until I started using those rules inside a predicate. It turns out that any code blocks after the colon can be if’ed away, if ANTLR decides that it will need to be able to execute that rule in guess mode (i.e., if that rule, or one that references it, appears in the parenthesized part of a semantic predicate). And since most every rule contained several code blocks (one that declared the variable, followed by one or more that used it), and each code block got moved into its own scope, I had a problem.

The right way is to put the variable declarations before the colon. Then they won’t be if’ed out:

expression
{ string label = ""; } :
    // ...

Worked like a charm.

I gave the ANTLR docs a brief skim before I started, and then dove right in to constructing a lexer and parser. I’ve learned an awful lot this way, but every now and then, it’s good to go back to the docs.

DGrok Delphi grammar

I’ve just posted a copy of Borland’s Delphi grammar with my annotations, corrections, and tweaks. It’s not in EBNF form, but it’s close, and it’s a lot more accurate than Borland’s original grammar document.

This is a file that’s generated from a little Ruby script I wrote this morning. It’s not just a static document; it includes the current status of the Delphi parser as I’m working on it. It shows which language constructs I’ve fully implemented, which I’ve thrown away in favor of something else, and which I haven’t worked on yet. (As of now, out of 121 language constructs, I’ve got 21 finished.)

The colors and such aren’t all self-explanatory yet, but you can probably figure most of them out. I’ll be adding more explanatory text later. If you can’t figure something out, post a question here.

I expect the project to further accelerate as I go, partly because of the “Likely Targets” section I put into this status page. It probably took twenty minutes to add that to the report, but it’ll save me a lot of head-scratching and wondering what I can work on next without having to stop and finish something else first.

I’ll keep the project status page updated as I go (whenever my Net connection is actually working, that is). Go check it out.

Parsing Delphi expressions

It’s commit #111 to my local Subversion repository. I have 234 unit tests so far. And my parser can now handle expressions. Whether it’s 2+2 or Application.Forms[0].ClientWidth, it can build a parse tree for it.

Still far from done. It can’t handle statements yet, so no A := B, no if, no for..in. Nowhere near being able to parse an entire method or a type declaration, much less an entire unit. And it doesn’t have any sort of symbol tables yet, so it doesn’t know Application.Forms from a hole in the wall. But expressions were a big hairy part of the grammar. They took a bit of creative bootstrapping. And they’re working. I’m a happy geek.

No, I can’t post the code yet. Still waiting to hear back from my employer’s legal department on the extent to which they own my soul, and what that means for me in terms of open source.

But I’m psyched about being able to parse expressions. And Jennie is deeply concerned about my mental health.