Delphi grammar trivia of the day: double-quoted strings

Quoted strings, in Delphi, use single quotes. Almost always.

But did you know that the Delphi compiler supports double-quoted strings, too?

I didn’t either, until I wrote a Delphi lexer. I set it loose on all of the library source that ships with Delphi, and it complained about this " character that it didn’t understand. In SysUtils.pas, as it turns out, there’s a line that looks like this:

CMP     AL,"'"

Yep. Inside an asm block, you can have a double-quoted character literal. And since I use the same lexer for all source files (i.e., I don’t have a separate lexer for asm blocks), I too have to support double-quoted strings.

But I only support double quotes around a single apostrophe: "'". After all, I don’t want to encourage this sort of thing.

Packrat Parsing

Over the past few days, I’ve taken a couple of looks back at parser generators, with an eye toward (possibly) resuming work on DGrok, my project to write a parser for the Delphi language.

Along the way, I ran across a term I wasn’t familiar with: packrat parsing. It claims to be a way to get the simplicity of a recursive-descent parser, together with the flexibility of an LL or LR parser like yacc. It also claims unlimited lookahead in linear time (at the cost of higher memory requirements).

Don’t know yet how promising packrat parsing will be — I’ve only read the first page or so of the paper — but it should at least be an interesting read.

(Someone else’s) open-source Delphi parser

Alert reader David Champion told me that there’s another open-source Delphi parser out there. It’s called castaliadelphiparser, and is, as I understand it, the actual parser that’s used in the Castalia add-in for the Delphi IDE. (We’ve got the Borland Edition of their add-in installed on a few of our computers at work. It’s pretty nice, and I need to get around to sending them some feature requests that would make the full version worth buying.)

Anyway, if anyone plays around with their parser, I’d love to hear how well it works, so please post comments here and let me know. If it can parse the entire Delphi language, including conditional compilation, typed constants, records with methods, nested types, class variables, strict private/protected, etc., then it’s way ahead of my parser. (If it can’t do all that yet, then you can take the italics off the “way”.)

If anyone’s wondering about the status of my parser… I’ve been making progress, but I probably won’t make any more progress between now and the end of November, since I’ll be spending the whole month of November writing a novel. And I can’t open-source what I’ve got because I’m still waiting on corporate red tape. (My employer owns my brain unless otherwise specified, but I’m tantalizingly close to getting them to otherwise specify.)

Unit tests for parsers

I’m still tinkering with my ANTLR-based Delphi parser, but I’ve also started playing around with Racc. Today, I found out that there are going to be some fundamental differences in how I write unit tests for Racc.

ANTLR generates a recursive descent parser, which means that it can’t handle any sort of ambiguity in the grammar. (Is “at” a keyword or an identifier? “It depends” is not generally an answer ANTLR can accept.) But there’s an upside, too: the code it generates is very easy to understand (and debug, for that matter). If I have a parser rule “expression ::= atom operator atom“, then ANTLR generates a function called expression(), which consists of calls to atom(), operator(), and atom(). There’s some other decorations along the way, but it’s basically that simple.

This makes it really easy to write unit tests for the expression rule in isolation: I just stuff some tokens into the lexer, call parser.expression(), and then make sure all the lexer’s input has been consumed. It’s not really a unit test of expression(), since I’m also testing everything expression() calls, but it’s been working reasonably well so far.

Racc, on the other hand, makes an LALR parser. It should, like yacc, be able to handle all sorts of ambiguity, as long as there’s some way to eventually resolve it. It generates a table-driven parser, basically a big state machine (so I hope I never have to debug it).

And it doesn’t have entry points for individual parser rules. This presents a problem for unit testing.

I see several options here:

  • Don’t write unit tests for the parser. (Fat chance.)
  • Force the state machine into the right state, then throw some tokens at it, and check its final state. The problem with this is, I’m pretty sure there’s more than one starting state for “expression”, since it will show up multiple places in the grammar, each with its own exit conditions. (Which state do you move to if you’re in an expression and you hit a semicolon? Depends entirely on which state you were in when you started…)
  • Feed in a complete source file’s worth of syntax for every test. If I’m testing the expression rule, then each test would prepend “unit Foo; initialization implementation begin” and append “end.” to the expression I’m testing. This might be doable, though it would force me to implement the grammar starting from the unit rule and working down (I had been going the other direction with the ANTLR parser), which would necessitate some really good record-keeping to know how close I was to being done.
  • Whip up some magic tokens that will never be generated by the real tokenizer, but can be injected by tests. Then make the top-level rule able to parse either projects (“program Foo; …” or “library Foo; …”), units (“unit Foo; …”), packages (“package Foo; …”), expressions (“<secret token for expression> 1 + 2”), uses clauses (“<secret token for uses clause> uses Foo, Bar;”), etc. This also would be doable, if I could automate the repetitive parts of the syntax; and it would let me build the grammar from the bottom up, which has its advantages.

Has anyone out there written an LALR grammar test-first? Does anyone have any suggestions, or see anything I hadn’t thought about?

On using Rake.run_tests

Okay, an update to my previous Rake post. Rake does have some nice facilities for running tests, but the code I posted isn’t quite the right way to do it.

When you use rake/runtest, you should put the require inside your test task (inside each test task, if you have more than one). Like so:

# RIGHT
task :test do
  require 'rake/runtest'
  Rake.run_tests "**/*tests.rb"
end

# WRONG
require 'rake/runtest'
task :test do
  Rake.run_tests "**/*tests.rb"
end

The reason: the WRONG version will always (indirectly) require ‘test/unit’, which will always run your tests, even if you go through a code path that doesn’t register any.

Here’s what it looks like if you have the WRONG version and run rake -T (list all the interesting targets to the console, but don’t run any of them). Note that it’s running the tests, even though there aren’t any to run.

C:\svn\dgrok-r>rake -T
(in C:/svn/dgrok-r)
rake all  # Runs everything
Loaded suite c:/Program Files/ruby/bin/rake
Started

Finished in 0.0 seconds.

0 tests, 0 assertions, 0 failures, 0 errors

C:\svn\dgrok-r>

Here’s what it looks like if you use the RIGHT code above. It still runs the tests when you execute the “test” target, but when you’re not doing anything that involves test running, it doesn’t give you all that nonsense about how it ran 0 tests in 0.0 seconds:

C:\svn\dgrok-r>rake -T
(in C:/svn/dgrok-r)
rake all  # Runs everything

C:\svn\dgrok-r>

The coolness that is Rake

Rake is cool enough as a build tool, but wait until you check out its library.

First off, if you use make or ant, you owe it to yourself to look into Rake. The syntax is so much more readable than either, and you have a full-fledged programming language right there in your build script. The docs have a concise introduction to Rakefiles; Martin Fowler has a longer article called “Using the Rake Build Language“.

But wait! There’s more!

So Rake’s documentation is pretty sketchy. Mention is made of some cool libraries that ship with it, but no details. So I decided to RTFS, and I found much of coolness.

Coolness #1: All the coolness comes right out of the box. You don’t even need to require anything from your rakefile (unless you’re doing really fancy stuff). Just write a rakefile containing some tasks, run rake, and you’re off and running.

Coolness #2: They add an ext() method to the String class. Treats the string as a filename; returns a new filename with the extension changed to whatever you specify. Equivalent to Delphi’s ChangeFileExt() or .NET’s Path.ChangeExtension. Something that’s sorely lacking from Ruby out of the box.

Coolness #3: A built-in command to shell out to Ruby. They have a method to shell out, and another one specifically to shell out to another Ruby process.

When you want to shell out (e.g. to run your compiler), you should use the sh() and ruby() methods, which will automatically fail the task if the process returns a nonzero exit code. (By default, the value returned by a task block is ignored; if you want to fail the task, you need to do it explicitly with fail() or raise or some such, or call something like sh() that does it for you.)

rule '.rb' => ['.y'] do |t|
  sh "racc", t.source, "-o", t.name
end
task :test do
  ruby "test_all.rb"
end

Coolness #4: Built-in support for running unit tests. I already wrote a test_all.rb, but if I’d looked at Rake, I wouldn’t have had to. This works just as well:

[Note: It turns out that this sample code isn’t the right way to do it; see this post for corrected code. Still the same number of lines, just better-behaved.]

require 'rake/runtest'
task :test do
  Rake.run_tests "**/*tests.rb"
end

If you follow their convention of having your tests in test/test*.rb, you don’t need to pass an argument to Rake.run_tests at all. I like my tests to be next to the files they’re testing, so the above example is what I prefer.

There’s also a TestTask if you want fancier options, like modifying your $LOAD_PATH before running the test. (TestTask actually launches a new Ruby process via the ruby() method; Rake.run_tests runs them in-process.)

Coolness #5: FileList.

FileList is a list of files, much like an array. But instead of giving it individual filenames, you can give it patterns, like ‘lib/**/*’, and it comes back as an array of matching filenames. (It can take both include and exclude patterns.) It lazy-initializes its contents, so if you create a FileList but then don’t run the task that uses it, it’s basically free. By default, it ignores CVS and .svn directories.

But coolest of all, FileList has shorthand methods: sub(), gsub(), ext(), egrep(). Want to return a new copy of a FileList with all the extensions changed? No sweat. Want to grep through every file in a FileList, and do something with each file whose contents match a given regex? egrep() to the rescue. I may never again have to write six whole lines of code to recursively grep through a directory tree of source files.

Meta-ANTLR

My first go-round, I got frustrated with the amount of duplicate code I had to write in my ANTLR grammar just to build real parse trees. ANTLR is great if you want to parse and evaluate an expression, but it kind of sucks if you want to parse and understand source code.

So I wrote a meta-parser and had another go of it. I’ve now got a Ruby script that reads in my metagrammar, does some chugging, and spits out an ANTLR .g syntax file, as well as some autogenerated .cs files. (All automated as part of every build, of course.)

Here’s the cool stuff my metaparser can do.

  • Node classes. Just about every parser rule now has a corresponding class, which is automatically codegenned by the metagrammar script. That class has properties for each of the syntax elements, so for example, when the Parameter class is automatically generated, it is given these properties:
    • public TokenNode Semicolon { get … }
    • public TokenNode Style { get … } // var, const, out, or nothing
    • public List<IdentItem> Names { get … }
    • public TokenNode Colon { get … }
    • public INode Type { get … }
    • public TokenNode Equal { get … }
    • public INode Default { get … }
  • Readability. In the metagrammar file, token names are in angle brackets, such as <Ident>. Parser names, like MethodCall, aren’t in brackets. Everything is Pascal-cased. I find this far easier to read than the capitalization-based TOKEN and parserule convention used by ANTLR.
  • ANTLR bug workaround. Since ANTLR sometimes forgets that EOF is a valid end-of-rule token, my metagrammar script automatically generates the appropriate workaround for every rule in the metagrammar. Suddenly I can write unit tests for everything again, without even having to think about it! Hooray!
  • Remove duplication. To get keywords and semikeywords to work, I had to list them all twice: once in the tokens{} section, and again in the keyword and semikeyword parser rules. (Predictably, when I went to remove the duplication, I found that I had made some mistakes: the two lists weren’t exactly the same.) The metagrammar just lists them once, and it automatically generates the necessary duplication to make ANTLR do the right thing.

Honestly, though, I think that the class-for-each-node-type thing is going to be the coolest. Especially when I want to start hanging some hand-coded properties (like a List<string> of all the used units, so I don’t always have to walk the Ident nodes) off of some node types. (Partial classes are going to be so sweet for this.)

The only downside is that I more or less had to start over. I’ve still got parsing for over half the rules, but I threw away my old tree construction and started over, so I’m only finished with 42% of all the rules. But the parse trees are going to be way cooler now that I’m doing them myself and not worrying about ANTLR’s AST.

(And no, I still can’t release the code. Working on that.)

ANTLR and SmaCC

Don Roberts is consulting for us this week, and I got to talking to him about parsers. I mentioned my problems building logical parse trees with ANTLR.

Turns out I was talking to the right person, because he’s one of the maintainers of SmaCC. And SmaCC has built-in support for building parse trees just the way you want them. You can specify that a particular rule should generate a parse-tree node, or not. You can specify the name of the class you want it to generate for that particular rule. You can even name the individual pieces of the parse rule (e.g., leftside, rightside), and those names become the names of methods on the class, which return other nodes.

SmaCC even generates a base visitor class that you can use to walk the tree. He showed me a simple C-like grammar, and explained how he used multiple visitors: one pass to build the symbol tables, another to do codegen.

I know very little Smalltalk, which is a big downside in this, since SmaCC (as its name implies) generates Smalltalk code. But that’s going to be changing: by the end of this summer, they’re going to have it able to create parsers in C#. Sweeeet.

So I’m about to go download VisualWorks so I can play with the current version of SmaCC. I’m not sure what this means for my current Delphi parser. I may go ahead and finish it in ANTLR, since it’ll be a few months before SmaCC does C#. Then again, if SmaCC is cool enough, I just might teach myself Smalltalk…

Delphi parser: Half done

My Delphi parser is now half done! (Well, okay, halfway towards being able to parse Delphi 6 grammar. Still, it’s a big milestone.)

A few current statistics:

  • Rules completed: 63 out of 125
  • Number of unit tests: 330
  • Last time I fixed a serious error in the Borland-provided grammar definition: 1 hour ago
  • Last rule added: ProcedureType
  • Next likely targets: MethodHeading, PropertyParameterList, PropertySpecifiers
  • Next Really Big Thing that needs to be cleaned up in the grammar: ClassType, which has the visibility stuff completely wrong
  • Latest interesting fix to the official grammar: procedural types (type TFoo = procedure of object;) were in the grammar, but wouldn’t work because they expected a procedure name (e.g., the official grammar would only parse it if it was type TFoo = procedure Foo of object;, which is not in fact something that the compiler will accept)

What to do when ANTLR breaks your inner rules

Three or four times now, I’ve written one parser rule, test-first, and gotten everything working; and then used that rule from another rule, and watched the original tests all fail. For example:

parameter:
  parameter_style
  ident_list
  (COLON parameter_type)?
  (EQUAL expression)?;

Worked great. It would match A: Integer, it would match const Foo, it would match A, B: array of const. Passed all the tests I could throw at it.

Then I added a rule that used it:

formal_parameters:
  OPEN_PAREN
  (p:parameter)?
  CLOSE_PAREN;

(This new rule isn’t finished yet. Yes, it will accept multiple parameters by the time it’s done.)

As soon as I defined the formal_parameters rule in my grammar, my parameter tests started failing with errors like this:

antlr.NoViableAltException: unexpected token: ["",<1>,line=1,col=17]

I did some digging and some diffing, and found the problem. To take one particular example, the test was trying to parse var Foo: Integer. var matched parameter_style; Foo matched ident_list; and the colon and Integer matched COLON parameter_type. Then it tried to look for EQUAL expression, and it broke.

It was looking at the next token in the input and deciding what to do. If the next token was EQUAL, then it would take the optional EQUAL expression branch. If the next token was something that could validly come after the parameter rule (which it figured out by looking at every other rule in the grammar that could possibly use parameter), then it would skip the optional branch and return. And if neither of those cases held — if the next token was neither EQUAL nor a token that could possibly come after a parameter — then it would throw a NoViableAltException.

The problem was, somehow ANTLR got confused, and forgot that EOF was one of those somethings that could validly come after the parameter rule. After all, you might call the parameter rule all on its own, and it’s perfectly legit for it to match everything in the input stream.

ANTLR gets confused this way, every now and then. Most likely some bug in its “look at every other rule in the grammar that could possibly use parameter” logic. Fortunately, I don’t expect to use the parameter rule on its own, except in tests; it’ll always be part of some larger construct that does still work. So I used to just shrug helplessly, and move my parameter unit tests onto formal_parameters. Made the tests messier, because now they were testing two levels deep, but I got adequate coverage.

This latest time, though, a light bulb went off: I could easily force ANTLR to recognize EOF as something that can come after the parameter rule. I wrote another rule, like so, and with no other changes, magically my tests started passing:

parameter_with_eof:
  parameter EOF;

I don’t even need to use this new rule, in my tests or anywhere else. But it gives ANTLR a clue-by-four that yes, EOF can come after parameter. So everywhere I use parameter, it recognizes EOF as a valid signal to skip the optional bits. And everything starts working again.

It’d be nice if ANTLR worked right in the first place, of course. But it is at least encouraging to know that I’ve been able to work around every one of its bugs that I’ve found so far.