DGrok 0.1 released

I can successfully parse four of the source files that ship with Delphi. I’d say that’s a major milestone. So I’m releasing version 0.1 (alpha) of DGrok.

The source code is included in the download. DGrok is open-source, under the Open Software License (I’d rather use GPL, but I’m stuck with OSL because of NUnitLite).

I’ve included a GUI demo app that shows off the current capabilities a bit. It has two major screens: Ad-Hoc Parsing and Parse Source Tree.

Screenshot of DGrok's Ad-Hoc Parsing screen

Ad-Hoc Parsing lets you type in some source code, select which parse rule you want to use, and click Parse (Alt+P). The box in the lower right shows either the parse tree (if parsing was successful) or the error message. Additionally, if there was an error, the focus is put back in the edit box, and the cursor is moved to the error location.

If you want to type an entire source file, select the Goal rule (this is selected by default when the app starts). Or you could use Unit, if you know it’s really a unit and not a project, library, or package. If you just want to play around with expression parsing, select Expression. Or whatever. There’s a Grammar.html included in the ZIP that shows which rules are working in this release, and to what extent.

Screenshot of DGrok's Parse Source Tree screen

The Parse Source Tree tab lets you point DGrok at a directory, and set it loose. It will automatically search through subdirectories for .pas files (I should probably make it look at .dpr and .dpk files as well), load them, and try parsing them. (Since it knows they’re entire files, it doesn’t need to ask you which rule to apply; it automatically uses Goal.) If a file parses successfully, it gets listed under the “Passing” node; otherwise, the files are listed by error message. As you can see, there are still a lot of errors, so I must not be done yet.

Note that there isn’t currently a GUI for telling it which $IFDEFs evaluate to “true” and which evaluate to “false”. And if it doesn’t know whether something is true or false, it bombs out with an error. This is on purpose — I wanted to make absolutely sure I didn’t miss anything that should be defined — but it’s probably inconvenient if you’re trying to parse anything other than the Delphi RTL source that I’ve already tuned it for. I’ll get a GUI for this in a future version.

Happy parsing!

Software “Assurance”?

Our department has Software Assurance for all of our Delphi licenses. So we’re supposed to get all the new releases. Right?

Well, not so much.

We’ve been paying for Software Assurance since we bought Delphi 2005. And it still took us six weeks to get our copies of Delphi 2006. Sure, they were shipping physical product, so fulfillment is going to take a little while, but good grief, six weeks?

It should have been better for Delphi 2007, since they had Electronic Software Delivery by then. You’d think that you’d get the product faster if you don’t have to wait for them to package and ship it. But no. Once again, it took us six weeks to get Delphi 2007. And even after those six weeks, we only got our copies after I e-mailed the CodeGear CEO to complain. (Things happened quickly after that.)

Studio 2007 was released on Monday. If we didn’t have Software Assurance, we could have bought it online on Monday afternoon. But because we do have Software Assurance — you know, the thing that’s supposed to assure that we get new releases — we have to wait.

How long? Well, we called CodeGear to find out. And we were told we’d have to wait a month.

(At least they’re improving.)

Later in the call, it came up that the month would be to get the physical media. Once the guy — apparently a trainee — figured out that we could actually download the software, the estimate was revised to a week.

We’ll see if that actually pans out this time.

When people pay in advance for upgrades — providing your company with a reliable revenue stream, mind you — wouldn’t you think they should get those upgrades when they’re released, or perhaps even slightly before? So they’ll, you know, keep providing your company with a reliable revenue stream? Software Assurance may be a bit less expensive than buying every version when it comes out, but it still isn’t exactly cheap.

As Brian remarked to me this morning, “I don’t feel assured.”

Little-known Delphi grammar feature of the day: control-character syntax

Did you know that you can use caret-letter syntax to define a string literal?

const
CR = ^M;
LF = ^J;
TabDelimited = 'Name'^I'Address'^I'City'^I'State'^I'ZIP';
TwoLines = 'First line'^M^J'Second line';

It reads like the classic syntax for “control-M”. It’s valid Delphi grammar, it compiles, and it works.

That said, I have no plans to support it in my parser. I find the string literals during the tokenizing pass, and at that stage, I can’t tell the difference between the control character (^M) and the pointer type (^J) in this snippet:

const
CR = ^M;

type
J = ...;
PJ = ^J;

Pointer syntax is used a lot more often (translate: I’ve only ever seen one source file with ^M string-literal syntax, and that was in-house), so I’ll give preference to being able to handle pointers correctly.

I have thought about doing some string-literal folding at parse time… for example, to join

'Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Curabitur ' +
'euismod. Cum sociis natoque penatibus et magnis dis parturient monte' +
's, nascetur ridiculus mus. Sed porta, felis at fermentum pretium, pe' +
'de leo ornare eros, ut ullamcorper turpis arcu id metus.'

into a single StringLiteral token in my parse tree. (This would make it possible to write a frontend that provides a “find in string literals” feature, and make it able to find “pede leo” in the above snippet.)

But don’t expect ^M to work anytime soon.

Grammar details of Delphi’s “type type” feature

If you’re a hardcore VCL geek like me, you probably already know about Delphi’s “type type” feature. But I learned some interesting details about it last night while discovering the Delphi grammar.

The documentation doesn’t give a name to this language construct, so “type type” is a name I made up. It’s when you prepend “type” to a type declaration to give it its own type identity:

type
TColor = type Integer;

I won’t go into any details on what you would use this for, because it’s not that useful for anything outside the Object Inspector. You probably wouldn’t even notice it was a different type if you never used it as a var parameter. But it’s more useful than goto, and I’ve got that in my parser…

Anyway, I found some interesting details about “type type” in my research. Specifically, there are only three type constructs that allow you to prepend “type” to them: identifiers, string, and file.

type
TTypeQualId = type System.Integer; // or any (qualified) identifier
TTypeString = type string;
TTypeFile = type file;

No variations on these are allowed, and no other types are allowed (I tried every type in the Delphi grammar):

type
TTypeStringOfLength = type string[42]; // compiler error
TTypeFileOfType = type file of Integer; // compiler error

I suspect the reason for this is that (according to the documentation) every time you declare something like string[42], it’s automatically considered a distinct type from every other string[42] you’ve ever defined — and therefore has its own RTTI identity, and isn’t var-parameter compatible with the others. You don’t need to declare type string[42] because it’s already distinct.

I found type file to be particularly interesting. Even if you’re dealing with an untyped file — the ones where you have to pass that second parameter to Reset and Rewrite to make them even remotely useful, the ones that have been utterly replaced by TStream — you can still make distinct types, and let the compiler make sure you pass the right ones to your var parameters. That’s actually an interesting feature, since file parameters always have to be var. I wonder if anybody has ever used this.

Update: Sébastien Doeraene, alias sjrd, posted some good details about “type type” in the comments, including how this construct affects the IDE’s autocompletion features and the distinction with type aliases.

Delphi grammar quirk of the day: sealed and abstract classes

Delphi’s syntax for sealed and abstract classes is a bit bizarre. The following all compile:

type
TFoo = class sealed sealed sealed sealed
end;

TBar = class abstract abstract abstract abstract
end;

If you put the sealed keyword more than once, does that make the class somehow extra-sealed?

Interestingly, sealed and abstract are both what I’m calling “semikeywords” — they only have a special meaning in a particular context. Outside that particular context, they can be used as plain old identifiers. So you could actually have a field in the class called sealed or abstract… as long as it’s not the first thing after the class keyword. Add another field first, or a visibility specifier, and you’re fine:

type
TFoo = class sealed
Index: Integer;
Abstract: Boolean;
end;

TBar = class abstract
strict private
Sealed: string;
end;

But there is a bit of sense amidst all this oddity. They at least don’t let you mix and match sealed (cannot descend from this class) with abstract (must descend from this class):

type
TBaz = class abstract sealed
end;

[Pascal Error] Project3.dpr(17): E2383 ABSTRACT and SEALED cannot be used together

Now, if they would just make it so that class abstract actually did something… (see QC#24662)

Unfinished Delphi feature of the day: virtual class helper methods

I’ve been researching the syntax for class helpers, and found some very interesting things. First, that class helpers can descend from other class helpers. And second, that they can have virtual methods.

Class helpers, for anyone not familiar with them, are a way of adding methods to an existing class — or at least, making it look like you do. The existing class is left unchanged; you might just as well be writing unit procedures that take an instance as their first parameter, except that class helpers make the code look nicer because you’re actually saying “Foo.NewThing” rather than “NewThing(Foo, …)”.

Now, since you’re not actually modyfing the existing class, your class helper can’t have any fields (there’s no place to store them). Nor can you override methods from the class you’re helping (since that would involve changing the VMT). So this whole “virtual” thing really surprised me.

So back to the interesting discoveries. First, class helpers can descend from other class helpers, but the syntax isn’t what I would have guessed:

TFooHelper = class helper for TFoo
...
end;
TOtherFooHelper = class helper (TFooHelper) for TFoo
...
end;

Presumably this would only make sense if they were helpers for different classes, but that’s the syntax: the parent class goes after “helper”, not after the entire “class helper for” clause. The parent must be another class helper (not an ordinary class).

Now, the really interesting bit: the compiler lets you put virtual methods on these guys.

TFooHelper = class helper for TFoo
public
procedure DoSomething; virtual;
end;

TBarHelper = class helper (TFooHelper) for TBar
public
procedure DoSomething; override;
end;

Except that as soon as you put “virtual;” on a class helper method, the compiler starts griping:

[Pascal Error] Project3.dpr(29): E2003 Undeclared identifier: 'QueryInterface'
[Pascal Error] Project3.dpr(29): E2003 Undeclared identifier: '_AddRef'
[Pascal Error] Project3.dpr(29): E2003 Undeclared identifier: '_Release'

Do what happened?

I was curious; I added those methods. Since you can’t add fields (e.g. FRefCount) to a class helper, I made _AddRef and _Release both return -1, to indicate that the class wasn’t refcounted. Then I wrote some code that called that virtual method, and ran my app.

The instruction at "0x004112f5" referenced memory at "0x00000000". The memory could not be "read".

Interesting.

So I looked at the assembly code that was generated for that virtual helper call. And sure enough, it was looking for an interface:

Project3.dpr.124: Foo.VirtualHelperMethod;
004112E2 8D4DEC           lea ecx,[ebp-$14]
004112E5 8B15B4004100     mov edx,[$004100b4]
004112EB 8BC3             mov eax,ebx
004112ED E8F629FFFF       call @GetHelperIntf // emphasis mine
004112F2 8B45EC           mov eax,[ebp-$14]
004112F5 8B10             mov edx,[eax]
004112F7 FF520C           call dword ptr [edx+$0c]

Very interesting, says I. That explains why it wanted me to implement IInterface on the helper: it somehow uses interfaces to deal with this “virtual helper method” business. But exactly which interface was it looking for, and why was it crashing? What else did I need to do? What interfaces did I need to implement? How could I implement interfaces, when the compiler doesn’t allow interface syntax (“class helper (TFooHelper, ISomething)” fails with “‘)’ expected but ‘,’ found”)?

So I opened up System.pas to look for this @GetHelperIntf method. Here’s the lone line of code in its implementation, along with the answer to why the app crashed…

function _GetHelperIntf(Instance: TObject; Cls: TClass): IInterface;
begin

Result := nil;
end;

Delphi bug of the day: FPU stack leak

Brian and I spent a couple hours debugging this afternoon. Our code was throwing an EInvalidOp exception: “Invalid floating-point operation”. And it was doing it in code that looked kinda like this:

S := S + '*';
Figure 1

Yyyyep. String concatenation. Was causing a floating-point exception.

It took us a good long while to figure out why, but we finally did. See, this EInvalidOp was a secondary exception. Before getting to the string concatenation, we had run through some code that did something like this:

Result := SomeArray[Index] - SomeArray[Index - 1];
Figure 2

Important details:

  • SomeArray was an array of Double.
  • Index happened to be the lowest index for SomeArray.
  • Range checking was on.

The code in Figure 2 was blowing up with an ERangeError, because SomeArray[Index - 1] was outside the bounds of the array. The code expected this, caught the exception (it was a unit test that specifically tested how the code reacted to boundary conditions), and continued on its way.

And then it did some string concatenation and blew up.

It took a lot of grovelling around in the CPU view and, finally, even the FPU view (which I have never before opened on purpose), to find the answer.

See, the floating-point unit (FPU) has its own stack. A fairly small one: the FPU stack has eight slots. And Delphi code (probably most code, for that matter) appears to pretty much expect the FPU stack to be empty most of the time. As a general thing, any given Delphi statement would start and end with an empty FPU stack. (That’s fairly standard for a stack-based architecture.) So a statement like the one in Figure 2 would break down something like this:

  • Start out with an empty FPU stack.
  • Get the value SomeArray[Index] and push it onto the FPU stack.
  • Get the value SomeArray[Index - 1] and push it onto the FPU stack.
  • Give the FPU a “subtract” instruction, which pops the top two items off the FPU stack, does the subtraction, and puts the result back on the stack. There’s now only one item on the FPU stack.
  • Pop the result off the FPU stack and put it into Result.
  • End the statement with an empty FPU stack again.

Now let’s switch gears for a minute. When you do string concatenation in place (e.g., S := S + something), the runtime tries to reuse the existing memory block. But if the new string doesn’t fit in the old block, it may need to allocate a new block somewhere else on the heap, and copy the existing content into it. (See SysReallocMem, in $(BDS)\source\Win32\rtl\sys\getmem.inc, if you’re really curious about how this works. Bring a flashlight.)

Delphi 2006 ships with the FastMM4 memory manager. Among other things, FastMM has very highly-optimized copy routines as part of its ReallocMem code. So highly-optimized that they have different routines for different block sizes. Here’s the one for a 36-byte block (copied from $(BDS)\source\Win32\rtl\sys\getmem.inc):

procedure Move36(const ASource; var ADest; ACount: Integer);
asm
fild qword ptr [eax]
fild qword ptr [eax + 8]
fild qword ptr [eax + 16]
fild qword ptr [eax + 24]
mov ecx, [eax + 32]
mov [edx + 32], ecx
fistp qword ptr [edx + 24]
fistp qword ptr [edx + 16]
fistp qword ptr [edx + 8]
fistp qword ptr [edx]
end;

In case you can’t read that gibberish, it amounts to, “push four quad-words (8 bytes each) onto the FPU stack, then pop them into a different memory location”. The remaining four bytes are done with a regular 32-bit MOV instruction.

This takes advantage of the fact that the FPU can move around eight bytes at a time, but basically it amounts to abusing the FPU stack as scratch space. And it makes assumptions about how many FPU stack slots are available — four for Move36, the one shown above. Move44 needs five slots, Move52 needs six, Move60 needs seven, and Move68 assumes that all eight slots are available. (Beyond that, they use a different strategy.) And those assumptions should be safe, since the FPU stack is always supposed to stay empty.

Maybe you can see where I’m going with this.

Let’s go through those steps again, this time paying particular attention to what happened in our code today.

  • Start out with an empty FPU stack.
  • Get the value SomeArray[Index] and push it onto the FPU stack.
  • Get the value SomeArray[Index - 1], and — oops. Index - 1 is out of bounds. Raise an ERangeError.
  • And then, of course, you end with an empty FPU stack… right?

Oops.

No, it turns out that if the second argument to a floating-point operation throws an exception, you end up with an FPU stack leak. And when there are only eight slots, leaks are Not a Good Thing.

Especially when your memory manager relies on the FPU stack.

I’ve already logged this in QualityCentral, as QC#51215. If you’re curious, the QC report includes a repro case you can play around with. Happy floating-point erroring.

Generating the grammar document

My Delphi grammar document is built from two pieces: my research, and a tool.

The research

The first step was to do all the research to figure out what the Delphi grammar is. This is not easy. The Delphi 5 documentation included an incomplete, and sometimes wildly inaccurate, BNF grammar. The Delphi 2006 documentation no longer includes the grammar; either the documentation team lost it (along with the docs for the IDE’s regular-expression syntax), or they gave up because it was so far out of date. The language has added loads of features since then: strict private, records with methods, even generics on the way.

So I had a rough sketch to start from, and an undergraduate compiler-design class from ten years ago. The rest — correcting the errors, and filling in the (large) blanks — is trial and error, and a lot of refactoring.

The upshot is, if you see something I’m missing, let me know. Fatih Tolga Ata already put class helpers and generics on my radar — although I can’t really do much with generics yet. Since there is no official (correct) grammar from CodeGear, my main method of discovering the grammar is to type stuff into the IDE and see what compiles (and, often more instructively, what doesn’t), I won’t be able to figure out the generic grammar until I have Highlander.

The tool

As I puzzle out the grammar, I document it in a YAML file. Here’s a snippet from this file:

Atom:
Doc: |
! -> <number>
. -> <stringliteral>
! -> Ident
! -> NIL
. -> '(' Expression ')'
. -> SetLiteral
Block:
Doc: |
. -> BEGIN StatementList END

The ! at the beginning of a line means “I’ve implemented this in my parser”; the . means “I haven’t implemented this yet”. That’s what drives the “(Completed)” and “(In progress)” notations in the grammar document.

I wrote a Ruby script that reads this YAML file and generates the HTML Delphi grammar documentation. That Ruby script is the part that’s cool enough to figure out which rules are fully implemented (shown with a solid underline), which are partially implemented (e.g., Atom, as shown above; shown with a dashed underline), and which ones I haven’t started on yet (no underline). It also figures out the “Likely Targets” — the rules whose dependencies are (fully or partially) done: the ones I can (probably) work on next.

I edit the YAML file frequently — as you can imagine, since it reflects my completion status. And I refer to the generated HTML document just as frequently. So I’ve made the Ruby script part of my Rakefile. It works out fairly well.

Of course, uploading the HTML doc to my Web site happens… a little less frequently. I just uploaded the latest version this morning, but before that, it looks like it had been a little more than a year since my last update. I’ll try to keep it a little more active.

Delphi grammar trivia of the day: double-quoted strings

Quoted strings, in Delphi, use single quotes. Almost always.

But did you know that the Delphi compiler supports double-quoted strings, too?

I didn’t either, until I wrote a Delphi lexer. I set it loose on all of the library source that ships with Delphi, and it complained about this " character that it didn’t understand. In SysUtils.pas, as it turns out, there’s a line that looks like this:

CMP     AL,"'"

Yep. Inside an asm block, you can have a double-quoted character literal. And since I use the same lexer for all source files (i.e., I don’t have a separate lexer for asm blocks), I too have to support double-quoted strings.

But I only support double quotes around a single apostrophe: "'". After all, I don’t want to encourage this sort of thing.

Packrat Parsing

Over the past few days, I’ve taken a couple of looks back at parser generators, with an eye toward (possibly) resuming work on DGrok, my project to write a parser for the Delphi language.

Along the way, I ran across a term I wasn’t familiar with: packrat parsing. It claims to be a way to get the simplicity of a recursive-descent parser, together with the flexibility of an LL or LR parser like yacc. It also claims unlimited lookahead in linear time (at the cost of higher memory requirements).

Don’t know yet how promising packrat parsing will be — I’ve only read the first page or so of the paper — but it should at least be an interesting read.