Joe White’s Blog

Life, .NET, and Cats


Little-known Delphi grammar feature of the day: control-character syntax

Did you know that you can use caret-letter syntax to define a string literal?

const
CR = ^M;
LF = ^J;
TabDelimited = 'Name'^I'Address'^I'City'^I'State'^I'ZIP';
TwoLines = 'First line'^M^J'Second line';

It reads like the classic syntax for “control-M”. It’s valid Delphi grammar, it compiles, and it works.

That said, I have no plans to support it in my parser. I find the string literals during the tokenizing pass, and at that stage, I can’t tell the difference between the control character (^M) and the pointer type (^J) in this snippet:

const
CR = ^M;

type
J = ...;
PJ = ^J;

Pointer syntax is used a lot more often (translate: I’ve only ever seen one source file with ^M string-literal syntax, and that was in-house), so I’ll give preference to being able to handle pointers correctly.

I have thought about doing some string-literal folding at parse time… for example, to join

'Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Curabitur ' +
'euismod. Cum sociis natoque penatibus et magnis dis parturient monte' +
's, nascetur ridiculus mus. Sed porta, felis at fermentum pretium, pe' +
'de leo ornare eros, ut ullamcorper turpis arcu id metus.'

into a single StringLiteral token in my parse tree. (This would make it possible to write a frontend that provides a “find in string literals” feature, and make it able to find “pede leo” in the above snippet.)

But don’t expect ^M to work anytime soon.

5 Responses to “Little-known Delphi grammar feature of the day: control-character syntax”

  1. Andreas Hausladen Says:

    - ^M as a char cannot appear after ":"

    - ^M as a char can appear after ":=" where ^M as a pointer declaration cannot appear.

    - ^M as a char cannot appear after "=" in the "type" block, whereas ^M as a pointer cannot appear after "=" outside of a "type" block.

    So where is the problem?

  2. Joe White Says:

    Simple. The lexer doesn’t know where the type blocks start and end. That’s the parser’s job.

  3. Andreas Hausladen Says:

    But you wrote "I have no plans to support it in my parser". So you have a parser.

  4. Joe White Says:

    Sure, I *could* do it. What I can’t do is handle it in the lexer, where it ideally would belong. Having to handle it in the parser (and generate a parse tree I’d be satisfied with) is a little more involved.

    Impossible? Certainly not. An hour or two of work, perhaps, to write the tests, make them pass, and make sure that every parse rule that used to look for TokenType.StringLiteral now looks for ParseRule.StringLiteral.

    But as I said, I’ve never actually seen this syntax in the wild… so I’m more interested in spending that hour or two on something useful.

  5. Marcel Popescu Says:

    I have actually used this syntax a lot. It makes sense to me to define

    const

    CR = ^M;

    LF = ^J;

    However, that doesn’t say much :) Plus, with the general move towards Unicode, I really doubt anyone has any use for control characters.

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>


Joe White's Blog copyright © 2004-2008. Portions of the site layout use Yahoo! YUI Reset, Fonts, and Grids.
Proudly powered by WordPress. Entries (RSS) and Comments (RSS).