Little-known Delphi grammar feature of the day: control-character syntax
Did you know that you can use caret-letter syntax to define a string literal?
const CR = ^M; LF = ^J; TabDelimited = 'Name'^I'Address'^I'City'^I'State'^I'ZIP'; TwoLines = 'First line'^M^J'Second line';
It reads like the classic syntax for “control-M”. It’s valid Delphi grammar, it compiles, and it works.
That said, I have no plans to support it in my parser. I find the string literals during the tokenizing pass, and at that stage, I can’t tell the difference between the control character (^M) and the pointer type (^J) in this snippet:
const CR = ^M; type J = ...; PJ = ^J;
Pointer syntax is used a lot more often (translate: I’ve only ever seen one source file with ^M string-literal syntax, and that was in-house), so I’ll give preference to being able to handle pointers correctly.
I have thought about doing some string-literal folding at parse time… for example, to join
'Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Curabitur ' + 'euismod. Cum sociis natoque penatibus et magnis dis parturient monte' + 's, nascetur ridiculus mus. Sed porta, felis at fermentum pretium, pe' + 'de leo ornare eros, ut ullamcorper turpis arcu id metus.'
into a single StringLiteral token in my parse tree. (This would make it possible to write a frontend that provides a “find in string literals” feature, and make it able to find “pede leo” in the above snippet.)
But don’t expect ^M to work anytime soon.
September 5th, 2007 at 8:59 am
- ^M as a char cannot appear after ":"
- ^M as a char can appear after ":=" where ^M as a pointer declaration cannot appear.
- ^M as a char cannot appear after "=" in the "type" block, whereas ^M as a pointer cannot appear after "=" outside of a "type" block.
So where is the problem?
September 5th, 2007 at 5:23 pm
Simple. The lexer doesn’t know where the type blocks start and end. That’s the parser’s job.
September 6th, 2007 at 2:43 am
But you wrote "I have no plans to support it in my parser". So you have a parser.
September 8th, 2007 at 4:51 am
Sure, I *could* do it. What I can’t do is handle it in the lexer, where it ideally would belong. Having to handle it in the parser (and generate a parse tree I’d be satisfied with) is a little more involved.
Impossible? Certainly not. An hour or two of work, perhaps, to write the tests, make them pass, and make sure that every parse rule that used to look for TokenType.StringLiteral now looks for ParseRule.StringLiteral.
But as I said, I’ve never actually seen this syntax in the wild… so I’m more interested in spending that hour or two on something useful.
September 11th, 2007 at 12:20 pm
I have actually used this syntax a lot. It makes sense to me to define
const
CR = ^M;
LF = ^J;
However, that doesn’t say much
Plus, with the general move towards Unicode, I really doubt anyone has any use for control characters.