ANTLR: when a token is not a token

Lexers are good at figuring out where tokens start and end. They read one character at a time, but still, they can tell < from <= with ease. They thrive on it.

ANTLR gets stuck occasionally.

If I define a token named DOT that matches a single period, and another token named ELLIPSIS that matches two periods, it can handle them just fine. Most of the time. But it can’t parse 1..2, because it thinks that first dot is a decimal point, not an ellipsis.

When I tell ANTLR that I expect a dot, I want it to check that the next token is a dot. But instead, it checks that the next character is a dot. Which means, if the next two characters in the input stream are “..“, and I call the ANTLR-generated method mDOT(), it will read the first dot out of those two. I want it to fail, because the next thing isn’t a dot, it’s an ellipsis. But it doesn’t work that way.

The workaround is to use something called “symantic predicates”. (If you understand positive lookahead in regular expressions, it’s the same thing.) Symantic predicates look like this:

(...) => ...

You put this on one leg of an alternation. It means “if x, then match y.” x is the stuff inside the parentheses; y is the stuff after =>. I can use this to tell it “if you look ahead and see “..“, then do something other than matching the first ‘.‘ alone.”

This is really flexible. The biggest downside is, you have to provide this context at every call site. In practice, it hasn’t been too bad yet, but boy am I glad I’ve got my unit tests!

My complete NUMBER lexer rule (excluding unit tests) looks like this. If I hit “..“, I just give an empty alternate, and stop parsing the token:

    // if we find "..", we're done parsing the number
    (ELLIPSIS) => |
    // otherwise, look for a decimal point
    (DOT ('0'..'9')+)?
    (('E' | 'e') ('+' | '-')? ('0'..'9')+)?

Leave a Reply

Your email address will not be published. Required fields are marked *