[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

wishlist: deterministic epsilon token recognition



I have a feature request for the next version of SableCC (3.0?):
deterministic epsilon token recognition.  Having consulted with Etienne,
here's a writeup:

Epsilons are zero-length tokens.  The current syntax:
Tokens
 tkn = ;
means that the token "tkn" will not be recognized, but it is declared (so
it may be used in productions).  I propose changing this syntax to
recognize epsilon tokens whenever this can be done deterministically, and
signal an error when nondeterminism would ensue (see below).  The current
effect (declaring tokens that will not be recognized) may be achieved by
creating a new declarative section ("Unrecognized Tokens" perhaps?),
similar to the "Ignored Tokens" that exists currently.  The difference
between the two is that ignored tokens are consumed off the input and then
forgotten about, while nonrecognized tokens do not consume any input at
all in the first place.

Epsilon tokens may not always be unambiguously recognized, for instance
the grammar
Tokens
 x = ['' + 'x'];
Productions
 xs = x*;
is nondeterministic.  In these cases the user should be clearly warned.

An example of epsilon tokens at work:
States
 initial, rest;
Tokens
 eps = ;
 char = [0x0000 .. 0xffff];
Productions
 whole = eps char*;

Of course, the above in itself is not very useful.  So when is an epsilon
token useful in real life? The application I have in mind is 
disambiguating grammars (using lexer states) without swallowing any input.

Note, again, that we can only get away with an epsilon token here because
it is deterministic: in the "word_end" state, there is no other way to
proceed:

Helpers
 all = [0x00000 .. 0xffff];
 ch = ['a' .. 'z'];
 sym = [all - ch];

States
 initial, one, two, word_end;

Tokens
 {initial -> one, one -> two, two -> word_end} char = ch;
 {initial -> one, one -> two, two -> word_end} other = sym;
 {word_end -> initial} word_end = ;

Productions
 words = word*;
 word = symbol* word_end;
 symbol = {char} char | {other} other;

Here, we are recognizing three-symbol "words", but the length is
controlled solely by lexer states.  We don't need to alter the productions
at all (as long as we keep the set of symbol names intact) if we want to
extend this to four symbols: we just need to add a new state and update
the "char" and "other" token definitions accordingly.

For some applications it is imperative (from a maintainability 
standpoint), to be able to write grammars this way.

Cheers,
VEROK Istvan