[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SableCC Thoughts



> Because, this is how the input stream is parsed.  The "+*?" operators
> are first translated into grammar modifications that introduce these
> Xxxx nodes into the grammar.  They are then eliminated after parsing.
> (This is a lie; I do it as soon as they can be eliminated at parse
> time;-)
> 
> That's another reason to push towards my two level error recovery
> proposal.  Anyway, the question is still open.

I think the way I am implementing recovery will work out very cleanly --
it doesn't, as far as I can tell, tie the programmer to the
implementation of the parser too much...  I'll let you judge when it is
done, which should be in a day or two.  I need clearance from my
client's lawyers to donate the code, which could take a little while
longer.

> sense.  To make sure we agree, here are examples:
> z =
>  k b c d;
> 
> results in classes PZ and AZ extends PZ
> 
> How about it?

Not sure about the one above.  Seems like consistency would be better
for ease of learning and so the visitor methods don't need to change
when another alternative is added to production z above.  Either way
it's a feature I'm looking forward to having.
  
> I thought I had already agreed to have case insensitive tokens.  I would
> quote these tokens differently, something like
> 
> 'hello' -> case sentitive
> "hello" -> case insensitive
> 
> Remains the problem of non-ascii characters.  e.g. shouldn't "é" be
> equivalent to ('é' | 'É')?  We could possibly accomodate a "case
> insensitivity" specification section.
> 
> > Case Insensitive Tokens
> > full_speed_ahead = 'full-speed-ahead';
> 
> I prefer a special notation, instead of additional section, as it
> affects the order of declarations.
 
I think using different quotes is not a good idea.  The beauty of
SableCC is that you can look at the grammar and understand it if you
know BNF.  Using different quotes is completely non-intuitive, and its
easy to use the wrong one without thinking about it.  I like my
different sections idea, but also maybe a keyword that preceeds the
states, like

IgnoreCase {bol->normal, normal} end = 'end';

Still, allowing multiple Tokens and Case Ignored Tokens sections seems
cleaner and would allow the tokens to be arbitrarily ordered.

Even better is if you could do this:

Tokens {normal}

end = 'end';
begin = 'begin';

Case Ignored Tokens {bol->normal, eol}

dog = 'dog';
cat = 'cat';

This way you can group tokens that have similar lexical state
transitions together.  This would make the grammar far more readable and
intuitive when you have many tokens that share the same case
sensitivitiness and / or state transitions.

>> Tokens
>>   eol = cr | lf;
>>   {bol} hello = 'hello';
>>   {bol->normal, fubar} AUTO_DECLARED_TOKENS;
>>   number = digit+;
>>    ...
> I have many reserves on this.  I prefer to keep a separate token
> declaration section, as it leaves much more flexibility to adding
> features into SableCC specifications.
>
> ... and ...
>
>Hmmm... Maybe... Let say.  How about we solve the other problems first,
>and after that, if it still itches you, we can rethink about it? ;-)))

It is just syntactic sugar, so not sure why there is resistance.  I
think it helps to make the grammar more intuivive and easy to read,
since it makes the tokens section smaller and makes so there are less
places where you have to say "is 'try' a production that can match
different things, or is it a keyword?"  Nevertheless, I'll drop the
subject until things that are of higher priority to me or are easy for
me to do are done.

>> BTW, it seems like Sable should not be using a PushBackReader, because
>> you have to specify in the constructor what the maximum number of
>> characters that can be pushed back is.  It's really easy to make our own
>> PushBackReader without this constraint.  It can just use a StringBuffer
>> instead of the PushBackReader's char[].
>
>Might be a good idea.  Have you made tests on the relative speed on both
>approaches? 

I finished coding the so called SableCCPushbackReader yesterday, and
SableCC still works as before.  Other than having unlimited pushback
ability is also has a readBackwards() command, which >MAY< be useful for
the low-level-error-recovery.  It does this by keeping track of all the
characters read.  Since SableCC is building a parse tree for the whole
grammar, including all the tokens, it seems that this would not
significantly effect the memory overhead of SableCC (It would double it
as a rough upper-bound).  I have not tested the relative speeds.  Surely
the new one is slower, but unless you have already optimized SableCC and
found that to be the 'hot spot', I wouldn't want to prematurely optimize
it.  I made a bunch of changes to Parser.java that may even compensate
for the slowness of the new reader.  We'll see :)

If we're going to optimize things, I'd like to optimize the grammar
production.  Every time I change my 600 line grammar, it takes one or
two minutes to generate all the parser, node, and lexer files.  Ugh.  I
change it constantly! ;(

-Dan