[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SableCC Thoughts

> > I need error-recovery.  Specifically, when a Lexer or Parser exception
> > occurs, I need to be able to read tokens until EOL is reached, put the
> > parser in a known state, and continue parsing.  I've gotten this
> > working, but in a very horrible kludgy way.  The 'right way' I think, is
> > for there to be a callback which receives an object that lets the user
> > manipulate the state of the lexer/parser.  Any thoughts on how to
> > programmically tell the parser where it should be in its parse?
> Agreed. As said, I think the callback mechanism based on some
> event/listener model would be preferable. In such case the internal state
> needs to be explicitely accessible.. not sure about details though

Yes, I really don't feel that subclassing here is a good solution.  In
fact, I
think the existing filter methods should use callbacks rather than

So here are a bunch of questions related to doing this:

1. Should error recovery info be in the grammar file?  For example,
there could be an Error Recovery section where you specify that if you
are 'in' a certain production, you jump to a certain parser and lexer
state. (I don't really have an opinion on this, and how we code this
doesn't depend on this)

2.  How are productions referred to programatically? I'd say there
should be a class with public static final fields for all the different
productions.  -- Maybe this isn't necessary, see #6.

3.  How can we allow modification to the parser stack (and lexer state)
without betraying so much information about the internals that code
would have to be changed if the parser was LL instead of LALR?

4.  What operations do we want to allow? Reading through the lexer
tokens seems like a must.  Unreading already read tokens seems like it
would be useful.  Taking things off the stack seems like a must. 
Jumping to a certain stack state is a must.  Anything else?

5.  Perhaps an easy way to allow a programmer to go to a certain stack
state is to be able to directly give tokens to the parser (from the
start state).  This way only valid stack states can be reached.  like
parserState.processToken( new TCurlyBrace() );
parserState.processToken( new TInt() ); 

6. ** MAJOR:  Right now it seems that when a token is reached that
doesn't fit into the current production, the stack is REDUCED as many
times as possible and then a "EOF expected" error is given.  For error
recovery to be possible, the parser needs to KNOW THAT IT CAN'T REDUCE,
that way the stack will be given to the error handler in the same state
that it was in when the unexpected token was originally reached.

so we have (extremely tentatively)
lexerState.unread (hmm, unread characters to pushbackinputstream, or
unread tokens?)
lexerState.setState ( State from States section )

parserState.getListIterator() (Allows seeing the elements on the stack,
and >ONLY< removing the topmost elements)
parserState.gotoStartStart (so all subsequence processTokens will be
relative to the start state)
parserState.processToken (from wherever we are now)
parserState.continueParsing (for when we are done) This would resume the
paused parsing thread.  So this whole event handler is handled by a
separate thread.

BTW, it seems like Sable should not be using a PushBackReader, because
you have to specify in the constructor what the maximum number of
characters that can be pushed back is.  It's really easy to make our own
PushBackReader without this constraint.  It can just use a StringBuffer
instead of the PushBackReader's char[].

> [...]
> > To me SableCC has two main advantages over its competitors: The
> > grammar's closeness to BNF and the separation of code and grammar.  I
> > think there are a bunch of things that can be done to enhance the
> > former.  My grammar often has productions like this:
> >
> > vertical_direction = {up} up | {down} down;
> > direction = {up} up | {down} down | {left} left | {right} right;
> >
> > It seems to me the names ({up}) are only subtracting from the clarity of
> > the statements.  Why not make so names are unncessary when the
> > alternative are so simple themselves?  I realize we are trying to
> > minimize the impact of grammar changes on the visitors, but if we change
> > the alternative 'up' to alternative 'straight-up' then presumably the
> > visitor name for alternative Up would need to be changed anyway, so it
> > shouldn't make a difference.
> Agreed. I had similar feelings before. Maybe the explicit {name} tags
> should be left as optional, and by default the name of production being
> used? I beliefe there are cases where {name} construct is unescapable.

Well, A name could always be created by Sable, but it would definitely
be pretty bad sometimes.  I have a feeling Etienne would hate this -- it
would make so visitor methods would have to change names when you change
productions.  I wonder though, it seems like when you change a
production you will have to change your visitors anyway, if not the
method names, then the method content, so why is this such a big deal?

What is the worst case production we could be dealing with? a = b? d? |
I'd say that should be equivalent to a = {b} b? d? | {c} c.  I must be
missing a worse case.
> > I know that it is fairly easy to handle case-insensitive tokens by
> > doing:
> > Tokens
> > full_speed_ahead = f u l l '-' s p e e d '-' a h e a d;
> >
> > but this really detracts from grammar readability.  It is also not
> > immediately obvious to those uninitiated in SableCC tricks what the heck
> > that is.  Why not have a section like:
> >
> > Case Insensitive Tokens
> > full_speed_ahead = 'full-speed-ahead';
> >
> > I see no harm.  Token order (since it is significant) can be in the same
> > order as the tokens are in the file, spanning those two sections.
> Agreed. I was even thinking of using a global flag for the whole grammar,
> as I thought mixing case sensitive with case insensitive tokens is rather
> rare, but, well, having seperate sections seems pretty clean to me.

Yes, I think it's very rare too, but I wouldn't want to be the first
person who hits that rarity and says Ohh crap.. ;)

> > Lastly (for today) why not have the parser automatically add 'floating
> > tokens', like:
> >
> > import_declaration = "import" package_name;
> >
> > This makes the grammar more readable, easier to write, easier to mantain
> > (cause every time you want to add a production with a new token, you
> > don't have to add the token to two places).  The only problems are 1)
> > how to do case-insensitive floating tokens and 2) Telling the parser
> > where the floating tokens should be added relative to the existing
> > tokens (since the order of the tokens in the Tokens section matters).
> >
> > Both problems seem easily addressed with options in a new Options
> > section (which SableCC will inevitably need).
> Not sure about that. It is true that adding tokens into two places is not
> that nice, but I am not sure if having two types of tokens, the regular
> and floating ones will make things simpler. I like the model that
> Helpers/Tokens represent declarations, whereas Productions represent
> definitions where all the "atoms" are previously declared. Well, that;s
> just my feelings, I am open for discussions ;o)

There would still be only one type of token.  It's just that floating
tokens would be auto-declared.   Maybe I'm the only one who is bothered
by this, but I am parsing a configuration format that has dozens and
dozens of keywords, so it's getting on my nerves.  Actually, when I used
JTB / JavaCC on a previous project, I also found the lack of this
feature annoying.  I've spent a lot of time doing database stuff, so I
guess I am trying to 'normalize' the grammar file :)  Anyone else have
an opinion on this?  This feature could be off by default, and turned on
by placing a special floating token marker in the Tokens section, like

  eol = cr | lf;
  {bol} hello = 'hello';
  {bol->normal, fubar} AUTO_DECLARE_TOKENS;
  number = digit+;