[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: help with lexer



Etienne GAGNON writes:
> I have read your e-mail. Unless I am wrong, there might be an easy (and
> relatively elegant) solution to your problem.
> 
> > I have a line-based protocol running over TCP and want to create
> > a parser that the server will use. There is exactly one command
> > per line, where a line is terminated with \n.
> > 
> > Ther is one twist: backslash continuations are allowed... that is,
> > if a newline immediately follows a backslash, both characters are
> > ignored and the command continues on the next line.
> > 
> > I want the server to be able to parse the client's input, with
> > these things being true:
> > 
> >  o Interactivity - the server should not read any more characters
> >    than absolutely necessary to determine the correct token (ie,
> >    it should not read past the terminating newline character).
> > 
> >  o Error recovery - if there's any lexical or parse error, the server
> >    should read and discard characters up to & including the next newline
> >    character (with flex, you'd define an "error command" that matches
> >    the error token).
> > 
> >  o Repeatabilty - the server should be able to repeatedly call the
> >    parsing engine to parse out consecutive AST's (ie, commands) from
> >    the input stream.
> > 
> > I'm having trouble doing this with SableCC, mainly the last point,
> > because it always expects to parse the first production followed by
> > an EOF token. In bison, you could use YYACCEPT after parsing a
> > "command" production to make the parser accept & stop. With SableCC,
> > I'm trying to insert an EOF token after reading the newline, but
> > it seems impossible to insert an EOF token in the filter() routine
> > without violating the interactivity requirement.

Etienne-
Yes this looks like it will work, and I'll try it soon.
Thanks very much!

However, one thing has always bothered me though about all parsers:
why is EOF considered a special token?!? Why can't it just be
another token like any other? Why does a parser always assume
that the very last token will be EOF?

I realize this "EOF hack" makes sense if you're talking about
parsing FILES, because typically it's wrong to ignore
extra garbage at the end of program source code, etc.

Yes, EOF is special, in that once you see it, that's all your
going to see from then on. But what does this have to do with
parsing?

In the most generic sense, a parser should just parse one start
symbol and then stop. If you really want EOF to terminate your
start production, you can just stick it on the end of your start
production rule!

In my example, having a more generic parser like this, I could
solve this whole problem very simply...

  start =
    command eol;

IMHO, if SableCC wants to claim "purity", it should treat
EOF just like any other token... hint, hint :-) What are
the chances of this change happening?

-Archie

___________________________________________________________________________
Archie Cobbs   *   Whistle Communications, Inc.  *   http://www.whistle.com