[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SableCC Thoughts Part II

> I was intending to reply to your message in depth in a few days.
> But, as you need some quick answers, here are brief ones...

Sorry for the rush, but I am working on a project for a client and so
need to get things done quickly.  I will go ahead and code the error
recovery so it does what I need, and I'll just hope that whatever is
later decided on will be a superset of the functionality I implement.

[See my thoughts below]

> Dan Sandberg wrote:
> > I will do the error-recovery functionality this weekend, but in a kludgy
> > way since I'm not sure how to solve the multiple reduction before error
> > problem I mentioned in the last message.  Etienne, please give the
> > go-ahead (or not) for the other enhancements we've been discussing.
> Error recovery is a complex topic.  Before we go into some
> implementation, I wanted to start a more philosophical discussion on the
> topic, namely, "what do we want, out of error recovery?"
> This is important, as, different goals or objectives will lead to a
> different kind of error recovery mechanism.
> Here's some preliminary ideas, on the subject:
> 1- "Trusted recovery?"
> ======================
> Do we expect the compiler to completely recover from some errors?
> [Should we trust a compiler to guess correctly what the programmer
> intended?]
> For example, if a compiler was to recover from a missing ";", it could
> potentially compile an otherwise uncompilable program.  It would simply
> issue a warning to the user, saying that it assumed there was a missing
> semicolon at location (file:line, column).
> In a pretty recent discussion on the comp.compilers newsgroup, somebody
> told the story of such a compiler. When he used it for the first time,
> he didn't know that the compiler was fixing mispelled variable names for
> him and simply proceeded with compiling the code.  Do I need to continue
> the story?...  Apparently, there are even companies where programs are
> left unfixed.  [Everyone knows that many (most? (unfortunately!))
> programmers simply ignore compiler warnings.  They probably assume that
> if it was is a serious problem, the compiler would have issued an error
> instead.]
> I personally think that, if we want a compiler to be forgiving for small
> mistakes, the right approach is to modify the language specification to
> be more flexible (while remaining unambiguous!!).  A language could, for
> example, use case insensitive identifiers (or imprecise identifier
> names;-), or require less/more semicolons.  E.g. PASCAL uses ";" as a
> statement separator, while C uses ";" as a statement tail.  There have
> been debates on which one is better.  On a first look, PASCAL's approach
> seems more flexible.  On the other hand, when you move statements
> around, C's regular approach saves you some trouble.
> If we take this approach, where the language is specified with the
> required flexibility, then I think that the error recovery mechanism's
> goal should be to allow:
> 1- extend error detection beyond the first syntax error in a program,
> and
> 2- (more difficult, requires careful design) potentially allow users to
> interactively fix programs in "interpreters". [No automatic fixing: user
> intervention REQUIRED]
> The mechanism in (2-) could be extended to compilers that write the
> fixed source code to back a file.  This could be useful in an
> interactive development environment (IDE).

I think proper error recovery sometimes depends on semantics, rather
than syntax, and so it would be dangerous for the parser to make
assumptions that it knows how to handle errors.  By user intervention,
you mean in customized error handlers written by the programmer, right? 
(Not the end-user).  If so, then I agree.

I don't think the parser itself should ever try to fix parse errors (
unless we add directives in the grammar about how to do this ).  This
kind of 'magic' invariabley causes programming nightmares.  I understand
the danger in infinite loops caused by programmer-recovery, but if
SableCC doesn't offer this then programmers will be forced to use other
parsers.  It is an absolutely necessary functionality.  The best we can
do is make so the API at least discourages behaviors that could lead to
these cases.  

BTW, I would guess that most people who use Compiler Compilers are using
them to parse text, such as configurations or expressions like a==3 or b
> 6 rather than to write compilers.  I for instance am parsing a (mostly, but not completely) line oriented configuration file.  So I know that to fix an error, I can always just read characters until an EOL, and continue from there.
> 2- Automatic vs programmer driven
> =================================
> The line of thought is that of "who should do the recovery"?  Does the
> (average) compiler programmer know enough about the parsing process to
> do a good job at fixing syntax errors?  Is the result really better than
> that achieved by some automatic recovery mechanism?
> As you well know, I intend to improve SableCC's parsing to
> linear-approximate LALR(K).  Using the lookahead algorithm, one can
> think of a relatively simple, yet potentially interesting automatice
> error recovevery mechanism for syntax errors, based on the following
> idea: usually a programmer attempts to write valid programs. Thus he is
> more likely to forget to write down a "semicolon" than to write down a
> spurious identifier.
> E.g.:  If a program looks like:
>   a = 3 + 5  b + 3;
> The programmer is more likely to have forgotten to write "+", "-"
> between "5" and "b" than to have erroneously put an invalid "b" in the
> middle of the statement "a = 3 + 5 + 3;".
> Error recovery is difficult.  What if we see:
>   a = 3 + 5 * 3 - 2 / 1 + 4);
> It is very difficult to guess what the programmer intended in such a
> case.  For simplicity, the parser could assume that everything already
> parsed is OK.
> The automatic recovery mechanism would thus work as follows.  As soon as
> a "next" invalid token (let say T) is discovered, the parser looks ahead
> to find the shortest distance, from the current parsing position, to a
> lookahead of T. An error is recorded, then either:
> 1- T is discarded, if there is no possible lookahead of T.
> 2- The parse stack is reorganized so that T is now a valid next token.
> So, in the examples above, it would (possibly) look like:
> 1- a = 3 + 5 [OP] b + 3;
> 2- a = 3 + 5 * 3 - 2 / 1 + 4 ["("] [EXP] );
> The advantage of this mechanism are:
> 1- Automatic (no hassle): Compiler writer can concentrate on actually
> compiling the code, not worrying too much about syntax (beyond getting
> an LALR'(K) grammar).
> 2- Fast: For every error, the recovery strategy is pre-encoded in the
> parse table.  No trial/error approach.
> 3- Grammar sensitive [;-)]: the recovery action is based on the actual
> grammar.
> 4- Termination: After [error] tokens are pushed on the stack, we are
> sure that T will be consumed in the next parsing step. (The only case
> where T could still trigger a new error, is if we "reduce" the to of the
> stack, but as the stack has finite depth, this cannot go on forever.  I
> know, I am skipping many details...)
> (4-) is very important.  In fact (4-) + (1-) are complementary.  The
> problem with programmer written error recovery is often that the
> programmer will not anticipate all the possible errors in a given
> grammar context (due to the complexity and the amount of work), and
> often, the custom error recovery mechanism could go into infinite loop.
> [One usually guards against this by keeping a count of error recovery
> steps stopping the process after an arbitrary number of steps].
> Inconvenients:
> 1- Very knwoledgable compiler programmers could potentially handle some
> cases better.  [Maybe we could provide a special sablecc specification
> section that allows the description of error recovery special cases].
> 2- Longer parser generation time (to compute the error recovery tables),
> and larger tables...
> 3- When
> =======
> Finally, I would prefer that the AST be unconditionally generated with
> embedded [error] nodes.  Semantic error recover and/or error message
> issue would be delayed until parsing is done.  No compiler writer code
> in the parsing process.  [Separation of phases].
> What do you think?

I really dislike the idea of automatic error recovery for the reasons I
mentioned below.  I think it makes the parser's behavior much less
transparent to the programmer, and this seems bad.  However, if it is
only an option that is off by default, then sure, it's great :)

I'm going to code the error-recovery so it is a callback DURING the
parse phase (because it's easier for me) but I wouldn't have any
problems with embedded error nodes as you suggest.  The only problem is
how that fits into the Visitor structrue, and where exactly the error
nodes go in the grammar, and how the partially parsed statement is
represented.  This seems like a very difficult thing to get right, and
so I suggest that we start small and have a callback.  When we figure
out how to do the embedded error nodes, we can add that as well.  Having
multiple ways to do something is fine, as long as the default behaviors
allow a beginner to quickly start using the program.

I don't think it is possible to handle all the behaviors a programmer
could want in the grammar itself.  I think inevitably people will want
callbacks to implement custom behaviors.  I don't see why the two
methods could not co-exist.  The grammar would handle the common cases
that many people want, and callbacks would handle the strange behaviors
that are too esoteric to handle in the grammar.  I think the important
thing is for SableCC to be flexible enough that people don't find they
need to switch to JavaCC to get the 1% of functionality that is missing
from SableCC. 

BTW, wouldn't you eventually want to support semantic lookahead?  If so
then callbacks seem inevitable.  Maybe a separate parallel visitor
hierarchy that gets used during parsing rather than after parsing?  Does
this idea seem repugnant?  This would give SableCC boatloads of power
but still keep the beautiful separation between grammar and code.

> > I've noticed that the toString() methods of tokens adds spaces.  It
> > seems to me that it would be better if the productions that used these
> > tokens added spaces between the tokens.  This is because often in a
> > grammar you have:
> Use the getText() method.  ToString() is simply a convenience for
> debugging [ast.toString()].

If you have a grammer like:

cat = tiger | lion;
dog = wolf | house_dog;
pet = cat | dog;

Since getText() only works on tokens, I would need a visitor for cat, a
visitor for dog, and a visitor for pet that uses a hashtable lookup, or
some other complicated scheme.  If spaces were added by the productions
that used multiple tokens (or multiple other productions) than I could
simply to pet.toString().  Is there any disadvantage in this?  Perhaps
we can just add getText() to the productions?

I would guess this is a comon situtation for people.