[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SableCC Thoughts Part II

To: Dan Sandberg <dan@siliconpoetry.com>
Subject: Re: SableCC Thoughts Part II
From: "Etienne M. Gagnon" <egagnon@j-meg.com>
Date: Fri, 13 Apr 2001 14:44:47 -0400
CC: sablecc-list@sable.mcgill.ca
Organization: J-Meg inc.
References: <3AD730CE.650D81CE@siliconpoetry.com>
Sender: owner-sablecc-list@sable.mcgill.ca

Hi Dan (and ther other),

I was intending to reply to your message in depth in a few days.

But, as you need some quick answers, here are brief ones...

Dan Sandberg wrote:
> I will do the error-recovery functionality this weekend, but in a kludgy
> way since I'm not sure how to solve the multiple reduction before error
> problem I mentioned in the last message.  Etienne, please give the
> go-ahead (or not) for the other enhancements we've been discussing.

Error recovery is a complex topic.  Before we go into some
implementation, I wanted to start a more philosophical discussion on the
topic, namely, "what do we want, out of error recovery?"

This is important, as, different goals or objectives will lead to a
different kind of error recovery mechanism.

Here's some preliminary ideas, on the subject:

1- "Trusted recovery?"  
======================

Do we expect the compiler to completely recover from some errors? 
[Should we trust a compiler to guess correctly what the programmer
intended?]

For example, if a compiler was to recover from a missing ";", it could
potentially compile an otherwise uncompilable program.  It would simply
issue a warning to the user, saying that it assumed there was a missing
semicolon at location (file:line, column).

In a pretty recent discussion on the comp.compilers newsgroup, somebody
told the story of such a compiler. When he used it for the first time,
he didn't know that the compiler was fixing mispelled variable names for
him and simply proceeded with compiling the code.  Do I need to continue
the story?...  Apparently, there are even companies where programs are
left unfixed.  [Everyone knows that many (most? (unfortunately!))
programmers simply ignore compiler warnings.  They probably assume that
if it was is a serious problem, the compiler would have issued an error
instead.]

I personally think that, if we want a compiler to be forgiving for small
mistakes, the right approach is to modify the language specification to
be more flexible (while remaining unambiguous!!).  A language could, for
example, use case insensitive identifiers (or imprecise identifier
names;-), or require less/more semicolons.  E.g. PASCAL uses ";" as a
statement separator, while C uses ";" as a statement tail.  There have
been debates on which one is better.  On a first look, PASCAL's approach
seems more flexible.  On the other hand, when you move statements
around, C's regular approach saves you some trouble.

If we take this approach, where the language is specified with the
required flexibility, then I think that the error recovery mechanism's
goal should be to allow:
1- extend error detection beyond the first syntax error in a program,
and
2- (more difficult, requires careful design) potentially allow users to
interactively fix programs in "interpreters". [No automatic fixing: user
intervention REQUIRED]

The mechanism in (2-) could be extended to compilers that write the
fixed source code to back a file.  This could be useful in an
interactive development environment (IDE).

2- Automatic vs programmer driven
=================================

The line of thought is that of "who should do the recovery"?  Does the
(average) compiler programmer know enough about the parsing process to
do a good job at fixing syntax errors?  Is the result really better than
that achieved by some automatic recovery mechanism?

As you well know, I intend to improve SableCC's parsing to
linear-approximate LALR(K).  Using the lookahead algorithm, one can
think of a relatively simple, yet potentially interesting automatice
error recovevery mechanism for syntax errors, based on the following
idea: usually a programmer attempts to write valid programs. Thus he is
more likely to forget to write down a "semicolon" than to write down a
spurious identifier.

E.g.:  If a program looks like:

  a = 3 + 5  b + 3;

The programmer is more likely to have forgotten to write "+", "-"
between "5" and "b" than to have erroneously put an invalid "b" in the
middle of the statement "a = 3 + 5 + 3;".

Error recovery is difficult.  What if we see:

  a = 3 + 5 * 3 - 2 / 1 + 4);

It is very difficult to guess what the programmer intended in such a
case.  For simplicity, the parser could assume that everything already
parsed is OK.

The automatic recovery mechanism would thus work as follows.  As soon as
a "next" invalid token (let say T) is discovered, the parser looks ahead
to find the shortest distance, from the current parsing position, to a
lookahead of T. An error is recorded, then either:
1- T is discarded, if there is no possible lookahead of T.
2- The parse stack is reorganized so that T is now a valid next token.

So, in the examples above, it would (possibly) look like:
1- a = 3 + 5 [OP] b + 3;
2- a = 3 + 5 * 3 - 2 / 1 + 4 ["("] [EXP] );

The advantage of this mechanism are:
1- Automatic (no hassle): Compiler writer can concentrate on actually
compiling the code, not worrying too much about syntax (beyond getting
an LALR'(K) grammar).
2- Fast: For every error, the recovery strategy is pre-encoded in the
parse table.  No trial/error approach.
3- Grammar sensitive [;-)]: the recovery action is based on the actual
grammar.
4- Termination: After [error] tokens are pushed on the stack, we are
sure that T will be consumed in the next parsing step. (The only case
where T could still trigger a new error, is if we "reduce" the to of the
stack, but as the stack has finite depth, this cannot go on forever.  I
know, I am skipping many details...)

(4-) is very important.  In fact (4-) + (1-) are complementary.  The
problem with programmer written error recovery is often that the
programmer will not anticipate all the possible errors in a given
grammar context (due to the complexity and the amount of work), and
often, the custom error recovery mechanism could go into infinite loop.
[One usually guards against this by keeping a count of error recovery
steps stopping the process after an arbitrary number of steps].

Inconvenients:
1- Very knwoledgable compiler programmers could potentially handle some
cases better.  [Maybe we could provide a special sablecc specification
section that allows the description of error recovery special cases].
2- Longer parser generation time (to compute the error recovery tables),
and larger tables...

3- When
=======

Finally, I would prefer that the AST be unconditionally generated with
embedded [error] nodes.  Semantic error recover and/or error message
issue would be delayed until parsing is done.  No compiler writer code
in the parsing process.  [Separation of phases].

What do you think?

> I've noticed that the toString() methods of tokens adds spaces.  It
> seems to me that it would be better if the productions that used these
> tokens added spaces between the tokens.  This is because often in a
> grammar you have:

Use the getText() method.  ToString() is simply a convenience for
debugging [ast.toString()].

Later,

Etienne
-- 
----------------------------------------------------------------------
Etienne M. Gagnon, M.Sc.                     e-mail: egagnon@j-meg.com
Author of SableCC:                             http://www.sablecc.org/
and SableVM:                                   http://www.sablevm.org/

Follow-Ups:
- Re: SableCC Thoughts Part II
  - From: "Etienne M. Gagnon" <egagnon@j-meg.com>
- Re: SableCC Thoughts Part II
  - From: Dan Sandberg <dan@siliconpoetry.com>

References:
- SableCC Thoughts Part II
  - From: Dan Sandberg <dan@siliconpoetry.com>

Prev by Date: SableCC Thoughts Part II
Next by Date: Re: SableCC Thoughts Part II
Prev by thread: SableCC Thoughts Part II
Next by thread: Re: SableCC Thoughts Part II
Index(es):
- Date
- Thread