[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Possible extension to SableCC. WAS: Re: reduce conflict on TKAs



Hi,

after reading your examples I realized that I, too, mixed something up
with
parsing in LALR style(well, all working parsers I have ever written are
recurcive descending parsers).

Wouldn'd it be nice to have a custumizable shift/reduce step?

I imagine to have a kind of lookahead buffer and a parse stack and
some builder as helper to craft the ast.

After each token, the parser could be asked to how to proceede, e.g.
shift or reduce.

Or a better way: make each conflict to be handled by a method of the
parser
where the user can descide how to handle it.

All you need is a naming scema for the various cases.

e.g: ReduceReduceConflictInProductionXX(ParseStack, LookAhead, Scanner)
     ShiftReduceConflictInProductionXY(ParseStack, LookAhead, Scanner)

The user can derive a customized parser by deriving and definig the
resolution
for those conflicts.

In fact I do not like to bother to much about the grammar. I want to
describe the 
AST with my grammer and expect the parser to fill/construct the AST
somehow :-)

Regards,
	Angelo

Etienne Gagnon schrieb:
> 
> Hi.
> 
> I just want to give a few tricks I've developed to resolve LALR(1)
> conflicts.
> 
> ----------
> TRICK[1] The first trick is to try to get some high level insight on the
> relation between the parser and the grammar.
> 
> The idea is that an LR parser can only proceed when everything on the
> left is unambiguously determined by looking at the next token.
> 
> example: the (.) indicates the position of the parser.
> 
> some_prod =
>       (.) A B C |
>       (.) A B D;
> 
> That's ok, because there's nothing on the left in all cases. If the next
> token is A then the parser can proceed to:
> 
> some_prod =
>       A (.) B C |
>       A (.) B D;
> 
> That's ok again, because there's an A on the left in all cases. If the
> next token is B then the parser can proceed to:
> 
> some_prod =
>       A B (.) C |
>       A B (.) D;
> 
> Everything is still ok, but the next token determines a specific way to
> proceed:
> if the next token is C:
> 
> program = U some_prod X;
> some_prod =
>       A B C (.) |
>       A B D;
> =>
>  the parser can "backup" (e.g. reduce) and replace [A B C] by
> [some_prod]
> 
> Obviously, the parser has read a U before the A. So now the parse
> "stack" looks like:
> (before) U A B C
> (after)  U some_prod
> 
> So now the parser is at:
> program = U some_prod (.) X;
> some_prod =
>       A B C |
>       A B D;
> 
> Then, if it sees a X it will "shift it" :
> stack: U some_prod X
> 
> program = U some_prod X (.);
> some_prod =
>       A B C |
>       A B D;
> 
> Then it will "accept" the input as a valid program.
> 
> ----------
> TRICK[2] When does a conflict happen?
> 
> A conflict happens when the parser has to make a decision, but it does
> not (yet) have enough information to do it.
> 
> example: (shift/reduce)
> 
> prog =
>   exp END |
>   cast ID END;
> 
> exp =
>   LPAR value STAR value RPAR; // multiplication
> 
> value =
>   ID |
>   NUMBER;
> 
> cast =
>   LPAR type RPAR; // type cast
> 
> type =
>   ID STAR;  // pointer type
> 
> This example is a typical C conflict between expressions and type casts.
> 
> Let's look at the problem with the following input:
> ( name * ...
> [could be:
>   ( name * ) name2 end
>   ( name * name2 ) end
> 
> Let's look at the parser behavior:
> As usual the (.) indicates the parser's current position. Notice that
> (.) is propagated everywhere.
> 
> (stack)
> prog =
>   (.) exp END | // => propagate in front of exp
>   (.) cast ID END; // => propagate in front of cast
> exp =
>   (.) LPAR value STAR value RPAR;
> value =
>   ID |
>   NUMBER;
> cast =
>   (.) LPAR type RPAR;
> type =
>   ID STAR;
> 
> When LPAR is seen, there's nothing on the stack. We shift it.
> 
> (stack) LPAR
> prog =
>   exp END |
>   cast ID END;
> exp =
>   LPAR (.) value STAR value RPAR; // => propagate in front of value
> value =
>   (.) ID |
>   (.) NUMBER;
> cast =
>   LPAR (.) type RPAR; // => propagate in front of type
> type =
>   (.) ID STAR;
> 
> When the ID is seen, there's an LPAR on the stack. We shift.
> 
> (stack) LPAR ID
> prog =
>   exp END |
>   cast ID END;
> exp =
>   LPAR value [STAR] value RPAR;
> value =
>   ID (.) | // => followed by [STAR] in exp.
>   NUMBER;
> cast =
>   LPAR type RPAR;
> type =
>   ID (.) STAR;
> 
> When STAR is seen, the parser has two choices:
> (1) shift STAR, because it is parsing "type".
> (2) reduce ID to value, then proceed parsing exp.
> 
> So the parser's dilemma is that it has to make a decision about whether
> is is parsing a type or a value, but it does not have enough information
> yet to do so.
> 
> The trick to solve this kind of conflict is to delay the decision as
> much as possible by rewriting the grammar.
> 
> Insight:
> - "shifting" is equivalent to delaying decisions
> - "reducing" is equivalent to making a decision.
> 
> So, let's try it:
> Our solution is to delay the decision. => We must eliminate the
> reduction.
> What was reduced? "value".
> Where was the context for this "value" reduction? "exp".
> So let's rewrite "exp" by inlining the left "value":
> 
> exp =
>   LPAR ID STAR value RPAR |
>   LPAR NUMBER STAR value RPAR;
> 
> As you can see, I had to make two alternatives, one for each alternative
> of "value".
> 
> So our final state that had problems becomes:
> (stack) LPAR ID
> prog =
>   exp END |
>   cast ID END;
> exp =
>   LPAR ID (.) STAR value RPAR |
>   LPAR NUMBER STAR value RPAR;
> value =
>   ID |
>   NUMBER;
> cast =
>   LPAR type RPAR;
> type =
>   ID (.) STAR;
> 
> No reduction is needed. We can simply shift:
> (stack) LPAR ID STAR
> prog =
>   exp END |
>   cast ID END;
> exp =
>   LPAR ID STAR (.) value RPAR | // => propagate in front of value
>   LPAR NUMBER STAR value RPAR;
> value =
>   (.) ID |
>   (.) NUMBER;
> cast =
>   LPAR type [RPAR];
> type =
>   ID STAR (.); // followed by [RPAR] in cast.
> 
> Now the next token determines clearly the next move:
> ID, NUMBER => shift
> RPAR => reduce [ID STAR] to [type]
> 
> So, let me repeat myself:
> - "shifting" is GOOD. It delays decisions.
> - "reducing" is BAD. It forces you to make a decision.
> 
> The art of resolving LALR(1) conflicts is the art of delaying decisions
> as much as possible.
> 
> ----------
> TRICK[3] Eliminate unnecessary "single" productions that cause
> conflicts.
> 
> The idea is that sometimes we write productions like:
> type = ID;
> variable = ID;
> 
> But this means that the parser has to decide whether ID is a type or a
> variable without much context. In many cases, just replacing all
> occurrences of "type" and "variable" by ID just resolves the ambiguity.
> In addition, in a SableCC AST, we can still recover the "type" and
> "variable" names by using the []: notation.
> 
> example: (SableCC notation. id, l_par, r_par are tokens).
> 
> prog =
>   {exp} exp |
>   {cast} cast id;
> exp =
>   l_par var r_par |
> cast =
>   l_par type r_par;
> var = id;
> type = id;
> 
> You can easily verify that this causes a reduce/reduce conflict on
> input:
> ( name ) ...
> [2 possibilities}
> ( name ) EOF
> ( name ) name2 EOF
> 
> Here's the conflicting state:
> (state) LPAR ID
> prog =
>   {exp} exp |
>   {cast} cast id;
> exp =
>   l_par var r_par;
> cast =
>   l_par type r_par;
> var = id (.); // followed by r_par in exp.
> type = id (.); // followed by r_par in cast.
> 
> The parser does not know how to reduce ID. Is it a "var" in an "exp" or
> a "type" in a "cast"?
> 
> If, instead, we inlined "var" and "type", we would be in state:
> (state) LPAR ID
> prog =
>   {exp} exp |
>   {cast} cast id;
> exp =
>   l_par [var]:id r_par |
> cast =
>   l_par [type]:id r_par;
> 
> Then there is no conflict. THe parser "shifts" RPAR. Then it can reduce
> to "cast" if the next token is ID or to "exp" if the next token is EOF.
> 
> Notice that the SableCC [var]: notation means that if you have an AExp
> node in your AST, you can do the following:
> AExp exp = ...
> TId var = exp.getVar(); // notice the getVar() instead of getId().
> 
> So, in summary:
> Prefer using the []: notation to adding spurious productions in your
> grammar.
> ----------
> 
> A big summary:
> (1) try to get the insight on the behavior of the parser.
> (2) Shifting is good.
> (3) Reducing is bad.
> (4) Inlining productions often helps resolving conflicts by delaying
> reductions to later.
> (5) Avoid spurious productions.
> 
> Have fun with SableCC!
> 
> Etienne
> --
> 
> ----------------------------------------------------------------------
> Etienne Gagnon, M.Sc.                   e-mail: gagnon@sable.mcgill.ca
> Author of SableCC:                 http://www.sable.mcgill.ca/sablecc/
> ----------------------------------------------------------------------

-- 
---------------------------------------------------------------------
Angelo Schneider           OOAD/UML           Angelo.Schneider@xcc.de
Putlitzstr. 24         Patterns/FrameWorks       Fon: +49 721 9812465
76137 Karlsruhe             C++/JAVA             Fax: +49 721 9812467