[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: reduce conflict on TKAs



Hi.

I just want to give a few tricks I've developed to resolve LALR(1)
conflicts.

----------
TRICK[1] The first trick is to try to get some high level insight on the
relation between the parser and the grammar.

The idea is that an LR parser can only proceed when everything on the
left is unambiguously determined by looking at the next token.

example: the (.) indicates the position of the parser.

some_prod =
      (.) A B C |
      (.) A B D;

That's ok, because there's nothing on the left in all cases. If the next
token is A then the parser can proceed to:

some_prod =
      A (.) B C |
      A (.) B D;

That's ok again, because there's an A on the left in all cases. If the
next token is B then the parser can proceed to:

some_prod =
      A B (.) C |
      A B (.) D;

Everything is still ok, but the next token determines a specific way to
proceed:
if the next token is C:

program = U some_prod X;
some_prod =
      A B C (.) |
      A B D;
=>
 the parser can "backup" (e.g. reduce) and replace [A B C] by
[some_prod]

Obviously, the parser has read a U before the A. So now the parse
"stack" looks like:
(before) U A B C
(after)  U some_prod

So now the parser is at:
program = U some_prod (.) X;
some_prod =
      A B C |
      A B D;

Then, if it sees a X it will "shift it" :
stack: U some_prod X

program = U some_prod X (.);
some_prod =
      A B C |
      A B D;

Then it will "accept" the input as a valid program.

----------
TRICK[2] When does a conflict happen?

A conflict happens when the parser has to make a decision, but it does
not (yet) have enough information to do it.

example: (shift/reduce)

prog = 
  exp END |
  cast ID END;

exp = 
  LPAR value STAR value RPAR; // multiplication

value = 
  ID | 
  NUMBER;

cast = 
  LPAR type RPAR; // type cast

type = 
  ID STAR;  // pointer type

This example is a typical C conflict between expressions and type casts.

Let's look at the problem with the following input:
( name * ...
[could be: 
  ( name * ) name2 end 
  ( name * name2 ) end 

Let's look at the parser behavior:
As usual the (.) indicates the parser's current position. Notice that
(.) is propagated everywhere.

(stack)
prog = 
  (.) exp END | // => propagate in front of exp
  (.) cast ID END; // => propagate in front of cast
exp = 
  (.) LPAR value STAR value RPAR;
value = 
  ID | 
  NUMBER;
cast = 
  (.) LPAR type RPAR; 
type = 
  ID STAR;

When LPAR is seen, there's nothing on the stack. We shift it. 

(stack) LPAR
prog = 
  exp END |
  cast ID END; 
exp = 
  LPAR (.) value STAR value RPAR; // => propagate in front of value
value = 
  (.) ID | 
  (.) NUMBER;
cast = 
  LPAR (.) type RPAR; // => propagate in front of type
type = 
  (.) ID STAR;

When the ID is seen, there's an LPAR on the stack. We shift. 

(stack) LPAR ID
prog = 
  exp END |
  cast ID END; 
exp = 
  LPAR value [STAR] value RPAR; 
value = 
  ID (.) | // => followed by [STAR] in exp.
  NUMBER;
cast = 
  LPAR type RPAR; 
type = 
  ID (.) STAR;

When STAR is seen, the parser has two choices:
(1) shift STAR, because it is parsing "type".
(2) reduce ID to value, then proceed parsing exp.

So the parser's dilemma is that it has to make a decision about whether
is is parsing a type or a value, but it does not have enough information
yet to do so.

The trick to solve this kind of conflict is to delay the decision as
much as possible by rewriting the grammar.

Insight: 
- "shifting" is equivalent to delaying decisions
- "reducing" is equivalent to making a decision.

So, let's try it:
Our solution is to delay the decision. => We must eliminate the
reduction.
What was reduced? "value".
Where was the context for this "value" reduction? "exp".
So let's rewrite "exp" by inlining the left "value":

exp =
  LPAR ID STAR value RPAR |
  LPAR NUMBER STAR value RPAR;

As you can see, I had to make two alternatives, one for each alternative
of "value".

So our final state that had problems becomes:
(stack) LPAR ID
prog = 
  exp END |
  cast ID END; 
exp = 
  LPAR ID (.) STAR value RPAR |
  LPAR NUMBER STAR value RPAR;
value = 
  ID | 
  NUMBER;
cast = 
  LPAR type RPAR; 
type = 
  ID (.) STAR;

No reduction is needed. We can simply shift:
(stack) LPAR ID STAR
prog = 
  exp END |
  cast ID END; 
exp = 
  LPAR ID STAR (.) value RPAR | // => propagate in front of value
  LPAR NUMBER STAR value RPAR;
value = 
  (.) ID | 
  (.) NUMBER;
cast = 
  LPAR type [RPAR]; 
type = 
  ID STAR (.); // followed by [RPAR] in cast.

Now the next token determines clearly the next move:
ID, NUMBER => shift
RPAR => reduce [ID STAR] to [type]

So, let me repeat myself:
- "shifting" is GOOD. It delays decisions.
- "reducing" is BAD. It forces you to make a decision.

The art of resolving LALR(1) conflicts is the art of delaying decisions
as much as possible.

----------
TRICK[3] Eliminate unnecessary "single" productions that cause
conflicts.

The idea is that sometimes we write productions like:
type = ID;
variable = ID;

But this means that the parser has to decide whether ID is a type or a
variable without much context. In many cases, just replacing all
occurrences of "type" and "variable" by ID just resolves the ambiguity.
In addition, in a SableCC AST, we can still recover the "type" and
"variable" names by using the []: notation.

example: (SableCC notation. id, l_par, r_par are tokens).

prog = 
  {exp} exp |
  {cast} cast id;
exp =
  l_par var r_par |
cast =
  l_par type r_par;
var = id;
type = id;

You can easily verify that this causes a reduce/reduce conflict on
input:
( name ) ...
[2 possibilities}
( name ) EOF
( name ) name2 EOF

Here's the conflicting state:
(state) LPAR ID
prog = 
  {exp} exp |
  {cast} cast id;
exp =
  l_par var r_par;
cast =
  l_par type r_par;
var = id (.); // followed by r_par in exp.
type = id (.); // followed by r_par in cast.

The parser does not know how to reduce ID. Is it a "var" in an "exp" or
a "type" in a "cast"?

If, instead, we inlined "var" and "type", we would be in state:
(state) LPAR ID
prog = 
  {exp} exp |
  {cast} cast id;
exp =
  l_par [var]:id r_par |
cast =
  l_par [type]:id r_par;

Then there is no conflict. THe parser "shifts" RPAR. Then it can reduce
to "cast" if the next token is ID or to "exp" if the next token is EOF.

Notice that the SableCC [var]: notation means that if you have an AExp
node in your AST, you can do the following:
AExp exp = ...
TId var = exp.getVar(); // notice the getVar() instead of getId().

So, in summary:
Prefer using the []: notation to adding spurious productions in your
grammar.
----------

A big summary:
(1) try to get the insight on the behavior of the parser.
(2) Shifting is good.
(3) Reducing is bad.
(4) Inlining productions often helps resolving conflicts by delaying
reductions to later.
(5) Avoid spurious productions.

Have fun with SableCC!

Etienne
-- 

----------------------------------------------------------------------
Etienne Gagnon, M.Sc.                   e-mail: gagnon@sable.mcgill.ca
Author of SableCC:                 http://www.sable.mcgill.ca/sablecc/
----------------------------------------------------------------------