[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: ANNOUNCE: SableCC version 2.1

Hi Chris,

We should probably continue this discussion on the SableCC mailing list

See my anwers below.

> Etienne also said:
> > statement =
> >   if expr then statement* |
> >   if expr then statement* else statement*;
> 1) What does "getStatement()" mean to the first alternative (if expr
> then statement*)?  Does it get the first statement, the last
> statement, a linked list of all the statements, something else?  What
> if there were 0 statements?

getStatement() returns a linked list of statements. It would be
preferable to write:

statement = if expr then [statements]:statement*

which gives the name "statements" to the "statement*" element. This lets
you get the linked list of statements with getStatements()

I didn't want to fight with the plural of names;-)

> 2) What does "getStatement()" mean to the second alternative (if expr
> then statement* else statement*) where there are two statements to
> make the issue even more complicated.  Do you have some sort of naming
> scheme to identify the different statements (the then-statements and
> the else-statements)?

If you read the documentation on the SableCC web page, you will discover
that SableCC will require an explicit name for elements that appear more
than one in an alternative. See 1) for an example of explixit name.

> > Do not disallow getting children through indices (such as getChild(3)).
> Yes, many users want to easily walk the tree ignoring its shape.

The two default AST walkers implemented can visit all leaves by simply
overriding the defaultCase() method. Additional AST walkers can be built
to deal with all practical needs. I don't see, yet, the need to access
children by number.

Accessing children by number is a software engineering problem. What
happens when you modify your grammar? You could have to change all
getChild(3) by getChild(4). This is very easy to get wrong. In the case
of a grammar with free use of parentheses, it can get even worse. What
is the significance of getChild(3) in ((a*b+)*|c*)(d|f*)?

The point of SableCC is to build easy to maintain compilers. It should
be easy to change a single keyword by two keywords in an alternative (of
a production) without causing a maintenance nightmare. Accessing
children by name prevents this problem.

> Sreenivas wrote:
> > Also, I don't think there is one right way of building trees and so
> > it should be totally user-configurable.
> This point, was augmented by Nick:
> > (1) eliminating artifacts of operator precedence
> > (2) turning some constructs into proper lists/arrays
> > (3) various flattenings, like turning an IdentifierToken into a
> >     String, or turning a modifier token (like "synchronized" in
> >     Java) into a simple boolean field (isSynchronized)
> > (4) throwing away irrelevant information, especially tokens that act
> >     as delimiters
> These are particularly good points.  It would be interesting to see
> how each tool deals with each of the various flattenings Nick
> mentioned.  I know that Yacc++ can generate exactly the tree Nick
> mentions for case 1.  For case 2, Yacc++ can generate the right tree
> if the rules is written as a reg-expr and not as a recursive rule.
> Yacc++ also does the right thing for case 4 provided that it is
> consistently thrown away.  (On my to-do list is to add an equivalent
> of the PCCTS "!" operator which throws away one token in a specific
> context.)  Yacc++ does not have a solution for 3.  (Terr used to have
> an overview of the Yacc++ model in the proceedings of an OO Compiler
> Workshop we co-sponsored a couple of years ago.  I don't know if he
> still has those papers or not.)

I already answered this question (in part at least). You may want to
search the list archive. 

> Etienne wrote:
> > SableCC accepts only unambiguous grammars with no shift/reduce or
> > reduce/reduce conflicts.
> I think you will find that this should be relaxed (at least to allow
> s/r conflicts as most Yacc grammars have s/r conflicts and quite a few
> use precedence and associativity to resolve them).

I am thinking of a partial support for resolving s/r conflicts. The idea
is to limit the scope of validity of conflict resolution, so that the
result is what you expect. If you want to specify the precedence for *
and + in:
exp = exp + exp | exp + exp
the conflict resolution should have an effect only in this specific
context, not all conflicts of * and +. (Think of * used somewhere else
as a pointer operator. Should it really be considered to have the same
precedence as the multiplication operator relatively to the addition
operator)? I do not think so. Yacc S/R conflict resolution schemes don't
take this into account.

> Etienne added:
> > All the programmer Java code should be in Java classes and not
> > interspersed in the middle of a grammar definition.
> and:
> > What kept me away from LL(k) predicates are the (semantic) predicates
> > themselve. The problem is that you need to do semantic analysis at AST
> > construction time. Not only this obscures the grammar definition, it also
> > requires from you to understand the parsing process. This was one the
> > biggest flaw in YACC (in addition to the automatic resolution of conflicts!)
> Your first reason for avoiding predicates (no interspersed Java code)
> makes sense.  However, I don't know if it is completely practical.
> Unfortunately, there are always some cases where syntax and semantics
> mangle each other in real grammars.  There are numerous "hacks" (lexer
> states, changing token types, semantic predicates to name 3) which are
> used in a variety of grammars to make a real language parseable.  You
> don't have to support all the hacks, but disallowing them all will
> probably make your tool unusable at the practical level in the end.
> There are even semantic hacks you can allow without destroying your
> O(n) analysis time or making the user "understand" the parsing
> process.

They key to hacking a SableCC generated compiler is customized
Lexers/Parsers. These are classes derived from the generated
Lexer/Parser classes. By overriding the filter() method, you can hack
however you want. This method is called on every recognised token
(lexer) and every reduction (parser). This gives you all (or at least
most) of the flexibility you normally get by putting actions in the
grammar. But this way, we keep the grammar "clean" of Java code.

> Etienne wrote:
> > expr = expr + expr  | expr * expr | number;
> >
> > The type system you would get with this ambiguous grammar would allow you to
> > construct an ast for 5 * (2 + 4) * 3 without using parentheses. If you want
> > to serialize this back to disk, you have to add the parentheses.
> There are ways of serializing an arbitrary tree which automatically
> preserve "the parenthesis" without the grammar writer being aware of
> the issue (e.g. rpn).

My wording was wrong. I meant that if you want to write the program (not
the AST!) back to a "text" file, and parse it again later (possibly with
another compiler). Think of a preprocessor for Java, for example. (You
don't necessarily want to rewrite or hack "javac";-)