[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Fw: HTML and shift/reduce conflicts

>After verifying the documentation that comes with SableCC, I agree that the
>licensing terms could be confusing. In the next release, I will add a
>specific notice at the top of all Public Domain files clearly stating that
>they can be used without any restriction.

Sounds good.  Thanks!

>I've quickly looked at the HTML grammar. I have a couple of suggestions. I
>would try a different approach. I would define more specific opening tags.
>Here's the idea:
> h = ['h' + 'H'];  // case insensitive 'h'
> t = ['t' + 'T'];
> m = ['m' + 'M'];
> l = ['l' + 'L'];

I was meaning to ask at some point -- is that the only way to do case
insensitive lexing?

> {normal->tag} b_html = '<'  blank*  h  t  m  l;  // e.g. '<html'
> {normal->tag} e_html = '<'  blank*  '/'  blank*  h  t  m  l; // e.g.
> {tag->normal} end_tag = '>';

I reached a similar conclusion last night.  What I *really* need is a
two-phase parser.  Instead of the usual lex -> parse, I want lex -> parse1
-> parse2.  Parse1 takes tokens like '<', 'html', and '>', and turns them
into "tokens" like '<html>'.  I don't know how to implement this (at least
not without thinking), so I'm going to do the next best thing -- one
grammar that works the way you described (tokens like '<html>'), and a
second grammar to parse the internals of a tag.  Run the first grammar over
the entire input file, and then as a tree-walker run the second grammar
once per tag (ie, about a million times per file).  Not very elegant, but
it should get the job done.

>But the way, isn't there any YACC grammar for HTML?

Yes.  One is availabe at


It works approximately the way you suggest.

-Nick Kramer