[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

question





Gentlemen--

I'm very much a newcomer to this list, sablecc, and Java at the same
time. I'm learning the java, but I need a little help with sablecc.

Now, I'd like very much to use sablecc in my current project, a parser
for GEDCOM 5.5, a genealogical database transfer language.

The trouble with it, is that I don't think I can use the lexer as-is;
I'll demonstrate...

Here is a "taste" of GEDCOM...

0 HEAD
1 SOUR ANSTFILE
2 VERS 4.19
2 NAME Ancestral File
2 CORP The Church of Jesus Christ of Latter-day Saints
3 ADDR 50 East North Temple Street
4 CONT Salt Lake City, Utah 84150
2 DATA Ancestral File
3 DATE 5 January 1998
3 COPR Copyright (c) 1987, June 1998 by Intellectual Reserve, Inc.
4 CONT  All Rights Reserved. 
1 DEST PAF
1 DATE 29 MAR 2001
2 TIME 12:20:03
1 FILE GEDCOM4.ged
1 GEDC
2 VERS 5.5
2 FORM LINEAGE-LINKED
1 CHAR ANSEL
1 SUBM @SUB01@
1 SUBN @N01@
0 @SUB01@ SUBM
1 NAME Created by FamilySearch (TM) Internet Genealogy Service
1 ADDR 50 East North Temple Street
...



In the above, the first entry on each line is a "level number". The
second, (in most cases), is a "tag", and the remainder of the line is
data whose type and significance is indicated by the tag.  One other
thing is a by-name reference/definition which begins and ends with
'@'. (for example, the @SUB01@ above. If the ref precedes the tag,
it's a definition; if it follows, it's a reference.

What I'd like to do is have a list of tokens, one per "tag". And have
the data on each line turned into a "string" token. The @xxx@ turned
into "ref" tokens.

But the level numbers are the catch. I want to turn the level numbers
into an l_par token when the number is one greater than the previous
line, and when the level number is less than the previous line, into
possibly several r_par tokens, one for each level dropped. GEDCOM has
a rule that the levels can only grow by one at a time, but they can
drop any number of levels as is neccessary. And, of course, if the
level number is the same, no token at all would be returned.

This lexical behavior isn't possible to describe in what I see in
sablecc, but even if I have to craft a small handmade lexer by hand,
using sablecc seems still well worth the price over the yacc-equivs
that are the alternatives.

I have the grammar all ready to go; I've defined a set of tokens for
all the tags, and ref, and the data string, and I've invented
fictitious patterns for l_par and r_par. it's pretty big. Sablecc
generates all the files without complaint. Do any of you have
suggestions as to how to handle the lexer?

I appreciate any advise you all can give me.

murf

PS: as a first time user, I have some (hopefully perceived as)
constructive criticism for sablecc:

1. Error messages are not very intuitive. It took me a few hours to
   guess that 237,20 was a reference to a line and character number.


2. The documentation is sketchy. It could really be improved with
   better examples, more from "real" compilers, than the simple
   evaluation engine given. Some example code of walkers building some
   data structures, etc, could be valuable...

3. Where in the docs does it say that all token and alternative names
   must be lower case?