Etienne, everyone-- Thank you for your help in getting my parser going. I have two items here: First, a question; then, some conclusions from my project. Question: The files I am reading can have several encodings; at least the textual parts can. The main syntax elements, like the keywords, etc. are all ascii. How would I best convert, say, an ANSEL character set to UTF-8, or vice versa in the parser? "String"s are supposed to be raw unicode? When I write them to a file, they seem to be in UTF-8 or somesuch 8-bit standard. How do I tweak the in/out encodings, and how would this affect the parser? Conclusions: I guess I do some unique things in this parser, that probably no-one else on earth would do. But, just in case someone else has the same problems I did, here is what I did: 1. reversed, LONG lists I had two problems: the DepthFirst traverser reversed all my lists, and on top of that, the lists were so long, they were exploding the java stack. And I couldn't find a number big enough to use on the command line to allow the stacks the room they needed. And this is with lists only around 20,000 to 50,000 elements long. So, what to do? My grammar looks like this: (oh, and by the way, using record* didn't seem to help at all): file = {filewsub} header record_list trailer ; record_list = {el}record | {list}record_list record ; record = {individual}individual_rec | {fam}fam_rec | {note}note_rec | {multimedia}multimedia_rec | {repository}repository_rec | {source}source_rec | {submitter}submitter_rec | {submission} submission | {eventdef} event_definition ; In the above, the record_list is the one that can get pretty big. So, I overrode all the list traversal funcs to do this: public void caseAListRecordList(AListRecordList node) { inAListRecordList(node); // if(node.getRecordList() != null) // { // node.getRecordList().apply(this); // } // if(node.getRecord() != null) // { // node.getRecord().apply(this); // } // we are taking the full control now final Stack lists = new Stack(); PRecordList n = node.getRecordList(); while(n instanceof AListRecordList) { lists.push(n); n = ((AListRecordList)n).getRecordList(); } // ok, we have the final List in this sequence which // is just an El type, remember that! ((AElRecordList)n).getRecord().apply(this); while( ! lists.empty() ) { AListRecordList nl = (AListRecordList)lists.pop(); nl.getRecord().apply(this); } if(node.getRecord() != null) { node.getRecord().apply(this); } //and the last one outAListRecordList(node); } Using the above reversed the reversed list, and reduced stack usage to acceptable levels. Etienne supplied me with a rough outline of the above code, but the above is fully debugged and works fine. 2. Pre-processing the token flow. The format I'm parsing (gedcom 5.5 and pretty much all its predecessors) use a level number at the beginning of each line to indicate "level", or containment. Writing a grammar around this is a pain. You have to include the level numbers in the rules, so you can keep track of who really owns what. And with recursive ownership (for example, notes can contain source citations, and source citations can contain notes), you have to included all the possible levels that constructs could occur at, and you have to cut it off somewhere... Not nice. Also, another problem with the format, is that textual data follows the tags on each line, and this data can contain anything, including tags and other keyword type data. It would have been nice if the format syntactically marked this data with double quotes or somesuch, but it didn't... BUT, if you replace the level numbers with parens (open and close), the grammar becomes straightforward and simple. AND, if you put double quotes around the text, you solve that parsing problem also (assuming you won't have any double quotes in the text, of course) .... This I did with this kind of code: import java.io.*; public class PreProcessor extends Reader { int level; PushbackReader fil; StringBuffer currline,numbuf; int atchar; int ref; public PreProcessor(PushbackReader in) { int c; fil = in; currline = new StringBuffer(512); numbuf = new StringBuffer(5); level = 0; atchar = 0; ref = 0; // System.out.println("PreProcessor called"); try { c = in.read(); } catch(Exception e) { e.printStackTrace(); System.out.println(e); System.exit(1); } } public int read() throws IOException { while (currline.length() == 0 || atchar == currline.length() ) { /* time to get a new string */ if( atchar == currline.length()) { // zero out and start over currline.setLength(0); numbuf.setLength(0); atchar = 0; ref = 0; } int c,lc; Integer num; try { while( (c = fil.read()) >= 0 ) { int chartype = Character.getType((char)c); if( atchar == 0 && (c == 10 /* CONTROL */ || c == 13 /* LINE_SEPARATOR */) ) { if( c == 10 ) currline.append((char)c); continue; } if( atchar > 0 && (c == 10 /* CONTROL */ || c == 13 /* LINE_SEPARATOR */) ) { if( c == 10 ) currline.append((char)c); break; } if( chartype != 9 /* DECIMAL_DIGIT_NUMBER */ ) { break; } else numbuf.append((char)c); } if( c == -1 && atchar == 0 ) return -1; lc = c; if( numbuf.length() > 0 ) { num = Integer.valueOf(numbuf.toString()); // System.out.println("Level: "+numbuf.toString()); while( (num.intValue()-level) > 0 ) { currline.append('('); // System.out.println("Appended '('"); level++; } while( (level-num.intValue()) > 0 ) { currline.append(')'); // System.out.println("Appended ')'"); level--; } // Sorry, chopped a chunk of code out of here that did the actual // insertion of quotes, etc.-- hopefully the above will give you enough // framework that you can fill in the gap for yourselves... // Didn't want to bore everyone with lots of code! } } catch(IOException e) { } } atchar++; ref = 0; return currline.charAt(atchar-1); } public int read(char cbuf[], int off, int len) throws IOException { for(int i = 0; i< len; i++ ) { int c = read(); if(c == -1 ) { if( i == 0 ) return -1; else return 1; } cbuf[off+i] = (char)c; } return len; } public void unread(int oneChar) throws IOException { atchar--; } public void close() throws IOException { fil.close(); fil = null; } } 3. Removing the darn double quotes. The above code added the double quotes. But they really are completely worse than useless after the parse is over. So, the following code filters them back out as they go into the parse tree: import java.io.*; import java.util.*; public class MyLexer extends Lexer { public MyLexer(PushbackReader in) { super(in); } /* Filter */ protected void filter() { if (token instanceof TString) { String text = token.getText(); int length = text.length(); if( length > 0 && text.charAt(0) == '"' ) token.setText(text.substring(1, length - 1)); } } } Assuming of course, that your grammar says: Helpers cr = 13; lf = 10; at = 64; notcrlf = [[1..255]-[cr+lf]]; notcrlfat = [[[32..127]-[cr+lf]]-at]; string = '"' notcrlf* '"'; I hope the above code snippets may be of service to someone else using this system. The fact that I could do the above is witness to the fact that Sablecc can be used to handle jobs from fairly simple to fairly obtuse and complex. murf
This is a digitally signed message part