[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Encoding question; example code for strange situations



Etienne, everyone--

Thank you for your help in getting my parser going.

I have two items here:

First, a question;  then, some conclusions from my project.

Question:

The files I am reading can have several encodings; at least the
textual parts can. The main syntax elements, like the keywords, etc. are
all ascii. How would I best convert, say, an ANSEL character set to 
UTF-8, or vice versa in the parser? "String"s are supposed to be raw
unicode? When I write them to a file, they seem to be in UTF-8 or
somesuch 8-bit standard. How do I tweak the in/out encodings, and how
would this affect the parser?


Conclusions:

I guess I do some unique things in this parser, that probably no-one
else on earth would do. But, just in case someone else has the same
problems I did, here is what I did:

1. reversed, LONG lists

I had two problems: the DepthFirst traverser reversed all my lists, and
on top of that, the lists were so long, they were exploding the java
stack. And I couldn't find a number big enough to use on the command
line to allow the stacks the room they needed. And this is with lists
only around 20,000 to 50,000 elements long. So, what to do?

My grammar looks like this: (oh, and by the way, using record* didn't
seem to help at all):


file = {filewsub} header record_list trailer
     ;
record_list = {el}record
            | {list}record_list record
            ;

record = {individual}individual_rec
       | {fam}fam_rec
       | {note}note_rec
       | {multimedia}multimedia_rec
       | {repository}repository_rec
       | {source}source_rec
       | {submitter}submitter_rec
       | {submission} submission
       | {eventdef} event_definition
       ;

In the above, the record_list is the one that can get pretty big.

So, I overrode all the list traversal funcs to do this:

    public void caseAListRecordList(AListRecordList node)
    {
       inAListRecordList(node);
       //  if(node.getRecordList() != null)
       //  {
       //      node.getRecordList().apply(this);
       //  }
       //  if(node.getRecord() != null)
       //  {
       //      node.getRecord().apply(this);
       //  }
	 // we are taking the full control now

        final Stack lists = new Stack();

 	PRecordList n = node.getRecordList();
 	while(n instanceof AListRecordList)
	{
		lists.push(n);
		n = ((AListRecordList)n).getRecordList();
	}
        // ok, we have the final List in this sequence which
        // is just an El type, remember that!

        ((AElRecordList)n).getRecord().apply(this);

	while( ! lists.empty() )
	{
		AListRecordList nl = (AListRecordList)lists.pop();
        	nl.getRecord().apply(this);
	}

        if(node.getRecord() != null)
        {
            node.getRecord().apply(this);
        }

        //and the last one
        outAListRecordList(node);
    }

Using the above reversed the reversed list, and reduced stack usage to
acceptable levels. Etienne supplied me with a rough outline of the above
code, but the above is fully debugged and works fine.

2. Pre-processing the token flow.

The format I'm parsing (gedcom 5.5 and pretty much all its predecessors)
use a level number at the beginning of each line to indicate "level", or
containment. Writing a grammar around this is a pain. You have to
include the level numbers in the rules, so you can keep track of who
really owns what. And with recursive ownership (for example, notes can
contain source citations, and source citations can contain notes), you
have to included all the possible levels that constructs could occur at,
and you have to cut it off somewhere... Not nice.

Also, another problem with the format, is that textual data follows the
tags on each line, and this data can contain anything, including tags
and other keyword type data. It would have been nice if the format
syntactically marked this data with double quotes or somesuch, but it
didn't...

BUT, if you replace the level numbers with parens (open and close), the
grammar becomes straightforward and simple. AND, if you put double
quotes around the text, you solve that parsing problem also (assuming
you won't have any double quotes in the text, of course) .... This I did
with this kind of code:

import java.io.*;


public class PreProcessor extends Reader
{
	int level;
	PushbackReader fil;
	StringBuffer currline,numbuf;
	int atchar;
	int ref;
	
	public PreProcessor(PushbackReader in)
	{
		int c;
		fil = in;
		currline = new StringBuffer(512);
		numbuf = new StringBuffer(5);
		level = 0;
		atchar = 0;
		ref = 0;
// System.out.println("PreProcessor called");
		try
		{
			c = in.read();

		}
		catch(Exception e)
		{
			e.printStackTrace();
			System.out.println(e);
			System.exit(1);
		}
	}
	
	public int read() throws IOException
	{
		while (currline.length() == 0 || atchar == currline.length() )
		{
			/* time to get a new string */
			if( atchar == currline.length())
			{
				// zero out and start over 
				currline.setLength(0);
				numbuf.setLength(0);
				atchar = 0;
				ref = 0;
			}
			int c,lc;
			Integer num;
			try
			{
				while( (c = fil.read()) >= 0 )
				{
					int chartype = Character.getType((char)c);
					
					if( atchar == 0 && (c == 10 /* CONTROL */ || c == 13 /*
LINE_SEPARATOR */) )
					{
						if( c == 10 )
							currline.append((char)c);
						continue;
					}
					if( atchar > 0 && (c == 10 /* CONTROL */ || c == 13 /*
LINE_SEPARATOR */) )
					{
						if( c == 10 )
							currline.append((char)c);
						break;
					}
					if( chartype != 9 /* DECIMAL_DIGIT_NUMBER */ )
					{
						break;
					}
					else
						numbuf.append((char)c);
				}
				if( c == -1 && atchar == 0 )
					return -1;
				lc = c;
				if( numbuf.length() > 0 )
				{
					num = Integer.valueOf(numbuf.toString());
//		System.out.println("Level: "+numbuf.toString());
					
					while( (num.intValue()-level) > 0 )
					{
						currline.append('(');
//		System.out.println("Appended  '('");
						level++;
					}
					while( (level-num.intValue()) > 0 )
					{
						currline.append(')');
//		System.out.println("Appended  ')'");
						level--;
					}


// Sorry, chopped a chunk of code out of here that did the actual 
// insertion of quotes, etc.-- hopefully the above will give you enough 
// framework that you can fill in the gap for yourselves...
// Didn't want to bore everyone with lots of code!


				}
				
			}
			catch(IOException e)
			{
				
			}	
		}
		atchar++;
		ref = 0;
		return currline.charAt(atchar-1);
	}

	public int read(char cbuf[], int off, int len) throws IOException
	{
		for(int i = 0; i< len; i++ )
		{
			int c = read();
			
			if(c == -1 )
			{
				if( i == 0 )
					return -1;
				else
					return 1;
			}
			cbuf[off+i] = (char)c;
		}
		return len;
	}

	public void unread(int oneChar) throws IOException
	{
		atchar--;
	}

	public void close() throws IOException
	{
		fil.close();
		fil = null;
	}
}

3. Removing the darn double quotes.

   The above code added the double quotes. But they really are
completely worse than useless after the parse is over. So, the following
code filters them back out as they go into the parse tree:


import java.io.*;
import java.util.*;

public class MyLexer extends Lexer
{
    public MyLexer(PushbackReader in)
    {
		super(in);
	}
  /* Filter */
  protected void filter()
  {
    if (token instanceof TString)
    {
      String text = token.getText();
      int length = text.length();
	  if( length > 0 && text.charAt(0) == '"' )
	      token.setText(text.substring(1, length - 1));
    }
  }
}

Assuming of course, that your grammar says:


Helpers
    cr = 13; lf = 10; at = 64;
    notcrlf = [[1..255]-[cr+lf]];
    notcrlfat = [[[32..127]-[cr+lf]]-at];


string = '"' notcrlf* '"';

I hope the above code snippets may be of service to someone else using
this system. The fact that I could do the above is witness to the fact
that Sablecc can be used to handle jobs from fairly simple to fairly
obtuse and complex.

murf

This is a digitally signed message part