[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Encoding question; example code for strange situations

Steve Murphy wrote:
> Question:
> The files I am reading can have several encodings; at least the
> textual parts can. The main syntax elements, like the keywords, etc. are
> all ascii. How would I best convert, say, an ANSEL character set to 
> UTF-8, or vice versa in the parser? "String"s are supposed to be raw
> unicode? When I write them to a file, they seem to be in UTF-8 or
> somesuch 8-bit standard. How do I tweak the in/out encodings, and how
> would this affect the parser?

The lexer expects "UNICODE" characters (or streams or strings).  You might want 
to search the JDK libraries for encoding related APIs (probably in the java.io.* 
classes). I think that even class java.lang.String has some encoding related 
methods. If you want to manually implement a very specific encoding, you can do 
as I did for the Java lexer: add a preprocessing stage (and do the same in the 
back-end/writing stage).

> Conclusions:
> I guess I do some unique things in this parser, that probably no-one
> else on earth would do. But, just in case someone else has the same
> problems I did, here is what I did:

Thanks a lot for your report.  It is always nice to hear about successful 
projects and learn new solutions to difficult problems.

Have fun!

Etienne M. Gagnon                    http://www.info.uqam.ca/~egagnon/
SableVM:                                       http://www.sablevm.org/
SableCC:                                       http://www.sablecc.org/