[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Tokens and free text



Bob.

Many details...Why don't you create a system independent end-of-line? Why
don't you use lexer states to detect the beginning of lines and get a better
header recognition? This would allow you to narrow the definition of "text".

The trick is to recognize the form "blank* 'to' blank* ':'" only at the
beginning of a line. Inside the line (in normal state) you recognize only
text chars.

Helpers
   cr_lf = cr lf;
   blank = ' '; // you could add tabs,...

   colon    = ':';
   cr       = 0x000d;
   lf       = 0x000a;
   char       = [0x00..0xff];

States
  bol, normal;

Tokens

   {bol, normal->bol} eol = cr | lf | cr_lf;

/* notice that helpers & tokens don't share the same name space */
   {bol->normal, normal} text_chars         = char;

   {bol->normal} to_header = blank* 'to' blank* colon;
   {bol->normal} from_header = blank* 'from' blank* colon;

/*******************************************************************
* Productions                                                     *
*******************************************************************/
Productions

   message         = lines*;
   lines           = header text*;
   header          =   {to}    to_header |
                     {from}  from_header;

   text            =   {text}  text_chars |
                          {eol} eol;

Anyway, this is another way to look at the problem.

Etienne


-----Original Message-----
From: Bob Hutchison <hutch@RedRock.com>
To: sablecc-list@sable.mcgill.ca <sablecc-list@sable.mcgill.ca>
Date: Thursday, May 28, 1998 2:11 AM
Subject: Tokens and free text


>Hi,
>
>Another question...
>
>Consider the following input:
>
>"to: i want this to work"
>
>The quotes are not part of the input.
>
>I've got this small grammar:
>
>
>Package simple;
>
>/*******************************************************************
> * Helpers                                                         *
> *******************************************************************/
>Helpers
>
>   c_colon    = ':';
>   c_cr       = 0x000d;
>   c_lf       = 0x000a;
>   char       = [0x00..0xff];
>
>/*******************************************************************
> * Tokens                                                          *
> *******************************************************************/
>Tokens
>
>   colon              = c_colon;
>   cr                 = c_cr;
>   lf                 = c_lf;
>   crlf               = c_cr c_lf;
>   text_chars         = char;
>
>   to_header          = 'to';
>   from_header        = 'from';
>
>/*******************************************************************
> * Productions                                                     *
> *******************************************************************/
>Productions
>
>   message         = lines*;
>   lines           = header colon text* crlf;
>   header          =   {to}    to_header
>                     | {from}  from_header
>                     ;
>
>   text            =   {text}  text_chars
>                     | {to}    to_header
>                     | {from}  from_header
>                     | {cr}    cr
>                     | {lf}    lf
>                     | {colon} colon
>                     ;
>
>
>The problem I have with this is the definition of the 'text' production.
>It seems I have to list every token here to recognise the text of the
>token.
>
>Now I really hope I'm missing something... or is this the way it is?
>
>thanks,
>Bob
>
>
>---
>Bob Hutchison, hutch@RedRock.com, (416) 878-3454
>RedRock, Toronto, Canada