[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Tokens and free text

To: "Etienne Gagnon" <egagnon@sprynet.com>, sablecc-list@sable.mcgill.ca
Subject: Re: Tokens and free text
From: hutch@RedRock.com (Bob Hutchison)
Date: Thu, 28 May 1998 14:05:42 GMT
In-Reply-To: <008e01bd8a36$15cf0dc0$02e0aec7@default>
Organization: RedRock
References: <008e01bd8a36$15cf0dc0$02e0aec7@default>
Reply-To: hutch@RedRock.com

On Thu, 28 May 1998 08:42:30 -0400, you wrote:

>Bob.
>
>Many details...Why don't you create a system independent end-of-line? Why
>don't you use lexer states to detect the beginning of lines and get a better
>header recognition? This would allow you to narrow the definition of "text".
>

Lexer states are the way to go, I think. I realised this as I was waking
up this morning :-)

>The trick is to recognize the form "blank* 'to' blank* ':'" only at the
>beginning of a line. Inside the line (in normal state) you recognize only
>text chars.
>
>Helpers
>   cr_lf = cr lf;
>   blank = ' '; // you could add tabs,...
>
>   colon    = ':';
>   cr       = 0x000d;
>   lf       = 0x000a;
>   char       = [0x00..0xff];
>
>States
>  bol, normal;
>
>Tokens
>
>   {bol, normal->bol} eol = cr | lf | cr_lf;
>
>/* notice that helpers & tokens don't share the same name space */

Thanks for pointing this out (should have realised this when I couldn't
use helpers in the Productions).

>   {bol->normal, normal} text_chars         = char;
>
>   {bol->normal} to_header = blank* 'to' blank* colon;
>   {bol->normal} from_header = blank* 'from' blank* colon;
>
>/*******************************************************************
>* Productions                                                     *
>*******************************************************************/
>Productions
>
>   message         = lines*;
>   lines           = header text*;
>   header          =   {to}    to_header |
>                     {from}  from_header;
>
>   text            =   {text}  text_chars |
>                          {eol} eol;
>
>Anyway, this is another way to look at the problem.

So how do these states work? When a token like 'to' is recognised only
in the bol state and when it is encountered a shift to 'normal' state
occurs?

If you remember my other post about case sensitivity you might notice a
slight complication. If there wasn't a separator (':' in my case)
between header and the rest, would states work? I still vote for "to" :)

Is this documented anywhere? I still think I must be missing some
documentation.

Thanks,
Bob

>
>Etienne
>
>
>-----Original Message-----
>From: Bob Hutchison <hutch@RedRock.com>
>To: sablecc-list@sable.mcgill.ca <sablecc-list@sable.mcgill.ca>
>Date: Thursday, May 28, 1998 2:11 AM
>Subject: Tokens and free text
>
>
>>Hi,
>>
>>Another question...
>>
>>Consider the following input:
>>
>>"to: i want this to work"
>>
>>The quotes are not part of the input.
>>
>>I've got this small grammar:
>>
>>
>>Package simple;
>>
>>/*******************************************************************
>> * Helpers                                                         *
>> *******************************************************************/
>>Helpers
>>
>>   c_colon    = ':';
>>   c_cr       = 0x000d;
>>   c_lf       = 0x000a;
>>   char       = [0x00..0xff];
>>
>>/*******************************************************************
>> * Tokens                                                          *
>> *******************************************************************/
>>Tokens
>>
>>   colon              = c_colon;
>>   cr                 = c_cr;
>>   lf                 = c_lf;
>>   crlf               = c_cr c_lf;
>>   text_chars         = char;
>>
>>   to_header          = 'to';
>>   from_header        = 'from';
>>
>>/*******************************************************************
>> * Productions                                                     *
>> *******************************************************************/
>>Productions
>>
>>   message         = lines*;
>>   lines           = header colon text* crlf;
>>   header          =   {to}    to_header
>>                     | {from}  from_header
>>                     ;
>>
>>   text            =   {text}  text_chars
>>                     | {to}    to_header
>>                     | {from}  from_header
>>                     | {cr}    cr
>>                     | {lf}    lf
>>                     | {colon} colon
>>                     ;
>>
>>
>>The problem I have with this is the definition of the 'text' production.
>>It seems I have to list every token here to recognise the text of the
>>token.
>>
>>Now I really hope I'm missing something... or is this the way it is?
>>
>>thanks,
>>Bob
>>
>>
>>---
>>Bob Hutchison, hutch@RedRock.com, (416) 878-3454
>>RedRock, Toronto, Canada
>

---
Bob Hutchison, hutch@RedRock.com, (416) 878-3454
RedRock, Toronto, Canada

References:
- Re: Tokens and free text
  - From: "Etienne Gagnon" <egagnon@sprynet.com>
- Re: Tokens and free text
  - From: "Etienne Gagnon" <egagnon@sprynet.com>

Prev by Date: Re: Tokens and free text
Next by Date: (fwd) Re: case insensitive tokens??
Prev by thread: Re: Tokens and free text
Next by thread: Re: Tokens and free text
Index(es):
- Date
- Thread