[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Grammar for URLs
Hi Kim,
> > Maybe a should tell you that regexps would not be
> > enough, because I'm analising log files from a
> > usability tool, and they have certain complexity. They
> > include URLs, integers, dates, strings, quoted
> > strings, lists and tons of events information.
> The format remains the same.
I think what Cesar means is that the _logfiles_ contain all of the above
mentioned information. He doesn't mean that a single URL contains that
information.
>
protocol://domain.tld/somesite.format?attr1=val1&attr2=val2...attrN=valN
That is actually not the complete format of an URL.
You have to be careful when dealing with this stuff if you want to do it
right. So I suggest taking a read of the RFC1738 standard before making
the grammar.
In particular, URLs can look like this:
protocol://user:password@host:port/path
Where "user:password", "user:password@", ":port" and "/path" are
optional. Note that this you can have an URL like this:
protocol://@host:port/path
Which means an empty username and an empty password (which is completely
different from having no username and password).
Also URLs can be URLs for protocols that does not follow the regular
"host"-based pattern, for example:
news:comp.dcom.sys.cisco
or
mailto:someone@somewhere.invalid
The easiest way to go about specifying this is simply to read the RFC
standard 1738, which includes a BNF-like syntax grammar for URLs.
--
Jens Kristian Søgaard, Mermaid Consulting ApS,
jens@mermaidconsulting.dk,
http://www.mermaidconsulting.com/