[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Grammar for URLs



Hi Kim,

> > Maybe a should tell you that regexps would not be
> > enough, because I'm analising log files from a
> > usability tool, and they have certain complexity. They
> > include URLs, integers, dates, strings, quoted
> > strings, lists and tons of events information.

> The format remains the same.

I think what Cesar means is that the _logfiles_ contain all of the above
mentioned information. He doesn't mean that a single URL contains that
information.

>
protocol://domain.tld/somesite.format?attr1=val1&attr2=val2...attrN=valN

That is actually not the  complete format of an URL.

You have to be careful when dealing with this stuff if you want to do it
right. So I suggest taking a read of the RFC1738 standard before making
the grammar.

In particular, URLs can look like this:

  protocol://user:password@host:port/path

Where "user:password",  "user:password@", ":port" and "/path" are
optional. Note that this you can have an URL like this:

  protocol://@host:port/path

Which means an empty username and an empty password (which is completely
different from having no username and password).

Also URLs can be URLs for protocols that does not follow the regular
"host"-based pattern, for example:

  news:comp.dcom.sys.cisco

or

  mailto:someone@somewhere.invalid

The easiest way to go about specifying this is simply to read the RFC
standard 1738, which includes a BNF-like syntax grammar for URLs.

--
Jens Kristian Søgaard, Mermaid Consulting ApS,
jens@mermaidconsulting.dk,
http://www.mermaidconsulting.com/