URL parsing in HTML5

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

URL parsing in HTML5

Peter Saint-Andre-2
After chatting during TPAC 2011 with Addison, Larry, Richard, Ian, Mike,
Ted, Julian (etc.), I'd like to share some thoughts about a possible
compromise / resolution regarding Issue 56 in the HTML WG:

http://www.w3.org/html/wg/tracker/issues/56

Some observations and opinions:

1. It is unlikely that existing browsers will change their current URL
parsing behavior. (I am not judging whether that behavior is good or bad.)

2. Documentation of that behavior is out of scope for the revisions to
RFC 3987, and outside the charter of the IRI WG, because it's a matter
of URI [pre-]processing (RFC 3986) and not IRI processing (RFC 3987).

3. It is unlikely that RFC 3986 will ever be modified to recommend the
current behavior, and simply impossible before HTML5 is advanced at the
W3C (even if such modifications were desirable).

4. As far as I can see, the current behavior is in fact out of scope for
RFC 3986 and any future possible revisions to RFC 3986 because:

   (a) it is mostly or completely a matter of pre-processing of strings
   that look like URIs/URLs/"web-addresses" -- we could call these
   "candidate strings" or "proto-URLs" or somesuch to disambiguate them
   from URIs

   (b) this pre-processing behavior is applied only in the web context
   by browsers and software applications that want to be consistent
   with browsers

   (c) because of (b), there is no great danger that this behavior will
   "leak" into processing of URIs in general (mailto:, sip:, tel:,
   URNs, and so on)

5. There's no necessity for work on documentation of the current URL
parsing behavior to happen at the IETF, given that it's out of scope for
the IRI WG. Although this work could be done as an individual (non-WG)
I-D at the IETF, I think it could more easily be done at the W3C, either
as part of the HTML specification or as a separate document (the latter
might be preferable so that it can be reviewed in a more focused manner
and referenced more easily by other W3C specifications, but naturally I
would leave such decisions up to folks at the W3C). [The IRI WG is still
responsible for rfc3987bis, but that's off-topic for this email message.]

If folks can agree on the foregoing points, then I think it would be
productive to work on proposed revisions to the current text (or at
least what I believe is the current text):

http://www.w3.org/TR/html5/Overview.html#parsing-urls

I would be happy to make concrete suggestions during that revision
process if someone from the W3C could point to the preferred venue or
process (e.g., wiki page or bugzilla comments).

I look forward to discussing this further tomorrow morning during the
HTML WG session:

http://lists.w3.org/Archives/Public/public-html/2011Nov/0013.html

Peter

--
Peter Saint-Andre
https://stpeter.im/



Reply | Threaded
Open this post in threaded view
|

Re: URL parsing in HTML5

Martin J. Dürst
Hello Peter, others,

On 2011/11/04 13:21, Peter Saint-Andre wrote:

> After chatting during TPAC 2011 with Addison, Larry, Richard, Ian, Mike,
> Ted, Julian (etc.), I'd like to share some thoughts about a possible
> compromise / resolution regarding Issue 56 in the HTML WG:
>
> http://www.w3.org/html/wg/tracker/issues/56
>
> Some observations and opinions:
>
> 1. It is unlikely that existing browsers will change their current URL
> parsing behavior. (I am not judging whether that behavior is good or bad.)

I agree that it's very unlikely that they change it in areas where they
all agree on a particular behavior. Discussions in the IRI WG have often
very quickly come up with examples where major browsers differ, and (at
least) in these areas, some change seems desirable.


> 2. Documentation of that behavior is out of scope for the revisions to
> RFC 3987, and outside the charter of the IRI WG, because it's a matter
> of URI [pre-]processing (RFC 3986) and not IRI processing (RFC 3987).

I have to say that I'm very surprised to see such an "out of scope"
statement. Of course, I haven't been part to the discussions you
mention, and I admit that coming from you as the responsible Area
Director, such a statement carries a lot of weight.

However, as far as I can remember, the issue of how browsers deal with
IRIs was always an important part in the deliberations that lead up to
the formation of the WG, and also during the WG.

Also, saying that browsers do URI (pre-)processing but not IRI
(pre-)processing surprises me quite a lot, because the single most
important difference between URIs and IRIs is that the later allow
non-ASCII characters, and browsers definitely do that. This is despite
the fact that the HTML5 spec likes to call these "URL"s (which is
neither URI nor IRI).


> 3. It is unlikely that RFC 3986 will ever be modified to recommend the
> current behavior, and simply impossible before HTML5 is advanced at the
> W3C (even if such modifications were desirable).

Fully agreed.


> 4. As far as I can see, the current behavior is in fact out of scope for
> RFC 3986 and any future possible revisions to RFC 3986 because:
>
>     (a) it is mostly or completely a matter of pre-processing of strings
>     that look like URIs/URLs/"web-addresses" -- we could call these
>     "candidate strings" or "proto-URLs" or somesuch to disambiguate them
>     from URIs
>
>     (b) this pre-processing behavior is applied only in the web context
>     by browsers and software applications that want to be consistent
>     with browsers
>
>     (c) because of (b), there is no great danger that this behavior will
>     "leak" into processing of URIs in general (mailto:, sip:, tel:,
>     URNs, and so on)

Mostly agree, except for (c). URI/IRI/URL processing isn't a matter of
schemes; browsers handle mailto: schemes, and some deal with tel:
schemes and others.


> 5. There's no necessity for work on documentation of the current URL
> parsing behavior to happen at the IETF, given that it's out of scope for
> the IRI WG.

As said above, I disagree with the later part of the sentence, and
therefore have to disagree with the overall conclusion.

It may very well be that for various and potentially even very good
reasons, it is better to do this work somewhere else than at the IETF,
but "it's out of scope for the IRI WG" doesn't really make a good
reason, because the IRI WG was formed and until now has worked under the
assumption that it's in scope.

[That the IRI WG hasn't made much progress on this issue may be a good
reason to decide it shouldn't be part of the IRI WGs work, and should be
done somewhere else, but that would be a different reason.]


Regards,   Martin.

> Although this work could be done as an individual (non-WG)
> I-D at the IETF, I think it could more easily be done at the W3C, either
> as part of the HTML specification or as a separate document (the latter
> might be preferable so that it can be reviewed in a more focused manner
> and referenced more easily by other W3C specifications, but naturally I
> would leave such decisions up to folks at the W3C). [The IRI WG is still
> responsible for rfc3987bis, but that's off-topic for this email message.]
>
> If folks can agree on the foregoing points, then I think it would be
> productive to work on proposed revisions to the current text (or at
> least what I believe is the current text):
>
> http://www.w3.org/TR/html5/Overview.html#parsing-urls
>
> I would be happy to make concrete suggestions during that revision
> process if someone from the W3C could point to the preferred venue or
> process (e.g., wiki page or bugzilla comments).
>
> I look forward to discussing this further tomorrow morning during the
> HTML WG session:
>
> http://lists.w3.org/Archives/Public/public-html/2011Nov/0013.html
>
> Peter
>
> --
> Peter Saint-Andre
> https://stpeter.im/
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: URL parsing in HTML5

Peter Saint-Andre-2
Hi Martin, thank you for your comments. I have time for only a quick
reply at the moment.

On 11/3/11 10:23 PM, "Martin J. Dürst" wrote:

> Hello Peter, others,
>
> On 2011/11/04 13:21, Peter Saint-Andre wrote:
>> After chatting during TPAC 2011 with Addison, Larry, Richard, Ian, Mike,
>> Ted, Julian (etc.), I'd like to share some thoughts about a possible
>> compromise / resolution regarding Issue 56 in the HTML WG:
>>
>> http://www.w3.org/html/wg/tracker/issues/56
>>
>> Some observations and opinions:
>>
>> 1. It is unlikely that existing browsers will change their current URL
>> parsing behavior. (I am not judging whether that behavior is good or
>> bad.)
>
> I agree that it's very unlikely that they change it in areas where they
> all agree on a particular behavior. Discussions in the IRI WG have often
> very quickly come up with examples where major browsers differ, and (at
> least) in these areas, some change seems desirable.

Yes, that's true.

>> 2. Documentation of that behavior is out of scope for the revisions to
>> RFC 3987, and outside the charter of the IRI WG, because it's a matter
>> of URI [pre-]processing (RFC 3986) and not IRI processing (RFC 3987).
>
> I have to say that I'm very surprised to see such an "out of scope"
> statement. Of course, I haven't been part to the discussions you
> mention, and I admit that coming from you as the responsible Area
> Director, such a statement carries a lot of weight.

You are right. I was voicing my impression from discussions this week.
My impression might be wrong.

> However, as far as I can remember, the issue of how browsers deal with
> IRIs was always an important part in the deliberations that lead up to
> the formation of the WG, and also during the WG.
>
> Also, saying that browsers do URI (pre-)processing but not IRI
> (pre-)processing surprises me quite a lot, because the single most
> important difference between URIs and IRIs is that the later allow
> non-ASCII characters, and browsers definitely do that. This is despite
> the fact that the HTML5 spec likes to call these "URL"s (which is
> neither URI nor IRI).

I meant among other things that behaviors like "remove whitespace from
the front and back of a proto-URL" are not specific to URIs or IRIs
because they are a matter of preprocessing. You are right that HTML
files can include UTF-8 encoded Unicode characters, so in theory we are
talking about IRI processing. However, many of the heuristics appear to
be related to things other than percent-encoding and such, so the topics
cross many boundaries. I think this has led to much of the confusion
about roles and responsibilities.

>> 3. It is unlikely that RFC 3986 will ever be modified to recommend the
>> current behavior, and simply impossible before HTML5 is advanced at the
>> W3C (even if such modifications were desirable).
>
> Fully agreed.
>
>
>> 4. As far as I can see, the current behavior is in fact out of scope for
>> RFC 3986 and any future possible revisions to RFC 3986 because:
>>
>>     (a) it is mostly or completely a matter of pre-processing of strings
>>     that look like URIs/URLs/"web-addresses" -- we could call these
>>     "candidate strings" or "proto-URLs" or somesuch to disambiguate them
>>     from URIs
>>
>>     (b) this pre-processing behavior is applied only in the web context
>>     by browsers and software applications that want to be consistent
>>     with browsers
>>
>>     (c) because of (b), there is no great danger that this behavior will
>>     "leak" into processing of URIs in general (mailto:, sip:, tel:,
>>     URNs, and so on)
>
> Mostly agree, except for (c). URI/IRI/URL processing isn't a matter of
> schemes; browsers handle mailto: schemes, and some deal with tel:
> schemes and others.

What I meant to say is that, given how many browsers are implemented,
such specialized "web processing" would not necessarily leak into
generic URI parsing code. This hunch would need to be validated, but my
sense is that some folks have been worried that documenting browser
behavior would cause all URIs/IRIs in all applications to be processed
in ways that are not fully consistent with RFC 3986 / RFC 3987. Right
now I think such a fear might be misplaced.

>> 5. There's no necessity for work on documentation of the current URL
>> parsing behavior to happen at the IETF, given that it's out of scope for
>> the IRI WG.
>
> As said above, I disagree with the later part of the sentence, and
> therefore have to disagree with the overall conclusion.
>
> It may very well be that for various and potentially even very good
> reasons, it is better to do this work somewhere else than at the IETF,
> but "it's out of scope for the IRI WG" doesn't really make a good
> reason, because the IRI WG was formed and until now has worked under the
> assumption that it's in scope.
>
> [That the IRI WG hasn't made much progress on this issue may be a good
> reason to decide it shouldn't be part of the IRI WGs work, and should be
> done somewhere else, but that would be a different reason.]

I see your point and I look forward to discussing the matter further in
an open fashion so that we can figure out a way forward.

Peter

--
Peter Saint-Andre
https://stpeter.im/



Reply | Threaded
Open this post in threaded view
|

Re: URL parsing in HTML5

Anne van Kesteren-2
In reply to this post by Peter Saint-Andre-2
On Thu, 03 Nov 2011 21:21:50 -0700, Peter Saint-Andre <[hidden email]>  
wrote:
> [...]

The outcome you sketch will also result in all other W3C specifications to  
be implemented by browsers (and even HTTP if it were to be defined in a  
non-fiction manner) depend on HTML for its definition of URL processing.

And as stated before I think user agents other than browsers are already  
affected when it comes to e.g. Location header handling if that header  
where to contain a space somewhere or an "invalid" character. URLs leak  
and as such the way they have been implemented in browsers leaks too. E.g.  
search engines will most certainly want to implement them identically.

The only piece of software I can think of where it does not matter is a  
piece of software that deals with walled garden content and when it comes  
to the web I think that should be the least of our priorities.


--
Anne van Kesteren
http://annevankesteren.nl/

Reply | Threaded
Open this post in threaded view
|

Re: URL parsing in HTML5

Julian Reschke
On 2011-11-04 16:34, Anne van Kesteren wrote:
> On Thu, 03 Nov 2011 21:21:50 -0700, Peter Saint-Andre
> <[hidden email]> wrote:
>> [...]
>
> The outcome you sketch will also result in all other W3C specifications
> to be implemented by browsers (and even HTTP if it were to be defined in
> a non-fiction manner) depend on HTML for its definition of URL processing.

Please stop the "fiction" rhetoric. There's also a lot of fiction in
HTML5 (such as requiring rewriting of \ for all URI schemes), and I
don't see you arguing about *that*.

I do agree that URIs leak, but that doesn't necessarily mean that we can
have the same processing requirements everywhere. For instance, there
are cases where whitespace acts as a delimiter and thus will not be
accepted as URI character, no matter how much you want it to.

> ...

Best regards, Julian


Reply | Threaded
Open this post in threaded view
|

Re: URL parsing in HTML5

Anne van Kesteren-2
On Fri, 04 Nov 2011 08:50:07 -0700, Julian Reschke <[hidden email]>  
wrote:
> On 2011-11-04 16:34, Anne van Kesteren wrote:
>> The outcome you sketch will also result in all other W3C specifications
>> to be implemented by browsers (and even HTTP if it were to be defined in
>> a non-fiction manner) depend on HTML for its definition of URL  
>> processing.
>
> Please stop the "fiction" rhetoric. There's also a lot of fiction in  
> HTML5 (such as requiring rewriting of \ for all URI schemes), and I  
> don't see you arguing about *that*.

I think it only rewrites it for a certain class of URL schemes, but the  
details of URL processing are besides the point here. The point is that  
URL processing should be uniform. What the exact details of URL processing  
should be is indeed not completely figured out just yet, but it is clear  
that the IETF specifications on the matter are fiction.


> I do agree that URIs leak, but that doesn't necessarily mean that we can  
> have the same processing requirements everywhere. For instance, there  
> are cases where whitespace acts as a delimiter and thus will not be  
> accepted as URI character, no matter how much you want it to.

You keep bringing this example up and I will remind you once again that  
obviously you would have to split on whitespace characters first in such  
cases. This has does not affect uniform URL processing in the slightest,  
it just means we should either require whitespace characters in URLs to  
always be escaped, or require whitespace characters in URLs to be escaped  
in cases where URLs are whitespace separated.


--
Anne van Kesteren
http://annevankesteren.nl/

Reply | Threaded
Open this post in threaded view
|

Re: URL parsing in HTML5

Elisabeth Robson
In reply to this post by Julian Reschke
Sorry, joining the conversation a bit late, can you explain more specifically what you mean by URIs "leaking" in this context?

Thanks

On Fri, Nov 4, 2011 at 8:50 AM, Julian Reschke <[hidden email]> wrote:

I do agree that URIs leak, but that doesn't necessarily mean that we can have the same processing requirements everywhere. For instance, there are cases where whitespace acts as a delimiter and thus will not be accepted as URI character, no matter how much you want it to.

Reply | Threaded
Open this post in threaded view
|

Re: URL parsing in HTML5

Julian Reschke
On 2011-11-04 16:59, Elisabeth Robson wrote:
> Sorry, joining the conversation a bit late, can you explain more
> specifically what you mean by URIs "leaking" in this context?
> ...

With "leaking" we mean that once an identifier "works" inside an HTML
link, it's likely to surface in other protocol elements (such as HTTP
Location header field) or API parameters (such as XmlHttpRequest).

Best regards, Julian

Reply | Threaded
Open this post in threaded view
|

Re: URL parsing in HTML5

Julian Reschke
In reply to this post by Anne van Kesteren-2
On 2011-11-04 16:58, Anne van Kesteren wrote:

> On Fri, 04 Nov 2011 08:50:07 -0700, Julian Reschke
> <[hidden email]> wrote:
>> On 2011-11-04 16:34, Anne van Kesteren wrote:
>>> The outcome you sketch will also result in all other W3C specifications
>>> to be implemented by browsers (and even HTTP if it were to be defined in
>>> a non-fiction manner) depend on HTML for its definition of URL
>>> processing.
>>
>> Please stop the "fiction" rhetoric. There's also a lot of fiction in
>> HTML5 (such as requiring rewriting of \ for all URI schemes), and I
>> don't see you arguing about *that*.
>
> I think it only rewrites it for a certain class of URL schemes, but the

...yes: "If result uses a scheme with a server-based naming authority..."

> details of URL processing are besides the point here. The point is that
> URL processing should be uniform. What the exact details of URL
> processing should be is indeed not completely figured out just yet, but
> it is clear that the IETF specifications on the matter are fiction.

Well, so is that the HTML spec says. The problem is to pretend that it's
possible to agree on the same error handling for everybody.

We spent tons of emails on the IRI mailing list to figure out *which*
"willful violations" of RFC 3986 UAs implementers agree on, and didn't
really find a lot.

>> I do agree that URIs leak, but that doesn't necessarily mean that we
>> can have the same processing requirements everywhere. For instance,
>> there are cases where whitespace acts as a delimiter and thus will not
>> be accepted as URI character, no matter how much you want it to.
>
> You keep bringing this example up and I will remind you once again that
> obviously you would have to split on whitespace characters first in such
> cases. This has does not affect uniform URL processing in the slightest,
> it just means we should either require whitespace characters in URLs to
> always be escaped, or require whitespace characters in URLs to be
> escaped in cases where URLs are whitespace separated.

It means that you have at least *two* processing algorithms, no matter
how you rephrase it .-)

Best regards, Julian

Reply | Threaded
Open this post in threaded view
|

Re: URL parsing in HTML5

Anne van Kesteren-2
On Fri, 04 Nov 2011 09:25:15 -0700, Julian Reschke <[hidden email]>  
wrote:
> On 2011-11-04 16:58, Anne van Kesteren wrote:
>> details of URL processing are besides the point here. The point is that
>> URL processing should be uniform. What the exact details of URL
>> processing should be is indeed not completely figured out just yet, but
>> it is clear that the IETF specifications on the matter are fiction.
>
> Well, so is that the HTML spec says. The problem is to pretend that it's  
> possible to agree on the same error handling for everybody.

We have crossed that bridge for much more complex problems, such as HTML  
parsing, so I think it should be doable.


> We spent tons of emails on the IRI mailing list to figure out *which*  
> "willful violations" of RFC 3986 UAs implementers agree on, and didn't  
> really find a lot.

That does not mean we do not want to converge.


>> You keep bringing this example up and I will remind you once again that
>> obviously you would have to split on whitespace characters first in such
>> cases. This has does not affect uniform URL processing in the slightest,
>> it just means we should either require whitespace characters in URLs to
>> always be escaped, or require whitespace characters in URLs to be
>> escaped in cases where URLs are whitespace separated.
>
> It means that you have at least *two* processing algorithms, no matter  
> how you rephrase it .-)

Yes, you need two algorithms because standalone URLs and whitespace  
separated URLs are distinct. What are you trying to say?


--
Anne van Kesteren
http://annevankesteren.nl/

Reply | Threaded
Open this post in threaded view
|

Re: URL parsing in HTML5

Michael[tm] Smith
In reply to this post by Peter Saint-Andre-2
Peter Saint-Andre <[hidden email]>, 2011-11-03 21:21 -0700:

> If folks can agree on the foregoing points, then I think it would be
> productive to work on proposed revisions to the current text (or at
> least what I believe is the current text):
>
> http://www.w3.org/TR/html5/Overview.html#parsing-urls
>
> I would be happy to make concrete suggestions during that revision
> process if someone from the W3C could point to the preferred venue or
> process (e.g., wiki page or bugzilla comments).

For proposed revisions to the HTML spec, the preferred place is the HTML WG
product in the W3C bugzilla -

  http://w3.org/brief/MjA2

--
Michael[tm] Smith
http://people.w3.org/mike/+

Reply | Threaded
Open this post in threaded view
|

RE: URL parsing in HTML5

Paul Cotton
I just thought that I would add that the WebSockets API document also contains "some" information on parsing of URLs:
http://dev.w3.org/html5/websockets/#parse-a-websocket-url-s-components 

/paulc

Paul Cotton, Microsoft Canada
17 Eleanor Drive, Ottawa, Ontario K2E 6A3
Tel: (425) 705-9596 Fax: (425) 936-7329


-----Original Message-----
From: Michael[tm] Smith [mailto:[hidden email]]
Sent: Friday, November 04, 2011 1:40 PM
To: Peter Saint-Andre
Cc: [hidden email]; [hidden email]; Sam Ruby; Paul Cotton; Ian Hickson; Adam Barth; Edward O'Connor
Subject: Re: URL parsing in HTML5

Peter Saint-Andre <[hidden email]>, 2011-11-03 21:21 -0700:

> If folks can agree on the foregoing points, then I think it would be
> productive to work on proposed revisions to the current text (or at
> least what I believe is the current text):
>
> http://www.w3.org/TR/html5/Overview.html#parsing-urls
>
> I would be happy to make concrete suggestions during that revision
> process if someone from the W3C could point to the preferred venue or
> process (e.g., wiki page or bugzilla comments).

For proposed revisions to the HTML spec, the preferred place is the HTML WG product in the W3C bugzilla -

  http://w3.org/brief/MjA2

--
Michael[tm] Smith
http://people.w3.org/mike/+


Reply | Threaded
Open this post in threaded view
|

Re: URL parsing in HTML5

Martin J. Dürst
In reply to this post by Julian Reschke
[Forwarded because it got caught in the moderator's queue.]

Sorry, joining the conversation a bit late, can you explain more
specifically what you mean by URIs "leaking" in this context?

Thanks

On Fri, Nov 4, 2011 at 8:50 AM, Julian Reschke <[hidden email]>wrote:

>
> I do agree that URIs leak, but that doesn't necessarily mean that we can
> have the same processing requirements everywhere. For instance, there are
> cases where whitespace acts as a delimiter and thus will not be accepted as
> URI character, no matter how much you want it to.
>
>


Reply | Threaded
Open this post in threaded view
|

RE: URL parsing in HTML5

Martin J. Dürst
In reply to this post by Michael[tm] Smith
[Forwarded because it got caught in the moderator's queue.]

I just thought that I would add that the WebSockets API document also
contains "some" information on parsing of URLs:
http://dev.w3.org/html5/websockets/#parse-a-websocket-url-s-components
/paulc

Paul Cotton, Microsoft Canada
17 Eleanor Drive, Ottawa, Ontario K2E 6A3
Tel: (425) 705-9596 Fax: (425) 936-7329


-----Original Message-----
From: Michael[tm] Smith [mailto:[hidden email]] Sent: Friday, November 04,
2011 1:40 PM
To: Peter Saint-Andre
Cc: [hidden email]; [hidden email]; Sam Ruby; Paul
Cotton; Ian Hickson; Adam Barth; Edward O'Connor
Subject: Re: URL parsing in HTML5

Peter Saint-Andre <[hidden email]>, 2011-11-03 21:21 -0700:

> If folks can agree on the foregoing points, then I think it would be
> productive to work on proposed revisions to the current text (or at
> least what I believe is the current text):
>
> http://www.w3.org/TR/html5/Overview.html#parsing-urls
>
> I would be happy to make concrete suggestions during that revision
> process if someone from the W3C could point to the preferred venue or
> process (e.g., wiki page or bugzilla comments).

For proposed revisions to the HTML spec, the preferred place is the HTML
WG product in the W3C bugzilla -

   http://w3.org/brief/MjA2

--
Michael[tm] Smith
http://people.w3.org/mike/+