Re: [whatwg] New URL Standard from Anne van Kesteren on 2012-09-24 (public-whatwg-archive@w3.org from September 2012)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [whatwg] New URL Standard from Anne van Kesteren on 2012-09-24 (public-whatwg-archive@w3.org from September 2012)

Martin J. Dürst
Hello Anne,

[Removed [hidden email], because I'm discussing details.]

Sorry to be late with answering. I'm blaming a conference and the
followup jetlag.

On 2012/10/25 22:54, Anne van Kesteren wrote:

> I aligned it with IRI now,

Great.

> apart from private Unicode ranges. Not
> really sure why we should ban them in one place and not in another.

Private Unicode ranges were originally banned everywhere, because they
are not intended for public interchange. We allowed them in the query
part, because sometimes you may want to use them as a payload. That's
how we got to where we are. [If it interests you, this happened in
August 2003, see http://tools.ietf.org/html/draft-duerst-iri-03 and
http://tools.ietf.org/rfcdiff?url2=draft-duerst-iri-03.txt.]

If you have a good reason to change that, please tell us.

Looking at the bigger picture, there are literally dozens groups of
characters/codepoints like private use characters in Unicode that are
almost never used, and almost always a bad idea, in IRIs. We could spend
lots of hours discussing the merit of including or excluding them, but I
think we can use our time for better stuff.

Regards,   Martin.

Reply | Threaded
Open this post in threaded view
|

Marginal codepoints in IRIs/URLs

Martin J. Dürst
Hello Anne,

On 2012/11/05 19:31, Anne van Kesteren wrote:

> On Mon, Nov 5, 2012 at 10:53 AM, "Martin J. Dürst"
> <[hidden email]>  wrote:
>> Private Unicode ranges were originally banned everywhere, because they are
>> not intended for public interchange. We allowed them in the query part,
>> because sometimes you may want to use them as a payload. That's how we got
>> to where we are. [If it interests you, this happened in August 2003, see
>> http://tools.ietf.org/html/draft-duerst-iri-03 and
>> http://tools.ietf.org/rfcdiff?url2=draft-duerst-iri-03.txt.]
>>
>> If you have a good reason to change that, please tell us.
>
> Alignment with HTML.

> There's actually another change required for
> that, see https://www.w3.org/Bugs/Public/show_bug.cgi?id=19743 for
> details.

That's for U+FFF0 to U+FFFD. U+FFF0 to U+FFFC are characters that are
strictly reserved for internal processing, I think MS Word, among else,
uses these. A browser that wanted to use these to simplify internal
implementation would have trouble accepting them from the outside. Of
course, in IRIs/URLs, they make even less sense. I'd have somebody with
some MS software try and see what happens if they open an HTML document
with some of these inside.

U+FFFD is the replacement character. It's difficult to disallow that in
the text. In identifiers, it doesn't make much sense.

> I'm also happy for HTML to change, but it seems to me that
> for code points higher than U+007F we should have some kind of
> consistent set of rules across syntaxes, unless the the code points
> are problematic for that particular format.

Yes, that makes sense. On the other hand, for implementers that work
independent of HTML, you need a standalone definition.

>> Looking at the bigger picture, there are literally dozens groups of
>> characters/codepoints like private use characters in Unicode that are almost
>> never used, and almost always a bad idea, in IRIs. We could spend lots of
>> hours discussing the merit of including or excluding them, but I think we
>> can use our time for better stuff.
>
> I'm not interested in a code-point-by-code-point discussion, just the
> bigger picture, and consistency in requirements across the formats we
> develop.

Consistency across formats is definitely a good thing. But there are
some serious differences between text and identifiers. Something that's
harmless in text (e.g. a zero-width space) may be hopeless in an IRI/URL
(because it creates a different address, leading to confusion).

Of course, I have to admit that in the IRI spec, we only excluded the
most egregious of these (private use characters in most parts,
U+FFF0-FFFD everywhere).

Actually, the characters that I currently would like to exclude most
(not just in a spec, but actually in the browser implementations) are
bidi control characters. RFC 3987 disallows them, but not in the syntax.
Moving the restrictions to the syntax would give them more prominence.
Allowing them in IRIs/URLs is just a wide open door for scams and phishers.

Regards,   Martin.

Reply | Threaded
Open this post in threaded view
|

Re: Marginal codepoints in IRIs/URLs

Anne van Kesteren-4
On Mon, Nov 5, 2012 at 12:19 PM, "Martin J. Dürst"
<[hidden email]> wrote:
> That's for U+FFF0 to U+FFFD. U+FFF0 to U+FFFC are characters that are
> strictly reserved for internal processing, I think MS Word, among else, uses
> these. A browser that wanted to use these to simplify internal
> implementation would have trouble accepting them from the outside.

Given the way strings in browsers are really 16-bit code units
(Mozilla's Rust might change that, I hear) with no restrictions I
doubt that's a problem. And given that the input to the URL parser can
certainly contain one of those code points you have to handle them
somehow.


> Consistency across formats is definitely a good thing. But there are some
> serious differences between text and identifiers. Something that's harmless
> in text (e.g. a zero-width space) may be hopeless in an IRI/URL (because it
> creates a different address, leading to confusion).

Unicode has lots of space for confusion. I'll note that HTML defines
an identifier too and it takes any code point except for ASCII
whitespace: http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#the-id-attribute
Incidentally for text/html, URL fragments can be used to refer to
it...


> Actually, the characters that I currently would like to exclude most (not
> just in a spec, but actually in the browser implementations) are bidi
> control characters. RFC 3987 disallows them, but not in the syntax. Moving
> the restrictions to the syntax would give them more prominence. Allowing
> them in IRIs/URLs is just a wide open door for scams and phishers.

I don't really have an opinion on this. I can certainly assist filing
bugs on implementors, but I doubt they are interested in taking this
potential compatibility hit (if I understand correctly what you're
proposing).


--
http://annevankesteren.nl/

Reply | Threaded
Open this post in threaded view
|

Re: Marginal codepoints in IRIs/URLs

Martin J. Dürst
Hello Anne,

Sorry to be late with my reply.

On 2012/11/06 0:20, Anne van Kesteren wrote:

> On Mon, Nov 5, 2012 at 12:19 PM, "Martin J. Dürst"
> <[hidden email]>  wrote:
>> That's for U+FFF0 to U+FFFD. U+FFF0 to U+FFFC are characters that are
>> strictly reserved for internal processing, I think MS Word, among else, uses
>> these. A browser that wanted to use these to simplify internal
>> implementation would have trouble accepting them from the outside.
>
> Given the way strings in browsers are really 16-bit code units
> (Mozilla's Rust might change that, I hear) with no restrictions I
> doubt that's a problem. And given that the input to the URL parser can
> certainly contain one of those code points you have to handle them
> somehow.

Yes. But that also applies to a space, very obviously (Web pages without
spaces would be really bad, except potentially in Chinese, Japanese,
Thai,...:-), but still these are not part of valid URLs.

Anyway, just for your information, here is what
http://tools.ietf.org/html/draft-ietf-iri-3987bis currently say about
the two classes of characters in question, in the LEIRI section
(http://tools.ietf.org/html/draft-ietf-iri-3987bis-13#section-6.3,
Characters Allowed in Legacy Extended IRIs but not in IRIs).

       Private use code points (U+E000-F8FF, U+F0000-FFFFD, U+100000-
       10FFFD): Display and interpretation of these code points is by
       definition undefined without private agreement.  Therefore, these
       code points are not suited for use on the Internet.  They are not
       interoperable and may have unpredictable effects.

       Specials (U+FFF0-FFFD): These code points provide functionality
       beyond that useful in a Legacy Extended IRI, for example byte
       order identification, annotation, and replacements for unknown
       characters and objects.  Their use and interpretation in a Legacy
       Extended IRI serves no purpose and may lead to confusing display
       variations.

(actually "byte order identification" is wrong, because that's U+FFFE; I
have fixed that in my internal copy).

While we are at it, could you go through the list in the LEIRI section
(http://tools.ietf.org/html/draft-ietf-iri-3987bis-13#section-6.3) as an
easy way to cross-check whether there are any other differences?


Anyway, I have created two issues in our tracker:

http://trac.tools.ietf.org/wg/iri/trac/ticket/136
(Allow U+FFF0-FFFD to align with HTML)

http://trac.tools.ietf.org/wg/iri/trac/ticket/136
(Allow private-use characters outside query part to align with HTML)

Please feel free to add any additional information.

Personally, I'm fine either way. If somebody has implementations that
have problems with adding these, they should speak up.


>> Consistency across formats is definitely a good thing. But there are some
>> serious differences between text and identifiers. Something that's harmless
>> in text (e.g. a zero-width space) may be hopeless in an IRI/URL (because it
>> creates a different address, leading to confusion).
>
> Unicode has lots of space for confusion. I'll note that HTML defines
> an identifier too and it takes any code point except for ASCII
> whitespace: http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#the-id-attribute
> Incidentally for text/html, URL fragments can be used to refer to
> it...

Noted.

>> Actually, the characters that I currently would like to exclude most (not
>> just in a spec, but actually in the browser implementations) are bidi
>> control characters. RFC 3987 disallows them, but not in the syntax. Moving
>> the restrictions to the syntax would give them more prominence. Allowing
>> them in IRIs/URLs is just a wide open door for scams and phishers.
>
> I don't really have an opinion on this. I can certainly assist filing
> bugs on implementors, but I doubt they are interested in taking this
> potential compatibility hit (if I understand correctly what you're
> proposing).

Only scammers should have any reason to use these. It's way more a
security issue (in which browsers often show a very strong interest)
than a compatibility issue. I'll try to follow up on this in a separate
mail, but that may not be this week, sorry.

Regards,    Martin.


P.S.: Thanks for the pointer to Rust. Very interesting project.

Reply | Threaded
Open this post in threaded view
|

Re: Marginal codepoints in IRIs/URLs

Anne van Kesteren-4
On Thu, Nov 8, 2012 at 5:16 AM, "Martin J. Dürst"
<[hidden email]> wrote:
> Sorry to be late with my reply.

No worries!


> On 2012/11/06 0:20, Anne van Kesteren wrote:
>> On Mon, Nov 5, 2012 at 12:19 PM, "Martin J. Dürst"
>> <[hidden email]>  wrote:
>> Given the way strings in browsers are really 16-bit code units
>> (Mozilla's Rust might change that, I hear) with no restrictions I
>> doubt that's a problem. And given that the input to the URL parser can
>> certainly contain one of those code points you have to handle them
>> somehow.
>
> Yes. But that also applies to a space, very obviously (Web pages without
> spaces would be really bad, except potentially in Chinese, Japanese,
> Thai,...:-), but still these are not part of valid URLs.

My current view is that it mostly makes sense to restrict certain code
points in the ASCII range as those are used as delimiters throughout
the ecosystem. HTML/Python use quotation marks, HTTP uses the colon
and whitespace, etc. So by putting the restrictions there, you make it
easy to copy and paste a URL around.


> While we are at it, could you go through the list in the LEIRI section
> (http://tools.ietf.org/html/draft-ietf-iri-3987bis-13#section-6.3) as an
> easy way to cross-check whether there are any other differences?

So LEIRIs are an even larger superset of IRIs. "\" seems problematic
as passing that to a URL parser results in it being handled as if it
were a "/". (I suppose we could make the parser handle that via a
flag, or before handing it to the parser you replace "\" with "%5C".)

U+0009, U+000A, and U+000D are pretty much always dropped on the floor
by a URL parser so those would be problematic too.

I am surprised [ and ] are not allowed. mailto:a@b?subject=[test]%20
is something I semi-frequently write and where I keep forgetting I
need to escape [ and ] to make it valid (I never had it fail anything
but the validator though).


>> I don't really have an opinion on this. I can certainly assist filing
>> bugs on implementors, but I doubt they are interested in taking this
>> potential compatibility hit (if I understand correctly what you're
>> proposing).
>
> Only scammers should have any reason to use these. It's way more a security
> issue (in which browsers often show a very strong interest) than a
> compatibility issue. I'll try to follow up on this in a separate mail, but
> that may not be this week, sorry.

What would be interesting is affected code points, and expected
results. There's a few cases currently where the URL parser has a hard
fail. E.g. if you resolve "/test" against "about:blank". We could
expand that to include these code points I suppose, but it seems like
a major risk.


--
http://annevankesteren.nl/