Standardizing on IDNA 2003 in the URL Standard

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
100 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

Standardizing on IDNA 2003 in the URL Standard

Anne van Kesteren-4
Reply | Threaded
Open this post in threaded view
|

Re: Standardizing on IDNA 2003 in the URL Standard

Peter Saint-Andre-2
On 8/19/13 6:37 AM, Anne van Kesteren wrote:
> http://lists.w3.org/Archives/Public/www-archive/2013Aug/0008.html
> might be of interest to readers of these lists.

Hi Anne, thanks for the heads-up.

Given that IDNA 2003 is tied to Unicode 3.2 (via stringprep), I'm
curious to know more about what you mean by "IDNA 2003 ... without
restrictions to a particular Unicode version".

Do you have a preferred venue for discussion of this topic?

Peter

--
Peter Saint-Andre
https://stpeter.im/



Reply | Threaded
Open this post in threaded view
|

Re: Standardizing on IDNA 2003 in the URL Standard

Anne van Kesteren-4
On Mon, Aug 19, 2013 at 5:35 PM, Peter Saint-Andre <[hidden email]> wrote:
> Given that IDNA 2003 is tied to Unicode 3.2 (via stringprep), I'm
> curious to know more about what you mean by "IDNA 2003 ... without
> restrictions to a particular Unicode version".

As far as I can tell from implementations what it means is that the
NFKC normalization algorithm from Unicode is the one defined in the
latest edition of Unicode rather than that of Unicode 3.2. I don't
think the other tables from Stringprep have been modified, but I
haven't exhaustively tested that. I probably should.


> Do you have a preferred venue for discussion of this topic?

Not really. Wherever people pay attention I suppose :-)


--
http://annevankesteren.nl/

Reply | Threaded
Open this post in threaded view
|

Re: Standardizing on IDNA 2003 in the URL Standard

Mark Davis ☕

On Mon, Aug 19, 2013 at 7:01 PM, Anne van Kesteren <[hidden email]> wrote:
As far as I can tell from implementations what it means is that the
NFKC normalization algorithm from Unicode is the one defined in the

Rather than promoting different, arbitrary modifications of IDNA2003, ​I would recommend instead using the TR46 specification, ​which provides a migration path from IDNA2003 to IDNA2008. It is, with some small exceptions, compatible with IDNA2003. 

"To satisfy user expectations for mapping, and provide maximal compatibility with IDNA2003, this document specifies a mapping for use with IDNA2008. In addition, to transition more smoothly to IDNA2008, this document provides a Unicode algorithm for a standardized processing that allows conformant implementations to minimize the security and interoperability problems caused by the differences between IDNA2003 and IDNA2008. This Unicode IDNA Compatibility Processing is structured according to IDNA2003 principles, but extends those principles to Unicode 5.2 and later. It also incorporates the repertoire extensions provided by IDNA2008."



— Il meglio è l’inimico del bene —
Reply | Threaded
Open this post in threaded view
|

RE: Standardizing on IDNA 2003 in the URL Standard

Shawn Steele

I concur.  We use the IDNA2008 + TR46 behavior.

 

-Shawn

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Mark Davis ?
Sent: Monday, August 19, 2013 10:32 AM
To: Anne van Kesteren
Cc: Peter Saint-Andre; [hidden email]; [hidden email]; www-tag.w3.org
Subject: Re: Standardizing on IDNA 2003 in the URL Standard

 

 

On Mon, Aug 19, 2013 at 7:01 PM, Anne van Kesteren <[hidden email]> wrote:

As far as I can tell from implementations what it means is that the
NFKC normalization algorithm from Unicode is the one defined in the

 

Rather than promoting different, arbitrary modifications of IDNA2003, ​I would recommend instead using the TR46 specification, ​which provides a migration path from IDNA2003 to IDNA2008. It is, with some small exceptions, compatible with IDNA2003. 

 

"To satisfy user expectations for mapping, and provide maximal compatibility with IDNA2003, this document specifies a mapping for use with IDNA2008. In addition, to transition more smoothly to IDNA2008, this document provides a Unicode algorithm for a standardized processing that allows conformant implementations to minimize the security and interoperability problems caused by the differences between IDNA2003 and IDNA2008. This Unicode IDNA Compatibility Processing is structured according to IDNA2003 principles, but extends those principles to Unicode 5.2 and later. It also incorporates the repertoire extensions provided by IDNA2008."

 

 

 

— Il meglio è l’inimico del bene —

Reply | Threaded
Open this post in threaded view
|

Re: Standardizing on IDNA 2003 in the URL Standard

Anne van Kesteren-4
In reply to this post by Mark Davis ☕
On Mon, Aug 19, 2013 at 6:31 PM, Mark Davis ☕ <[hidden email]> wrote:
> Rather than promoting different, arbitrary modifications of IDNA2003, I
> would recommend instead using the TR46 specification, which provides a
> migration path from IDNA2003 to IDNA2008. It is, with some small exceptions,
> compatible with IDNA2003.

Last I checked with implementers there was not much interest in that.
And to be clear, it's not different and arbitrary. The modifications
have been in place since IDNA2003 support landed in browsers. As
should have been clear to the original authors of IDNA2003 too. Nobody
is going to arbitrarily freeze their Unicode implementation.

(Aside: ToASCII in IDNA2003 applies to domain labels. It applying to
domain names in UTS #46 is somewhat confusing.)


On Mon, Aug 19, 2013 at 9:32 PM, Shawn Steele
<[hidden email]> wrote:
> I concur.  We use the IDNA2008 + TR46 behavior.

Interesting. Last I checked Internet Explorer that was not the case.
Since which version is this deployed? Does it depend on the operating
system? What variation of TR46 is implemented?


On Mon, Aug 19, 2013 at 11:36 PM, Vint Cerf <[hidden email]> wrote:

> It seems to me that we would serve the community well if we work towards a
> well-defined and timely transition to IDNA2008. It has a key property of
> independence from any particular version of UNICODE (which was the primary
> reason for moving in that direction). It also has a canonical representation
> of domain labels which is also a powerful standardizing element. We are all
> aware of the potential for some backward incompatibility with IDNA2003 but
> the committee that developed IDNA2008 discussed these issues at length and
> obviously concluded that the features of IDNA2008 were superior over all to
> the status quo. It is a disservice in the long run to delay adoption of the
> newer design, especially given the huge expansion of the TLD space - all
> these TLDs should be developed and evolved on the IDNA2008 principles.

I don't think the committee has carefully considered the compatibility
impact. Deployed domains would become invalid. Long-standing practice
of case folding (e.g. the idea that http://EXAMPLE.COM/ and
http://example.com/ are identical) is suddenly something that is no
longer decided upon by IDNA but needs to be decided somehow at the
application-level. And when the Unicode consortium provided such
profiling for applications in the form of
http://unicode.org/reports/tr46/ that was frowned upon. It's not at
all clear what the transition path is envisioned here.


--
http://annevankesteren.nl/

Reply | Threaded
Open this post in threaded view
|

Re: Standardizing on IDNA 2003 in the URL Standard

Jungshik SHIN (신정식)-2


2013. 8. 20. 오전 5:33에 "Anne van Kesteren" <[hidden email]>님이 작성:
>
> On Mon, Aug 19, 2013 at 6:31 PM, Mark Davis ☕ <[hidden email]> wrote:
> > Rather than promoting different, arbitrary modifications of IDNA2003, I
> > would recommend instead using the TR46 specification, which provides a
> > migration path from IDNA2003 to IDNA2008. It is, with some small exceptions,
> > compatible with IDNA2003.
>
> Last I checked with implementers there was not much interest in that.

Chrome is interested. It is very long overdue.

> And to be clear, it's not different and arbitrary. The modifications
> have been in place since IDNA2003 support landed in browsers. As
> should have been clear to the original authors of IDNA2003 too. Nobody
> is going to arbitrarily freeze their Unicode implementation.
>
> (Aside: ToASCII in IDNA2003 applies to domain labels. It applying to
> domain names in UTS #46 is somewhat confusing.)
>
>
> On Mon, Aug 19, 2013 at 9:32 PM, Shawn Steele
> <[hidden email]> wrote:
> > I concur.  We use the IDNA2008 + TR46 behavior.
>
> Interesting. Last I checked Internet Explorer that was not the case.
> Since which version is this deployed? Does it depend on the operating
> system? What variation of TR46 is implemented?
>
>
> On Mon, Aug 19, 2013 at 11:36 PM, Vint Cerf <[hidden email]> wrote:
> > It seems to me that we would serve the community well if we work towards a
> > well-defined and timely transition to IDNA2008. It has a key property of
> > independence from any particular version of UNICODE (which was the primary
> > reason for moving in that direction). It also has a canonical representation
> > of domain labels which is also a powerful standardizing element. We are all
> > aware of the potential for some backward incompatibility with IDNA2003 but
> > the committee that developed IDNA2008 discussed these issues at length and
> > obviously concluded that the features of IDNA2008 were superior over all to
> > the status quo. It is a disservice in the long run to delay adoption of the
> > newer design, especially given the huge expansion of the TLD space - all
> > these TLDs should be developed and evolved on the IDNA2008 principles.
>
> I don't think the committee has carefully considered the compatibility
> impact. Deployed domains would become invalid. Long-standing practice
> of case folding (e.g. the idea that http://EXAMPLE.COM/ and
> http://example.com/ are identical) is suddenly something that is no
> longer decided upon by IDNA but needs to be decided somehow at the
> application-level. And when the Unicode consortium provided such
> profiling for applications in the form of
> http://unicode.org/reports/tr46/ that was frowned upon. It's not at
> all clear what the transition path is envisioned here.
>
>
> --
> http://annevankesteren.nl/
>

Reply | Threaded
Open this post in threaded view
|

Re: Standardizing on IDNA 2003 in the URL Standard

Andrew Sullivan-9
In reply to this post by Anne van Kesteren-4
I'm pretty sure I'm not on many of these lists, so I bet this mail
won't go everywhere.  Nevertheless,

On Tue, Aug 20, 2013 at 01:32:23PM +0100, Anne van Kesteren wrote:
> (Aside: ToASCII in IDNA2003 applies to domain labels. It applying to
> domain names in UTS #46 is somewhat confusing.)

Or "broken".  It can't apply to domain names, of course, because
that's not how the DNS works; but one might be forgiven for wondering
whether not understanding the details of an underlying technical
problem is a barrier to having an opinion in this space.

> I don't think the committee has carefully considered the compatibility
> impact. Deployed domains would become invalid.

The IDNABIS wg did not take that decision lightly.  In my opinion, we
concluded that some deployed domains were just _broken_, and that we
were eventually going to endure this pain, and that it would be better
to do it earlier rather than later.

> Long-standing practice
> of case folding (e.g. the idea that http://EXAMPLE.COM/ and
> http://example.com/ are identical) is suddenly something that is no
> longer decided upon by IDNA but needs to be decided somehow at the
> application-level.

Well, sort of.  There's nothing in IDNA2008 that prevents the OS from
providing a generic facility for this (which is apparently what the
current generation of Windows does).  

The point was to take this mapping out of the _protocol_ and put it
into local rules that could be made locale-sensitive.  The reason for
this is that, while it is impossible in general to provide case
folding rules where lower-case accented characters get mapped to upper
case without accents and then get case folded again (thereby losing
data), it _might_ be possible to do this in a locale-sensitive way if
one knew enough about the environment.  For instance, in some writing
systems for French, it is standard practice to fold LATIN SMALL LETTER
E WITH ACUTE to LATIN CAPITAL LETTER E (not all French systems, of
course.  Some fold to LATIN CAPITAL LETTER E WITH ACUTE).  Now, if the
LATIN CAPITAL LETTER E is next downcased, what should you get?  The
general rule will of course be LATIN SMALL LETTER E, but if you had a
clever program that could do intellingent things with the string
"ECOLE", the folding might be LATIN SMALL LETTER E WITH ACUTE, or the
folding might try both and see what happens.  This example is a little
contrived -- the French example seems silly -- but examples in other
scripts and languages are in my view considerably more compelling.  I
don't think that UTS#46 is actually different in this regard, although
it proposes uniform mapping rules in all cases.  

IDNA2003 doesn't handle this case real well, because it can't
possibly.  There's simply no room for locale in IDNA2003.

> And when the Unicode consortium provided such
> profiling for applications in the form of
> http://unicode.org/reports/tr46/ that was frowned upon.

I think the history us a little more complicated than that.

Best regards,

A

--
Andrew Sullivan
[hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Standardizing on IDNA 2003 in the URL Standard

Shawn Steele
In reply to this post by Anne van Kesteren-4
> >> I concur.  We use the IDNA2008 + TR46 behavior.
>
>>  Interesting. Last I checked Internet Explorer that was not the case.
>
> At this side of the keyboard, ß is still not supported in IE10/Win7-SP1

Yes, that's the + TR46 behavior.  

We're not changing the spoofable entries at this time due to security concerns.  You can register the ss version and it'll get there.  As a complete digression, IMO IDN/DNS should allow for a "display" form mechanism because the NFKC part of the mapping is a more than a bit destructive and there're a lot of other inputs that aren't going to look like the output.

-Shawn
Reply | Threaded
Open this post in threaded view
|

RE: Standardizing on IDNA 2003 in the URL Standard

Shawn Steele
In reply to this post by Anne van Kesteren-4
> At this side of the keyboard, ß is still not supported in IE10/Win7-SP1

(To be clear with the transitional support of TR46, www.Fußball.de will take you to www.fussball.de - which I strongly suspect is what the current owners of www.fussball.de expect, particularly for any of their Swiss visitors, we don't want IE to hijack those users and take them to another site, particularly if the target is a bank.)

-Shawn
Reply | Threaded
Open this post in threaded view
|

Re: Standardizing on IDNA 2003 in the URL Standard

John C Klensin-3
In reply to this post by Anne van Kesteren-4


--On Tuesday, August 20, 2013 15:55 +0200 Marcos Sanz
<[hidden email]> wrote:

> [hidden email] wrote on 20/08/2013 14:32:23:
>
>> On Mon, Aug 19, 2013 at 9:32 PM, Shawn Steele
>> <[hidden email]> wrote:
>> > I concur.  We use the IDNA2008 + TR46 behavior.
>>
>> Interesting. Last I checked Internet Explorer that was not
>> the case.
>
> At this side of the keyboard, ß is still not supported in
> IE10/Win7-SP1

But that is completely consistent with IDNA2008 + UTR46 when the
most IDOA2003-like profile (or, if you prefer, stage of
transition) of UTR46 is used.   One can debate endlessly whether
UTF46 is a good idea (and the IDNABIS WG did), but ultimately
[1] it was intended to provide an environment as much like that
of IDNA2003 as possible.  That includes:
       
--strict backward compatibility with the interpretation
        of strings that are valid with either IDNA2003 or
        IDNA2008   and
       
-- continued support for strings that were valid in
        IDNA2003 but that mapped into other strings before being
        converted using ASCII strings using Punycode where those
        target strings are valid under IDNA2008

If one accepts that kind of compatibility as a primary goal,
then the fact that "ß" was mapped to "ss" in IDNA2003 means
that mapping must be preserved forever and one will never [2]
actually be able to store an Eszett in the DNS.  

The bottom line, at least IMO, is that one can adopt either of
two philosophical models.   In one, whatever decisions were made
in building the IDNA2003 standard and the name strings those
decisions allowed are inviolable.  Arguments that errors were
made, that those strings create risks, or that the rules
prohibit orthographically-reasonable strings are simply
irrelevant if they conflict with absolute compatibility.  The
other(at the risk of showing my biases) is to assume that we are
human, that mistakes will get made, and that, if they are
significant, we should figure out how to correct them and move
on.  

As others have suggested, the latter includes realizing that
some labels and practices that were allowed under IDNA2003 were
simply a bad idea and we should move away from them as soon as
possible rather than encouraging their use in even more
contexts.  Coming back to the comment that started this note, it
also means that, if the relevant language communities decide,
for example, that Eszett is important as a character or that
zero-width joiners and non-joiners are critical, we need to
figure out how to accommodate them even if the accommodation is
not perfect and doesn't solve all problems.  And, in each case,
we need to remember that the Internet is growing and reaching
more communities and more people within almost every community,
making transition now, even if painful, much less painful than
transition in the future.

FWIW, without at least some measure of the latter model, we
would be stuck with HTTP 1.0, HTML 1 (or at least 3), and ISO
8859-1 forever.  The decision to interpret a string of non-ASCII
octets in content as, by default, a good candidate for UTF-8
rather than Latin-1 is, at least IMO, ultimately an incompatible
change of far more sweeping impact and consequences than this
IDNA2003 -> IDNA2008 transition.

In an odd way, while I would have preferred to see a much more
rapid transition, I think that exactly what should be happening
is happening.  The various registries --both the
ICANN-supervised ones and many others at the root and various
other levels-- are prohibiting (and not renewing) strings that
do not conform with IDNA2008.  Registries that want to support
labels that are problematic from a transition standpoint have
devised, or are devising, procedures to lower the odds of
strings that pose difficulties falling into hostile hands, just
as many of them do for potentially-confusing strings.  The right
time to transition systems that look up names involves tricky
questions including the "pain now or more pain later"
considerations mentioned above.   And where UTR 46 and/or RFC
5895 fit into transition strategies (as distinct from localized
mapping strategies), or not, is obviously part of that
transition question.

Anne, coming back to your original question, I don't know what
question you and your colleagues asked that got the "everyone is
still on IDNA2003" answer.  Especially given the information
from Microsoft, I suspect it was close to "are you fully
supporting IDNA2008" for which as "no" answer might lead to a
"using IDNA2003" answer despite their telling us that they are
running IDNA2008 with UTR 46.  Others have pointed out that
"IDNA2003 with the version restriction eliminated" may be a
sensible statement in individual cases but, because the Nameprep
profile of Stringprep is not simply Unicode Case Folding plus
NFKC, it leaves enough open to local interpretation that it is
not a plausible candidate for a statement in a standard that is
intended to promote interoperability.

Against that backdrop, I believe you should interpret what you
are seeing, not as "everyone is committed to IDNA2003"
(obviously not true as soon as exceptions are introduced) and
"IDNA2003 with exceptions forever" but as slow transition.  If
you want a standard that works going forward, make the
assumption that the folks who designed IDNA2008 were not fools
and that browsers should be moving, and eventually will move
(unless you discourage them) in the IDNA2008 direction.  Whether
you want to discuss transition or not is up to you.  If you want
to follow Mark's recommendation (and Microsoft's lead) and
suggest IDNA2008 plus UTR 46, I suggest you do so in a way that
really constitutes a transition strategy rather than an "IDNA
2003 forever" one, i.e., that you address the issues of when
"transition processing" gets turned off and the localization
issues (especially about case folding) mentioned by others.  If
not, you and your working group put us all at risk of many
internationalized email applications working differently than
web browsers do, in a fork between IETF and W3C i18n standards,
divergence between assumptions and norms used by those who
create DNS names and those who look them up, and so on.  I hope
we can agree that those would be bad outcomes.

regards,
    john

 -----------

[1] I hope Mark will more or less agree with this
characterization; it is a accurate and neutral as I know how to
make it.

[2[ This is associated with one of the key criticisms of UTR 46
that has not been discussed so far:  It has been described as a
transition strategy, but there is really no mechanism in it for
deciding when to adopt the IDNA2008 model and rules in favor of
strict backward-compatibility with as many names that were valid
under IDNA2003 as possible.   In reality, saying "we use UTR 46"
or "we conform to UTR 46" is somewhat underspecified because UTR
46 can be used strictly for local mapping, with what it calls
"transition processing" (which is where Eszett disappears),
and/or with other optional features such as flagging, but
continuing to look up, strings that contain punctuation or
symbol characters.  Either of those latter options makes a
so-called "IDNA2008 + UTR46" implementation non-conforming with
IDNA2008.


Reply | Threaded
Open this post in threaded view
|

Re: Standardizing on IDNA 2003 in the URL Standard

Gervase Markham
In reply to this post by Jungshik SHIN (신정식)-2
[I'm also not on many of these lists...]

On 20/08/13 15:46, Jungshik SHIN (신정식) wrote:
>
> 2013. 8. 20. 오전 5:33에 "Anne van Kesteren" <[hidden email]
> <mailto:[hidden email]>>님이 작성:
>> Last I checked with implementers there was not much interest in that.

In the case of Mozilla, if it was something I said which gave you that
impression, I apologise. That's not correct.

> Chrome is interested. It is very long overdue.

We are also interested. Sticking with a single version of Unicode is
untenable; given that, implementing anything other than IDNA2008 would
just be some mish-mash which would behave differently to everyone else.
Our implementation was held up for quite some time by licensing problems
with idnkit2 (now resolved), and it's now held up (I believe) due to
lack of time on the part of the main engineer in this area. (Patches
welcome.) But, insofar as I have any say, we do want to move to
IDNA2008, perhaps with some compatibility mitigations from TR46. (We've
not yet developed a precise plan.)

With regard to any incompatibilities, particularly around sharp-S and
final sigma, my understanding and expectation is that the registries
most concerned with those characters (e.g. the Greek registry for final
sigma) were in agreement that IDNA2008 was the correct way forward, and
that any breakage caused by the switch was better than the breakage
caused by not moving. If I became aware that this was not the case, my
view might perhaps change. But I believe that it is. If there is a
phishing problem in any particular TLD due to this change, then I place
the blame for that squarely on the registry concerned.

This is https://bugzilla.mozilla.org/show_bug.cgi?id=479520 .

Gerv

Reply | Threaded
Open this post in threaded view
|

Re: Standardizing on IDNA 2003 in the URL Standard

Mark Davis ☕
In reply to this post by John C Klensin-3




— Il meglio è l’inimico del bene —


On Tue, Aug 20, 2013 at 9:33 PM, John C Klensin <[hidden email]> wrote:


--On Tuesday, August 20, 2013 15:55 +0200 Marcos Sanz
<[hidden email]> wrote:

> [hidden email] wrote on 20/08/2013 14:32:23:
>
>> On Mon, Aug 19, 2013 at 9:32 PM, Shawn Steele
>> <[hidden email]> wrote:
>> > I concur.  We use the IDNA2008 + TR46 behavior.
>>
>> Interesting. Last I checked Internet Explorer that was not
>> the case.
>
> At this side of the keyboard, ß is still not supported in
> IE10/Win7-SP1

But that is completely consistent with IDNA2008 + UTR46 when the
most IDOA2003-like profile (or, if you prefer, stage of
transition) of UTR46 is used.   One can debate endlessly whether
UTF46 is a good idea (and the IDNABIS WG did), but ultimately
[1] it was intended to provide an environment as much like that
of IDNA2003 as possible.  That includes:

--strict backward compatibility with the interpretation
        of strings that are valid with either IDNA2003 or
        IDNA2008   and

-- continued support for strings that were valid in
        IDNA2003 but that mapped into other strings before being
        converted using ASCII strings using Punycode where those
        target strings are valid under IDNA2008

If one accepts that kind of compatibility as a primary goal,
then the fact that "ß" was mapped to "ss" in IDNA2003 means
that mapping must be preserved forever and one will never [2]
actually be able to store an Eszett in the DNS.

The bottom line, at least IMO, is that one can adopt either of
two philosophical models.   In one, whatever decisions were made
in building the IDNA2003 standard and the name strings those
decisions allowed are inviolable.  Arguments that errors were
made, that those strings create risks, or that the rules
prohibit orthographically-reasonable strings are simply
irrelevant if they conflict with absolute compatibility.  The
other(at the risk of showing my biases) is to assume that we are
human, that mistakes will get made, and that, if they are
significant, we should figure out how to correct them and move
on.

As others have suggested, the latter includes realizing that
some labels and practices that were allowed under IDNA2003 were
simply a bad idea and we should move away from them as soon as
possible rather than encouraging their use in even more
contexts.  Coming back to the comment that started this note, it
also means that, if the relevant language communities decide,
for example, that Eszett is important as a character or that
zero-width joiners and non-joiners are critical, we need to
figure out how to accommodate them even if the accommodation is
not perfect and doesn't solve all problems.  And, in each case,
we need to remember that the Internet is growing and reaching
more communities and more people within almost every community,
making transition now, even if painful, much less painful than
transition in the future.

The key migration issue is whether people are comfortable having implementations go to different IP addresses for IDNs containing 'ß' (or the other 3 related characters). The transitional form in TR46 is for those who are concerned with that problem. If the registries either bundled 'ss' with 'ß' or blocked (once either was registered the other could not), then the ambiguous addressing issue would not be a problem. So it is a matter of waiting for the significant registries to do that.


FWIW, without at least some measure of the latter model, we
would be stuck with HTTP 1.0, HTML 1 (or at least 3), and ISO
8859-1 forever.  The decision to interpret a string of non-ASCII
octets in content as, by default, a good candidate for UTF-8
rather than Latin-1 is, at least IMO, ultimately an incompatible
change of far more sweeping impact and consequences than this
IDNA2003 -> IDNA2008 transition.

That's not a particularly good analogy. ASCII is and remains ASCII in UTF-8; that's one of its virtues. Latin 1 was just one of many encodings that used the high bit for different purposes, so UTF-8 was simply one of many such encodings. It did not represent a backwards incompatibility with existing standards.
​​

In an odd way, while I would have preferred to see a much more
rapid transition, I think that exactly what should be happening
is happening.  The various registries --both the
ICANN-supervised ones and many others at the root and various
other levels-- are prohibiting (and not renewing) strings that
do not conform with IDNA2008.  Registries that want to support
labels that are problematic from a transition standpoint have
devised, or are devising, procedures to lower the odds of
strings that pose difficulties falling into hostile hands, just
as many of them do for potentially-confusing strings.  The right
time to transition systems that look up names involves tricky
questions including the "pain now or more pain later"
considerations mentioned above.   And where UTR 46 and/or RFC
5895 fit into transition strategies (as distinct from localized
mapping strategies), or not, is obviously part of that
transition question.

I agree with that, and it is the scenario envisioned for TR46. That is, once all (significant) registries move to IDNA2008, then then clients can impose stricter controls on the characters, excluding the characters that are disallowed in IDNA2008. Because the registries will have moved, the number of failing URLs would be acceptable.
​​

Anne, coming back to your original question, I don't know what
question you and your colleagues asked that got the "everyone is
still on IDNA2003" answer.  Especially given the information
from Microsoft, I suspect it was close to "are you fully
supporting IDNA2008" for which as "no" answer might lead to a
"using IDNA2003" answer despite their telling us that they are
running IDNA2008 with UTR 46.  Others have pointed out that
"IDNA2003 with the version restriction eliminated" may be a
sensible statement in individual cases but, because the Nameprep
profile of Stringprep is not simply Unicode Case Folding plus
NFKC, it leaves enough open to local interpretation that it is
not a plausible candidate for a statement in a standard that is
intended to promote interoperability.

Against that backdrop, I believe you should interpret what you
are seeing, not as "everyone is committed to IDNA2003"
(obviously not true as soon as exceptions are introduced) and
"IDNA2003 with exceptions forever" but as slow transition.  If
you want a standard that works going forward, make the
assumption that the folks who designed IDNA2008 were not fools
and that browsers should be moving, and eventually will move
(unless you discourage them) in the IDNA2008 direction.  Whether
you want to discuss transition or not is up to you.  If you want
to follow Mark's recommendation (and Microsoft's lead) and
suggest IDNA2008 plus UTR 46, I suggest you do so in a way that
really constitutes a transition strategy rather than an "IDNA
2003 forever" one, i.e., that you address the issues of when
"transition processing" gets turned off and the localization
issues (especially about case folding) mentioned by others.  If
not, you and your working group put us all at risk of many
internationalized email applications working differently than
web browsers do, in a fork between IETF and W3C i18n standards,
divergence between assumptions and norms used by those who
create DNS names and those who look them up, and so on.  I hope
we can agree that those would be bad outcomes.

regards,
    john

 -----------

[1] I hope Mark will more or less agree with this
characterization; it is a accurate and neutral as I know how to
make it.

Yes, thanks.
​​

[2[ This is associated with one of the key criticisms of UTR 46
that has not been discussed so far:  It has been described as a
transition strategy, but there is really no mechanism in it for
deciding when to adopt the IDNA2008 model and rules in favor of
strict backward-compatibility with as many names that were valid
under IDNA2003 as possible.   In reality, saying "we use UTR 46"
or "we conform to UTR 46" is somewhat underspecified because UTR
46 can be used strictly for local mapping, with what it calls
"transition processing" (which is where Eszett disappears),
and/or with other optional features such as flagging, but
continuing to look up, strings that contain punctuation or
symbol characters.  Either of those latter options makes a
so-called "IDNA2008 + UTR46" implementation non-conforming with
IDNA2008.

Yes, it is the latter two options that can disappear under the right conditions (as above).​​

Reply | Threaded
Open this post in threaded view
|

Re: Standardizing on IDNA 2003 in the URL Standard

Anne van Kesteren-4
On Wed, Aug 21, 2013 at 4:01 PM, Mark Davis ☕ <[hidden email]> wrote:
> I agree with that, and it is the scenario envisioned for TR46. That is, once
> all (significant) registries move to IDNA2008, then then clients can impose
> stricter controls on the characters, excluding the characters that are
> disallowed in IDNA2008. Because the registries will have moved, the number
> of failing URLs would be acceptable.

I doubt that would be true for subdomains. E.g. I know people using
http://☺.example.com/ as domain (forgot whether that particular code
point is excluded, but you get the idea).

It's also not true for URLs in resources that depend on the mapping to
happen. Especially for uppercase/lowercase I would expect that to be
fairly common. And in URLs in resources should remain
locale-insensitive. That they depend on encodings to some extent is
bad enough.


--
http://annevankesteren.nl/

Reply | Threaded
Open this post in threaded view
|

Re: Standardizing on IDNA 2003 in the URL Standard

Mark Davis ☕
It's also not true for URLs in resources that depend on the mapping to
happen. 

TR46 really has 3 parts:
  1. transitional handling for the 4 ambiguous characters
  2. inclusion of symbols
  3. client-side mapping (aka lowercasing)
Parts #1 and #2 are transitional in supporting IDNA2003 on the path to IDNA2008. 

Part #3 (client-side mapping) is something that is permitted by IDNA2008, and is thus optional for even a fully IDNA2008-compliant implementation.




— Il meglio è l’inimico del bene —


On Wed, Aug 21, 2013 at 5:45 PM, Anne van Kesteren <[hidden email]> wrote:
On Wed, Aug 21, 2013 at 4:01 PM, Mark Davis ☕ <[hidden email]> wrote:
> I agree with that, and it is the scenario envisioned for TR46. That is, once
> all (significant) registries move to IDNA2008, then then clients can impose
> stricter controls on the characters, excluding the characters that are
> disallowed in IDNA2008. Because the registries will have moved, the number
> of failing URLs would be acceptable.

I doubt that would be true for subdomains. E.g. I know people using
http://☺.example.com/ as domain (forgot whether that particular code
point is excluded, but you get the idea).

It's also not true for URLs in resources that depend on the mapping to
happen. Especially for uppercase/lowercase I would expect that to be
fairly common. And in URLs in resources should remain
locale-insensitive. That they depend on encodings to some extent is
bad enough.


--
http://annevankesteren.nl/

Reply | Threaded
Open this post in threaded view
|

RE: Standardizing on IDNA 2003 in the URL Standard

Shawn Steele
In reply to this post by Gervase Markham
> But I believe that it is. If there is a phishing problem in any particular TLD due to this change, then I place the blame for that squarely on the registry concerned.

Historically users blamed the browsers, not the registrars for things like the paypal-with-cyrillic-a homograph.

-Shawn
Reply | Threaded
Open this post in threaded view
|

RE: Standardizing on IDNA 2003 in the URL Standard

John C Klensin-3


--On Wednesday, August 21, 2013 16:14 +0000 Shawn Steele
<[hidden email]> wrote:

>> But I believe that it is. If there is a phishing problem in
>> any particular TLD due to this change, then I place the blame
>> for that squarely on the registry concerned.
>
> Historically users blamed the browsers, not the registrars for
> things like the paypal-with-cyrillic-a homograph.

Shawn, you can generalize from that to "historically, users
blame either the software with which they directly interact or
their blame their first-hop ISP" without any loss of
information.  Taking the Eszett problem as an example, if a
registry decides to register a label containing an Eszett but
block a similar one containing an "ss" (a rational, but probably
not optimal, strategy by Mark's reasoning or mine), then the
complaints will be about inaccessibility from an IDDA2003 or
IDNA2008-with-UTR46-transition=on browser.  If they allow "ss"
but not Eszett, than someone using an IDNA2008 browser (with no
transition tools) will be happy but someone expecting "ss" to
just work will be unhappy with all browsers.

That situation of course has the potential to provide clear
feedback to registries, even though well down in the tree.  If
they sell or otherwise allocate and delegate names that often
don't "work", they are likely to have trouble with their own
customers and constituencies.  Whether it is better to have
browsers (and other UIs) lead or follow is not a simple question
(although I clearly have biases about the right answer).

This is ultimately a "lose either way" situation, a problem that
was reasonably well understood and accepted when the IDNABIS WG
made its decisions.  The question is where on the curve one
wants to fall and when.  That question has no easy answers
although it is clear to me that "IDNA2003 forever" isn't one of
the reasonable ones.

best,
   john


Reply | Threaded
Open this post in threaded view
|

Re: Standardizing on IDNA 2003 in the URL Standard

Gervase Markham
In reply to this post by Shawn Steele
On 21/08/13 17:14, Shawn Steele wrote:
>> But I believe that it is. If there is a phishing problem in any
>> particular TLD due to this change, then I place the blame for that
>> squarely on the registry concerned.
>
> Historically users blamed the browsers, not the registrars for things
> like the paypal-with-cyrillic-a homograph.

Historically, this is true. If it happens again, we plan to put up a
significantly more robust defence, based on a decade of experience since
then of what the problem is and who should be solving it.

Gerv


Reply | Threaded
Open this post in threaded view
|

RE: Standardizing on IDNA 2003 in the URL Standard

Shawn Steele
In reply to this post by John C Klensin-3
IMO, the eszett & even more so, final sigma, are somewhat display issues.  My personal opinion is we need a display standard (yes, that's not easy).

A non-final sigma isn't (my understanding) a valid form of the word, so you shouldn't ever have both registered.  It could certainly be argued that 2003 shouldn't have done this mapping.  If these are truly mutually exclusive, then the biggest problem with 2003 isn't a confusing canonical form, but rather that it doesn't look right in the 2003 canonical form.  However there's no guarantee in DNS that I can have a perfect canonical form for my label.  Microsoft for example, is a proper noun, however any browser nowadays is going to display microsoft.com, not Microsoft.com.  (Yes, that's probably not "as bad" as the final sigma example).

Eszett is less clear, because using eszett or ss influences the pronunciation (at least in Germany, in Switzerland that can be different).  I imagine it's rather worse if you're Turkish and prefer different i's.  For German, nobody is ever going to expect fußball.ch and fussball.ch to go different place.  And nobody's going to be surprised if fußball.de and fussball.de end up at the same site.  (On the contrary, they'd probably be surprised otherwise).  IMO, this is kind of like dove.com (a bird site) vs dove.com (a swimming site), they have different pronunciations.  

For words that happen to be similar, there's no expectation that a DNS name is available.  AAA Plumbing and all the other AAA whatever's out there aren't going to be surprised that AAA.com is already taken.  So why's German more special that Turkish or English?  And particularly at the expense of spoofability?

I'd much prefer a mechanism to suggest a preferred display form.  That'd solve things like the Turkish I issue as well.

-Shawn


Reply | Threaded
Open this post in threaded view
|

Re: Standardizing on IDNA 2003 in the URL Standard

John Cowan-3
Shawn Steele scripsit:

> A non-final sigma isn't (my understanding) a valid form of the word,

Alas, things are not so simple.  φιλος would be appropriate if the
semantic is 'friendship', but φιλοσ, with a non-final sigma, would
be appropriate as an abbreviation of φιλοσοφία 'philosophy'.
The Unicode rule is to downcase capital sigma to a non-final form if
a letter follows and to a final form otherwise, but this is just a
convention that dumb computers can follow rather than the whole truth.

> Eszett is less clear, because using eszett or ss influences the
> pronunciation (at least in Germany, in Switzerland that can be
> different).  I imagine it's rather worse if you're Turkish and prefer
> different i's.

Actually, missing diacritics aren't a big problem in Turkish for native
speakers, because of the vowel-harmony rules, which mean that most
words contain either the front vowels e, i, ö, and ü, or else the back
vowels a, ı (dotless i), o, and u, but not both in the same word.

> For German, nobody is ever going to expect fußball.ch and fussball.ch
> to go different place.  And nobody's going to be surprised if
> fußball.de and fussball.de end up at the same site.

Well, there are minimal pairs like Buße 'fine' vs. Busse 'buses', but
that's livable, particularly because in Switzerland and Liechtenstein
they are both spelled "Busse" anyway.

--
"But I am the real Strider, fortunately,"       John Cowan
he said, looking down at them with his face     [hidden email]
softened by a sudden smile.  "I am Aragorn son  http://www.ccil.org/~cowan
of Arathorn, and if by life or death I can
save you, I will."  --LotR Book I Chapter 10

12345