Concerns about new domain names, particularly non-Latin-scripts -- getting the tech community together

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Concerns about new domain names, particularly non-Latin-scripts -- getting the tech community together

David Singer
Hi

This was an informal email sent to a perhaps rather random collection of people at the IETF, W3C and Unicode Consortium, to see whether we need to kick off or re-open a conversation, now being re-posted to public-iri to enable conversation there (if the chairs approve).

I think that we have here one of those awkward areas that straddle technology and policy, and historically the technology groups have steered away from ‘policy questions’. Unfortunately, I think it’s a grey area and some policy answers have technology impacts, and that it’s possible to conceive of some that are ‘bad for the Internet’ or ‘break the web’. I think we need to find a way to enable the technical community to get more involved.

* * *

As I am sure you are aware, ICANN has introduced, and will introduce more, top-level domains, of which a number are or will be non-Latin-script.

I have a suspicion that some of the RFCs and other documents that exist were written ‘knowing’ that the top-level domains were essentially just the historic 6 (com, mil, net, org, edu and arpa) and the geographic ones.

It also seems that some of the treatment of ‘structured text’ — that has a structure and meaning associated with that structure, such as URLs and mail addresses — was defined assuming that we would not, or did not need to, treat it differently from regular text.

Attached you will find a PDF document (sorry, since appearance is an important part of the discussion, PDF seemed best; I hope that the formatting and so on has not got messed up), outlining some issues we have noticed recently, and concluding with some recommendations based on those issues. I rather suspect that there are more issues than I outline.  I do wonder if we should be taking more positive steps to build up a shared set of test cases as well, that check for resolution, presentation, entry, selection, and other problems in domain names. I am also aware that in some places the Public Suffix List is used for a secondary purpose, as a way to sanity check host names.




The document doesn’t “dig deeper” into motivations, or policy. But I think it’s worth asking “what would we prefer we had done; how could we have met the needs better?”. We might not be able to get there from here any more, but it’s always (in my opinion) worth knowing what you’d really like and what its characteristics are.  There is also a presumption in the ICANN community that Universal Acceptance of whatever is introduced is desirable and expected, and I fear that there may be real technical or human issues (e.g. readability, phishing) which question that assumption.

* * *

As examples of directions we might have taken, I wonder whether we’d be better off if we do something like the following.

We have an undeniable need to change the net so that it is not Latin-script (really English) centric. The response at the moment is to introduce new domains with names written in other scripts. However, this seems to assume that people will only be exposed to their ‘local’ internet — that names written in Korean will not ‘leak out’ of Korea. I think this is both unrealistic and contrary to the spirit of ‘one web’. In the current regime, anyone in the world can read any email or URL address as long as they can read English. But we seem to be heading towards a world in which everyone will need to be able to read every script, which is, I think, unrealistic. Simple questions — is this the right address? is it plausible? is it phishing? — may be unanswerable if the user cannot read the script(s) the URL is written in.

What could we do?  We could introduce into DNS the possibility that a domain name can have ‘aliases’, and those aliases can be in other script systems (maybe using CNAME/DNAME but probably a new record that has a script code). Then it would be *possible* to take a hostname, and see if aliases can be found for each domain name that are written in a preferred script of the user we’re presenting to, and we’d know that the resulting hostname would be, from a resolution point of view, functionally identical.  (It probably would not be functionally identical if the URL or email address is used as an identifier).  (There are obvious issues here with what the canonical form is, how long such lookup and translation would take, and so on).

This is a ‘thought experiment’, surely not a proposal.

* * *

In summary: do we need to raise the level of discussion in the technical community, and if so, how?


David Singer
Manager, Software Standards, Apple Inc.


TLD_Challenges_R2.pdf (325K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Concerns about new domain names, particularly non-Latin-scripts -- getting the tech community together

masinter

References: you might want to look at the (expired) draft
https://tools.ietf.org/html/draft-ruby-url-problem-01
(from https://github.com/webspecs/url -- also see issues
labeled IETF).

as (dated) background.

Until the implementors of URL processing are willing to
work on the many interoperability problems in general,
it seems just wishful thinking that they would implement
improved presentation of bidi URLs.

I think the best that can be accomplished in the
current environment would be to advise ICANN and
domain registrars to avoid bidi in URLs.




________________________________________
From: [hidden email] <[hidden email]> on behalf of David Singer <[hidden email]>
Sent: Sunday, October 25, 2015 5:06 PM
To: [hidden email]
Subject: Concerns about new domain names,  particularly non-Latin-scripts -- getting the tech community together

Hi

This was an informal email sent to a perhaps rather random collection of people at the IETF, W3C and Unicode Consortium, to see whether we need to kick off or re-open a conversation, now being re-posted to public-iri to enable conversation there (if the chairs approve).

I think that we have here one of those awkward areas that straddle technology and policy, and historically the technology groups have steered away from ‘policy questions’. Unfortunately, I think it’s a grey area and some policy answers have technology impacts, and that it’s possible to conceive of some that are ‘bad for the Internet’ or ‘break the web’. I think we need to find a way to enable the technical community to get more involved.

* * *

As I am sure you are aware, ICANN has introduced, and will introduce more, top-level domains, of which a number are or will be non-Latin-script.

I have a suspicion that some of the RFCs and other documents that exist were written ‘knowing’ that the top-level domains were essentially just the historic 6 (com, mil, net, org, edu and arpa) and the geographic ones.

It also seems that some of the treatment of ‘structured text’ — that has a structure and meaning associated with that structure, such as URLs and mail addresses — was defined assuming that we would not, or did not need to, treat it differently from regular text.

Attached you will find a PDF document (sorry, since appearance is an important part of the discussion, PDF seemed best; I hope that the formatting and so on has not got messed up), outlining some issues we have noticed recently, and concluding with some recommendations based on those issues. I rather suspect that there are more issues than I outline.  I do wonder if we should be taking more positive steps to build up a shared set of test cases as well, that check for resolution, presentation, entry, selection, and other problems in domain names. I am also aware that in some places the Public Suffix List is used for a secondary purpose, as a way to sanity check host names.


Reply | Threaded
Open this post in threaded view
|

RE: Concerns about new domain names, particularly non-Latin-scripts -- getting the tech community together

Shawn Steele
It wouldn't hurt to improve guidance around Bidi IRIs.... We're trying to break them into labels and arrange each label in right to left (or left to right) order.  Eg:  a bidi http://www.microsoft.com/some/url.html?this=that would be that=this?html.url/some/com.microsoft.www//:http  (Presumably that'd only be triggered if there was BIDI content in the IRI, and perhaps overridden by user locale or preference).

Our investigation has shown that users expect the precedence to be consistently ordered from left to right or right to left regardless of the RTL/LRT letters in any particular label or component of the IRI.

Of course this leads to other interesting questions.  The one thing that is clear is that "mixing" RTL and LTR runs within the same IRI is just tremendously confusing, and potentially spoofable.

-Shawn

-----Original Message-----
From: Larry Masinter [mailto:[hidden email]]
Sent: October 27, 2015 10:16 PM
To: David Singer <[hidden email]>; [hidden email]
Cc: Sam Ruby <[hidden email]>
Subject: Re: Concerns about new domain names, particularly non-Latin-scripts -- getting the tech community together


References: you might want to look at the (expired) draft https://tools.ietf.org/html/draft-ruby-url-problem-01
also see issues labeled IETF).

as (dated) background.

Until the implementors of URL processing are willing to work on the many interoperability problems in general, it seems just wishful thinking that they would implement improved presentation of bidi URLs.

I think the best that can be accomplished in the current environment would be to advise ICANN and domain registrars to avoid bidi in URLs.




________________________________________
From: [hidden email] <[hidden email]> on behalf of David Singer <[hidden email]>
Sent: Sunday, October 25, 2015 5:06 PM
To: [hidden email]
Subject: Concerns about new domain names,  particularly non-Latin-scripts -- getting the tech community together

Hi

This was an informal email sent to a perhaps rather random collection of people at the IETF, W3C and Unicode Consortium, to see whether we need to kick off or re-open a conversation, now being re-posted to public-iri to enable conversation there (if the chairs approve).

I think that we have here one of those awkward areas that straddle technology and policy, and historically the technology groups have steered away from 'policy questions'. Unfortunately, I think it's a grey area and some policy answers have technology impacts, and that it's possible to conceive of some that are 'bad for the Internet' or 'break the web'. I think we need to find a way to enable the technical community to get more involved.

* * *

As I am sure you are aware, ICANN has introduced, and will introduce more, top-level domains, of which a number are or will be non-Latin-script.

I have a suspicion that some of the RFCs and other documents that exist were written 'knowing' that the top-level domains were essentially just the historic 6 (com, mil, net, org, edu and arpa) and the geographic ones.

It also seems that some of the treatment of 'structured text' - that has a structure and meaning associated with that structure, such as URLs and mail addresses - was defined assuming that we would not, or did not need to, treat it differently from regular text.

Attached you will find a PDF document (sorry, since appearance is an important part of the discussion, PDF seemed best; I hope that the formatting and so on has not got messed up), outlining some issues we have noticed recently, and concluding with some recommendations based on those issues. I rather suspect that there are more issues than I outline.  I do wonder if we should be taking more positive steps to build up a shared set of test cases as well, that check for resolution, presentation, entry, selection, and other problems in domain names. I am also aware that in some places the Public Suffix List is used for a secondary purpose, as a way to sanity check host names.



Reply | Threaded
Open this post in threaded view
|

Re: Concerns about new domain names, particularly non-Latin-scripts -- getting the tech community together

John C Klensin-4
In reply to this post by masinter


--On Wednesday, October 28, 2015 05:16 +0000 Larry Masinter
<[hidden email]> wrote:

> I think the best that can be accomplished in the
> current environment would be to advise ICANN and
> domain registrars to avoid bidi in URLs.

Just so I understand what you are suggesting, do you mean:

 (1) "don't use { Arabic, Hebrew, etc. } at all"
 (2) "don't use { Arabic, Hebrew, etc. } if digits appear
        anywhere (in any label) in the domain name"
        or
 (3) "don't use { Arabic, Hebrew, etc. } unless the FQDN
        is entirely in that (or some other nominally RtoL)
        script _and_ do not allow digits anywhere in the domain
        name"
 (4) and (5) Either (2) or (3) but with "anywhere in"
        replaced by "at the beginning or end of".
 
--john




Reply | Threaded
Open this post in threaded view
|

Re: Concerns about new domain names, particularly non-Latin-scripts -- getting the tech community together

masinter
In reply to this post by Shawn Steele
>> It wouldn't hurt to improve guidance around Bidi IRIs.... We're trying to break them into labels and arrange each label in right to left (or left to right) order.

(I'm using URL = IRI)

I think there are two kinds of guidance:

(1) Technical specifications for implementors
(2) advice we might give ICANN and domain name registrars and registrants about requests for RTL in TLDs and inside domain names, and application developers hoping to use RTL components with their IRIs (URLs).

Unless we have a critical mass of the implementors of popular IRI processors who are willing to resolve conflicts and make URLs work uniformly (Bidi only being one case), there's not much point in trying to do (1).

The IRI working group closed because the implementors weren't playing. And, unless things have changed in the last several months, there's not much interest in fixing URLs to work consistently, even without RTL components. Each browser has its own hacks.

Secondly, URLs are used in running text, where the text display will use whatever it uses for text display, and there's no opportunity to introduce anything else. And in this world, the natural way you'd put together a URL

<method> : // <host> / <path>

where <host> and <path> are allowed to be RTL

In many cases, there's no way to modify software to both make the URLs look nice (as users expect) and be consistent with the Unicode standard.

If there's no way to make it work everywhere, we should first advise registrars to be cautious about selling domain names which will cause the buyers problems when they go to deploy, namely, anything other than LTR.

Larry
--
http://larry.masinter.net




Reply | Threaded
Open this post in threaded view
|

RE: Concerns about new domain names, particularly non-Latin-scripts -- getting the tech community together

Shawn Steele
> <method> : // <host> / <path>
> where <host> and <path> are allowed to be RTL

> In many cases, there's no way to modify software to both make the URLs look nice (as users expect) and be consistent with the Unicode standard.

Which is why we (Microsoft) are intending to be consistent having the sections arranged consistently to the right or to the left for a single link.  The trick is what delimits the parts.  Which led to the examples like http://ltra.ltrb and LTRB.LTRA//:http - there's no way to predict how flipping of the important bits is going to confuse the reading of the label if you start trying to make runs of LTR and RTL that cross the logical sections of the URL.

Mark Davis & I chatted about whether Unicode should suggest a way of handling things like this in BIDI.  If you extrapolate the problem, this isn't the only issue with the BIDI algorithm in contexts like these.  For example, I could have a list of winners of a contest:  The top finishers are apple, banana, carrot, durian, and eggplant.  If a couple of adjacent ones in that list are in a different script, then the list is going to get confused without helper bidi marks.  Perhaps there would be a general way to tweak the algorithm for runs of ordered text?

-Shawn