why use IRIs?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

why use IRIs?

Peter Saint-Andre-2
<hat type='individual'/>

I've been thinking about IRIs, and I'm wondering: why would a protocol
"upgrade" from URIs to IRIs? (If it really is an "upgrade" -- a topic
for another time.)

Consider HTTP. It has always used URIs for retrieving documents and
linking and such. Why would it change to use IRIs? Section 1.2 of
3987bis describes some necessary conditions for such a change, but
doesn't really motivate why the HTTP community would want to do so. Yes,
there is text in Section 1.1 about representing the words of natural
languages, but URIs can be used to represent those words right now. I
grant that the current mechanism for such representation isn't pretty,
but do the addressing elements of a protocol like HTTP need to be
pretty, or can we simply depend on the presentation software (e.g., web
browsers) to make things look nice for the user? (Certainly we do that
with structural elements like the HTML document format, why not also
with addressing elements like URIs?) I realize that these questions get
back to the matter of "protocol element" vs. "presentation", but I guess
what I'm saying is that I don't yet think we've really explained why we
need to make IRIs a first-class protocol element (or why a given
protocol would want to make the switch from URI-only to IRI).

Furthermore, 3987bis doesn't really explain what would be involved in
the change from URI-only to IRI in any given protocol. I suppose spec
writers in a technology community like HTTP would need to figure it out,
but IMHO some guidelines would be helpful.

Peter

--
Peter Saint-Andre
https://stpeter.im/




Reply | Threaded
Open this post in threaded view
|

Re: why use IRIs?

Bjoern Hoehrmann
* Peter Saint-Andre wrote:
>I've been thinking about IRIs, and I'm wondering: why would a protocol
>"upgrade" from URIs to IRIs? (If it really is an "upgrade" -- a topic
>for another time.)

Looking at http://lists.w3.org/Archives/Public/uri/2001Jun/0027.html I
wonder whether you are two days late or one day early with the question,
depending on whether you ignore leap days. The "upgrade" question does
not seem very relevant to me, I would rather ask about new protocols and
go from there. I see no reason why http://björn.höhrmann.de/ should be
an error in any new protocol or format that does not suffer compatibili-
ty problems if it allows non-ASCII literals in "other places".

>I guess what I'm saying is that I don't yet think we've really explained
>why we need to make IRIs a first-class protocol element (or why a given
>protocol would want to make the switch from URI-only to IRI).

URIs are technical debt. If we could wish them away, we would, as having
them and also IRIs as "first-class" protocol elements is very expensive.

How would you like it if URIs could use only 20 of the 26 letters in the
english alphabet and you would have to encode, decode and convert them
all the time, or use awkward transliterations to avoid having to do so?

>Furthermore, 3987bis doesn't really explain what would be involved in
>the change from URI-only to IRI in any given protocol. I suppose spec
>writers in a technology community like HTTP would need to figure it out,
>but IMHO some guidelines would be helpful.

You just change specifications and software and content as needed. If
there are problems in doing so, there does not seem to be much that we
could say on how to address those as they would be technology-specific.
--
Björn Höhrmann · mailto:[hidden email] · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

Reply | Threaded
Open this post in threaded view
|

Re: why use IRIs?

Martin J. Dürst
In reply to this post by Peter Saint-Andre-2
Hello Peter,

I think Björn already gave very good answers to your questions.

On 2012/06/22 3:28, Peter Saint-Andre wrote:
> <hat type='individual'/>
>
> I've been thinking about IRIs, and I'm wondering: why would a protocol
> "upgrade" from URIs to IRIs?

As Björn said, it's really more about new protocols than about upgrades.
Also, different protocols (and formats) can upgrade in different ways.
Sometimes, this can be done formally with extensions, at other times
it's done gradually and sooner or later gets accepted in a spec. For
other cases, of course, it may never happen.

> (If it really is an "upgrade" -- a topic
> for another time.)
>
> Consider HTTP. It has always used URIs for retrieving documents and
> linking and such.

[There are some reports of clients just sending UTF-8, which I think
would mean using IRIs. But that has never reached the spec.]


> Why would it change to use IRIs? Section 1.2 of
> 3987bis describes some necessary conditions for such a change, but
> doesn't really motivate why the HTTP community would want to do so. Yes,
> there is text in Section 1.1 about representing the words of natural
> languages, but URIs can be used to represent those words right now. I
> grant that the current mechanism for such representation isn't pretty,
> but do the addressing elements of a protocol like HTTP need to be
> pretty, or can we simply depend on the presentation software (e.g., web
> browsers) to make things look nice for the user?

I think the real motivation would be people looking at HTTP traces and
preferring to see Unicode rather than lots of %HH strings. Of course the
number of people looking at HTTP traces is low, and they are not end users.

In general, the motivation to use IRIs is highest closer to end users
and content-oriented people such as document authors, and gets lower the
lower one gets in the protocol stack.

Another motivation may be compression.
http://ja.wikipedia.org/wiki/青山学院大学 is quite a bit shorter than
http://ja.wikipedia.org/wiki/%E9%9D%92%E5%B1%B1%E5%AD%A6%E9%99%A2%E5%A4%A7%E5%AD%A6.
So maybe we can sell that to HTTP 2.0. But I'm somewhat skeptical. Only
a tiny bit of creative thinking would have been needed to transition
various header fields in HTTP from the hopelessly outdated iso-8859-1
(Latin-1) to UTF-8, but it didn't happen :-(.

The best motivation would be streamlining. EAI does a lot of
streamlining for e-mail; if it weren't for all the legacy baggage, it
would be a joy to implement. For HTTP, if browsers use Unicode
internally, and servers use it internally, what's the need for this
weird %HH stuff anyway? (It's still needed to escape reserved
characters, though.)


> (Certainly we do that
> with structural elements like the HTML document format, why not also
> with addressing elements like URIs?) I realize that these questions get
> back to the matter of "protocol element" vs. "presentation", but I guess
> what I'm saying is that I don't yet think we've really explained why we
> need to make IRIs a first-class protocol element (or why a given
> protocol would want to make the switch from URI-only to IRI).
>
> Furthermore, 3987bis doesn't really explain what would be involved in
> the change from URI-only to IRI in any given protocol. I suppose spec
> writers in a technology community like HTTP would need to figure it out,
> but IMHO some guidelines would be helpful.

As I said at the start of this mail, I think it depends a lot on the
specific protocol. The conditions we give in Section 1.2 are general
considerations that apply to any protocol/format. Protocol-specific
considerations should do the rest, and I'm not sure it makes sense to
write much about this.

But when looking at Section 1.2, I realized that the first sentence
might have been the motivation for your mail. This sentence says:
    IRIs are designed to allow protocols and software that deal with URIs
    to be updated to handle IRIs.
I think that this puts too much emphasis on "update", but I'm not yet
sure how to fix that.

Regards,   Martin.

Reply | Threaded
Open this post in threaded view
|

Re: why use IRIs?

Peter Saint-Andre-2
Hi Martin, thanks for the clarification. I have a few comments inline.

On 6/25/12 3:22 AM, "Martin J. Dürst" wrote:

> Hello Peter,
>
> I think Björn already gave very good answers to your questions.
>
> On 2012/06/22 3:28, Peter Saint-Andre wrote:
>> <hat type='individual'/>
>>
>> I've been thinking about IRIs, and I'm wondering: why would a protocol
>> "upgrade" from URIs to IRIs?
>
> As Björn said, it's really more about new protocols than about upgrades.
> Also, different protocols (and formats) can upgrade in different ways.
> Sometimes, this can be done formally with extensions, at other times
> it's done gradually and sooner or later gets accepted in a spec. For
> other cases, of course, it may never happen.
>
>> (If it really is an "upgrade" -- a topic
>> for another time.)
>>
>> Consider HTTP. It has always used URIs for retrieving documents and
>> linking and such.
>
> [There are some reports of clients just sending UTF-8, which I think
> would mean using IRIs. But that has never reached the spec.]

Do you think it should reach the spec?

>> Why would it change to use IRIs? Section 1.2 of
>> 3987bis describes some necessary conditions for such a change, but
>> doesn't really motivate why the HTTP community would want to do so. Yes,
>> there is text in Section 1.1 about representing the words of natural
>> languages, but URIs can be used to represent those words right now. I
>> grant that the current mechanism for such representation isn't pretty,
>> but do the addressing elements of a protocol like HTTP need to be
>> pretty, or can we simply depend on the presentation software (e.g., web
>> browsers) to make things look nice for the user?
>
> I think the real motivation would be people looking at HTTP traces and
> preferring to see Unicode rather than lots of %HH strings. Of course the
> number of people looking at HTTP traces is low, and they are not end users.
>
> In general, the motivation to use IRIs is highest closer to end users
> and content-oriented people such as document authors, and gets lower the
> lower one gets in the protocol stack.

It seems to me that end users can be shielded from what you call "this
weird %HH stuff" (after all, we don't show them "this weird
angle-bracket stuff" either), but what you say about document authors
and operations people makes sense. Perhaps it would be good to capture
that in the spec.

> Another motivation may be compression.
> http://ja.wikipedia.org/wiki/青山学院大学 is quite a bit shorter than
> http://ja.wikipedia.org/wiki/%E9%9D%92%E5%B1%B1%E5%AD%A6%E9%99%A2%E5%A4%A7%E5%AD%A6.
> So maybe we can sell that to HTTP 2.0. But I'm somewhat skeptical. Only
> a tiny bit of creative thinking would have been needed to transition
> various header fields in HTTP from the hopelessly outdated iso-8859-1
> (Latin-1) to UTF-8, but it didn't happen :-(.
>
> The best motivation would be streamlining. EAI does a lot of
> streamlining for e-mail; if it weren't for all the legacy baggage, it
> would be a joy to implement. For HTTP, if browsers use Unicode
> internally, and servers use it internally, what's the need for this
> weird %HH stuff anyway? (It's still needed to escape reserved
> characters, though.)
>
>
>> (Certainly we do that
>> with structural elements like the HTML document format, why not also
>> with addressing elements like URIs?) I realize that these questions get
>> back to the matter of "protocol element" vs. "presentation", but I guess
>> what I'm saying is that I don't yet think we've really explained why we
>> need to make IRIs a first-class protocol element (or why a given
>> protocol would want to make the switch from URI-only to IRI).
>>
>> Furthermore, 3987bis doesn't really explain what would be involved in
>> the change from URI-only to IRI in any given protocol. I suppose spec
>> writers in a technology community like HTTP would need to figure it out,
>> but IMHO some guidelines would be helpful.
>
> As I said at the start of this mail, I think it depends a lot on the
> specific protocol. The conditions we give in Section 1.2 are general
> considerations that apply to any protocol/format. Protocol-specific
> considerations should do the rest, and I'm not sure it makes sense to
> write much about this.
>
> But when looking at Section 1.2, I realized that the first sentence
> might have been the motivation for your mail. This sentence says:
>    IRIs are designed to allow protocols and software that deal with URIs
>    to be updated to handle IRIs.
> I think that this puts too much emphasis on "update", but I'm not yet
> sure how to fix that.

Well, "update" is not "upgrade", so perhaps I have read too much into
the text. However, I think we could change it to read:

   IRIs are designed to allow protocols and software that deal with URIs
   to also handle IRIs if desired.

Peter

--
Peter Saint-Andre
https://stpeter.im/





Reply | Threaded
Open this post in threaded view
|

Re: why use IRIs?

John C Klensin
In reply to this post by Peter Saint-Andre-2
(sorry - sent from wrong address)

--On Monday, June 25, 2012 18:22 +0900 "\"Martin J. Dürst\""
<[hidden email]> wrote:

> Hello Peter,
>
> I think Björn already gave very good answers to your
> questions.

Martin, Björn, Peter,

> On 2012/06/22 3:28, Peter Saint-Andre wrote:
>> <hat type='individual'/>
>>
>> I've been thinking about IRIs, and I'm wondering: why would a
>> protocol "upgrade" from URIs to IRIs?
>
> As Björn said, it's really more about new protocols than
> about upgrades. Also, different protocols (and formats) can
> upgrade in different ways. Sometimes, this can be done
> formally with extensions, at other times it's done gradually
> and sooner or later gets accepted in a spec. For other cases,
> of course, it may never happen.
>...

For whatever it is worth, I don't find that answer particularly
helpful.  My problem with it is one that we have discussed
pieces of before.  If the requirement were to make something
that was coupled closely enough to URIs to be a UI overlay, then
we have one set of issues.  The WG has moved beyond that into
precisely what you are commenting on above and that they key
draft seems to reflect -- a new protocol element to be used
primarily in new, or radically updated/upgraded, protocols.  

But, if we are going to define a new protocol element for new
uses, then why stick with the basic URI syntax framework?  We
already know that causes problems.  It is hard to localize
because it contains a lot of ASCII characters that are special
sometimes and not others, that may have non-Latin-script
lookalikes, and because parsing is method-dependent.  That
method-dependency makes it very hard to create variations that
are appropriate to the local writing system because one has to
be method-sensitive at too many different points.  If some
protocols are to permit only IRIs, some only URIs, and some
both, it would also be beneficial to be able to determine which
is which, rather than wondering whether an IRI that actually
contains only ASCII characters (and no escapes) is actually an
IRI or is just the URI it looks like.   Again, as long as IRIs
were just an UI overlap, it made no difference.  But, as a
protocol element.

I continue to believe that makes a strong case for doing
something that gets us internationalization by moving away from
the URI syntax model, probably to something that explicitly
identifies the data elements that make up a particular URI.  If,
for example, one insisted that domain names be identified as
such wherever they appear, the mess about whether something can
or should be given IDNA treatment (even if only to verify
U-label syntax) and the associated RFC 6055 considerations
become much easier to handle than if one can to guess whether
something might be a domain name or something else with periods
in it.

Stated a little differently, if IRIs are protocol elements that
are intended to support new protocols, then it seems to me that
it is not obvious that the URI syntax is a constraint.
Certainly the WG has not had a serious discussion about what the
advantages of that constraint are and whether they outweigh the
disadvantages.

best,
    john



Reply | Threaded
Open this post in threaded view
|

Re: why use IRIs?

Martin J. Dürst
In reply to this post by Peter Saint-Andre-2
[Moderator forward]

-------- Original Message --------
Subject: [Moderator Action] Re: why use IRIs?
Date: Mon, 02 Jul 2012 01:02:59 +0000
From: John C Klensin <[hidden email]>
To: "Martin J. Dürst" <[hidden email]>,        Peter Saint-Andre
<[hidden email]>
CC: [hidden email]



--On Monday, June 25, 2012 18:22 +0900 "\"Martin J. Dürst\""
<[hidden email]> wrote:

> Hello Peter,
>
> I think Björn already gave very good answers to your
> questions.

Martin, Björn, Peter,

> On 2012/06/22 3:28, Peter Saint-Andre wrote:
>> <hat type='individual'/>
>>
>> I've been thinking about IRIs, and I'm wondering: why would a
>> protocol "upgrade" from URIs to IRIs?
>
> As Björn said, it's really more about new protocols than
> about upgrades. Also, different protocols (and formats) can
> upgrade in different ways. Sometimes, this can be done
> formally with extensions, at other times it's done gradually
> and sooner or later gets accepted in a spec. For other cases,
> of course, it may never happen.
>...

For whatever it is worth, I don't find that answer particularly
helpful.  My problem with it is one that we have discussed
pieces of before.  If the requirement were to make something
that was coupled closely enough to URIs to be a UI overlay, then
we have one set of issues.  The WG has moved beyond that into
precisely what you are commenting on above and that they key
draft seems to reflect -- a new protocol element to be used
primarily in new, or radically updated/upgraded, protocols.

But, if we are going to define a new protocol element for new
uses, then why stick with the basic URI syntax framework?  We
already know that causes problems.  It is hard to localize
because it contains a lot of ASCII characters that are special
sometimes and not others, that may have non-Latin-script
lookalikes, and because parsing is method-dependent.  That
method-dependency makes it very hard to create variations that
are appropriate to the local writing system because one has to
be method-sensitive at too many different points.  If some
protocols are to permit only IRIs, some only URIs, and some
both, it would also be beneficial to be able to determine which
is which, rather than wondering whether an IRI that actually
contains only ASCII characters (and no escapes) is actually an
IRI or is just the URI it looks like.   Again, as long as IRIs
were just an UI overlap, it made no difference.  But, as a
protocol element.

I continue to believe that makes a strong case for doing
something that gets us internationalization by moving away from
the URI syntax model, probably to something that explicitly
identifies the data elements that make up a particular URI.  If,
for example, one insisted that domain names be identified as
such wherever they appear, the mess about whether something can
or should be given IDNA treatment (even if only to verify
U-label syntax) and the associated RFC 6055 considerations
become much easier to handle than if one can to guess whether
something might be a domain name or something else with periods
in it.

Stated a little differently, if IRIs are protocol elements that
are intended to support new protocols, then it seems to me that
it is not obvious that the URI syntax is a constraint.
Certainly the WG has not had a serious discussion about what the
advantages of that constraint are and whether they outweigh the
disadvantages.

best,
     john





Reply | Threaded
Open this post in threaded view
|

Re: why use IRIs?

John C Klensin-2
In reply to this post by Martin J. Dürst


--On Monday, June 25, 2012 18:22 +0900 "\"Martin J. Dürst\""
<[hidden email]> wrote:

> Hello Peter,
>
> I think Björn already gave very good answers to your
> questions.

Martin, Björn, Peter,

> On 2012/06/22 3:28, Peter Saint-Andre wrote:
>> <hat type='individual'/>
>>
>> I've been thinking about IRIs, and I'm wondering: why would a
>> protocol "upgrade" from URIs to IRIs?
>
> As Björn said, it's really more about new protocols than
> about upgrades. Also, different protocols (and formats) can
> upgrade in different ways. Sometimes, this can be done
> formally with extensions, at other times it's done gradually
> and sooner or later gets accepted in a spec. For other cases,
> of course, it may never happen.
>...

For whatever it is worth, I don't find that answer particularly
helpful.  My problem with it is one that we have discussed
pieces of before.  If the requirement were to make something
that was coupled closely enough to URIs to be a UI overlay, then
we have one set of issues.  The WG has moved beyond that into
precisely what you are commenting on above and that they key
draft seems to reflect -- a new protocol element to be used
primarily in new, or radically updated/upgraded, protocols.  

But, if we are going to define a new protocol element for new
uses, then why stick with the basic URI syntax framework?  We
already know that causes problems.  It is hard to localize
because it contains a lot of ASCII characters that are special
sometimes and not others, that may have non-Latin-script
lookalikes, and because parsing is method-dependent.  That
method-dependency makes it very hard to create variations that
are appropriate to the local writing system because one has to
be method-sensitive at too many different points.  If some
protocols are to permit only IRIs, some only URIs, and some
both, it would also be beneficial to be able to determine which
is which, rather than wondering whether an IRI that actually
contains only ASCII characters (and no escapes) is actually an
IRI or is just the URI it looks like.   Again, as long as IRIs
were just an UI overlap, it made no difference.  But, as a
protocol element.

I continue to believe that makes a strong case for doing
something that gets us internationalization by moving away from
the URI syntax model, probably to something that explicitly
identifies the data elements that make up a particular URI.  If,
for example, one insisted that domain names be identified as
such wherever they appear, the mess about whether something can
or should be given IDNA treatment (even if only to verify
U-label syntax) and the associated RFC 6055 considerations
become much easier to handle than if one can to guess whether
something might be a domain name or something else with periods
in it.

Stated a little differently, if IRIs are protocol elements that
are intended to support new protocols, then it seems to me that
it is not obvious that the URI syntax is a constraint.
Certainly the WG has not had a serious discussion about what the
advantages of that constraint are and whether they outweigh the
disadvantages.

best,
    john






Reply | Threaded
Open this post in threaded view
|

Re: why use IRIs?

Martin J. Dürst
In reply to this post by John C Klensin
Hello John,

On 2012/07/02 10:03, John C Klensin wrote:
> (sorry - sent from wrong address)

[Sorry for forwarding as a moderator, I missed this one at first.
I have added the other address to the ignore list, so you should be able
to post from either in the future.]

> --On Monday, June 25, 2012 18:22 +0900 "\"Martin J. Dürst\""
> <[hidden email]>  wrote:

>> As Björn said, it's really more about new protocols than
>> about upgrades. Also, different protocols (and formats) can
>> upgrade in different ways. Sometimes, this can be done
>> formally with extensions, at other times it's done gradually
>> and sooner or later gets accepted in a spec. For other cases,
>> of course, it may never happen.
>> ...
>
> For whatever it is worth, I don't find that answer particularly
> helpful.  My problem with it is one that we have discussed
> pieces of before.  If the requirement were to make something
> that was coupled closely enough to URIs to be a UI overlay, then
> we have one set of issues.  The WG has moved beyond that into
> precisely what you are commenting on above and that they key
> draft seems to reflect -- a new protocol element to be used
> primarily in new, or radically updated/upgraded, protocols.

It looks as if some of the discussion in the IRI WG might have led to
the assumption that we are moving to calling IRIs a "Protocol Element"
starting with the revision of RFC 3987. This is wrong.

RFC 3987 defines IRIs as a protocol element. Please see the first line
of the abstract at http://tools.ietf.org/html/rfc3987.

Also, please note that IRIs have been working, and are working, in
protocols/formats that are in no way new since a long time. The prime
example here is HTML (of course, there it works with some warts, but
that's not more warts than the average HTML feature).


> But, if we are going to define a new protocol element for new
> uses, then why stick with the basic URI syntax framework?  We
> already know that causes problems.

First, as said above, your presumption is wrong. Second, other solutions
have been shown to have problems too.


> It is hard to localize
> because it contains a lot of ASCII characters that are special
> sometimes and not others, that may have non-Latin-script
> lookalikes, and because parsing is method-dependent.  That
> method-dependency makes it very hard to create variations that
> are appropriate to the local writing system because one has to
> be method-sensitive at too many different points.

The fact that URI/IRI characters are sometimes special and sometimes not
comes from the fact that URIs/IRIs combine a lot of different
components, and from the desire of people to not have to escape more
than absolutely necessary. You can always just go ahead and escape all
delimiters, and be on the safe side, if you don't want to complicate
your life. This is completely independent of IRIs.

The problem with non-Latin-script (or for that matter, even Latin
script) lookalikes is already present (and not solved (*)) in domain
names. It's also a problem in internationalized email addresses, because
there's @, a full-width variant of @.

[(*) IDNA 2003 had a partial solution, but IDNA 2008 abandoned it.]

As for method-dependent parsing, do you mean scheme-dependent parsing?
Given the wide variety of different syntax that all the various URI/IRI
schemes deal with, the amount of parsing that can be done generically is
actually pretty amazing, I'd think.

> If some
> protocols are to permit only IRIs, some only URIs, and some
> both, it would also be beneficial to be able to determine which
> is which, rather than wondering whether an IRI that actually
> contains only ASCII characters (and no escapes) is actually an
> IRI or is just the URI it looks like.

There is no "only IRIs". IRIs always include URIs. With that tweak,
let's rewrite the above sentence in two different ways:

If some protocols/formats/applications are to permit only ASCII domain
names, and others both ASCII and internationalized domain names, it
would also be beneficial to be able to determine which is which, rather
than wondering whether an IDN that actually contains only ASCII
characters is actually an IDN or is just the ASCII domain name it looks
like.

If some protocols/formats/applications are to permit only ASCII email
addreses, and others both ASCII and internationalized email addresses,
it would also be beneficial to be able to determine which is which,
rather than wondering whether an internationalized email address that
actually contains only ASCII characters is actually an internationalized
email address or is just the ASCII email address it looks like.

I don't see a problem, but if IRIs have a problem, so do IDNs and
internationalized email addresses.


> I continue to believe that makes a strong case for doing
> something that gets us internationalization by moving away from
> the URI syntax model, probably to something that explicitly
> identifies the data elements that make up a particular URI.  If,
> for example, one insisted that domain names be identified as
> such wherever they appear, the mess about whether something can
> or should be given IDNA treatment (even if only to verify
> U-label syntax) and the associated RFC 6055 considerations
> become much easier to handle than if one can to guess whether
> something might be a domain name or something else with periods
> in it.

This problem has three levels of difficulty.

1) For those schemes that follow the generic syntax (e.g. http,
ftp,...), the domain name is easy to find.

2) There are a few schemes that don't use generic syntax, but use
domain names. A typical example is mailto:. Here you need
scheme-specific processing.

3) Many URI schemes are open-ended. The typical example is the query
part of the http scheme, which can contain domain names or even
(suitably encoded) whole URIs. This is an example, please not the
"www.ietf.org" at the end:
http://www.google.com/search?as_q=URI&as_sitesearch=www.ietf.org

It is rather trivial to come up with a kind of format/data structure for
this. I'll give a concrete example using XML, but of course, JSON or
some other popular format would also work. The details are mostly
bike-shedding.

<IRI>
   <scheme>http</scheme>
   <host type='dns'>
     <label>www</label>
     <label>google</label>
     <label>com</label>
   </host>
   <path>
     <segment>search</segment>
   <path>
   <query>
     <parameter>
       <name>as_q</name>
       <value>URI</value>
     </parameter>
     <parameter>
       <name>as_sitesearch</name>
       <value type='dns'>
         <label>www</label>
         <label>ietf</label>
         <label>org</label>
       </value>
     </parameter>
   </query>
</IRI>

Note that this duly identifies DNS 'stuff'. It's probably not too
difficult for anybody to figure out why
people/applications/formats/protocols use URIs/IRIs rather than
something like the example above. I'm leaving this as an "exercise for
the reader".


> Stated a little differently, if IRIs are protocol elements that
> are intended to support new protocols, then it seems to me that
> it is not obvious that the URI syntax is a constraint.
> Certainly the WG has not had a serious discussion about what the
> advantages of that constraint are and whether they outweigh the
> disadvantages.

I hesitate to refer to the charter of the IRI WG
(http://datatracker.ietf.org/wg/iri/charter/) because some aspects of it
(in particular the milestones) are hopelessly out of date. I see no
indication whatsoever about removing the URI syntax constraint, and many
indications that strongly (although not explicitly) that are
contradicting such a proposal.


Please note that while IRIs are intended for new protocols (in the sense
that new protocols should preferably use IRIs and not just URIs), they
are also intended for "gradual" updates where that's appropriate, and
they are already used in many protocols/formats.


Regards,    Martin.

Reply | Threaded
Open this post in threaded view
|

Re: why use IRIs?

Mark Nottingham-2
In reply to this post by Peter Saint-Andre-2
I tend to agree with Peter.

The experience of using IRIs as identifiers in Atom was, IME, a disaster. Identifiers need to be resistant to spoofing and mistakes. Exposing a significant portion of the Unicode character plane in them doesn't do anyone any good.

As a presentation element? Fine, but AFAIK we Don't Do That Here. In places where Users touch (e.g., HTML)? Sure, but We Don't Do That here.

There may be a *few* places in protocols that are user-visible, but AFAICT we're not doing a lot of new protocols recently (thank goodness).

Björn said:

> How would you like it if URIs could use only 20 of the 26 letters in the
> english alphabet and you would have to encode, decode and convert them
> all the time, or use awkward transliterations to avoid having to do so?

URIs already have a constrained syntax; you can't use certain characters in certain places. As long as people can put IRIs into HTML and browser address bars, I don't think they'll care.

Martin said:

> I think the real motivation would be people looking at HTTP traces and
> preferring to see Unicode rather than lots of %HH strings. Of course the
> number of people looking at HTTP traces is low, and they are not end users.

Is this use case really worth the pain, inefficiency, and very likely security vulnerabilities caused by transcoding from IRIs to URIs and back when hopping from HTTP 2.0 to 1.1 and back? I don't think so.


My English-centric .02; ŸṀṂṼ.

Regards,


--
Mark Nottingham   http://www.mnot.net/




Reply | Threaded
Open this post in threaded view
|

Re: why use IRIs?

Bjoern Hoehrmann
* Mark Nottingham wrote:
>I tend to agree with Peter.
>[...]

This doesn't really help me understand where you see problems with IRIs.
Could you take a simple example like http://björn.höhrmann.de/ and tell
me of some places where I should be unable to use that even though I can
use http://bjoern.hoehrmann.de/ in the same place, without arguing about
limitations of deployed protocols, software, or hardware, and without
arguing about issues that would arise anyway when displaying URIs, and
why I should be unable to use the non-URI IRI there?

Unhelpful arguments in the sense above would be "HTTP/2.0 should stick
to URIs because using IRIs there is a hassle when HTTP/2.0 implementa-
tions interact with HTTP/1.1 implementations", as that relies on limi-
tations of HTTP/1.1 implementations, or "IRIs with zero-width spaces
can be confused with ones without such spaces" as you'd have the same
issue when you turn URIs into IRIs "for display", and so on.
--
Björn Höhrmann · mailto:[hidden email] · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

Reply | Threaded
Open this post in threaded view
|

Re: why use IRIs?

Martin J. Dürst
In reply to this post by Mark Nottingham-2
Hello Mark,

On 2012/07/04 15:13, Mark Nottingham wrote:
> I tend to agree with Peter.
>
> The experience of using IRIs as identifiers in Atom was, IME, a disaster.

Can you be specific? Can you provide pointers?


> Identifiers need to be resistant to spoofing and mistakes.

It's easy to create spoofing identifiers using ASCII/English only.

It's also not too difficult to create spoofing/mistake-resistant
identifiers in other scripts or languages, for people who are better
versed in these scripts/languages. This may be difficult to understand
for "English-centric" people, but it's indeed the case.


> Björn said:
>
>> How would you like it if URIs could use only 20 of the 26 letters in the
>> english alphabet and you would have to encode, decode and convert them
>> all the time, or use awkward transliterations to avoid having to do so?
>
> URIs already have a constrained syntax; you can't use certain characters in certain places.

Yes. But not being able to use certain punctuation is different from not
being able to use characters in the basic alphabet/character repertoire
of the language. It's easy to replace spaces with hyphens or whatever.
It's a different thing to replace one letter with another, or just drop it.

> As long as people can put IRIs into HTML and browser address bars, I don't think they'll care.
>
> Martin said:
>
>> I think the real motivation would be people looking at HTTP traces and
>> preferring to see Unicode rather than lots of %HH strings. Of course the
>> number of people looking at HTTP traces is low, and they are not end users.
>
> Is this use case really worth the pain,

For that specific case, I'm not sure. That's why I used "would". But I
also don't think the pain would be that high.


> inefficiency,

Conversion would indeed cost some cycles. But using raw bytes instead of
%-encoding would save bytes (which, these days, as far as I have
followed the SPDY debates so far, seems to be the more important side of
the tradeoff).

> and very likely security vulnerabilities caused by transcoding from IRIs to URIs and back when hopping from HTTP 2.0 to 1.1 and back? I don't think so.

There are quite a lot of places where security blunders can happen. That
conversion step wouldn't be the first one and wouldn't be the last one.
And using %-encoding for basic ASCII characters is already allowed
today, so the basic security vulnerability (firewalls can't just check
on character strings) already exists today.

> My English-centric .02; ŸṀṂṼ.

您里可变 (this is not real Chinese, but just four roughly corresponding
characters put together).

Regards,   Martin.

Reply | Threaded
Open this post in threaded view
|

Re: why use IRIs?

David Clarke-6
I've been reading this thread with interest. I'm wondering how the
originator would feel if URIs had been defined to use digits and
punctuation only with no alphabetic characters?

 From the point of view of someone who doesn't natively use the Latin
alphabet, that is equivalent to what he is proposing. Most literate
people in the world are able to use the Latin alphabet, but will be
better at recognising errors in their native script (when programming
etc), and more likely to be able to remember host names, without error,
that are in their native script.

As far as spoofing goes, in most typefaces, there are already confusions
between 1 (DIGIT ONE), l (LOWER CASE LATIN LETTER L), I (UPPER CASE
LATIN LETTER I) and between 0 (UPPER CASE LATIN LETTER O) and 0 (DIGIT
ZERO). Would it be reasonable propose removal of those characters from
URLs to reduce spoofing?

On 04/07/2012 09:49, "Martin J. Dürst" wrote:

> Hello Mark,
>
> On 2012/07/04 15:13, Mark Nottingham wrote:
>> I tend to agree with Peter.
>>
>> The experience of using IRIs as identifiers in Atom was, IME, a
>> disaster.
>
> Can you be specific? Can you provide pointers?
>
>
>> Identifiers need to be resistant to spoofing and mistakes.
>
> It's easy to create spoofing identifiers using ASCII/English only.
>
> It's also not too difficult to create spoofing/mistake-resistant
> identifiers in other scripts or languages, for people who are better
> versed in these scripts/languages. This may be difficult to understand
> for "English-centric" people, but it's indeed the case.
>
>
>> Björn said:
>>
>>> How would you like it if URIs could use only 20 of the 26 letters in
>>> the
>>> english alphabet and you would have to encode, decode and convert them
>>> all the time, or use awkward transliterations to avoid having to do so?
>>
>> URIs already have a constrained syntax; you can't use certain
>> characters in certain places.
>
> Yes. But not being able to use certain punctuation is different from
> not being able to use characters in the basic alphabet/character
> repertoire of the language. It's easy to replace spaces with hyphens
> or whatever. It's a different thing to replace one letter with
> another, or just drop it.
>
>> As long as people can put IRIs into HTML and browser address bars, I
>> don't think they'll care.
>>
>> Martin said:
>>
>>> I think the real motivation would be people looking at HTTP traces and
>>> preferring to see Unicode rather than lots of %HH strings. Of course
>>> the
>>> number of people looking at HTTP traces is low, and they are not end
>>> users.
>>
>> Is this use case really worth the pain,
>
> For that specific case, I'm not sure. That's why I used "would". But I
> also don't think the pain would be that high.
>
>
>> inefficiency,
>
> Conversion would indeed cost some cycles. But using raw bytes instead
> of %-encoding would save bytes (which, these days, as far as I have
> followed the SPDY debates so far, seems to be the more important side
> of the tradeoff).
>
>> and very likely security vulnerabilities caused by transcoding from
>> IRIs to URIs and back when hopping from HTTP 2.0 to 1.1 and back? I
>> don't think so.
>
> There are quite a lot of places where security blunders can happen.
> That conversion step wouldn't be the first one and wouldn't be the
> last one. And using %-encoding for basic ASCII characters is already
> allowed today, so the basic security vulnerability (firewalls can't
> just check on character strings) already exists today.
>
>> My English-centric .02; ŸṀṂṼ.
>
> 您里可变 (this is not real Chinese, but just four roughly
> corresponding characters put together).
>
> Regards,   Martin.
>
>





Reply | Threaded
Open this post in threaded view
|

Re: why use IRIs?

Bjoern Hoehrmann
* David Clarke wrote:
>I've been reading this thread with interest. I'm wondering how the
>originator would feel if URIs had been defined to use digits and
>punctuation only with no alphabetic characters?

The spoofing problem seems to be a sidetrack here, as far as humans go
only the "cookie domain" really matters; and for machines there is less
of a spoofing and more of a robustness problem: machines would not be
fooled, but they might implement conversions and comparisons incorrect-
ly. And domain names can have non-ASCII even in URIs, whether you dis-
play them and how is an issue either way.

>As far as spoofing goes, in most typefaces, there are already confusions
>between 1 (DIGIT ONE), l (LOWER CASE LATIN LETTER L), I (UPPER CASE
>LATIN LETTER I) and between 0 (UPPER CASE LATIN LETTER O) and 0 (DIGIT
>ZERO). Would it be reasonable propose removal of those characters from
>URLs to reduce spoofing?

The typical response to that would be that people do not want to make it
any worse. And as you note, forcing people to choose characters from a
rather limited set might actually make it harder for them to avoid some
spoofed address as they do not readily recognize what is being encoded.
--
Björn Höhrmann · mailto:[hidden email] · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

Reply | Threaded
Open this post in threaded view
|

Re: why use IRIs?

Mark Nottingham-2
In reply to this post by Bjoern Hoehrmann

On 04/07/2012, at 5:42 PM, Bjoern Hoehrmann wrote:

> This doesn't really help me understand where you see problems with IRIs.
> Could you take a simple example like http://björn.höhrmann.de/ and tell
> me of some places where I should be unable to use that even though I can
> use http://bjoern.hoehrmann.de/ in the same place, without arguing about
> limitations of deployed protocols, software, or hardware, and without
> arguing about issues that would arise anyway when displaying URIs, and
> why I should be unable to use the non-URI IRI there?

In protocols as identifiers, like an entry ID in Atom. They aren't exposed to end users, and really not to authors either.

Humans have lots of ideas about equivalence and transcription in text which machines are blissfully unaware of.

Regards,

--
Mark Nottingham   http://www.mnot.net/




Reply | Threaded
Open this post in threaded view
|

Re: why use IRIs?

Roy T. Fielding
In reply to this post by Bjoern Hoehrmann
On Jul 4, 2012, at 12:42 AM, Bjoern Hoehrmann wrote:

> This doesn't really help me understand where you see problems with IRIs.
> Could you take a simple example like http://björn.höhrmann.de/ and tell
> me of some places where I should be unable to use that even though I can
> use http://bjoern.hoehrmann.de/ in the same place, without arguing about
> limitations of deployed protocols, software, or hardware, and without
> arguing about issues that would arise anyway when displaying URIs, and
> why I should be unable to use the non-URI IRI there?

The harm in the above example is how many aliases are created by
inconsistent encoding of the characters, how difficult we make
it for servers to route based on Host (or equivalents), and how
much risk we want to allow for less-interoperable forms.  These
are all trade-offs; not hard rules.

The main problem with IRIs as protocol elements is aliasing and invalid
characters, not spoofing.  Aliases create security holes if various
routines within the server + OS normalize them in different ways,
reduce cache efficiency, and interfere with page rank.  Invalid UTF-8
sometimes results in the whole code sequence being ignored and other times
results in only the valid part of sequence being ignored (leaving the
next byte to be misinterpreted by the next round of parsing).

These problems can exist with pct-encoded UTF-8 as well, but they are
usually harmless if the origin server consistently redirects non-encoded
non-ASCII to the pct-encoded form and then uses a consistent routine
to do name mapping from URI form to native labels.  In other words,
they are less of a problem because only the origin server needs to
deal with invalid or aliased pct-encodes, and intermediaries that
secure or load-balance based on the target URI can just work on the
pct-encoded patterns (leaving the UTF-8 form to be redirected by the
origin or some server-side intermediary).

IRIs are not used in HTML or XML.  All references in those languages
are parsed as arbitrary strings with language-specific delimiting
and then converted to either a URI or something vaguely like it.
IRIs are not used in browser Location bars -- those are just arbitrary
string parsers that occasionally spit out a URI reference as a result.
IRIs are not used in waka because they would make gateways and fast
pattern matching more difficult and error-prone, which I consider
more of a concern than the potential saving in bytes.

In short, I believe that what potential users of the IRI protocol want
is a set of consistent presentation rules for displaying arbitrary
strings that might include pct-encodes and IDNA, and a simple routine
for converting an arbitrary string reference to a URI reference.
I think the idea of treating IRIs as a separate identifier space has
been harmful to its adoption by folks who already implement non-ASCII
identifiers via presentation and conversion.  It is also confusing
to those who want to create new URI schemes but think that they also
need to define IRI schemes.

....Roy