Re: Fwd: Re: HRRIs, IRIs, etc

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: HRRIs, IRIs, etc

Martin J. Dürst

Dear IRI and XML experts,

Some additional comments on the issues raised by the HRRI draft.

I discovered these when trying to create some definitions in
the new IRI draft for XML and friends to use.

The core of the issue is the following:

- The XML Core WG wants to concentrate the definitions of
  IRI-like syntax in a single document, without having to
  normatively change XML or the various related specs that
  currently use an "any Unicode character goes" definition
  for what's allowed in a resource identifier.

- The HRRI draft
  (http://www.ietf.org/internet-drafts/draft-walsh-tobin-hrri-01.txt)
  gives the following for the conversion procedure:
   To convert a Human Readable Resource Identifier to an IRI reference,
   the following characters MUST be percent encoded:

   o  the control characters #x0 to #x1F and #x7F to #x9F
   o  space #x20
   o  the delimiters "<" #x3C, ">" #x3E, and """ #x22
   o  the unwise characters "{" #x7B, "}" #x7D, "|" #x7C, "\" #x5C,
      "^" #x5E, and "`" #x60

   These characters are percent encoded by applying steps 2.1 to 2.3 of
   Section 3.1 of RFC 3987[3] to them.

  It also says: "A string is a legal Human Readable Resource Identifier
  if and only if the string generated by applying the encoding rules
  above is a legal IRI."

- The current XML spec gives the following procedure of how to convert
  from a system identifier to an URI (summarized):
  Convert all the above characters, plus all characters above 0x7F,
  to %HH-encoding via UTF-8.

- The IRI spec excludes private use characters from all but the query part.
  (there are other smaller differences, but for the moment, this is enough)

As a consequence, what we end up with is that the definition in the HRRI
draft isn't backwards compatible with the definition in the XML spec,
or in other words, it results in a normative change.

There are various ways to deal with this:

- Accept a normative change to XML. In that case, my guess would
  be that at least general control characters should also be removed.
  Neither general control characters nor private use characters
  should be used at all in the wild, at least not on purpose.

- Refine the definition of conversion to an IRI in the HRRI spec.
  My guess is that this can be done, but will look ugly.

- Change the IRI spec, to allow private use characters in other places.

Any comments wellcome.

Regards,     Martin.


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp      mailto:[hidden email]    


Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: HRRIs, IRIs, etc

Richard Tobin-2

>[...]
>   It also says: "A string is a legal Human Readable Resource Identifier
>   if and only if the string generated by applying the encoding rules
>   above is a legal IRI."

> - The current XML spec gives the following procedure of how to convert
>   from a system identifier to an URI (summarized):
>   Convert all the above characters, plus all characters above 0x7F,
>   to %HH-encoding via UTF-8.

> - The IRI spec excludes private use characters from all but the query part.
>   (there are other smaller differences, but for the moment, this is enough)

I don't think we realised that there was a difference here.  We just
thought that we could shorten the description by converting to IRIs
instead of URIs.

> - Refine the definition of conversion to an IRI in the HRRI spec.
>   My guess is that this can be done, but will look ugly.

Or we could go back to converting to URIs.

Presumably the IRI spec allows %HH sequences that correspond to
private use characters?  If so, HRRI could add private use characters
to the list to be encoded to produce an IRI.

-- Richard

Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: HRRIs, IRIs, etc

Norman Walsh
In reply to this post by Martin J. Dürst
/ Martin Duerst <[hidden email]> was heard to say:
| Dear IRI and XML experts,
[...]
| - The IRI spec excludes private use characters from all but the query part.

We have attempted to address this concern[1] by adding

  * characters in the Unicode private use area (#xE000-#xF8FF), except
    where they appear in the query part of the resulting IRI.

to the list.

|   (there are other smaller differences, but for the moment, this is enough)

Could you, please, provide a more exhaustive account of the
differences which concern you? The Core WG thinks it would be most
efficient if we could consider as many of them as possible at the same
time.

                                        Be seeing you,
                                          norm

--
Norman Walsh <[hidden email]> | Everything should be made as simple as
http://nwalsh.com/            | possible, but no simpler.

attachment0 (194 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: HRRIs, IRIs, etc

Norman Walsh
/ Norman Walsh <[hidden email]> was heard to say:
| / Martin Duerst <[hidden email]> was heard to say:
| | Dear IRI and XML experts,
| [...]
| | - The IRI spec excludes private use characters from all but the query part.
|
| We have attempted to address this concern[1] by adding

Oops, dangling reference. Sorry about that

 [1] http://www.w3.org/XML/2007/04/hrri/draft-walsh-tobin-hrri-01c.html

                                        Be seeing you,
                                          norm

--
Norman Walsh <[hidden email]> | Everything should be made as simple as
http://nwalsh.com/            | possible, but no simpler.

attachment0 (194 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: Fwd: Re: HRRIs, IRIs, etc

Grosso, Paul
In reply to this post by Norman Walsh

Martin et al.,

Please check out
http://www.w3.org/XML/2007/04/hrri/draft-walsh-tobin-hrri-01c.html
and let us know whether you have any further comments or are
satisfied with this (draft) ID.  In either case, please send
a response.  Several specs are on hold awaiting progression
of this to RFC, and we would like to be sure to make progress.

We would prefer to have a definite response from you, but if
we have not heard by 11:00 ET (Boston time) this Wednesday,
we will assume you have no more comments on this spec.

paul

> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]] On Behalf Of Norman Walsh
> Sent: Tuesday, 2007 June 12 12:25
> To: Martin Duerst
> Cc: [hidden email]; Richard Ishida; Felix Sasaki;
> [hidden email]; [hidden email]
> Subject: Re: Fwd: Re: HRRIs, IRIs, etc
>
> / Martin Duerst <[hidden email]> was heard to say:
> | Dear IRI and XML experts,
> [...]
> | - The IRI spec excludes private use characters from all but
> the query part.
>
> We have attempted to address this concern[1] by adding
>
>   * characters in the Unicode private use area (#xE000-#xF8FF), except
>     where they appear in the query part of the resulting IRI.
>
> to the list.
>
> |   (there are other smaller differences, but for the moment,
> this is enough)
>
> Could you, please, provide a more exhaustive account of the
> differences which concern you? The Core WG thinks it would be most
> efficient if we could consider as many of them as possible at the same
> time.


 [1] http://www.w3.org/XML/2007/04/hrri/draft-walsh-tobin-hrri-01c.html

Reply | Threaded
Open this post in threaded view
|

RE: Fwd: Re: HRRIs, IRIs, etc

Martin J. Dürst

Hello Paul, others,

First, I'd prefer a bit more notice if you want to set such a
hard deadline, and I guess others would do so, too.

Second, with respect to minor differences between the IRI spec
and the HRRI draft, I think that you should be able to look
at the differences carefully as well as I am able to look at
your draft. Nevertheless, I have tried to give it another look.

Here is what I have found (with no guarantee for completeness,
of course, please check again on your side).

http://www.w3.org/TR/REC-xml/#charsets allows (although, at least
in never versions, discourages):
[#xFDD0-#xFDDF],
[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
[#x10FFFE-#x10FFFF]

In the IRI spec, these are excluded:
   ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
                  / %xD0000-DFFFD / %xE1000-EFFFD

so you have to add them to your list in section 3. This in essence
seems to amount to adding 'iprivate', the list of characters above,
and the list of characters you already have to 'ucschar'. Please
check this understanding as a cross-check.


Apart from these small fixes, I have to very clearly note that most
of my more general concerns haven't been addressed. These are, somewhat
reworded/completed:

- The overall usefulness (seen from a overall W3C or overall IETF
  standpoint) of having separate definitions, in separate documents,
  for two essentially extremely closely related protocol elements.
  [I have proposed to integrate your material into an update of the
  IRI spec.]

- The choice of name, which is highly suggestive instead of descriptive,
  inappropriate on several accounts (for the largest part of URIs/IRIs,
  HRRIs are only marginally more readable, if at all, and the overall
  syntax still poses a lot of problems for average human users (http://...).

- The overall description. I note e.g. the following:
  "However, it is often inconvenient for authors to encode these characters."
  How often? Unless somebody is authoring a lot of XPointers by hand,
  this can't happen that often (maybe with the exception of the space,
  but then you discurage that (correctly!) yourself).
  I suggest to reword "often" to "occasionally". There are similar
  examples elsewhere.

- The classification as a BCP. Procedurally, it's unclear to me why the
  IETF would classify a protocol element spec as a BCP when the related
  ones (URI, IRI) are standards track. Content-wise, it's unclear why
  the IETF would call something a BEST current practice if in earlier
  discussion, they have clearly preferred to disallow or marginalize
  this practice (and that was only for spaces and such, not for controls).

- The security section now mentions the issues with control characters.
  This should definitely be a bit more specific, and should contain
  explicit recommendations. I'd write that receivers may want to
  filter out such characters, or URIs with such characters, and
  therefore including them in the first place is discouraged.

- You have some advice against using raw spaces ("Also, authors of HRRIs
  are advised to percent encode space characters themselves, rather than
  rely on the processor to do so, because spaces are often used to
  separate HRRIs in a sequence"), but not against others, where similar
  arguments apply:
  - tabs and CR/LF are removed/merged/coverted to spaces in attribute values
    (merging also occurrs for spaces)
  - <> are often used to delimit URIs/IRIs
  - arbitrary controls may trigger some security filter
  - private use characters are not interoperable
  - non-characters (the list above) are discouraged in XML itself
  (not sure this list is complete, but I guess it's getting close)

- The last paragraph of Section 3 is somewhat problematic. In general,
  it's okay, but the second half of the last sentence
  ("nor the process of passing a Human Readable Resource Identifier to a
   process or software component responsible for dereferencing it SHOULD
   trigger percent encoding") may suggest that resolution interfaces come
   with three different entry points. I think it would be better to have
   done this work by the XML side when resolving something.

Not only have these concerns not yet been adressed, but also do I not
remember having received any kind of reply on these issues.

Looking forward to hear from you again.

Regards,     Martin.


At 07:37 07/06/19, Grosso, Paul wrote:

>
>
>
>
>Martin et al.,
>
>Please check out
>http://www.w3.org/XML/2007/04/hrri/draft-walsh-tobin-hrri-01c.html
>and let us know whether you have any further comments or are
>satisfied with this (draft) ID.  In either case, please send
>a response.  Several specs are on hold awaiting progression
>of this to RFC, and we would like to be sure to make progress.
>
>We would prefer to have a definite response from you, but if
>we have not heard by 11:00 ET (Boston time) this Wednesday,
>we will assume you have no more comments on this spec.
>
>paul
>
>> -----Original Message-----
>> From: [hidden email]
>> [mailto:[hidden email]] On Behalf Of Norman Walsh
>> Sent: Tuesday, 2007 June 12 12:25
>> To: Martin Duerst
>> Cc: [hidden email]; Richard Ishida; Felix Sasaki;
>> [hidden email]; [hidden email]
>> Subject: Re: Fwd: Re: HRRIs, IRIs, etc
>>
>> / Martin Duerst <[hidden email]> was heard to say:
>> | Dear IRI and XML experts,
>> [...]
>> | - The IRI spec excludes private use characters from all but
>> the query part.
>>
>> We have attempted to address this concern[1] by adding
>>
>>   * characters in the Unicode private use area (#xE000-#xF8FF), except
>>     where they appear in the query part of the resulting IRI.
>>
>> to the list.
>>
>> |   (there are other smaller differences, but for the moment,
>> this is enough)
>>
>> Could you, please, provide a more exhaustive account of the
>> differences which concern you? The Core WG thinks it would be most
>> efficient if we could consider as many of them as possible at the same
>> time.
>
>
> [1] http://www.w3.org/XML/2007/04/hrri/draft-walsh-tobin-hrri-01c.html


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:[hidden email]    


Reply | Threaded
Open this post in threaded view
|

RE: Fwd: Re: HRRIs, IRIs, etc

Grosso, Paul

 

> -----Original Message-----
> From: Martin Duerst [mailto:[hidden email]]
> Sent: Tuesday, 2007 June 19 20:24
> To: Grosso, Paul
> Cc: [hidden email]; Richard Ishida; Felix Sasaki;
> [hidden email]; [hidden email];
> [hidden email]
> Subject: RE: Fwd: Re: HRRIs, IRIs, etc
>
> Hello Paul, others,

>
> Not only have these concerns not yet been adressed, but also do I not
> remember having received any kind of reply on these issues.
>
> Looking forward to hear from you again.
>
> Regards,     Martin.

Martin,

Thank you for your detailed comments.

I think you may have sent some private email in the
past that never made it to the WG's attention.

The only email I see from you in the archive is
http://lists.w3.org/Archives/Public/www-xml-linking-comments/2007AprJun/
0000
which doesn't mention any of the issues you remind us
of above, hence our apparent lack of response to you.

Because the XLink 1.1 and XML Base PER are both on hold
for this issue, and because we were just pulling out the
definition already in XML, XLink, and other specs and
putting it into this RFC, we were hoping to to this in
an expeditious manner.

I'm not sure how the IRI spec and the words in the XLink
spec (which we were attempting to copy into this HRRI
spec) ended up so out of sync given that we thought what
we put in the XLink spec was a copy of what was in the
IRI spec (before the IRI spec was officially available),
so I'm somewhat surprised by your long list of issues.

We don't want to rush out something that is wrong or
confusing.  We thought issuing an RFC to define HRRIs
(or whatever you want to call these things) was the
easiest and best route.  If that isn't going to work,
we may fall back to defining them in a W3C Note or in
a separate mini-Rec or something else, but for the time
being, we will review your comments and try to figure
out where to go from here.

paul

Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: HRRIs, IRIs, etc

Bjoern Hoehrmann

* Grosso, Paul wrote:
>I think you may have sent some private email in the
>past that never made it to the WG's attention.

It's <http://lists.w3.org/Archives/Public/public-iri/2007May/0000.html>.

>We don't want to rush out something that is wrong or
>confusing.  We thought issuing an RFC to define HRRIs
>(or whatever you want to call these things) was the
>easiest and best route.  If that isn't going to work,
>we may fall back to defining them in a W3C Note or in
>a separate mini-Rec or something else, but for the time
>being, we will review your comments and try to figure
>out where to go from here.

You should simply drop this effort and use IRI References instead. There
is a high cost associated with yet another notion of resource identifier
technology, while the value is near zero, especially since you do not in
any way attempt to standardize common error handling for e.g. \ chars in
the relevant contexts. Simply prohibit anything but IRI references and,
if necessary, specify "utf-8-percent-escape all disallowed characters"
as error recovery method.
--
Björn Höhrmann · mailto:[hidden email] · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: HRRIs, IRIs, etc

Richard Tobin-2
In reply to this post by Martin J. Dürst

> You should simply drop this effort and use IRI References instead. There
> is a high cost associated with yet another notion of resource identifier
> technology

This is not another notion of resource identifier.  It is the existing
notion used for XML system identifier, XLink href, and several other
things.  We are merely providing a name and a single place for a
definition that already exists in multiple specs.

> Simply prohibit anything but IRI references and,
> if necessary, specify "utf-8-percent-escape all disallowed characters"
> as error recovery method.

That would constitute a normative change to several specs.

-- Richard

Reply | Threaded
Open this post in threaded view
|

RE: Fwd: Re: HRRIs, IRIs, etc

Grosso, Paul
In reply to this post by Martin J. Dürst

Martin,

The XML Core WG discussed this message of yours during
our telcon today.  I'd like to thank you for your input
and give some preliminary responses.

[We have only just now noticed your email at
http://lists.w3.org/Archives/Public/public-iri/2007May/0000
that most of us on the XML Core WG never saw before,
so we have not yet discussed those points.]

[I'm not sure I have permission to cross post to all the
various lists, but I hesitate to remove anyone, so we'll
have to see how this works.]

> -----Original Message-----
> From: Martin Duerst [mailto:[hidden email]]
> Sent: Tuesday, 2007 June 19 20:24
> To: Grosso, Paul
> Cc: [hidden email]; Richard Ishida; Felix Sasaki;
> [hidden email]; [hidden email];
> [hidden email]
> Subject: RE: Fwd: Re: HRRIs, IRIs, etc

> In the IRI spec, these are excluded:
>    ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
>                   / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
>                   / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
>                   / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
>                   / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
>                   / %xD0000-DFFFD / %xE1000-EFFFD
>
> so you have to add them to your list in section 3.

We'll plan to add them.

> - The overall usefulness (seen from a overall W3C or overall IETF
>   standpoint) of having separate definitions, in separate documents,
>   for two essentially extremely closely related protocol elements.
>   [I have proposed to integrate your material into an update of the
>   IRI spec.]

The XML Base PER went out in December, and the XLink 1.1 CR
ended a year ago (July 2007), and these are both awaiting
resolution of this issue.

Both the basic idea as well as most of the actual wording
for what we are now calling HRRIs currently exist in several
Recs including XML, XLink, XML Base, and maybe others.  Our
attempt here was just to pull that wording out if the
various specs and reference a definition in one place.
We were hoping to to this in an expeditious manner.

We discussed the options with our team contact who discussed
it with W3T, and we agreed that a short RFC was the best approach.

>
> - The choice of name, which is highly suggestive instead of
> descriptive,
>   inappropriate on several accounts (for the largest part of
> URIs/IRIs,
>   HRRIs are only marginally more readable, if at all, and the overall
>   syntax still poses a lot of problems for average human
> users (http://...).

We had a hard time coming up with a name ourselves, and
we'd consider another name if we can find one more generally
acceptable.  We do think that allowing spaces (as is the case
with HRRIs) does improve readability a bit, but we'd be happy
with any name that works.  We had called these XML Resource
Identifiers earlier, but (1) the XRI acronym is already taken
and (2) these have meaning and usefulness outside of XML.

If anyone has suggestions, we're interested in considering them.

>
> - The overall description. I note e.g. the following:
>   "However, it is often inconvenient for authors to encode
> these characters."
>   How often? Unless somebody is authoring a lot of XPointers by hand,
>   this can't happen that often (maybe with the exception of the space,
>   but then you discurage that (correctly!) yourself).
>   I suggest to reword "often" to "occasionally". There are similar
>   examples elsewhere.

As you say, the space character is the most common.  We would be
happy to tone down or remove this sentence; our motivation for
defining HRRIs is not that they are a good thing but that they
already exist in multiple standards.

>
> - The classification as a BCP. Procedurally, it's unclear to
> me why the
>   IETF would classify a protocol element spec as a BCP when
> the related
>   ones (URI, IRI) are standards track. Content-wise, it's unclear why
>   the IETF would call something a BEST current practice if in earlier
>   discussion, they have clearly preferred to disallow or marginalize
>   this practice (and that was only for spaces and such, not
> for controls).

I think this may be a "typo".  I believe we intended
this to become an RFC.

Actually, we don't care what it becomes as long as it is
referenceable from the various W3C XML-related specs.

>
> - The security section now mentions the issues with control
> characters.
>   This should definitely be a bit more specific, and should contain
>   explicit recommendations. I'd write that receivers may want to
>   filter out such characters, or URIs with such characters, and
>   therefore including them in the first place is discouraged.

Most of us in the XML Core WG don't feel as strongly as you
appear to that we need to go on at great length about security
issues, but we are happy to expand this section along the lines
you suggest.

>
> - You have some advice against using raw spaces ("Also,
> authors of HRRIs
>   are advised to percent encode space characters themselves,
> rather than
>   rely on the processor to do so, because spaces are often used to
>   separate HRRIs in a sequence"), but not against others,
> where similar
>   arguments apply:
>   - tabs and CR/LF are removed/merged/coverted to spaces in
> attribute values
>     (merging also occurrs for spaces)
>   - <> are often used to delimit URIs/IRIs
>   - arbitrary controls may trigger some security filter
>   - private use characters are not interoperable
>   - non-characters (the list above) are discouraged in XML itself
>   (not sure this list is complete, but I guess it's getting close)

These are all good points.  We will expand the document
along the lines you suggest.

>
> - The last paragraph of Section 3 is somewhat problematic. In general,
>   it's okay, but the second half of the last sentence
>   ("nor the process of passing a Human Readable Resource
> Identifier to a
>    process or software component responsible for
> dereferencing it SHOULD
>    trigger percent encoding") may suggest that resolution
> interfaces come
>    with three different entry points. I think it would be
> better to have
>    done this work by the XML side when resolving something.

The above quoted phrase is in the XLink 1.1 CR, but we are not
sure at this time exactly why it is in there.

We are discussing this and will try to figure out what to do
about this wording and let you know.

paul

Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: HRRIs, IRIs, etc

Addison Phillips-2

The following note is PERSONAL and does not represent the
Internationalization Core WG.

Hi Paul,

I'm concerned about this discussion. I note that it has been a long
standing (perhaps mythological) belief by many of us in the
internationalization activity that XLink, XML Base, et al, represented
an instance of IRI. I thought that was a Good Thing and have been
distressed to discover that, rather than developing in the direction of
normatively referencing IRI, the issue has become murkier.

In fact, CharMod says that XLink 1.0 is meant to be IRI:

   http://www.w3.org/TR/charmod-resid/#sec-URIs

I personally support Martin's desire to avoid fragmentation into various
flavors of IRI by incorporating the necessary few minor changes into IRI
itself. I'm not sure there is a sound case for IRI-with-space: I need to
study your reasons further, myself.

 > I think this may be a "typo".  I believe we intended
 > this to become an RFC.

There is no such thing as "just an RFC". The document has to have an
intended status (which is where you probably got your BCP). Your choices
at the IETF include Informational, BCP, and standards-track. The closest
to "just an RFC" is the "Informational" category. It's not uncommon to
use this format for the purpose you have in mind.

I would tend, personally, to recommend against an Informational RFC,
simply because the other xRI formats are on the STD track. The
additional scrutiny of the STD track would, I think, benefit everyone
involved. Note that the draft can be published as an RFC (and thus
referenced) prior to attaining STD status.

I am sure that the Internationalization Core WG will shortly take up
this topic (since it is already scheduled for our next teleconference).
However, I felt that it would be wise to respond personally in advance,
noting some concern exists. Also, I note that the I18N Arch WG is
probably concerned here, since they maintain CharMod and CharMod-Resid.

Addison

--
Addison Phillips
Globalization Architect -- Yahoo! Inc.
Chair -- W3C Internationalization Core WG

Internationalization is an architecture.
It is not a feature.

Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: HRRIs, IRIs, etc

cowan

Addison Phillips scripsit:

> I'm concerned about this discussion. I note that it has been a long
> standing (perhaps mythological) belief by many of us in the
> internationalization activity that XLink, XML Base, et al, represented
> an instance of IRI.

It's always been true that random ASCII characters that are forbidden
in URI/IRIs have "worked" in XML system identifiers, as well as the
other things derived from it.  That didn't turn out to be what IRIs
are -- they have the same restrictions within the ASCII repertoire
as IRIs.

This is quite independent of the status of SPACE.

--
John Cowan  [hidden email]  http://ccil.org/~cowan
And now here I was, in a country where a right to say how the country should
be governed was restricted to six persons in each thousand of its population.
For the nine hundred and ninety-four to express dissatisfaction with the
regnant system and propose to change it, would have made the whole six
shudder as one man, it would have been so disloyal, so dishonorable, such
putrid black treason.  --Mark Twain's Connecticut Yankee

Reply | Threaded
Open this post in threaded view
|

RE: Fwd: Re: HRRIs, IRIs, etc

Martin J. Dürst
In reply to this post by Grosso, Paul

Hello Paul, others,

Many thanks for your detailled comments.

At 22:55 07/06/20, Grosso, Paul wrote:

>> -----Original Message-----
>> From: Martin Duerst [mailto:[hidden email]]
>> Sent: Tuesday, 2007 June 19 20:24
>> To: Grosso, Paul
>> Cc: [hidden email]; Richard Ishida; Felix Sasaki;
>> [hidden email]; [hidden email];
>> [hidden email]
>> Subject: RE: Fwd: Re: HRRIs, IRIs, etc
>>
>> Hello Paul, others,
>
>>
>> Not only have these concerns not yet been adressed, but also do I not
>> remember having received any kind of reply on these issues.
>>
>> Looking forward to hear from you again.
>>
>> Regards,     Martin.
>
>Martin,
>
>Thank you for your detailed comments.
>
>I think you may have sent some private email in the
>past that never made it to the WG's attention.
>
>The only email I see from you in the archive is
>http://lists.w3.org/Archives/Public/www-xml-linking-comments/2007AprJun/
>0000
>which doesn't mention any of the issues you remind us
>of above, hence our apparent lack of response to you.

My comments originally were private. Norm asked me to make
them public, and I did so, and he said he would try to make
sure your group knew about them.


>Because the XLink 1.1 and XML Base PER are both on hold
>for this issue, and because we were just pulling out the
>definition already in XML, XLink, and other specs and
>putting it into this RFC, we were hoping to to this in
>an expeditious manner.

I very much understand that you don't want to spend more
time than necessary. But please believe me, publishing
something as an RFC is bound to take time.

>I'm not sure how the IRI spec and the words in the XLink
>spec (which we were attempting to copy into this HRRI
>spec) ended up so out of sync given that we thought what
>we put in the XLink spec was a copy of what was in the
>IRI spec (before the IRI spec was officially available),
>so I'm somewhat surprised by your long list of issues.

You can find some analisys of that in the first half of
http://www.w3.org/mid/6.0.0.20.2.20070530103108.050daec0@localhost

Simply speaking, input from the IETF on this issue was
rather strong, and I tried to make sure that XLink and
friends were accomodated in some way.

>We don't want to rush out something that is wrong or
>confusing.  We thought issuing an RFC to define HRRIs
>(or whatever you want to call these things) was the
>easiest and best route.  If that isn't going to work,
>we may fall back to defining them in a W3C Note or in
>a separate mini-Rec or something else,

In terms of "time to publication", that would probably
be the fastest. However, unless very carefully written,
it wouldn't address one of my main comment (fragmentation).

>but for the time
>being, we will review your comments and try to figure
>out where to go from here.

Thanks a lot.

Here are two more comments:

1. (already mentioned in the mail referenced above, and
   not the job of the XML Core WG, but a job that I think
   has to be done): What other (W3C) specs have 'circumscriptions'
   of what's now called IRI? Are there other syntax variants
   around? How have other specs managed (or not) to move
   from circumscriptions to a reference to the IRI spec?

2. I think I got the syntactic differences resulting from
   the IRI syntax correct in my last mail, but there are
   some more restrictions in section 4 (Bidi). In particular,
   there is "IRIs MUST NOT contain bidirectional formatting characters
   (LRM, RLM, LRE, RLE, LRO, RLO, and PDF)." There are
   other restrictions, but they are shoulds, not musts.

Number 2 above is a good example for understanding some of the
differences: The general idea of IRIs was easy to put down,
but as I guess you all know from lots of experience, things
can get more complicated when looked at closely. The area of
bidirectional IRIs is something that took a long time to
work out. It would have been rather infeasible to add this
to the XML or XLink specification, and it is clear that
we would have no chance of getting this reasonably well
baked if we had given it a shot. The result in the IRI spec,
as far as I understand, it the best solution that we could
possibly come up with given all the constraints we had to
work with.

It's up to the XML community to decide whether they want to
accept this conclusion wholeheartedly (maybe resulting in a
normative change) or to find a way to 'fudge' things
(hopefully making sure that they at least recommend against
e.g. bidi formatting characters). But it would be a pity to
simply ignore the work that has been done on this (and other)
issues, and the input from the IETF community.

Regards,    Martin.


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:[hidden email]    


Reply | Threaded
Open this post in threaded view
|

RE: Fwd: Re: HRRIs, IRIs, etc

Martin J. Dürst
In reply to this post by Grosso, Paul

Hello Paul, others,

At 05:11 07/06/21, Grosso, Paul wrote:
>Martin,
>
>The XML Core WG discussed this message of yours during
>our telcon today.  I'd like to thank you for your input
>and give some preliminary responses.

Great, thanks.

>[We have only just now noticed your email at
>http://lists.w3.org/Archives/Public/public-iri/2007May/0000
>that most of us on the XML Core WG never saw before,
>so we have not yet discussed those points.]
>
>[I'm not sure I have permission to cross post to all the
>various lists, but I hesitate to remove anyone, so we'll
>have to see how this works.]

For the moment, it seems to work.

>> -----Original Message-----
>> From: Martin Duerst [mailto:[hidden email]]
>> Sent: Tuesday, 2007 June 19 20:24
>> To: Grosso, Paul
>> Cc: [hidden email]; Richard Ishida; Felix Sasaki;
>> [hidden email]; [hidden email];
>> [hidden email]
>> Subject: RE: Fwd: Re: HRRIs, IRIs, etc
>
>> In the IRI spec, these are excluded:
>>    ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
>>                   / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
>>                   / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
>>                   / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
>>                   / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
>>                   / %xD0000-DFFFD / %xE1000-EFFFD
>>
>> so you have to add them to your list in section 3.
>
>We'll plan to add them.
>
>> - The overall usefulness (seen from a overall W3C or overall IETF
>>   standpoint) of having separate definitions, in separate documents,
>>   for two essentially extremely closely related protocol elements.
>>   [I have proposed to integrate your material into an update of the
>>   IRI spec.]
>
>The XML Base PER went out in December, and the XLink 1.1 CR
>ended a year ago (July 2007), and these are both awaiting
>resolution of this issue.

Should that be July 2006? Anyway, that's a long time ago.
It's a pity that we haven't learned about this earlier.

>Both the basic idea as well as most of the actual wording
>for what we are now calling HRRIs currently exist in several
>Recs including XML, XLink, XML Base, and maybe others.  Our
>attempt here was just to pull that wording out if the
>various specs and reference a definition in one place.
>We were hoping to to this in an expeditious manner.

I understand.


>We discussed the options with our team contact who discussed
>it with W3T, and we agreed that a short RFC was the best approach.

This seems to be like a typical example of locally optimal
advice. Good on one level, problematic on a higher level.
[I'm sure I have given such advice in the past when I was
on the W3C Team.]


>> - The choice of name, which is highly suggestive instead of
>> descriptive,
>>   inappropriate on several accounts (for the largest part of
>> URIs/IRIs,
>>   HRRIs are only marginally more readable, if at all, and the overall
>>   syntax still poses a lot of problems for average human
>> users (http://...).
>
>We had a hard time coming up with a name ourselves, and
>we'd consider another name if we can find one more generally
>acceptable.  We do think that allowing spaces (as is the case
>with HRRIs) does improve readability a bit,

I'd probably have to agree. I think of the characters in question,
spaces are also those that one sees most in the wild.

>but we'd be happy
>with any name that works.  We had called these XML Resource
>Identifiers earlier, but (1) the XRI acronym is already taken
>and (2) these have meaning and usefulness outside of XML.
>
>If anyone has suggestions, we're interested in considering them.

I made some in a previous mail.

>> - The overall description. I note e.g. the following:
>>   "However, it is often inconvenient for authors to encode
>> these characters."
>>   How often? Unless somebody is authoring a lot of XPointers by hand,
>>   this can't happen that often (maybe with the exception of the space,
>>   but then you discurage that (correctly!) yourself).
>>   I suggest to reword "often" to "occasionally". There are similar
>>   examples elsewhere.
>
>As you say, the space character is the most common.  We would be
>happy to tone down or remove this sentence; our motivation for
>defining HRRIs is not that they are a good thing but that they
>already exist in multiple standards.

Okay, fine.

>> - The classification as a BCP. Procedurally, it's unclear to
>> me why the
>>   IETF would classify a protocol element spec as a BCP when
>> the related
>>   ones (URI, IRI) are standards track. Content-wise, it's unclear why
>>   the IETF would call something a BEST current practice if in earlier
>>   discussion, they have clearly preferred to disallow or marginalize
>>   this practice (and that was only for spaces and such, not
>> for controls).
>
>I think this may be a "typo".  I believe we intended
>this to become an RFC.
>
>Actually, we don't care what it becomes as long as it is
>referenceable from the various W3C XML-related specs.

Understood.

>> - The security section now mentions the issues with control
>> characters.
>>   This should definitely be a bit more specific, and should contain
>>   explicit recommendations. I'd write that receivers may want to
>>   filter out such characters, or URIs with such characters, and
>>   therefore including them in the first place is discouraged.
>
>Most of us in the XML Core WG don't feel as strongly as you
>appear to that we need to go on at great length about security
>issues, but we are happy to expand this section along the lines
>you suggest.

Well, security sections are always looked at carefully when an RFC
is published. Preparation saves time later.
 

>> - You have some advice against using raw spaces ("Also,
>> authors of HRRIs
>>   are advised to percent encode space characters themselves,
>> rather than
>>   rely on the processor to do so, because spaces are often used to
>>   separate HRRIs in a sequence"), but not against others,
>> where similar
>>   arguments apply:
>>   - tabs and CR/LF are removed/merged/coverted to spaces in
>> attribute values
>>     (merging also occurrs for spaces)
>>   - <> are often used to delimit URIs/IRIs
>>   - arbitrary controls may trigger some security filter
>>   - private use characters are not interoperable
>>   - non-characters (the list above) are discouraged in XML itself
>>   (not sure this list is complete, but I guess it's getting close)
>
>These are all good points.  We will expand the document
>along the lines you suggest.
>
>>
>> - The last paragraph of Section 3 is somewhat problematic. In general,
>>   it's okay, but the second half of the last sentence
>>   ("nor the process of passing a Human Readable Resource
>> Identifier to a
>>    process or software component responsible for
>> dereferencing it SHOULD
>>    trigger percent encoding") may suggest that resolution
>> interfaces come
>>    with three different entry points. I think it would be
>> better to have
>>    done this work by the XML side when resolving something.
>
>The above quoted phrase is in the XLink 1.1 CR, but we are not
>sure at this time exactly why it is in there.

A similar phrase at least was at one point in the XML spec, and
something similar is in the IRI spec. For resolution, it's not
terribly important, because for resolution, the average %-escaped
stuff and its unescaped counterpart are equivalent. But for
things such as namespace processing, it's important not to
change %-encodings because e.g. http://www.w3.org/A and
http://www.w3.org/%41 are two different namespaces.


Regards,    Martin.


>We are discussing this and will try to figure out what to do
>about this wording and let you know.
>
>paul


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:[hidden email]    


Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: HRRIs, IRIs, etc

Martin J. Dürst
In reply to this post by cowan

Hello John,

At 06:42 07/06/21, John Cowan wrote:

>Addison Phillips scripsit:
>
>> I'm concerned about this discussion. I note that it has been a long
>> standing (perhaps mythological) belief by many of us in the
>> internationalization activity that XLink, XML Base, et al, represented
>> an instance of IRI.
>
>It's always been true that random ASCII characters that are forbidden
>in URI/IRIs have "worked" in XML system identifiers, as well as the
>other things derived from it.  That didn't turn out to be what IRIs
>are -- they have the same restrictions within the ASCII repertoire
>as IRIs.

I guess you ment "URIs" in the last line.

This is true, and is also true for HTML.

There are several ways to explain this:

- Implementers carefully implemented the spec.

- Implementers did what worked with the least effort.

- Implementers understood that it's a well-held principle for URIs
  and IRIs that there shouldn't (or can't) be any detailled syntax checks.
  About the only thing you can check reliably without going down the
  scheme specific road is that if it contains a ':', then the characters
  before the first ':' need to match the scheme production:
       scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
  I think the XML Schema WG tried to come up with a regexp, but they
  gave up. Please also see the following note at
  http://www.w3.org/TR/xmlschema-2/#anyURI
  Note:  Each URI scheme imposes specialized syntax rules for URIs in that
     scheme, including restrictions on the syntax of allowed fragment identifiers.
     Because it is impractical for processors to check that a value is a
     context-appropriate URI reference, this specification follows the lead of
     [RFC 2396] (as amended by [RFC 2732]) in this matter: such rules and
     restrictions are not part of type validity and are not checked by
     ・minimally conforming・ processors. Thus in practice the above definition
     imposes only very modest obligations on ・minimally conforming・ processors.

>This is quite independent of the status of SPACE.

Can you explain how this is independent? Isn't space just one of these
characters?

Regards,    Martin.



#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:[hidden email]    


Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: HRRIs, IRIs, etc

Martin J. Dürst
In reply to this post by Richard Tobin-2

Hello Richard,


At 00:59 07/06/21, Richard Tobin wrote:

>> You should simply drop this effort and use IRI References instead. There
>> is a high cost associated with yet another notion of resource identifier
>> technology
>
>This is not another notion of resource identifier.  It is the existing
>notion used for XML system identifier, XLink href, and several other
>things.  We are merely providing a name and a single place for a
>definition that already exists in multiple specs.

If these things are not resource identifiers, then what are they?

>> Simply prohibit anything but IRI references

That would constitute a normative change to several specs.
In my oppinion, that may be inappropriate for spaces and
a few other characters, in particular in the context of XPointer,
but it would definitely be highly appropriate for arbitrary
control characters (if you ever have encountered an URI/IRI
with an arbitrary control character (not TAB/CR/LF, I'd really
like to know).

>> and,
>> if necessary, specify "utf-8-percent-escape all disallowed characters"
>> as error recovery method.

That would not, at least not if you consider observable behavior
to be the relevant criterion.

At least for the XML spec itself, there may be a point of
view that simply saying "it's an IRI" won't change anything.
I'll try to explain this below.

Looking at the definition of a PubidLiteral for a moment
(http://www.w3.org/TR/REC-xml/#NT-PubidLiteral), it just
specifies a range of characters that can be used, nothing
more in terms of syntax, although it could be argued that
some syntaxes (those with several // included) are much
more likely, or even highly expected to make a PubidLiteral
usable in a wider context (which 'Public' suggests in the
first place).

Likewise, the syntax for SystemLiteral is specified simply
as a string of characters (from a much wider repertoire).
To say that this is an IRI does not restrict this syntax.

It is a well acknowledged fact that URI and IRI syntax are
very difficult to check (because there are scheme-dependent
restrictions, and so on) and that therefore, any strict
checking (in the way e.g. the XML syntax is checked for
well-formedness) is not appropriate for URIs or IRIs.

The rest (namely conversion of unallowed characters to
%hh-encoding) seems to already be covered under the following
paragraph from the IRI spec:

   Systems accepting IRIs MAY also deal with the printable characters in
   US-ASCII that are not allowed in URIs, namely "<", ">", '"', space,
   "{", "}", "|", "\", "^", and "`", in step 2 above.  If these
   characters are found but are not converted, then the conversion
   SHOULD fail.  Please note that the number sign ("#"), the percent
   sign ("%"), and the square bracket characters ("[", "]") are not part
   of the above list and MUST NOT be converted.  Protocols and formats
   that have used earlier definitions of IRIs including these characters
   MAY require percent-encoding of these characters as a preprocessing
   step to extract the actual IRI from a given field.  This
   preprocessing MAY also be used by applications allowing the user to
   enter an IRI.

I'm not saying that this interpretation is the only one possible,
and I'm not sure how it would apply to XLink and others, but
I wanted to show it here as one point of view.

Regards,    Martin.

>That would constitute a normative change to several specs.
>
>-- Richard


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:[hidden email]    


Reply | Threaded
Open this post in threaded view
|

RE: Fwd: Re: HRRIs, IRIs, etc

Richard Tobin-2
In reply to this post by Martin J. Dürst

Can I clarify the status of some characters of the characters Martin
listed, please?

> http://www.w3.org/TR/REC-xml/#charsets allows (although, at least
> in never versions, discourages):
> [#xFDD0-#xFDDF],
> [#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
> [#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
> [#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
> [#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
> [#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
> [#x10FFFE-#x10FFFF]
>
> In the IRI spec, these are excluded:
>    ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
>                   / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
>                   / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
>                   / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
>                   / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
>                   / %xD0000-DFFFD / %xE1000-EFFFD

I see XML discourages FDD*, but the ucschar excludes both FDD* and
FDE*.  Does anyone know the reason for this discrepancy?  FDE* seem to
be also "not a character".

ucschar also excludes E0***, which seem to be "tags" - what does that
mean?

ucschar also exclude FFF*, but XML makes no mention of them, except
of course FFFE and FFFF which aren't allowed in XML at all.

-- Richard

Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: HRRIs, IRIs, etc

Richard Tobin-2
In reply to this post by Martin J. Dürst

> >> You should simply drop this effort and use IRI References instead. There
> >> is a high cost associated with yet another notion of resource identifier
> >> technology

> >This is not another notion of resource identifier.  It is the existing
> >notion used for XML system identifier, XLink href, and several other
> >things.  We are merely providing a name and a single place for a
> >definition that already exists in multiple specs.

> If these things are not resource identifiers, then what are they?

I did not mean that they were not resource identifiers.  It was the
"yet another" part that I was disputing.  They are an *existing* form
of resource identifiers, which does not have an name and whose
definition is currently replicated in several places.

-- Richard

Reply | Threaded
Open this post in threaded view
|

RE: Fwd: Re: HRRIs, IRIs, etc

Richard Tobin-2
In reply to this post by Martin J. Dürst

> 2. I think I got the syntactic differences resulting from
>    the IRI syntax correct in my last mail, but there are
>    some more restrictions in section 4 (Bidi). In particular,
>    there is "IRIs MUST NOT contain bidirectional formatting characters
>    (LRM, RLM, LRE, RLE, LRO, RLO, and PDF)."

Am I right in thinking that the use of these characters in HRRIs would
be a security risk, because they might change the appearance if the
HRRI just as a control character might?

-- Richard

Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: HRRIs, IRIs, etc

cowan
In reply to this post by Richard Tobin-2

Richard Tobin scripsit:

> I see XML discourages FDD*, but the ucschar excludes both FDD* and
> FDE*.  Does anyone know the reason for this discrepancy?  FDE* seem to
> be also "not a character".

Almost certainly a blunder on my part.  The correct range of
non-characters is FDD0-FDEF.

> ucschar also excludes E0***, which seem to be "tags" - what does that
> mean?

E0000-E007F are a clone of ASCII, dedicated to encoding language tags in
plain text in a context where language tagging is considered essential but
full markup too complex or expensive.  Thus the language tag "en" would
be encoded as E0065 E006E.  These characters were born deprecated, and
served to stave off the attempt of a certain IETF WG to abuse otherwise
reserved UTF-8 forms to the same purpose.

E0010-E01EF are variant selectors, attached to ordinary chaacters to
specify variant forms of characters that are important or unpredictable
in certain contexts, but in other contexts are equivalent to the
forms without variant selectors.  E01FF-E0FFF are reserved for other
"default-ignorable" characters; processes that do not understand these
characters ought to ignore them (and not render them as boxes, etc.).

> ucschar also exclude FFF*, but XML makes no mention of them, except
> of course FFFE and FFFF which aren't allowed in XML at all.

FFF0-FFF8 are currently unassigned.  FFF9-FFFB are used to do ruby in
plain text, FFFC is a placeholder for a non-character object, and FFFD
is used to replace an incoming character whose value is unknown or has
no Unicode equivalent.

We should issue an erratum for XML 1.0/1.1 adding FDE0-FDEF, E0000-E007F,
and FFF0-FFFD to the discouraged characters list, as all of them have
better equivalents in markup.  Likewise, the characters 0340, 0341,
17A3, 17D3, and 206A-206F should be discouraged, as they are in Unicode.
E0010-E01EF are still useful in XML character content, though probably
not in *RIs.

--
What has four pairs of pants, lives             John Cowan
in Philadelphia, and it never rains             http://www.ccil.org/~cowan
but it pours?                                   [hidden email]
        --Rufus T. Firefly

12