HRRI vs IRI in XML

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

HRRI vs IRI in XML

Norman Walsh
Hi,

Sorry I was out of the loop for a bit. I see from the email threads
that we've got some improved wording proposed for the list of
characters that have to be escaped if they appear in HRRI and some
improved wording for the security considerations section. I'll
incorporate those as soon as I can.

However, as far as I can tell, we still don't have a clear
understanding about whether we need HRRI or not.

Here's how I see it. Sorry if this is a little repetative; I'm hoping
that considering this issue from a higher level again will help.

1. The XML Recommendation says that a system identifier consists of a
single or double quote followed by any characters followed by a
matching quote:

  SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'")

Any attempt to limit the characters allowed in a system identifier
would be a backwards incompatible change to XML. That is simply not an
option.

2. Because we knew that system identifiers allowed characters that
couldn't appear in URIs, we added some wording to clarify how
processors must escape those characters if they needed URIs.

Over time, this text was refined, using fragments taken from drafts of
the IRI spec, and is now "cut-and-pasted" into several
recommendations.

It's become clear that this cut-and-paste approach is tedious and
error-prone and does not scale. Asking future specs to continue this
cut-and-paste process from one or another of the existing specs is
just not helpful to the community.

3. The HRRI spec proposes to instantiate the very liberal repertoire
of characters allowed in a system identifier (and all the other
places) in a short, stand-alone specification. This specification will
have a name and will be available for normative reference.

I understand that perhaps the world would be a better place if we
didn't need another name for another flavor of a string that serves
the role of identifying a resource. But that's not an option, see
point 1.

Martin's message that quoted this paragraph from the IRI spec gave a
glimmer of hope that perhaps we could avoid 3.

   Systems accepting IRIs MAY also deal with the printable characters in
   US-ASCII that are not allowed in URIs, namely "<", ">", '"', space,
   "{", "}", "|", "\", "^", and "`", in step 2 above.  If these
   characters are found but are not converted, then the conversion
   SHOULD fail.  Please note that the number sign ("#"), the percent
   sign ("%"), and the square bracket characters ("[", "]") are not part
   of the above list and MUST NOT be converted.  Protocols and formats
   that have used earlier definitions of IRIs including these characters
   MAY require percent-encoding of these characters as a preprocessing
   step to extract the actual IRI from a given field.  This
   preprocessing MAY also be used by applications allowing the user to
   enter an IRI.

Unfortunately, our problem is that system identifiers can contain not
just "printable characters in US-ASCII that are not allowed in URIs"
but a wide range of characters from elsewhere in Unicode that are not
allowed in URIs (or IRIs).

Question: Is the paragraph from the IRI spec above intended to be
broader than a literal reading would suggest? Is it the intent of the
IRI spec that systems accepting IRIs MAY also deal with characters not
allowed in URIs by converting them?

If so, then perhaps we can simply say that system identifiers are IRIs
and note this provision in the IRI spec for what I'll call "legacy"
identifiers.

If not, then I think we must proceed with the HRRI spec.

Thoughts?

                                        Be seeing you,
                                          norm

--
Norman Walsh <[hidden email]> | A great deal may be done by severity,
http://nwalsh.com/            | more by love, but most by clear
                              | discernment and impartial
                              | justice.--Goethe

attachment0 (194 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: HRRI vs IRI in XML

Norman Walsh
Ping?

/ Norman Walsh <[hidden email]> was heard to say:
| Hi,
|
| Sorry I was out of the loop for a bit. I see from the email threads
| that we've got some improved wording proposed for the list of
| characters that have to be escaped if they appear in HRRI and some
| improved wording for the security considerations section. I'll
| incorporate those as soon as I can.
|
| However, as far as I can tell, we still don't have a clear
| understanding about whether we need HRRI or not.
|
| Here's how I see it. Sorry if this is a little repetative; I'm hoping
| that considering this issue from a higher level again will help.
|
| 1. The XML Recommendation says that a system identifier consists of a
| single or double quote followed by any characters followed by a
| matching quote:
|
|   SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'")
|
| Any attempt to limit the characters allowed in a system identifier
| would be a backwards incompatible change to XML. That is simply not an
| option.
|
| 2. Because we knew that system identifiers allowed characters that
| couldn't appear in URIs, we added some wording to clarify how
| processors must escape those characters if they needed URIs.
|
| Over time, this text was refined, using fragments taken from drafts of
| the IRI spec, and is now "cut-and-pasted" into several
| recommendations.
|
| It's become clear that this cut-and-paste approach is tedious and
| error-prone and does not scale. Asking future specs to continue this
| cut-and-paste process from one or another of the existing specs is
| just not helpful to the community.
|
| 3. The HRRI spec proposes to instantiate the very liberal repertoire
| of characters allowed in a system identifier (and all the other
| places) in a short, stand-alone specification. This specification will
| have a name and will be available for normative reference.
|
| I understand that perhaps the world would be a better place if we
| didn't need another name for another flavor of a string that serves
| the role of identifying a resource. But that's not an option, see
| point 1.
|
| Martin's message that quoted this paragraph from the IRI spec gave a
| glimmer of hope that perhaps we could avoid 3.
|
|    Systems accepting IRIs MAY also deal with the printable characters in
|    US-ASCII that are not allowed in URIs, namely "<", ">", '"', space,
|    "{", "}", "|", "\", "^", and "`", in step 2 above.  If these
|    characters are found but are not converted, then the conversion
|    SHOULD fail.  Please note that the number sign ("#"), the percent
|    sign ("%"), and the square bracket characters ("[", "]") are not part
|    of the above list and MUST NOT be converted.  Protocols and formats
|    that have used earlier definitions of IRIs including these characters
|    MAY require percent-encoding of these characters as a preprocessing
|    step to extract the actual IRI from a given field.  This
|    preprocessing MAY also be used by applications allowing the user to
|    enter an IRI.
|
| Unfortunately, our problem is that system identifiers can contain not
| just "printable characters in US-ASCII that are not allowed in URIs"
| but a wide range of characters from elsewhere in Unicode that are not
| allowed in URIs (or IRIs).
|
| Question: Is the paragraph from the IRI spec above intended to be
| broader than a literal reading would suggest? Is it the intent of the
| IRI spec that systems accepting IRIs MAY also deal with characters not
| allowed in URIs by converting them?
|
| If so, then perhaps we can simply say that system identifiers are IRIs
| and note this provision in the IRI spec for what I'll call "legacy"
| identifiers.
|
| If not, then I think we must proceed with the HRRI spec.
|
| Thoughts?
|
|                                         Be seeing you,
|                                           norm
|
| --
| Norman Walsh <[hidden email]> | A great deal may be done by severity,
| http://nwalsh.com/            | more by love, but most by clear
|                               | discernment and impartial
|                               | justice.--Goethe

attachment0 (194 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: HRRI vs IRI in XML

Martin J. Dürst
In reply to this post by Norman Walsh

Hello Norm, others,

Sorry for the delay in responding; in summer, everything moves
a bit slower.

At 00:50 07/07/19, Norman Walsh wrote:

>Hi,
>
>Sorry I was out of the loop for a bit. I see from the email threads
>that we've got some improved wording proposed for the list of
>characters that have to be escaped if they appear in HRRI and some
>improved wording for the security considerations section. I'll
>incorporate those as soon as I can.
>
>However, as far as I can tell, we still don't have a clear
>understanding about whether we need HRRI or not.
>
>Here's how I see it. Sorry if this is a little repetative; I'm hoping
>that considering this issue from a higher level again will help.

I think laying out the issues clearly can only help. Thanks for doing this.

>1. The XML Recommendation says that a system identifier consists of a
>single or double quote followed by any characters followed by a
>matching quote:
>
>  SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'")
>
>Any attempt to limit the characters allowed in a system identifier
>would be a backwards incompatible change to XML. That is simply not an
>option.

Well, it would sure look like a backwards-incompatible change on
the spec level. But how many XML documents would indeed turn
non-well-formed if one e.g. disallowed general control characters
in the C0 area (I'm not speaking about TAB/CR/LF)?

As far as I understand, these characters cannot appear in XML 1.0.
They can appear, in the form of numeric character references (NCRs),
in XML 1.1, but the above grammar rule doesn't allow NCRs in
System Literals. The XML REC mentions this explicitly, as follows:
"Note that a SystemLiteral can be parsed without scanning for markup."

So in fact changing the SystemLiteral production to exclude general
C0 control characters wouldn't change anything at all.

[There is potentially another interpretation of the grammar in the
XML spec, which is that the Char production
(http://www.w3.org/TR/REC-xml/#NT-Char) does not restrict the
contents of SystemLiteral, but in that case, it would also not
restrict the contens of http://www.w3.org/TR/REC-xml/#NT-CharData,
which would mean that arbitrary element content could contain
such control characters including NUL characters/bytes.
I think it would probably be best to fix this by explicitly
using the Char production in SystemLiteral and the other
relevant places. If I need to submit an erratum, please tell
me where.]

This is of course different for e.g. C1 control characters and for
URI-like fields in XML attributes or element content. But even for
these, the question remains of how many XML document there are really
out there that use any of these characters (for any other purpose
than to prove that there are indeed such documents).

>2. Because we knew that system identifiers allowed characters that
>couldn't appear in URIs, we added some wording to clarify how
>processors must escape those characters if they needed URIs.

Well, I think it's actually slightly different. Because we wanted
System Literals to accept something like IRIs (which didn't have
that name yet at that time), we added wording to clarify how to
convert these into URIs. I do not remember the SystemLiteral
production ever having been brought up in the discussion, neither
in the way above (we neeed to describe the conversion because
SystemLiteral allows anything) nor the other way round (to make
sure that we can use more than just URIs, we have to make the
SystemLiteral production more general than US-ASCII). But
these things happended a long time ago.

My guess is that the main motivation for having the SystemLiteral
production the way it is is that people who wrote the XML spec
understood one of the general principles of URI/IRI syntax, which
is that it's a bad idea to unnecessarily restrict this specs that
carry URIs/IRIs, because this creates unnecessary dependencies.

>Over time, this text was refined, using fragments taken from drafts of
>the IRI spec, and is now "cut-and-pasted" into several
>recommendations.
>
>It's become clear that this cut-and-paste approach is tedious and
>error-prone and does not scale. Asking future specs to continue this
>cut-and-paste process from one or another of the existing specs is
>just not helpful to the community.

I agree. However, please note that many other W3C specs currently
have circumscriptive texts. In some cases, these have been written
in expectation of the IRI spec being available as an RFC, in other
cases, they are there to allow to use old terminology (URI) with
new meaning (IRI). For some examples, please see
http://www.w3.org/International/iri-edit/spec-use-survey.html,
a page I have started to put together to get an overview of the
different ways the issue we are discussing here is addressed in
W3C specs. Please feel free to add to that page (if you have
access rights) or to suggest additions.

>3. The HRRI spec proposes to instantiate the very liberal repertoire
>of characters allowed in a system identifier (and all the other
>places) in a short, stand-alone specification. This specification will
>have a name and will be available for normative reference.

The "all" in "all the other places" is misleading, because it very
much depends on the scale at which things are looked at.

>I understand that perhaps the world would be a better place if we
>didn't need another name for another flavor of a string that serves
>the role of identifying a resource. But that's not an option, see
>point 1.

I don't think it's productive to write "that's not an option" without
actual backup technical arguments. I'm yet waiting for the first
XML document that contains any of the characters in question in
any of the URI/IRI-like slots under discussion here (of course
this would exclude documents that have been created just to show
that such documents exist, but I haven't even see one of these).
I'm still waiting for anybody comming up and claiming that they
actually need or want to use any of the obscure "characters"
(not talking about printable US-ASCII or TAB/CR/LF/Space here).

If the XML Core WG said "we think that the risk is extremely
low, but we don't want to take this risk", I could to some
extent understand this, and it's ultimately the job of the
XML Core WG to decide how they want to proceed with their
specs. However, I think that the overall effect on the community
should considered when looking at the benefits and problems
of different approaches. For the overall community, the
benefit of having a single concept, defined by a single
specification, is very high compared to the issue of the XML
Core WG wanting to save a few lines in a few specs that otherwise
may be needed to avoid a risk that is extremely small.


>Martin's message that quoted this paragraph from the IRI spec gave a
>glimmer of hope that perhaps we could avoid 3.
>
>   Systems accepting IRIs MAY also deal with the printable characters in
>   US-ASCII that are not allowed in URIs, namely "<", ">", '"', space,
>   "{", "}", "|", "\", "^", and "`", in step 2 above.  If these
>   characters are found but are not converted, then the conversion
>   SHOULD fail.  Please note that the number sign ("#"), the percent
>   sign ("%"), and the square bracket characters ("[", "]") are not part
>   of the above list and MUST NOT be converted.  Protocols and formats
>   that have used earlier definitions of IRIs including these characters
>   MAY require percent-encoding of these characters as a preprocessing
>   step to extract the actual IRI from a given field.  This
>   preprocessing MAY also be used by applications allowing the user to
>   enter an IRI.
>
>Unfortunately, our problem is that system identifiers can contain not
>just "printable characters in US-ASCII that are not allowed in URIs"
>but a wide range of characters from elsewhere in Unicode that are not
>allowed in URIs (or IRIs).
>
>Question: Is the paragraph from the IRI spec above intended to be
>broader than a literal reading would suggest? Is it the intent of the
>IRI spec that systems accepting IRIs MAY also deal with characters not
>allowed in URIs by converting them?

This is a very interesting thought. What I have said earlier is that
I think it would be possible to extend the above paragraph to other
kinds of characters in an (already started) update of the IRI spec.

I'm quite a bit more sceptical about dealing with this just as an
erratum, because looking at all the drafts of the IRI spec listed at http://www.w3.org/International/iri-edit/#Published, there never seems
to have been any question about whether control characters (both
general C0 and all of C1) should be allowed or not. Nobody ever
came up and requested that these be allowed, in any way, and I'm
still not seeing any actual need at all. The above note was specifically
put in to address the actual and expressed needs of some people in
the XML community (see my earlier email with references to the
email archive).


>If so, then perhaps we can simply say that system identifiers are IRIs
>and note this provision in the IRI spec for what I'll call "legacy"
>identifiers.

This is essentially what I proposed, except that this would happen
in a new version of the IRI spec. There is a huge difference between
using an erratum (with very little feedback possibilities from the
community on whether this was indeed intended, and very little
room for adding additional warning text), and an updated spec,
where we can make sure we spend all the necessary time on
getting the wording correct and adding all the necessary
warnings.

>If not, then I think we must proceed with the HRRI spec.

"must" is quite strong. What about looking at what other specs
did? What about going for something along the lines of:

A SystemLiteral SHOULD be an IRI [RFC3987 (or its successor)].
Note: This includes the provision in the IRI spec for dealing with
      printable characters in US-ASCII that are not allowed in URIs.
Note: XML processors MUST/SHOULD also convert characters outside
      the repertoire of characters allowed in IRIs according to
      Section 3.1 of [RFC 3987].

With the erratum, you would have used the first three lines.
Without the erratum, your text gets a bit longer. You might
even want to tweak the second note to cover some of what the
RDF specs say (essentially, processors may issue warnings
when they see something that doesn't conform to the IRI spec).

If you wait for a new version of the IRI spec, you should be
able to then use some text such as:

A SystemLiteral is an IRI according to [RFCXXXX], including
the provisions in section Y.Z of [RFCXXXX]. We will be able
to make sure that Y.Z covers your needs, and hopefully the
needs of other W3C (and other) specs, and we will greatly
reduce the confusion for the overall community and have
technology converge to what's really needed, rather than
diverge for the sake of non-existing backwards compatibility
needs.

Hope this helps.


Regards,     Martin.


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:[hidden email]    


Reply | Threaded
Open this post in threaded view
|

Re: HRRI vs IRI in XML

Norman Walsh
/ Martin Duerst <[hidden email]> was heard to say:
| At 00:50 07/07/19, Norman Walsh wrote:
|>1. The XML Recommendation says that a system identifier consists of a
|>single or double quote followed by any characters followed by a
|>matching quote:
|>
|>  SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'")
|>
|>Any attempt to limit the characters allowed in a system identifier
|>would be a backwards incompatible change to XML. That is simply not an
|>option.
|
| Well, it would sure look like a backwards-incompatible change on
| the spec level. But how many XML documents would indeed turn
| non-well-formed if one e.g. disallowed general control characters
| in the C0 area (I'm not speaking about TAB/CR/LF)?
|
| As far as I understand, these characters cannot appear in XML 1.0.
| They can appear, in the form of numeric character references (NCRs),
| in XML 1.1, but the above grammar rule doesn't allow NCRs in
| System Literals. The XML REC mentions this explicitly, as follows:
| "Note that a SystemLiteral can be parsed without scanning for markup."
|
| So in fact changing the SystemLiteral production to exclude general
| C0 control characters wouldn't change anything at all.

Fair enough. That still leaves the non-IRI space character (and other
non-IRI characters?) and a bunch of non-URI characters.

| [There is potentially another interpretation of the grammar in the
| XML spec, which is that the Char production
| (http://www.w3.org/TR/REC-xml/#NT-Char) does not restrict the
| contents of SystemLiteral, but in that case, it would also not
| restrict the contens of http://www.w3.org/TR/REC-xml/#NT-CharData,
| which would mean that arbitrary element content could contain
| such control characters including NUL characters/bytes.
| I think it would probably be best to fix this by explicitly
| using the Char production in SystemLiteral and the other
| relevant places. If I need to submit an erratum, please tell
| me where.]

This message is probably sufficient.

| This is of course different for e.g. C1 control characters and for
| URI-like fields in XML attributes or element content. But even for
| these, the question remains of how many XML document there are really
| out there that use any of these characters (for any other purpose
| than to prove that there are indeed such documents).

The XML 1.1 experience has (absolutely, utterly) convinced me that no
backwards-incompatible change to XML, no matter how negligible the
practical impact, is acceptable. Backwards incompatibility is simply
not an option.

|>2. Because we knew that system identifiers allowed characters that
|>couldn't appear in URIs, we added some wording to clarify how
|>processors must escape those characters if they needed URIs.
|
| Well, I think it's actually slightly different.

Fine. We can argue about the history over a beer sometime :-)

|>Over time, this text was refined, using fragments taken from drafts of
|>the IRI spec, and is now "cut-and-pasted" into several
|>recommendations.
|>
|>It's become clear that this cut-and-paste approach is tedious and
|>error-prone and does not scale. Asking future specs to continue this
|>cut-and-paste process from one or another of the existing specs is
|>just not helpful to the community.
|
| I agree. However, please note that many other W3C specs currently
| have circumscriptive texts.

I don't see how that helps. We have to describe strings that aren't
URIs or IRIs. We don't want IRIs, we don't want to use the term URI
and mean IRI, we want to use a term that means "this {string}".

|>3. The HRRI spec proposes to instantiate the very liberal repertoire
|>of characters allowed in a system identifier (and all the other
|>places) in a short, stand-alone specification. This specification will
|>have a name and will be available for normative reference.
|
| The "all" in "all the other places" is misleading, because it very
| much depends on the scale at which things are looked at.

I meant "all the other places that currently refer to the XLink 1.0
spec for their description of what characters are allowed".

|>I understand that perhaps the world would be a better place if we
|>didn't need another name for another flavor of a string that serves
|>the role of identifying a resource. But that's not an option, see
|>point 1.
|
| I don't think it's productive to write "that's not an option" without
| actual backup technical arguments.

The principle argument isn't technical, it's political. I'm not saying
it's "not an option" on the basis of some self-righteous personal
opinion, I'm saying it because I have scars and burns from the last
time we did something backwards incompatible to XML. The community
will not stand for it.

I'd be *delighted* to be proved wrong. Get consensus from the
community that it would be ok to change the definition of system
literal or the value of href attributes so that "this {string}.html"
no longer accesses the resource we'd identify with the URI
"this%20%7bstring%7d.html" but is instead an error and I'll be the
first in line to fix the specs.

Specs have to reflect reality. The reality is system identifiers and
href attributes contain values that aren't valid in URIs or IRIs so we
need to do something.

|>Martin's message that quoted this paragraph from the IRI spec gave a
|>glimmer of hope that perhaps we could avoid 3.
|>
|>   Systems accepting IRIs MAY also deal with the printable characters in
|>   US-ASCII that are not allowed in URIs, namely "<", ">", '"', space,
|>   "{", "}", "|", "\", "^", and "`", in step 2 above.  If these
|>   characters are found but are not converted, then the conversion
|>   SHOULD fail.  Please note that the number sign ("#"), the percent
|>   sign ("%"), and the square bracket characters ("[", "]") are not part
|>   of the above list and MUST NOT be converted.  Protocols and formats
|>   that have used earlier definitions of IRIs including these characters
|>   MAY require percent-encoding of these characters as a preprocessing
|>   step to extract the actual IRI from a given field.  This
|>   preprocessing MAY also be used by applications allowing the user to
|>   enter an IRI.
|>
|>Unfortunately, our problem is that system identifiers can contain not
|>just "printable characters in US-ASCII that are not allowed in URIs"
|>but a wide range of characters from elsewhere in Unicode that are not
|>allowed in URIs (or IRIs).
|>
|>Question: Is the paragraph from the IRI spec above intended to be
|>broader than a literal reading would suggest? Is it the intent of the
|>IRI spec that systems accepting IRIs MAY also deal with characters not
|>allowed in URIs by converting them?
|
| This is a very interesting thought. What I have said earlier is that
| I think it would be possible to extend the above paragraph to other
| kinds of characters in an (already started) update of the IRI spec.

If the IRI spec can be extended to cover the characters we need, then
I think we could say that these things are IRIs. That means we need
to answer two questions:

1. Will the folks maintaining the IRI spec agree to extend that
   paragraph to cover all of the characters we need to have in order
   to maintain perfect backwards-compatibility with XML 1.0 and 1.1?

2. What is a realistic timeline for IRI v.Next?

|>If so, then perhaps we can simply say that system identifiers are IRIs
|>and note this provision in the IRI spec for what I'll call "legacy"
|>identifiers.
|
| This is essentially what I proposed, except that this would happen
| in a new version of the IRI spec. There is a huge difference between
| using an erratum (with very little feedback possibilities from the
| community on whether this was indeed intended, and very little
| room for adding additional warning text), and an updated spec,
| where we can make sure we spend all the necessary time on
| getting the wording correct and adding all the necessary
| warnings.
|
|>If not, then I think we must proceed with the HRRI spec.
|
| "must" is quite strong. What about looking at what other specs
| did? What about going for something along the lines of:

Perhaps. I'll see what the Core WG says next week.

| A SystemLiteral SHOULD be an IRI [RFC3987 (or its successor)].
| Note: This includes the provision in the IRI spec for dealing with
|       printable characters in US-ASCII that are not allowed in URIs.
| Note: XML processors MUST/SHOULD also convert characters outside
|       the repertoire of characters allowed in IRIs according to
|       Section 3.1 of [RFC 3987].
|
| With the erratum, you would have used the first three lines.
| Without the erratum, your text gets a bit longer. You might
| even want to tweak the second note to cover some of what the
| RDF specs say (essentially, processors may issue warnings
| when they see something that doesn't conform to the IRI spec).
|
| If you wait for a new version of the IRI spec, you should be
| able to then use some text such as:
|
| A SystemLiteral is an IRI according to [RFCXXXX], including
| the provisions in section Y.Z of [RFCXXXX]. We will be able
| to make sure that Y.Z covers your needs, and hopefully the
| needs of other W3C (and other) specs, and we will greatly
| reduce the confusion for the overall community and have
| technology converge to what's really needed, rather than
| diverge for the sake of non-existing backwards compatibility
| needs.
|
| Hope this helps.

Thanks, Martin.

                                        Be seeing you,
                                          norm

--
Norman Walsh <[hidden email]> | There is no such thing as an absolute
http://nwalsh.com/            | certainty, but there is assurance
                              | sufficient for the purposes of human
                              | life.--John Stuart Mill

attachment0 (194 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: HRRI vs IRI in XML

Bjoern Hoehrmann
In reply to this post by Martin J. Dürst

* Martin Duerst wrote:

>[There is potentially another interpretation of the grammar in the
>XML spec, which is that the Char production
>(http://www.w3.org/TR/REC-xml/#NT-Char) does not restrict the
>contents of SystemLiteral, but in that case, it would also not
>restrict the contens of http://www.w3.org/TR/REC-xml/#NT-CharData,
>which would mean that arbitrary element content could contain
>such control characters including NUL characters/bytes.
>I think it would probably be best to fix this by explicitly
>using the Char production in SystemLiteral and the other
>relevant places. If I need to submit an erratum, please tell
>me where.]

You missed section six of the specification which defines that character
classes only ever match something that matches the Char production. As
an aside, I do not think the question of system identifiers is relevant
to whether we need "HRRIs" or not, as Norman put it.

My understanding is that the XML Core Working Group wants to use "HRRIs"
in many other places, like xml:base, XLink, XInclude, etc. We would then
have elements and attributes in XML languages which use resource identi-
fiers as values, but differ in what strings you can put there and how
they would be processed: some would take HRRIs, some IRIs, some anyURIs,
and so on.

It seems clear to me having a "HRRI" specification would make sense only
if having multiple definitions for resource identifier-valued attributes
and elements is a good thing, or if there is community consensus to use
"HRRIs" everywhere.
--
Björn Höhrmann · mailto:[hidden email] · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

Reply | Threaded
Open this post in threaded view
|

Re: HRRI vs IRI in XML

Bjoern Hoehrmann
In reply to this post by Norman Walsh

* Norman Walsh wrote:
>I'd be *delighted* to be proved wrong. Get consensus from the
>community that it would be ok to change the definition of system
>literal or the value of href attributes so that "this {string}.html"
>no longer accesses the resource we'd identify with the URI
>"this%20%7bstring%7d.html" but is instead an error and I'll be the
>first in line to fix the specs.

You are denying the perfectly valid option of graceful error recovery if
such values include disallowed characters. XML processors, for example,
may continue normal processing if a system literal contains '#' even
though such a document is in error. I note that this particular part of
the XML specification is not well-implemented in many processors, it is
common for them to treat a literal '\' differently from '%5C' while the
specification requires them to treat them identically.

>If the IRI spec can be extended to cover the characters we need, then
>I think we could say that these things are IRIs. That means we need
>to answer two questions:

That is impossible as you'd want to include the space character while a
number of deployed formats use space-separated lists of resource identi-
fiers, like the xsi:schemaLocation attribute. At the least you would
need to have IRIs-with-white-space and IRIs-without-white-space where
the ones with white space would have a host of problems.
--
Björn Höhrmann · mailto:[hidden email] · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/