[iri] #131: Using document charset causes interoperability problems

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[iri] #131: Using document charset causes interoperability problems

iri issue tracker
#131: Using document charset causes interoperability problems

 As reported by Dave Thaler...

 URIs and/or IRIs can appear in many contexts.

 In normal text in an email message, or in a PDF file or Word doc or
 whatever else.

 Allowing it to vary complicates frameworks considerably since now the doc
 charset has to be passed from whatever extracts the URI from the document
 (HTML or otherwise) and whatever else needs to know the interpretation
 (normalizer code, comparison code, whatever).   Various API frameworks
 already have various sorts of "Uri" classes that take in a URI-like string
 and let you do things like get the URI form or the IRI form, or various
 components or whatever.   This means the constructor needs to change since
 you cannot correctly interpret an IRI(bis) without knowing the document
 charset.

 I'm not yet convinced that's a change worth making.   Currently everything
 assumes UTF-8.   With this change, we'll get random behavior until
 everything is updated, which is a state worse than today in my view.

 Example:
 http://www.sw.it.aoyama.ac.jp/non-existent?é

 If the charset were iso-8859-1 then under RFC 3987 as I understand it,
 this would become:

 http://www.sw.it.aoyama.ac.jp/non-existent?%C3%83%C2%A9

 In other words, you have to convert iso-8859-1 to UTF-8 and then pct-
 encode the UTF-8.

 But as I understand 3987bis it would become:

 http://www.sw.it.aoyama.ac.jp/non-existent?%C3%A9

 which would then be passed around via various APIs and protocols that
 would not pass the charset along with it. As such it would be interpreted
 by the receiving code as pct-encoded UTF-8:

 http://www.sw.it.aoyama.ac.jp/non-existent?é

 which of course it isn't.

 As such, we should make the RFC 3987 behavior (UTF-8, NOT the doc charset)
 required for everything that doesn't explicitly pass the charset along
 with the URI.

--
-----------------------+--------------------------------------
 Reporter:  stpeter@…  |      Owner:  draft-ietf-iri-3987bis@…
     Type:  defect     |     Status:  new
 Priority:  major      |  Milestone:
Component:  3987bis    |    Version:
 Severity:  -          |   Keywords:
-----------------------+--------------------------------------

Ticket URL: <http://trac.tools.ietf.org/wg/iri/trac/ticket/131>
iri <http://tools.ietf.org/wg/iri/>


Reply | Threaded
Open this post in threaded view
|

RE: [iri] #131: Using document charset causes interoperability problems

masinter
I hate this feature, and would love to get rid of it, but let's acknowledge at least somewhere that it happens. That is, the interoperability problems are real, but not documenting it here doesn't solve the problem.

I think what the text in the document intended was that whether there _was_ a "document charset" at all depended on the format of the document... yes, for HTML, maybe for Word (up to word), no for PDF, maybe (not yet defined) for text/plain.

I can see two choices that might work:

* Any document format that wishes this kind of processing has to say that what they are using aren't really IRIs, they're funny strings that get preprocessed to turn them into IRIs or URIs.
* The IRI spec (continues to) explicitly defines this document-charset-dependent behavior, but is more explicit about the rules for where "document charset" comes from.

I could go with either one of those. How do those seem to the group?


-----Original Message-----
From: iri issue tracker [mailto:[hidden email]]
Sent: Thursday, July 19, 2012 3:04 PM
To: [hidden email]; [hidden email]
Cc: [hidden email]
Subject: [iri] #131: Using document charset causes interoperability problems

#131: Using document charset causes interoperability problems

 As reported by Dave Thaler...

 URIs and/or IRIs can appear in many contexts.

 In normal text in an email message, or in a PDF file or Word doc or
 whatever else.

 Allowing it to vary complicates frameworks considerably since now the doc
 charset has to be passed from whatever extracts the URI from the document
 (HTML or otherwise) and whatever else needs to know the interpretation
 (normalizer code, comparison code, whatever).   Various API frameworks
 already have various sorts of "Uri" classes that take in a URI-like string
 and let you do things like get the URI form or the IRI form, or various
 components or whatever.   This means the constructor needs to change since
 you cannot correctly interpret an IRI(bis) without knowing the document
 charset.

 I'm not yet convinced that's a change worth making.   Currently everything
 assumes UTF-8.   With this change, we'll get random behavior until
 everything is updated, which is a state worse than today in my view.

 Example:
 http://www.sw.it.aoyama.ac.jp/non-existent?é

 If the charset were iso-8859-1 then under RFC 3987 as I understand it,
 this would become:

 http://www.sw.it.aoyama.ac.jp/non-existent?%C3%83%C2%A9

 In other words, you have to convert iso-8859-1 to UTF-8 and then pct-
 encode the UTF-8.

 But as I understand 3987bis it would become:

 http://www.sw.it.aoyama.ac.jp/non-existent?%C3%A9

 which would then be passed around via various APIs and protocols that
 would not pass the charset along with it. As such it would be interpreted
 by the receiving code as pct-encoded UTF-8:

 http://www.sw.it.aoyama.ac.jp/non-existent?é

 which of course it isn't.

 As such, we should make the RFC 3987 behavior (UTF-8, NOT the doc charset)
 required for everything that doesn't explicitly pass the charset along
 with the URI.

--
-----------------------+--------------------------------------
 Reporter:  stpeter@…  |      Owner:  draft-ietf-iri-3987bis@…
     Type:  defect     |     Status:  new
 Priority:  major      |  Milestone:
Component:  3987bis    |    Version:
 Severity:  -          |   Keywords:
-----------------------+--------------------------------------

Ticket URL: <http://trac.tools.ietf.org/wg/iri/trac/ticket/131>
iri <http://tools.ietf.org/wg/iri/>


Reply | Threaded
Open this post in threaded view
|

Re: [iri] #131: Using document charset causes interoperability problems

Peter Saint-Andre-2
<hat type='individual'/>

On 7/21/12 10:06 AM, Larry Masinter wrote:

> I hate this feature, and would love to get rid of it, but let's acknowledge at least somewhere that it happens. That is, the interoperability problems are real, but not documenting it here doesn't solve the problem.
>
> I think what the text in the document intended was that whether there _was_ a "document charset" at all depended on the format of the document... yes, for HTML, maybe for Word (up to word), no for PDF, maybe (not yet defined) for text/plain.
>
> I can see two choices that might work:
>
> * Any document format that wishes this kind of processing has to say that what they are using aren't really IRIs, they're funny strings that get preprocessed to turn them into IRIs or URIs.
> * The IRI spec (continues to) explicitly defines this document-charset-dependent behavior, but is more explicit about the rules for where "document charset" comes from.
>
> I could go with either one of those. How do those seem to the group?

In the interest of calling a spade a spade, I'd be in favor of the first
option: they're not really IRIs, but they can be turned into IRIs.

Peter

--
Peter Saint-Andre
https://stpeter.im/





Reply | Threaded
Open this post in threaded view
|

RE: [iri] #131: Using document charset causes interoperability problems

Dave Thaler-2
In reply to this post by masinter
> -----Original Message-----
> From: Larry Masinter [mailto:[hidden email]]
> Sent: Saturday, July 21, 2012 9:07 AM
> To: [hidden email]; [hidden email]
> Cc: [hidden email]
> Subject: RE: [iri] #131: Using document charset causes interoperability
> problems
>
> I hate this feature, and would love to get rid of it, but let's acknowledge at least
> somewhere that it happens. That is, the interoperability problems are real, but
> not documenting it here doesn't solve the problem.
>
> I think what the text in the document intended was that whether there _was_ a
> "document charset" at all depended on the format of the document... yes, for
> HTML, maybe for Word (up to word), no for PDF, maybe (not yet defined) for
> text/plain.
>
> I can see two choices that might work:
>
> * Any document format that wishes this kind of processing has to say that what
> they are using aren't really IRIs, they're funny strings that get preprocessed to
> turn them into IRIs or URIs.
> * The IRI spec (continues to) explicitly defines this document-charset-dependent
> behavior, but is more explicit about the rules for where "document charset"
> comes from.
>
> I could go with either one of those. How do those seem to the group?

I'd argue for the first (and against the second).

-Dave

>
>
> -----Original Message-----
> From: iri issue tracker [mailto:[hidden email]]
> Sent: Thursday, July 19, 2012 3:04 PM
> To: [hidden email]; [hidden email]
> Cc: [hidden email]
> Subject: [iri] #131: Using document charset causes interoperability problems
>
> #131: Using document charset causes interoperability problems
>
>  As reported by Dave Thaler...
>
>  URIs and/or IRIs can appear in many contexts.
>
>  In normal text in an email message, or in a PDF file or Word doc or  whatever
> else.
>
>  Allowing it to vary complicates frameworks considerably since now the doc
> charset has to be passed from whatever extracts the URI from the document
> (HTML or otherwise) and whatever else needs to know the interpretation
>  (normalizer code, comparison code, whatever).   Various API frameworks
>  already have various sorts of "Uri" classes that take in a URI-like string  and let
> you do things like get the URI form or the IRI form, or various
>  components or whatever.   This means the constructor needs to change since
>  you cannot correctly interpret an IRI(bis) without knowing the document
> charset.
>
>  I'm not yet convinced that's a change worth making.   Currently everything
>  assumes UTF-8.   With this change, we'll get random behavior until
>  everything is updated, which is a state worse than today in my view.
>
>  Example:
>  http://www.sw.it.aoyama.ac.jp/non-existent?é
>
>  If the charset were iso-8859-1 then under RFC 3987 as I understand it,  this
> would become:
>
>  http://www.sw.it.aoyama.ac.jp/non-existent?%C3%83%C2%A9
>
>  In other words, you have to convert iso-8859-1 to UTF-8 and then pct-  encode
> the UTF-8.
>
>  But as I understand 3987bis it would become:
>
>  http://www.sw.it.aoyama.ac.jp/non-existent?%C3%A9
>
>  which would then be passed around via various APIs and protocols that  would
> not pass the charset along with it. As such it would be interpreted  by the
> receiving code as pct-encoded UTF-8:
>
>  http://www.sw.it.aoyama.ac.jp/non-existent?é
>
>  which of course it isn't.
>
>  As such, we should make the RFC 3987 behavior (UTF-8, NOT the doc charset)
> required for everything that doesn't explicitly pass the charset along  with the
> URI.
>
> --
> -----------------------+--------------------------------------
>  Reporter:  stpeter@…  |      Owner:  draft-ietf-iri-3987bis@…
>      Type:  defect     |     Status:  new
>  Priority:  major      |  Milestone:
> Component:  3987bis    |    Version:
>  Severity:  -          |   Keywords:
> -----------------------+--------------------------------------
>
> Ticket URL: <http://trac.tools.ietf.org/wg/iri/trac/ticket/131>
> iri <http://tools.ietf.org/wg/iri/>
>

Reply | Threaded
Open this post in threaded view
|

Re: [iri] #131: Using document charset causes interoperability problems

John C Klensin
In reply to this post by Peter Saint-Andre-2


--On Saturday, July 21, 2012 15:37 -0600 Peter Saint-Andre
<[hidden email]> wrote:

>...
> On 7/21/12 10:06 AM, Larry Masinter wrote:
>> I hate this feature, and would love to get rid of it, but
>> let's acknowledge at least somewhere that it happens. That
>> is, the interoperability problems are real, but not
>> documenting it here doesn't solve the problem.
>...
>> I can see two choices that might work:
>>
>> * Any document format that wishes this kind of processing has
>> to say that what they are using aren't really IRIs, they're
>> funny strings that get preprocessed to turn them into IRIs or
>> URIs. * The IRI spec (continues to) explicitly defines this
>> document-charset-dependent behavior, but is more explicit
>> about the rules for where "document charset" comes from.
>>
>> I could go with either one of those. How do those seem to the
>> group?
>
> In the interest of calling a spade a spade, I'd be in favor of
> the first option: they're not really IRIs, but they can be
> turned into IRIs.

I agree, but couldn't the same argument be made about IRIs
themselves?  I.e., "they aren't really URIs, but they can be
turned into URIs".

    john