[iri] #128: use of the term 'origin'

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[iri] #128: use of the term 'origin'

iri issue tracker
#128: use of the term 'origin'

#choose ticket.new
  #when True
 While reviewing 3987bis for i18n terminology, I came across this
 paragraph (Section 3.5):

    For compatibility with existing deployed HTTP infrastructure, the
    following special case applies for schemes "http" and "https" and
    IRIs whose origin has a document charset other than one which is UCS-
    based (e.g., UTF-8 or UTF-16).  In such a case, the "query" component
    of an IRI is mapped into a URI by using the document charset rather
    than UTF-8 as the binary representation before pct-encoding.  This
    mapping is not applied for any other scheme or component.

 The term 'origin' could be ambiguous here. It doesn't seem to be
 referencing the Web Origin Concept (RFC 6454) but instead seems to be
 based on the "document" (broadly construed) in which the http or https
 URL is found (e.g., as a hyperlink in an HTML document or perhaps as
 running text in an email message). It would be good to make that clear.
  #end
  #otherwise
    #if changes_body
Changes (by ):


    #end
    #if changes_descr
      #if not changes_body and not change.comment and change.author
Description changed by :
      #end

--
    #end
    #if change.comment

Comment(by undefined):


    #end
  #end
#end

--
-----------------------+--------------------------------------
 Reporter:  stpeter@…  |      Owner:  draft-ietf-iri-3987bis@…
     Type:  defect     |     Status:  new
 Priority:  minor      |  Milestone:
Component:  3987bis    |    Version:
 Severity:  -          |   Keywords:
-----------------------+--------------------------------------

Ticket URL: <http://trac.tools.ietf.org/wg/iri/trac/ticket/128>
iri <http://tools.ietf.org/wg/iri/>


Reply | Threaded
Open this post in threaded view
|

Re: [iri] #128: use of the term 'origin'

iri issue tracker
#128: use of the term 'origin'

#choose ticket.new
  #when True
 While reviewing 3987bis for i18n terminology, I came across this
 paragraph (Section 3.5):

    For compatibility with existing deployed HTTP infrastructure, the
    following special case applies for schemes "http" and "https" and
    IRIs whose origin has a document charset other than one which is UCS-
    based (e.g., UTF-8 or UTF-16).  In such a case, the "query" component
    of an IRI is mapped into a URI by using the document charset rather
    than UTF-8 as the binary representation before pct-encoding.  This
    mapping is not applied for any other scheme or component.

 The term 'origin' could be ambiguous here. It doesn't seem to be
 referencing the Web Origin Concept (RFC 6454) but instead seems to be
 based on the "document" (broadly construed) in which the http or https
 URL is found (e.g., as a hyperlink in an HTML document or perhaps as
 running text in an email message). It would be good to make that clear.
  #end
  #otherwise
    #if changes_body
Changes (by stpeter@…):


    #end
    #if changes_descr
      #if not changes_body and not change.comment and change.author
Description changed by stpeter@…:
      #end

--
    #end
    #if change.comment

Comment(by stpeter@…):

 One way to remove the ambiguity would be to change "origin" here to
 something else, but even then I think we'd need additional text. I
 tentatively propose the following:

    For compatibility with existing deployed HTTP infrastructure, the
    following special case applies for the schemes "http" and "https"
    when an IRI is found in a document whose charset is not based on UCS
    (e.g., not UTF-8 or UTF-16).  In such a case, the "query" component
    of an IRI is mapped into a URI by using the document charset rather
    than UTF-8 as the binary representation before pct-encoding.  This
    mapping is not applied for any other scheme or component.
    #end
  #end
#end

--
-----------------------+---------------------------------------
 Reporter:  stpeter@…  |       Owner:  draft-ietf-iri-3987bis@…
     Type:  defect     |      Status:  new
 Priority:  minor      |   Milestone:
Component:  3987bis    |     Version:
 Severity:  -          |  Resolution:
 Keywords:             |
-----------------------+---------------------------------------

Ticket URL: <http://trac.tools.ietf.org/wg/iri/trac/ticket/128#comment:1>
iri <http://tools.ietf.org/wg/iri/>


Reply | Threaded
Open this post in threaded view
|

Re: [iri] #128: use of the term 'origin'

masinter
In reply to this post by iri issue tracker
does this apply to any format other than HTML? I'm not sure that this applies to anything else... Within image/svg+xml, for example? The notion of document charset doesn't apply to some formats.

Connected by DROID on Verizon Wireless


-----Original message-----
From: iri issue tracker <[hidden email]>
To:
"[hidden email]" <[hidden email]>, "[hidden email]" <[hidden email]>
Cc:
"[hidden email]" <[hidden email]>
Sent:
Mon, Jun 11, 2012 19:38:45 GMT+00:00
Subject:
Re: [iri] #128: use of the term 'origin'

#128: use of the term 'origin'

#choose ticket.new
  #when True
 While reviewing 3987bis for i18n terminology, I came across this
 paragraph (Section 3.5):

    For compatibility with existing deployed HTTP infrastructure, the
    following special case applies for schemes "http" and "https" and
    IRIs whose origin has a document charset other than one which is UCS-
    based (e.g., UTF-8 or UTF-16).  In such a case, the "query" component
    of an IRI is mapped into a URI by using the document charset rather
    than UTF-8 as the binary representation before pct-encoding.  This
    mapping is not applied for any other scheme or component.

 The term 'origin' could be ambiguous here. It doesn't seem to be
 referencing the Web Origin Concept (RFC 6454) but instead seems to be
 based on the "document" (broadly construed) in which the http or https
 URL is found (e.g., as a hyperlink in an HTML document or perhaps as
 running text in an email message). It would be good to make that clear.
  #end
  #otherwise
    #if changes_body
Changes (by stpeter@…):


    #end
    #if changes_descr
      #if not changes_body and not change.comment and change.author
Description changed by stpeter@…:
      #end

--
    #end
    #if change.comment

Comment(by stpeter@…):

 One way to remove the ambiguity would be to change "origin" here to
 something else, but even then I think we'd need additional text. I
 tentatively propose the following:

    For compatibility with existing deployed HTTP infrastructure, the
    following special case applies for the schemes "http" and "https"
    when an IRI is found in a document whose charset is not based on UCS
    (e.g., not UTF-8 or UTF-16).  In such a case, the "query" component
    of an IRI is mapped into a URI by using the document charset rather
    than UTF-8 as the binary representation before pct-encoding.  This
    mapping is not applied for any other scheme or component.
    #end
  #end
#end

--
-----------------------+---------------------------------------
 Reporter:  stpeter@…  |       Owner:  draft-ietf-iri-3987bis@…
     Type:  defect     |      Status:  new
 Priority:  minor      |   Milestone:
Component:  3987bis    |     Version:
 Severity:  -          |  Resolution:
 Keywords:             |
-----------------------+---------------------------------------

Ticket URL: <http://trac.tools.ietf.org/wg/iri/trac/ticket/128#comment:1>
iri <http://tools.ietf.org/wg/iri/>

Reply | Threaded
Open this post in threaded view
|

Re: [iri] #128: use of the term 'origin'

Peter Saint-Andre-2
Perhaps it doesn't belong in 3987bis, then, but instead in a spec about
internationalization in HTML.

On 6/16/12 9:28 AM, Larry Masinter wrote:

> does this apply to any format other than HTML? I'm not sure that this
> applies to anything else... Within image/svg+xml, for example? The
> notion of document charset doesn't apply to some formats.
>
> /Connected by DROID on Verizon Wireless/
>
>
> -----Original message-----
>
>     *From: *iri issue tracker <[hidden email]>*
>     To: *"[hidden email]"
>     <[hidden email]>, "[hidden email]"
>     <[hidden email]>*
>     Cc: *"[hidden email]" <[hidden email]>*
>     Sent: *Mon, Jun 11, 2012 19:38:45 GMT+00:00*
>     Subject: *Re: [iri] #128: use of the term 'origin'
>
>     #128: use of the term 'origin'
>
>     #choose ticket.new
>       #when True
>      While reviewing 3987bis for i18n terminology, I came across this
>      paragraph (Section 3.5):
>
>         For compatibility with existing deployed HTTP infrastructure, the
>         following special case applies for schemes "http" and "https" and
>         IRIs whose origin has a document charset other than one which is
>     UCS-
>         based (e.g., UTF-8 or UTF-16).  In such a case, the "query"
>     component
>         of an IRI is mapped into a URI by using the document charset rather
>         than UTF-8 as the binary representation before pct-encoding.  This
>         mapping is not applied for any other scheme or component.
>
>      The term 'origin' could be ambiguous here. It doesn't seem to be
>      referencing the Web Origin Concept (RFC 6454) but instead seems to be
>      based on the "document" (broadly construed) in which the http or https
>      URL is found (e.g., as a hyperlink in an HTML document or perhaps as
>      running text in an email message). It would be good to make that clear.
>       #end
>       #otherwise
>         #if changes_body
>     Changes (by stpeter@…):
>
>
>         #end
>         #if changes_descr
>           #if not changes_body and not change.comment and change.author
>     Description changed by stpeter@…:
>           #end
>
>     --
>         #end
>         #if change.comment
>
>     Comment(by stpeter@…):
>
>      One way to remove the ambiguity would be to change "origin" here to
>      something else, but even then I think we'd need additional text. I
>      tentatively propose the following:
>
>         For compatibility with existing deployed HTTP infrastructure, the
>         following special case applies for the schemes "http" and "https"
>         when an IRI is found in a document whose charset is not based on UCS
>         (e.g., not UTF-8 or UTF-16).  In such a case, the "query" component
>         of an IRI is mapped into a URI by using the document charset rather
>         than UTF-8 as the binary representation before pct-encoding.  This
>         mapping is not applied for any other scheme or component.
>         #end
>       #end
>     #end
>
>     --
>     -----------------------+---------------------------------------
>      Reporter:  stpeter@…  |       Owner:  draft-ietf-iri-3987bis@…
>          Type:  defect     |      Status:  new
>      Priority:  minor      |   Milestone:
>     Component:  3987bis    |     Version:
>      Severity:  -          |  Resolution:
>      Keywords:             |
>     -----------------------+---------------------------------------
>
>     Ticket URL:
>     <http://trac.tools.ietf.org/wg/iri/trac/ticket/128#comment:1>
>     iri <http://tools.ietf.org/wg/iri/>
>



Reply | Threaded
Open this post in threaded view
|

Re: [iri] #128: use of the term 'origin'

iri issue tracker
In reply to this post by iri issue tracker
#128: use of the term 'origin'

Changes (by duerst@…):

 * owner:  draft-ietf-iri-3987bis@… => duerst@…


Comment:

 This change looks good to me. I have included this in the SVN copy with
 revision 120. I'm leaving the issue open just in case somebody has some
 better idea.

--
-----------------------+-----------------------
 Reporter:  stpeter@…  |       Owner:  duerst@…
     Type:  defect     |      Status:  new
 Priority:  minor      |   Milestone:
Component:  3987bis    |     Version:
 Severity:  -          |  Resolution:
 Keywords:             |
-----------------------+-----------------------

Ticket URL: <http://trac.tools.ietf.org/wg/iri/trac/ticket/128#comment:2>
iri <http://tools.ietf.org/wg/iri/>


Reply | Threaded
Open this post in threaded view
|

Re: [iri] #128: use of the term 'origin'

Martin J. Dürst
In reply to this post by masinter
On 2012/06/17 0:28, Larry Masinter wrote:
> does this apply to any format other than HTML? I'm not sure that this applies to anything else... Within image/svg+xml, for example? The notion of document charset doesn't apply to some formats.

Hello Larry,

Very good idea to test this. I tested the various browsers that I have,
looking at the actual requests in Wireshark, everything on Windows 7.
The test consisted of the attached SVG file in iso-8859-1 with a link to
an existing domain but a non-existing page with a query part with
non-ASCII characters.

Here are the results:

Opera 12:
GET /non-existent?r%C3%A9sum%C3%A9 HTTP/1.1\r\n
This means the query part is sent as percent-encoded UTF-8.

Safari (5.1.7):
GET /non-existent?r%E9sum%E9 HTTP/1.1\r\n
This means that the query part is sent as percent-encoded iso-8859-1.

IE9:
GET /non-existent?r\351sum\351 HTTP/1.1\r\n
This means that the query part is sent as RAW iso-8859-1.

Firefox 13.0.1:
GET /non-existent?r%E9sum%E9 HTTP/1.1\r\n
This means that the query part is sent as percent-encoded iso-8859-1.

Chrome 20:
GET /non-existent?r%E9sum%E9 HTTP/1.1\r\n
This means that the query part is sent as percent-encoded iso-8859-1.

With the exception of Opera, SVG seems to follow HTML. But there are SVG
user agents that are not browsers. If somebody has one of these, please
run this test and tell us what you got.

Also, there are formats other than HTML and SVG.

Regards,   Martin.


> Connected by DROID on Verizon Wireless
>
>
> -----Original message-----
> From: iri issue tracker<[hidden email]>
> To: "[hidden email]"<[hidden email]>, "[hidden email]"<[hidden email]>
> Cc: "[hidden email]"<[hidden email]>
> Sent: Mon, Jun 11, 2012 19:38:45 GMT+00:00
> Subject: Re: [iri] #128: use of the term 'origin'
>
> #128: use of the term 'origin'
>
> #choose ticket.new
>    #when True
>   While reviewing 3987bis for i18n terminology, I came across this
>   paragraph (Section 3.5):
>
>      For compatibility with existing deployed HTTP infrastructure, the
>      following special case applies for schemes "http" and "https" and
>      IRIs whose origin has a document charset other than one which is UCS-
>      based (e.g., UTF-8 or UTF-16).  In such a case, the "query" component
>      of an IRI is mapped into a URI by using the document charset rather
>      than UTF-8 as the binary representation before pct-encoding.  This
>      mapping is not applied for any other scheme or component.
>
>   The term 'origin' could be ambiguous here. It doesn't seem to be
>   referencing the Web Origin Concept (RFC 6454) but instead seems to be
>   based on the "document" (broadly construed) in which the http or https
>   URL is found (e.g., as a hyperlink in an HTML document or perhaps as
>   running text in an email message). It would be good to make that clear.
>    #end
>    #otherwise
>      #if changes_body
> Changes (by stpeter@…):
>
>
>      #end
>      #if changes_descr
>        #if not changes_body and not change.comment and change.author
> Description changed by stpeter@…:
>        #end
>
> --
>      #end
>      #if change.comment
>
> Comment(by stpeter@…):
>
>   One way to remove the ambiguity would be to change "origin" here to
>   something else, but even then I think we'd need additional text. I
>   tentatively propose the following:
>
>      For compatibility with existing deployed HTTP infrastructure, the
>      following special case applies for the schemes "http" and "https"
>      when an IRI is found in a document whose charset is not based on UCS
>      (e.g., not UTF-8 or UTF-16).  In such a case, the "query" component
>      of an IRI is mapped into a URI by using the document charset rather
>      than UTF-8 as the binary representation before pct-encoding.  This
>      mapping is not applied for any other scheme or component.
>      #end
>    #end
> #end
>
> --
> -----------------------+---------------------------------------
>   Reporter:  stpeter@…  |       Owner:  draft-ietf-iri-3987bis@…
>       Type:  defect     |      Status:  new
>   Priority:  minor      |   Milestone:
> Component:  3987bis    |     Version:
>   Severity:  -          |  Resolution:
>   Keywords:             |
> -----------------------+---------------------------------------
>
> Ticket URL:<http://trac.tools.ietf.org/wg/iri/trac/ticket/128#comment:1>
> iri<http://tools.ietf.org/wg/iri/>
>

svg_test_query.svg (812 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: [iri] #128: use of the term 'origin'

Dave Thaler-2
Personally I dislike the change to allow using the document charset and prefer the 3987
behavior.

On the question of "other than HTML", URIs and/or IRIs can appear in many contexts...
In normal text in an email message, or in a PDF file or Word doc or whatever else.
Allowing it to vary complicates frameworks considerably since now the doc charset
has to be passed from whatever extracts the URI from the document (HTML or otherwise)
and whatever else needs to know the interpretation (normalizer code, comparison code,
whatever).   Various API frameworks already have various sorts of "Uri" classes that
take in a URI-like string and let you do things like get the URI form or the IRI form,
or various components or whatever.   Of course those would have to change for
any bis, but this also means the constructor needs to change since you cannot
correctly interpret an IRI(bis) without knowing the document charset.

I'm not yet convinced that's a change worth making.

-Dave

> -----Original Message-----
> From: "Martin J. Dürst" [mailto:[hidden email]]
> Sent: Tuesday, July 10, 2012 4:27 AM
> To: Larry Masinter
> Cc: [hidden email]; [hidden email]; Chris Lilley
> Subject: Re: [iri] #128: use of the term 'origin'
>
> On 2012/06/17 0:28, Larry Masinter wrote:
> > does this apply to any format other than HTML? I'm not sure that this
> applies to anything else... Within image/svg+xml, for example? The notion of
> document charset doesn't apply to some formats.
>
> Hello Larry,
>
> Very good idea to test this. I tested the various browsers that I have, looking
> at the actual requests in Wireshark, everything on Windows 7.
> The test consisted of the attached SVG file in iso-8859-1 with a link to an
> existing domain but a non-existing page with a query part with non-ASCII
> characters.
>
> Here are the results:
>
> Opera 12:
> GET /non-existent?r%C3%A9sum%C3%A9 HTTP/1.1\r\n This means the
> query part is sent as percent-encoded UTF-8.
>
> Safari (5.1.7):
> GET /non-existent?r%E9sum%E9 HTTP/1.1\r\n This means that the query
> part is sent as percent-encoded iso-8859-1.
>
> IE9:
> GET /non-existent?r\351sum\351 HTTP/1.1\r\n This means that the query
> part is sent as RAW iso-8859-1.
>
> Firefox 13.0.1:
> GET /non-existent?r%E9sum%E9 HTTP/1.1\r\n This means that the query
> part is sent as percent-encoded iso-8859-1.
>
> Chrome 20:
> GET /non-existent?r%E9sum%E9 HTTP/1.1\r\n This means that the query
> part is sent as percent-encoded iso-8859-1.
>
> With the exception of Opera, SVG seems to follow HTML. But there are SVG
> user agents that are not browsers. If somebody has one of these, please run
> this test and tell us what you got.
>
> Also, there are formats other than HTML and SVG.
>
> Regards,   Martin.
>
>
> > Connected by DROID on Verizon Wireless
> >
> >
> > -----Original message-----
> > From: iri issue tracker<[hidden email]>
> > To:
> > "[hidden email]"<[hidden email]
> > etf.org>, "[hidden email]"<[hidden email]>
> > Cc: "[hidden email]"<[hidden email]>
> > Sent: Mon, Jun 11, 2012 19:38:45 GMT+00:00
> > Subject: Re: [iri] #128: use of the term 'origin'
> >
> > #128: use of the term 'origin'
> >
> > #choose ticket.new
> >    #when True
> >   While reviewing 3987bis for i18n terminology, I came across this
> >   paragraph (Section 3.5):
> >
> >      For compatibility with existing deployed HTTP infrastructure, the
> >      following special case applies for schemes "http" and "https" and
> >      IRIs whose origin has a document charset other than one which is UCS-
> >      based (e.g., UTF-8 or UTF-16).  In such a case, the "query" component
> >      of an IRI is mapped into a URI by using the document charset rather
> >      than UTF-8 as the binary representation before pct-encoding.  This
> >      mapping is not applied for any other scheme or component.
> >
> >   The term 'origin' could be ambiguous here. It doesn't seem to be
> >   referencing the Web Origin Concept (RFC 6454) but instead seems to be
> >   based on the "document" (broadly construed) in which the http or https
> >   URL is found (e.g., as a hyperlink in an HTML document or perhaps as
> >   running text in an email message). It would be good to make that clear.
> >    #end
> >    #otherwise
> >      #if changes_body
> > Changes (by stpeter@…):
> >
> >
> >      #end
> >      #if changes_descr
> >        #if not changes_body and not change.comment and change.author
> > Description changed by stpeter@…:
> >        #end
> >
> > --
> >      #end
> >      #if change.comment
> >
> > Comment(by stpeter@…):
> >
> >   One way to remove the ambiguity would be to change "origin" here to
> >   something else, but even then I think we'd need additional text. I
> >   tentatively propose the following:
> >
> >      For compatibility with existing deployed HTTP infrastructure, the
> >      following special case applies for the schemes "http" and "https"
> >      when an IRI is found in a document whose charset is not based on UCS
> >      (e.g., not UTF-8 or UTF-16).  In such a case, the "query" component
> >      of an IRI is mapped into a URI by using the document charset rather
> >      than UTF-8 as the binary representation before pct-encoding.  This
> >      mapping is not applied for any other scheme or component.
> >      #end
> >    #end
> > #end
> >
> > --
> > -----------------------+---------------------------------------
> >   Reporter:  stpeter@…  |       Owner:  draft-ietf-iri-3987bis@…
> >       Type:  defect     |      Status:  new
> >   Priority:  minor      |   Milestone:
> > Component:  3987bis    |     Version:
> >   Severity:  -          |  Resolution:
> >   Keywords:             |
> > -----------------------+---------------------------------------
> >
> > Ticket
> > URL:<http://trac.tools.ietf.org/wg/iri/trac/ticket/128#comment:1>
> > iri<http://tools.ietf.org/wg/iri/>
> >
Reply | Threaded
Open this post in threaded view
|

Re: [iri] #128: use of the term 'origin'

iri issue tracker
In reply to this post by iri issue tracker
#128: use of the term 'origin'

Changes (by stpeter@…):

 * status:  new => closed
 * resolution:   => fixed


--
-----------------------+-----------------------
 Reporter:  stpeter@…  |       Owner:  duerst@…
     Type:  defect     |      Status:  closed
 Priority:  minor      |   Milestone:
Component:  3987bis    |     Version:
 Severity:  -          |  Resolution:  fixed
 Keywords:             |
-----------------------+-----------------------

Ticket URL: <http://trac.tools.ietf.org/wg/iri/trac/ticket/128#comment:3>
iri <http://tools.ietf.org/wg/iri/>


Reply | Threaded
Open this post in threaded view
|

Re: [iri] #128: use of the term 'origin'

Martin J. Dürst
In reply to this post by Dave Thaler-2
Hello Dave,

Sorry to be very late with my answer.

On 2012/07/11 9:05, Dave Thaler wrote:
> Personally I dislike the change to allow using the document charset and prefer the 3987
> behavior.

I very much also dislike this! I very much wish we could fix this!
Just in case you know a way to convince the IE team at Microsoft to fix
this, please tell us.

Some background on why browsers got to where they are about query
encoding later is in a P.S. to this mail.


> On the question of "other than HTML", URIs and/or IRIs can appear in many contexts...
> In normal text in an email message, or in a PDF file or Word doc or whatever else.

Yes indeed.

> Allowing it to vary complicates frameworks considerably since now the doc charset
> has to be passed from whatever extracts the URI from the document (HTML or otherwise)
> and whatever else needs to know the interpretation (normalizer code, comparison code,
> whatever).   Various API frameworks already have various sorts of "Uri" classes that
> take in a URI-like string and let you do things like get the URI form or the IRI form,
> or various components or whatever.   Of course those would have to change for
> any bis, but this also means the constructor needs to change since you cannot
> correctly interpret an IRI(bis) without knowing the document charset.

This is indeed a very important point. Libraries and tooling are too
often overlooked.

I think the current draft also doesn't say anything about cases where
"document charset" information is not available (e.g. when you type in a
query part into a browser bar, or when a query part appears on a napkin.
We should make sure it says that in that case, use UTF-8.


> I'm not yet convinced that's a change worth making.

Do you see a chance to convince the IE team to fix this?
We'd then also have to convince Mozilla and Webkit folks.

If we can't convince them, then our only hope is that UTF-8 content is
increasing steadily on the Web (IEEE Spectrum showed a graph provided by
Mark Davis that had UTF-8 (without pure ASCII) at 60%). I think we
should be careful to make sure that we write the spec so that it doesn't
make things overly complicated in a world where essentially all Web
pages are UTF-8.


Regards,    Martin.

P.S.: And here is the story of why query parts are treated the (odd!)
way they are in browsers.

In the mid '90ies, Web pages in all kinds of encodings started to show
up. CGI scripts took in data from forms, and there was a serious
problem: In what encoding should the form data be sent back to the
server? It was the most frequent question asked on mailing lists related
to I18N and the Web, and at the Unicode conference.

RFC 2070 (HTML I18N, now historic) introduced the accept-charset
attribute (see http://tools.ietf.org/html/rfc2070#section-5.1), but that
was not implemented by browsers. A convention started to emerge, which
was that the character encoding of the document containing the from
would be used.

This was taken over by HTML4 (see
http://www.w3.org/TR/html4/interact/forms.html#adef-accept-charset (*)),
although the accept-charset attribute was moved from individual fields
to the form element itself. The accept-charset attribute for a long time
was not implemented, but it finally got implemented when Mozilla got
totally re-implemented, mostly according to spec, and then it spread to
other browsers, to the extent that it ended up in HTML5 (see
http://www.w3.org/TR/html5/the-form-element.html#attr-form-accept-charset).

So for forms, we are all set: you can have a page in Shift_JIS with a
form that uses UTF-8 for application/x-www-form-urlencoded, which means
that you can display the query part as an IRI, or you could have the
reverse, which means that you have to use %-encoding for the Shift_JIS
bytes.

The problem with all this is that browser makers thought that a query
part in a complete IRI (e.g. in the href attribute of an <a> element or
the src attribute of an <img> element) is just like a form, and so
should use the document charset. RFC 3987 nowhere mentions that the
query part should be treated differently from the rest of the IRI, but
in hindsight, it might have been a good idea to put a big reminder into
RFC 3987, saying "all this also applies to the query part". And of
course there's no accept-charset attribute for <a> or <img>.

(*) There was a small tweak, in that in RFC 2070, the accept-charset
attribute was on each (textual) form element, but for HTML4, we moved it
to the form element itself.




> -Dave
>
>> -----Original Message-----
>> From: "Martin J. Dürst" [mailto:[hidden email]]
>> Sent: Tuesday, July 10, 2012 4:27 AM
>> To: Larry Masinter
>> Cc: [hidden email]; [hidden email]; Chris Lilley
>> Subject: Re: [iri] #128: use of the term 'origin'
>>
>> On 2012/06/17 0:28, Larry Masinter wrote:
>>> does this apply to any format other than HTML? I'm not sure that this
>> applies to anything else... Within image/svg+xml, for example? The notion of
>> document charset doesn't apply to some formats.
>>
>> Hello Larry,
>>
>> Very good idea to test this. I tested the various browsers that I have, looking
>> at the actual requests in Wireshark, everything on Windows 7.
>> The test consisted of the attached SVG file in iso-8859-1 with a link to an
>> existing domain but a non-existing page with a query part with non-ASCII
>> characters.
>>
>> Here are the results:
>>
>> Opera 12:
>> GET /non-existent?r%C3%A9sum%C3%A9 HTTP/1.1\r\n This means the
>> query part is sent as percent-encoded UTF-8.
>>
>> Safari (5.1.7):
>> GET /non-existent?r%E9sum%E9 HTTP/1.1\r\n This means that the query
>> part is sent as percent-encoded iso-8859-1.
>>
>> IE9:
>> GET /non-existent?r\351sum\351 HTTP/1.1\r\n This means that the query
>> part is sent as RAW iso-8859-1.
>>
>> Firefox 13.0.1:
>> GET /non-existent?r%E9sum%E9 HTTP/1.1\r\n This means that the query
>> part is sent as percent-encoded iso-8859-1.
>>
>> Chrome 20:
>> GET /non-existent?r%E9sum%E9 HTTP/1.1\r\n This means that the query
>> part is sent as percent-encoded iso-8859-1.
>>
>> With the exception of Opera, SVG seems to follow HTML. But there are SVG
>> user agents that are not browsers. If somebody has one of these, please run
>> this test and tell us what you got.
>>
>> Also, there are formats other than HTML and SVG.
>>
>> Regards,   Martin.
>>
>>
>>> Connected by DROID on Verizon Wireless
>>>
>>>
>>> -----Original message-----
>>> From: iri issue tracker<[hidden email]>
>>> To:
>>> "[hidden email]"<[hidden email]
>>> etf.org>, "[hidden email]"<[hidden email]>
>>> Cc: "[hidden email]"<[hidden email]>
>>> Sent: Mon, Jun 11, 2012 19:38:45 GMT+00:00
>>> Subject: Re: [iri] #128: use of the term 'origin'
>>>
>>> #128: use of the term 'origin'
>>>
>>> #choose ticket.new
>>>     #when True
>>>    While reviewing 3987bis for i18n terminology, I came across this
>>>    paragraph (Section 3.5):
>>>
>>>       For compatibility with existing deployed HTTP infrastructure, the
>>>       following special case applies for schemes "http" and "https" and
>>>       IRIs whose origin has a document charset other than one which is UCS-
>>>       based (e.g., UTF-8 or UTF-16).  In such a case, the "query" component
>>>       of an IRI is mapped into a URI by using the document charset rather
>>>       than UTF-8 as the binary representation before pct-encoding.  This
>>>       mapping is not applied for any other scheme or component.
>>>
>>>    The term 'origin' could be ambiguous here. It doesn't seem to be
>>>    referencing the Web Origin Concept (RFC 6454) but instead seems to be
>>>    based on the "document" (broadly construed) in which the http or https
>>>    URL is found (e.g., as a hyperlink in an HTML document or perhaps as
>>>    running text in an email message). It would be good to make that clear.
>>>     #end
>>>     #otherwise
>>>       #if changes_body
>>> Changes (by stpeter@…):
>>>
>>>
>>>       #end
>>>       #if changes_descr
>>>         #if not changes_body and not change.comment and change.author
>>> Description changed by stpeter@…:
>>>         #end
>>>
>>> --
>>>       #end
>>>       #if change.comment
>>>
>>> Comment(by stpeter@…):
>>>
>>>    One way to remove the ambiguity would be to change "origin" here to
>>>    something else, but even then I think we'd need additional text. I
>>>    tentatively propose the following:
>>>
>>>       For compatibility with existing deployed HTTP infrastructure, the
>>>       following special case applies for the schemes "http" and "https"
>>>       when an IRI is found in a document whose charset is not based on UCS
>>>       (e.g., not UTF-8 or UTF-16).  In such a case, the "query" component
>>>       of an IRI is mapped into a URI by using the document charset rather
>>>       than UTF-8 as the binary representation before pct-encoding.  This
>>>       mapping is not applied for any other scheme or component.
>>>       #end
>>>     #end
>>> #end
>>>
>>> --
>>> -----------------------+---------------------------------------
>>>    Reporter:  stpeter@…  |       Owner:  draft-ietf-iri-3987bis@…
>>>        Type:  defect     |      Status:  new
>>>    Priority:  minor      |   Milestone:
>>> Component:  3987bis    |     Version:
>>>    Severity:  -          |  Resolution:
>>>    Keywords:             |
>>> -----------------------+---------------------------------------
>>>
>>> Ticket
>>> URL:<http://trac.tools.ietf.org/wg/iri/trac/ticket/128#comment:1>
>>> iri<http://tools.ietf.org/wg/iri/>
>>>

Reply | Threaded
Open this post in threaded view
|

Re: [iri] #128: use of the term 'origin'

Peter Saint-Andre-2
<hat type='individual'/>

On 7/18/12 5:08 AM, "Martin J. Dürst" wrote:

> I think we
> should be careful to make sure that we write the spec so that it doesn't
> make things overly complicated in a world where essentially all Web
> pages are UTF-8.

I agree with that goal.

Peter