Encoding interaction of HTTP response header and META tag

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Encoding interaction of HTTP response header and META tag

Wayne Pollock
In HTML 4 and 5, there is a glaring, annoying error.  Or at least
it seems that way to me.  (This is related to but not identical
to, issue 148.)

The standard says (implicitly in 5.2.2, explicitly in 8.2.2.1, and
in 10.2.2.1 from the whatwg HTML 5 document) if the web server sets
the character encoding in the HTTP response header, then that is used,
and the encoding sniffing algorithm, e.g., the BOM, then the author's
META tag of (say):

  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

or the HTML 5 version:

  <meta charset="UTF-8">

is ignored.  However this is backward.  If document authors
goes to the trouble of stating the charset in the HEAD of their document,
that that should override any default set by the web sever.

The rational for this is given in section 5.2.2:

"Some servers examine the first few bytes of the document, or check against a
database of known files and encodings. Many modern servers give Web masters more
control over charset configuration than old servers do. Web masters should use these
mechanisms to send out a "charset" parameter whenever possible, but should take care
not to identify a document with the wrong "charset" parameter value."

Here's why I think this is wrong and should be changed:

In today's world a single website may have multiple web pages written by
multiple authors.  Each could be using a different charset.  A web
server typically has a setting to return the charset in the HTTP
response header, by looking only at the file extension.  It is a huge
burden to webmasters everywhere to have to manually set the charset
for every update to their website.  This is what Apache does, for instance.

TO OVERRIDE THE DEFAULT CHARSET RETURNED BY APACHE, A PER FILE DIRECTIVE MUST
BE USED TO SPECIFY EACH FILE'S CHARSET.  Such overriding is possible but
to allow web authors the ability to do so, per directory settings must be
enabled (the ".htaccess" files).  doing so severely impacts server performance
and many sites simply can't do so, so web pages WILL be send with the wrong
charset.  I'm sure some other web servers are similar and do not sniff the
MIME type or charset.  However most browsers do.

The alternative is to not have the web server return any such header,
allowing the browser to examine the document for a BOM and then META tag
that sepcifies the charset used.

But  allowing web page authors to override the (default) charset sent by
a web server with the appropriate META tag, is entirely reasonable to me.

How many times have you visited some web page only to find curly quotes,
bullets, etc., don't render correctly because despite a correct META tag,
the CMS used sent a default value in the HTTP response header?  This is
a problem that need not exist.

Part of the problem seems to be that very early on there was no way for document
authors to include charset info in their documents, so web browsers evolved
to use what the web server said.  When the META tag to allow web authors
t set the charset was added, the rules were written not to break existing
practice.  HTML 5 seems to be following this vicious cycle: browsers follow
the standard, and the standard follows the browser practice.

This should be a simple fix.  The issue was raised on the WHATWG list and
elsewhere, and noboby could think of an objection to this proposal.  (The
only web pages that could "break" with this change were already broken.)

Summary:  Change existing determination of charset by moving step two
  2. If the transport layer specifies an encoding, and it is supported,
     return that encoding with the confidence certain, and abort these steps.
to follow (existing) step 5.

===================================================================

On a related note, the new structural tags that denote articles and such
should allow an optional CHARSET attribute.  A web page with ARTICLEs
etc. may be (and may likely be) composed of content from many sources,
e.g., a "mash-up".  While CMS and blogging software could force a single
charset so there is only one charset per web page,  that seems an
unnecessary restriction (and I don't know that most blogging software
works that way).

>From what I know of how modern web browsers work, it would be relatively
easy to allow.  However, I know this one isn't an over site (like my main
point above) but a deliberate decision; the A tag's CHARSET attribute has
been deprecated.  However, CHARSET attribute was *added* to the SCRIPT tag,
so what is the rational for not allowing different parts of a document that
clearly are intended to represent content from different sources?

I believe a CHARSET attribute should be allowed on some block level tags,
including DIV, ARTICLE, and possibly SECTION, IFRAME, and INPUT.  If my main
suggestion is followed, there is probably no need for CHARSET attribute on
an A tag.

--
Wayne Pollock

Reply | Threaded
Open this post in threaded view
|

Re: Encoding interaction of HTTP response header and META tag

Jukka K. Korpela-2
Wayne Pollock wrote:

> If document authors
> goes to the trouble of stating the charset in the HEAD of their
> document,
> that that should override any default set by the web sever.

I have much sympathy for the idea, for reasons you gave, especially the
reason that web server admins often disallow the effects of .htaccess files,
effectively enforcing their settings on every authors.

However, I'm afraid it's too late; the change would break a long tradition
and would break existing pages.

> It is a huge
> burden to webmasters everywhere to have to manually set the charset
> for every update to their website.

I can't see what you mean by that. The settings need to be checked when you
start creating a site, not after every update.

> TO OVERRIDE THE DEFAULT CHARSET RETURNED BY APACHE, A PER FILE
> DIRECTIVE MUST
> BE USED TO SPECIFY EACH FILE'S CHARSET.

Pardon? Apache settings operate per filename extension, and mostly it
suffices to set the encoding for just one extension, ".html".

> Such overriding is possible but
> to allow web authors the ability to do so, per directory settings
> must be  enabled (the ".htaccess" files).  doing so severely impacts
> server
> performance

I don't think it has any significant impact on performance.

> and many sites simply can't do so, so web pages WILL be send with the
> wrong charset.

Well, I would put it this way: If the server admin disallows the effects of
your .htaccess file, then it's just something you need to live with it. If
the force your HTML documents to be served with headers saying that the
encoding is iso-8859-1, or utf-8, or whatever, then just make it so

> This should be a simple fix.  The issue was raised on the WHATWG list
> and elsewhere, and noboby could think of an objection to this
> proposal.

I think a more specific citation of previous discussions would be needed.

> (The
> only web pages that could "break" with this change were already
> broken.)

99% of web pages are broken, in the sense of not complying with HTML, CSS,
WCAG 1.0, or other relevant recommendations. When we worry about what
happens to existing pages, we need to worry about more or less broken pages,
mostly.

Consider a page on a server that forces Content-Type: text/html;
charset=utf-8 on all HTML files. Such servers are increasingly common.
Authors have had to accommodate to that, for example saving documents in
utf-8 encoding if needed. The pages may well have <meta> tags announcing
iso-8859-1 or something else, maybe because some web page editing software
emitted it, or it belonged to a sample file used as a starting point, or the
author copied it from somewhere, with little or no understanding of its
effect.

Your proposal, if accepted and implemented, would imply that all such pages
stopped working, if they (literally) contain any character outside the ASCII
range. This might mean a mess that everyone can see, or just one character
might be wrong, or anything between these extremes.

> On a related note, the new structural tags that denote articles and
> such
> should allow an optional CHARSET attribute.  A web page with ARTICLEs
> etc. may be (and may likely be) composed of content from many sources,
> e.g., a "mash-up".  While CMS and blogging software could force a
> single
> charset so there is only one charset per web page,  that seems an
> unnecessary restriction (and I don't know that most blogging software
> works that way).

No, it's an inherent restriction. The idea of allowing different character
encodings within a single document has often been suggested, but it's based
on a misunderstanding. Changing the encoding at a higher protocol level
conflicts with the basic modern model of using character data. Recognizing
encoding from meta tags is admittedly in conflict with it, too, but it was a
more or less unavoidable exception, which has been separately defined (and
is still known to cause problems, especially when people don't understand
how it works and place it too late in the document). - "Mash-up" simply
needs to recode when needed.

--
Yucca, http://www.cs.tut.fi/~jkorpela/ 


Reply | Threaded
Open this post in threaded view
|

RE: Encoding interaction of HTTP response header and META tag

Harley Rosnow
In reply to this post by Wayne Pollock
When I first encountered this order of precedence in HTML, I had the exact same response as you, Jukka.  In general, when designing software systems, I like to give higher precedence to information that's closer to the content which it describes.  Your points about the efficiency and difficulty of managing HTTP response headers vs. the META tag (and the BOM for UNICODE files) are well taken.  But,  I still think the spec should not change.

>
> Wayne Pollock wrote:
>
> > If document authors
> > goes to the trouble of stating the charset in the HEAD of their
> > document, that that should override any default set by the web sever.
>
> I have much sympathy for the idea, for reasons you gave, especially
> the reason that web server admins often disallow the effects of
> .htaccess files, effectively enforcing their settings on every authors.
>
> However, I'm afraid it's too late; the change would break a long
> tradition and would break existing pages.

Yes, it would cause compatibility issues to change this.  However, that's not the reason to keep it.

The reason for this precedence order is security against XSS attacks.  While hackers can sometimes insert content into web pages or even the very beginning of the stream in some vulnerabilities, it's much more difficult for hackers to manipulate HTTP headers.  In many cases I've encountered vulnerabilities completely mitigated by the presence of an HTTP header specifying the encoding of the web page (or XHR response).  The order of precedence of the HTTP header is essential for security and that's the reason we *really* cannot change it.

The other reason I've encountered for this order of precedence is that it allows for the transcoding of the content as it moves through proxy servers and other network entities between the client and the server.  I've not really encountered this in my work, but that's doesn't mean it's not happening on the Web.

> > It is a huge
> > burden to webmasters everywhere to have to manually set the charset
> > for every update to their website.
>
> I can't see what you mean by that. The settings need to be checked
> when you start creating a site, not after every update.

Yes, I agree.  It's often trivial to associate an HTTP header with a particular type of file and difficult to set a different header for each file.  Often, files are created by different authors and the encoding of the file can be changed by selecting a different option in a tool by the author.  At the same time, getting the HTTP header into sync with the author can require the web administrator to make the change.  This precedence scheme can cause an increasing burden as the scale of the web site and organization grows, if multiple encodings are supported.

My one recommendation here is to require the use UTF-8 everywhere.  This approach scales up well with minimal burden.

>
> > TO OVERRIDE THE DEFAULT CHARSET RETURNED BY APACHE, A PER FILE
> > DIRECTIVE MUST BE USED TO SPECIFY EACH FILE'S CHARSET.
>
> Pardon? Apache settings operate per filename extension, and mostly it
> suffices to set the encoding for just one extension, ".html".
>
> > Such overriding is possible but
> > to allow web authors the ability to do so, per directory settings
> > must be  enabled (the ".htaccess" files).  doing so severely impacts
> > server performance
>
> I don't think it has any significant impact on performance.
>
> > and many sites simply can't do so, so web pages WILL be send with
> > the wrong charset.
>
> Well, I would put it this way: If the server admin disallows the
> effects of your .htaccess file, then it's just something you need to
> live with it. If the force your HTML documents to be served with
> headers saying that the encoding is iso-8859-1, or utf-8, or whatever,
> then just make it so

I've more experience with IIS than Apache, but web authors need to live within the requirements set by their web admins.  

> > This should be a simple fix.  The issue was raised on the WHATWG
> > list and elsewhere, and noboby could think of an objection to this
> > proposal.
>
> I think a more specific citation of previous discussions would be needed.
>
> > (The
> > only web pages that could "break" with this change were already
> > broken.)
>
> 99% of web pages are broken, in the sense of not complying with HTML,
> CSS, WCAG 1.0, or other relevant recommendations. When we worry about
> what happens to existing pages, we need to worry about more or less
> broken pages, mostly.
>
> Consider a page on a server that forces Content-Type: text/html;
> charset=utf-8 on all HTML files. Such servers are increasingly common.
> Authors have had to accommodate to that, for example saving documents
> in
> utf-8 encoding if needed. The pages may well have <meta> tags
> announcing
> iso-8859-1 or something else, maybe because some web page editing
> software emitted it, or it belonged to a sample file used as a
> starting point, or the author copied it from somewhere, with little or
> no understanding of its effect.
>
> Your proposal, if accepted and implemented, would imply that all such
> pages stopped working, if they (literally) contain any character
> outside the ASCII range. This might mean a mess that everyone can see,
> or just one character might be wrong, or anything between these extremes.
>
> > On a related note, the new structural tags that denote articles and
> > such should allow an optional CHARSET attribute.  A web page with
> > ARTICLEs etc. may be (and may likely be) composed of content from
> > many sources, e.g., a "mash-up".  While CMS and blogging software
> > could force a single charset so there is only one charset per web
> > page, that seems an unnecessary restriction (and I don't know that
> > most blogging software works that way).
>
> No, it's an inherent restriction. The idea of allowing different
> character encodings within a single document has often been suggested,
> but it's based on a misunderstanding. Changing the encoding at a
> higher protocol level conflicts with the basic modern model of using
> character data. Recognizing encoding from meta tags is admittedly in
> conflict with it, too, but it was a more or less unavoidable
> exception, which has been separately defined (and is still known to
> cause problems, especially when people don't understand how it works
> and place it too late in the document). - "Mash-up" simply needs to recode when needed.
>

I agree with Wayne that multiple encodings within the same file must be rejected.  Thanks,

Harley Rosnow
Internet Explorer
Microsoft Corporation


Reply | Threaded
Open this post in threaded view
|

Re: Encoding interaction of HTTP response header and META tag

Wayne Pollock-2
Thanks for the informative replies.  I see now it cannot be changed.  But, I would like to see a note added to this part of the standard, explaining the security rational.

As was noted, Apache only allows one charset per extension easily.  Adding .htaccess files must be done carefully to avoid security issues.  And it really does have a huge impact on Apache performance; every access must check every directory on the file's path for .htaccess files---that info isn't cached as far as I know.

But security is vital, so I agree it is not wise to change the order.  I guess the best procedure is to convert all files on a server to a single encoding.  (While I like UTF-8 too, I can see where that is biased against non-Western languages and no single byte encoding will work well for everyone.)

--
Wayne Pollock

On Mar 4, 2011, at 7:41 PM, Harley Rosnow <[hidden email]> wrote:

> When I first encountered this order of precedence in HTML, I had the exact same response as you, Jukka.  In general, when designing software systems, I like to give higher precedence to information that's closer to the content which it describes.  Your points about the efficiency and difficulty of managing HTTP response headers vs. the META tag (and the BOM for UNICODE files) are well taken.  But,  I still think the spec should not change.
>
>>
>> Wayne Pollock wrote:
>>
>>> If document authors
>>> goes to the trouble of stating the charset in the HEAD of their
>>> document, that that should override any default set by the web sever.
>>
>> I have much sympathy for the idea, for reasons you gave, especially
>> the reason that web server admins often disallow the effects of
>> .htaccess files, effectively enforcing their settings on every authors.
>>
>> However, I'm afraid it's too late; the change would break a long
>> tradition and would break existing pages.
>
> Yes, it would cause compatibility issues to change this.  However, that's not the reason to keep it.
>
> The reason for this precedence order is security against XSS attacks.  While hackers can sometimes insert content into web pages or even the very beginning of the stream in some vulnerabilities, it's much more difficult for hackers to manipulate HTTP headers.  In many cases I've encountered vulnerabilities completely mitigated by the presence of an HTTP header specifying the encoding of the web page (or XHR response).  The order of precedence of the HTTP header is essential for security and that's the reason we *really* cannot change it.
>
> The other reason I've encountered for this order of precedence is that it allows for the transcoding of the content as it moves through proxy servers and other network entities between the client and the server.  I've not really encountered this in my work, but that's doesn't mean it's not happening on the Web.
>
>>> It is a huge
>>> burden to webmasters everywhere to have to manually set the charset
>>> for every update to their website.
>>
>> I can't see what you mean by that. The settings need to be checked
>> when you start creating a site, not after every update.
>
> Yes, I agree.  It's often trivial to associate an HTTP header with a particular type of file and difficult to set a different header for each file.  Often, files are created by different authors and the encoding of the file can be changed by selecting a different option in a tool by the author.  At the same time, getting the HTTP header into sync with the author can require the web administrator to make the change.  This precedence scheme can cause an increasing burden as the scale of the web site and organization grows, if multiple encodings are supported.
>
> My one recommendation here is to require the use UTF-8 everywhere.  This approach scales up well with minimal burden.
>
>>
>>> TO OVERRIDE THE DEFAULT CHARSET RETURNED BY APACHE, A PER FILE
>>> DIRECTIVE MUST BE USED TO SPECIFY EACH FILE'S CHARSET.
>>
>> Pardon? Apache settings operate per filename extension, and mostly it
>> suffices to set the encoding for just one extension, ".html".
>>
>>> Such overriding is possible but
>>> to allow web authors the ability to do so, per directory settings
>>> must be  enabled (the ".htaccess" files).  doing so severely impacts
>>> server performance
>>
>> I don't think it has any significant impact on performance.
>>
>>> and many sites simply can't do so, so web pages WILL be send with
>>> the wrong charset.
>>
>> Well, I would put it this way: If the server admin disallows the
>> effects of your .htaccess file, then it's just something you need to
>> live with it. If the force your HTML documents to be served with
>> headers saying that the encoding is iso-8859-1, or utf-8, or whatever,
>> then just make it so
>
> I've more experience with IIS than Apache, but web authors need to live within the requirements set by their web admins.  
>
>>> This should be a simple fix.  The issue was raised on the WHATWG
>>> list and elsewhere, and noboby could think of an objection to this
>>> proposal.
>>
>> I think a more specific citation of previous discussions would be needed.
>>
>>> (The
>>> only web pages that could "break" with this change were already
>>> broken.)
>>
>> 99% of web pages are broken, in the sense of not complying with HTML,
>> CSS, WCAG 1.0, or other relevant recommendations. When we worry about
>> what happens to existing pages, we need to worry about more or less
>> broken pages, mostly.
>>
>> Consider a page on a server that forces Content-Type: text/html;
>> charset=utf-8 on all HTML files. Such servers are increasingly common.
>> Authors have had to accommodate to that, for example saving documents
>> in
>> utf-8 encoding if needed. The pages may well have <meta> tags
>> announcing
>> iso-8859-1 or something else, maybe because some web page editing
>> software emitted it, or it belonged to a sample file used as a
>> starting point, or the author copied it from somewhere, with little or
>> no understanding of its effect.
>>
>> Your proposal, if accepted and implemented, would imply that all such
>> pages stopped working, if they (literally) contain any character
>> outside the ASCII range. This might mean a mess that everyone can see,
>> or just one character might be wrong, or anything between these extremes.
>>
>>> On a related note, the new structural tags that denote articles and
>>> such should allow an optional CHARSET attribute.  A web page with
>>> ARTICLEs etc. may be (and may likely be) composed of content from
>>> many sources, e.g., a "mash-up".  While CMS and blogging software
>>> could force a single charset so there is only one charset per web
>>> page, that seems an unnecessary restriction (and I don't know that
>>> most blogging software works that way).
>>
>> No, it's an inherent restriction. The idea of allowing different
>> character encodings within a single document has often been suggested,
>> but it's based on a misunderstanding. Changing the encoding at a
>> higher protocol level conflicts with the basic modern model of using
>> character data. Recognizing encoding from meta tags is admittedly in
>> conflict with it, too, but it was a more or less unavoidable
>> exception, which has been separately defined (and is still known to
>> cause problems, especially when people don't understand how it works
>> and place it too late in the document). - "Mash-up" simply needs to recode when needed.
>>
>
> I agree with Wayne that multiple encodings within the same file must be rejected.  Thanks,
>
> Harley Rosnow
> Internet Explorer
> Microsoft Corporation
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Encoding interaction of HTTP response header and META tag

Anne van Kesteren-2
On Sat, 05 Mar 2011 08:50:06 +0100, Wayne pollock <[hidden email]>  
wrote:
> But security is vital, so I agree it is not wise to change the order.  I  
> guess the best procedure is to convert all files on a server to a single  
> encoding.  (While I like UTF-8 too, I can see where that is biased  
> against non-Western languages and no single byte encoding will work well  
> for everyone.)

Using UTF-8 always is fine, even for non-Western languages:

http://lists.w3.org/Archives/Public/www-style/2009Feb/0087.html

There are also some other reasons why you would want to use UTF-8:

http://annevankesteren.nl/2009/09/utf-8-reasons


--
Anne van Kesteren
http://annevankesteren.nl/