Auto-detect and encodings in HTML5

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
51 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Auto-detect and encodings in HTML5

Travis Leithead-2

Ian, UA venders, and HTML/I18n mailing list folks:

 

I'd like to present the following feedback from one of our lead

Trident developers on the IE team. He and I work on a number of

parts of the web platform; the encoding and auto-detect subsystem

being the one most relevant to this mail. I'd really like to

generate some discussion from the other browser UAs on the this

topic.

 

The basic idea is that we feel like there are a few places that

the HTML5 spec could make assertions to improve the web's

international support and future ease of interoperability

regarding encodings and auto-detect. We recognize the need to be

as compatible as possible with currently deployed web sites, and

the technique proposed to maintain compatibility is by leveraging

the "HTML5 doctype". I don't want to focus too much on that

particular aspect of the proposal (though it's important), but to

also consider the implications and scenarios as well.

 

The proposal is straight-forward. Only in pages with the HTML5 doctype:

 

1.  Forbid the use of auto-detect heuristics for HTML encodings.

 

2.  Forbid the use problematic encodings such as UTF7 and EBCDIC.

 

    Essentially, get rid of the classes of encodings in which

    Jscript and tags do not correspond to simple ASCII characters

    in the raw byte stream.

 

3.  Only handling the encoding in the first META tag within the

    HEAD and requiring that the HEAD and META tags to appear within

    a well-defined, fixed byte distance into the file to take effect.

 

4.  Require the default HTML encoding to be UTF8.

 

I realize these changes depart somewhat from current practice and

may seem constraining.  But, I was very pleased to see UTF7 already

excluded and EBCDIC discouraged in the HTML5 draft.  The META tag

is supposed to be the first after the HEAD according to the draft.

But, if we could get substantial agreement from the various user

agents to tighten up the behavior covering this handling, we can

greatly improve the Internet in the following regards:

 

 

A.  HTML5 would no longer be vulnerable to script injection from

    encodings such as UTF7 and EBCDIC which then tricks the auto-

    detection code to reinterpret the entire page and run the

    injected script. 

 

    (Harley: I’ve had to fix a number of issues related to these

    security vulnerabilities but the problem is systemic in the

    products and the standard doesn’t help.)

 

B.  HTML5 would be able to process markup more efficiently by

    reducing the scanning and computation required to merely

    determine the encoding of the file.

 

C.  Since sometimes the heuristics or default encoding uses

    information about the user’s environment, we often see pages

    that display quite differently from one region to another.

    As much as possible, browsing from across the globe should

    give a consistent experience for a given page.  (Basically, I

    want my children to one day stop seeing garbage when they

    browse Japanese web sites from the US.)

 

D.  We’d greatly increase the consistency of implementation of

    markup handling by the various user agents. These openings

    for UA-specific heuristics and decisions, undermines the

    benefits of standards and standardization.

 

Thanks,

 

Travis and Harley

 

Internet Explorer Program Management/Development

Microsoft Corporation

 

Reply | Threaded
Open this post in threaded view
|

RE: Auto-detect and encodings in HTML5

Phillips, Addison

Hello Travis,

 

The Internationalization WG is, of course, quite interested in the problem of encoding management and detection in HTML5.

 

I have added your note to the Internationalization WG’s agenda for our upcoming teleconference.

 

Regards,

 

Addison

 

Addison Phillips

Globalization Architect -- Lab126

Chair -- W3C Internationalization WG

 

Internationalization is not a feature.

It is an architecture.

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Travis Leithead
Sent: Tuesday, May 26, 2009 4:46 PM
To: [hidden email]; [hidden email]; Richard Ishida; Ian Hickson
Cc: Chris Wilson; Harley Rosnow
Subject: Auto-detect and encodings in HTML5

 

Ian, UA venders, and HTML/I18n mailing list folks:

 

I'd like to present the following feedback from one of our lead

Trident developers on the IE team. He and I work on a number of

parts of the web platform; the encoding and auto-detect subsystem

being the one most relevant to this mail. I'd really like to

generate some discussion from the other browser UAs on the this

topic.

 

The basic idea is that we feel like there are a few places that

the HTML5 spec could make assertions to improve the web's

international support and future ease of interoperability

regarding encodings and auto-detect. We recognize the need to be

as compatible as possible with currently deployed web sites, and

the technique proposed to maintain compatibility is by leveraging

the "HTML5 doctype". I don't want to focus too much on that

particular aspect of the proposal (though it's important), but to

also consider the implications and scenarios as well.

 

The proposal is straight-forward. Only in pages with the HTML5 doctype:

 

1.  Forbid the use of auto-detect heuristics for HTML encodings.

 

2.  Forbid the use problematic encodings such as UTF7 and EBCDIC.

 

    Essentially, get rid of the classes of encodings in which

    Jscript and tags do not correspond to simple ASCII characters

    in the raw byte stream.

 

3.  Only handling the encoding in the first META tag within the

    HEAD and requiring that the HEAD and META tags to appear within

    a well-defined, fixed byte distance into the file to take effect.

 

4.  Require the default HTML encoding to be UTF8.

 

I realize these changes depart somewhat from current practice and

may seem constraining.  But, I was very pleased to see UTF7 already

excluded and EBCDIC discouraged in the HTML5 draft.  The META tag

is supposed to be the first after the HEAD according to the draft.

But, if we could get substantial agreement from the various user

agents to tighten up the behavior covering this handling, we can

greatly improve the Internet in the following regards:

 

 

A.  HTML5 would no longer be vulnerable to script injection from

    encodings such as UTF7 and EBCDIC which then tricks the auto-

    detection code to reinterpret the entire page and run the

    injected script. 

 

    (Harley: I’ve had to fix a number of issues related to these

    security vulnerabilities but the problem is systemic in the

    products and the standard doesn’t help.)

 

B.  HTML5 would be able to process markup more efficiently by

    reducing the scanning and computation required to merely

    determine the encoding of the file.

 

C.  Since sometimes the heuristics or default encoding uses

    information about the user’s environment, we often see pages

    that display quite differently from one region to another.

    As much as possible, browsing from across the globe should

    give a consistent experience for a given page.  (Basically, I

    want my children to one day stop seeing garbage when they

    browse Japanese web sites from the US.)

 

D.  We’d greatly increase the consistency of implementation of

    markup handling by the various user agents. These openings

    for UA-specific heuristics and decisions, undermines the

    benefits of standards and standardization.

 

Thanks,

 

Travis and Harley

 

Internet Explorer Program Management/Development

Microsoft Corporation

 

Reply | Threaded
Open this post in threaded view
|

RE: Auto-detect and encodings in HTML5

Jonathan Rosenne

EBCDIC and its national language variants, including visual encoding of bidi languages, are in use and will continue to be in use as long as mainframes are in use. A large quantity of data is stored in mainframes in EBCDIC and its variants, and the easiest way of interfacing this data to an HTML UI is by using the encoding features of HTML.

 

I have no objection to banning auto-detection and to making the default HTML encoding UTF8.

 

Jony Rosenne

 

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Phillips, Addison
Sent: Wednesday, May 27, 2009 2:54 AM
To: Travis Leithead; [hidden email]; [hidden email]; Richard Ishida; Ian Hickson
Cc: Chris Wilson; Harley Rosnow
Subject: RE: Auto-detect and encodings in HTML5

 

Hello Travis,

 

The Internationalization WG is, of course, quite interested in the problem of encoding management and detection in HTML5.

 

I have added your note to the Internationalization WG’s agenda for our upcoming teleconference.

 

Regards,

 

Addison

 

Addison Phillips

Globalization Architect -- Lab126

Chair -- W3C Internationalization WG

 

Internationalization is not a feature.

It is an architecture.

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Travis Leithead
Sent: Tuesday, May 26, 2009 4:46 PM
To: [hidden email]; [hidden email]; Richard Ishida; Ian Hickson
Cc: Chris Wilson; Harley Rosnow
Subject: Auto-detect and encodings in HTML5

 

Ian, UA venders, and HTML/I18n mailing list folks:

 

I'd like to present the following feedback from one of our lead

Trident developers on the IE team. He and I work on a number of

parts of the web platform; the encoding and auto-detect subsystem

being the one most relevant to this mail. I'd really like to

generate some discussion from the other browser UAs on the this

topic.

 

The basic idea is that we feel like there are a few places that

the HTML5 spec could make assertions to improve the web's

international support and future ease of interoperability

regarding encodings and auto-detect. We recognize the need to be

as compatible as possible with currently deployed web sites, and

the technique proposed to maintain compatibility is by leveraging

the "HTML5 doctype". I don't want to focus too much on that

particular aspect of the proposal (though it's important), but to

also consider the implications and scenarios as well.

 

The proposal is straight-forward. Only in pages with the HTML5 doctype:

 

1.  Forbid the use of auto-detect heuristics for HTML encodings.

 

2.  Forbid the use problematic encodings such as UTF7 and EBCDIC.

 

    Essentially, get rid of the classes of encodings in which

    Jscript and tags do not correspond to simple ASCII characters

    in the raw byte stream.

 

3.  Only handling the encoding in the first META tag within the

    HEAD and requiring that the HEAD and META tags to appear within

    a well-defined, fixed byte distance into the file to take effect.

 

4.  Require the default HTML encoding to be UTF8.

 

I realize these changes depart somewhat from current practice and

may seem constraining.  But, I was very pleased to see UTF7 already

excluded and EBCDIC discouraged in the HTML5 draft.  The META tag

is supposed to be the first after the HEAD according to the draft.

But, if we could get substantial agreement from the various user

agents to tighten up the behavior covering this handling, we can

greatly improve the Internet in the following regards:

 

 

A.  HTML5 would no longer be vulnerable to script injection from

    encodings such as UTF7 and EBCDIC which then tricks the auto-

    detection code to reinterpret the entire page and run the

    injected script. 

 

    (Harley: I’ve had to fix a number of issues related to these

    security vulnerabilities but the problem is systemic in the

    products and the standard doesn’t help.)

 

B.  HTML5 would be able to process markup more efficiently by

    reducing the scanning and computation required to merely

    determine the encoding of the file.

 

C.  Since sometimes the heuristics or default encoding uses

    information about the user’s environment, we often see pages

    that display quite differently from one region to another.

    As much as possible, browsing from across the globe should

    give a consistent experience for a given page.  (Basically, I

    want my children to one day stop seeing garbage when they

    browse Japanese web sites from the US.)

 

D.  We’d greatly increase the consistency of implementation of

    markup handling by the various user agents. These openings

    for UA-specific heuristics and decisions, undermines the

    benefits of standards and standardization.

 

Thanks,

 

Travis and Harley

 

Internet Explorer Program Management/Development

Microsoft Corporation

 

Reply | Threaded
Open this post in threaded view
|

Re: Auto-detect and encodings in HTML5

cowan
Jonathan Rosenne scripsit:

> EBCDIC and its national language variants, including visual encoding
> of bidi languages, are in use and will continue to be in use as long as
> mainframes are in use. A large quantity of data is stored in mainframes
> in EBCDIC and its variants, and the easiest way of interfacing this
> data to an HTML UI is by using the encoding features of HTML.

It certainly would be simple, but is anyone actually doing this?
I suspect that most mainframe-backed Web servers are actually translating
to ASCII-compatible formats, not sending out EBCDIC HTML.

--
How they ever reached any  conclusion at all    <[hidden email]>
is starkly unknowable to the human mind.        http://www.ccil.org/~cowan
        --"Backstage Lensman", Randall Garrett

Reply | Threaded
Open this post in threaded view
|

Re: Auto-detect and encodings in HTML5

Henri Sivonen
In reply to this post by Travis Leithead-2
On May 27, 2009, at 02:45, Travis Leithead wrote:

> The proposal is straight-forward. Only in pages with the HTML5  
> doctype:

Scoping new behavior to the HTML doctype would be contrary to the goal  
of specifying HTML processing is such a way that the same processing  
rules work for both legacy content and new HTML5 content.

> 1.  Forbid the use of auto-detect heuristics for HTML encodings.

IIRC, Firefox ships with the heuristic detector set to off. I don't  
have enough first-hand experience of browsing content in the affected  
languages (mainly CJK and Cyrillic) to be competent to say what the  
user experience impact of removing the detector altogether would be.

With the HTML5 parser, though, I've changed things so that the  
detector only runs for the first 512 bytes when enabled (the same 512  
bytes as the <meta> prescan). The HTML5 parser-enabled Gecko builds  
haven't had enough testing globally to tell yet whether this is  
enough. My cursory testing of CJK sites suggests 512 bytes is  
sufficient. (Previously in Gecko, the heuristic detector continued to  
inspect the data much further into the stream possibly triggering a  
reparse later on.)

I don't want to make the use of the detector conditional on doctype,  
because the doctype hasn't been parsed yet when the detector runs.

> 2.  Forbid the use problematic encodings such as UTF7 and EBCDIC.
>
>     Essentially, get rid of the classes of encodings in which
>     Jscript and tags do not correspond to simple ASCII characters
>     in the raw byte stream.

I support this change (for all of text/html).

My understanding is that Firefox and Opera have never had EBCDIC  
decoders, so getting away with not having them thus far suggests that  
IE and Safari could get rid of those decoders without ill effects (at  
least outside intranets).

The security issues with UTF-7 are well-documented, so it seems like a  
good idea to ban it. (HTML email declared as UTF-7 might be an  
exception. Are there popular MUAs in the real world that send HTML  
email as UTF-7?)

> 3.  Only handling the encoding in the first META tag within the
>     HEAD and

I think requiring the meta to be in HEAD for the purposes of the  
encoding sniffing would complicate things, so I don't support such a  
consumer implementation requirement. The spec already makes the  
containment of the encoding meta in head an authoring conformance  
requirement.

> requiring that the HEAD and META tags to appear within
>     a well-defined, fixed byte distance into the file to take effect.

I support making the number of bytes that the prescan applies to a  
fixed number. I think the number should not be smaller than 512 bytes  
and not be larger than 1024 bytes.

Could someone working on WebKit please comment on the experiences on  
choosing the number? (I have a vague recollection that WebKit  
increased the number from 512 to 1024 for some reason.)

However, due to existing content, I don't think we can remove the tree  
builder detecting later encoding <meta>s and causing renavigation to  
the document. The renavigation is unfortunate, but detecting the  
situation in the tree builder seems to have a negligible cost on the  
load times of pages that don't cause the renavigation.

> 4.  Require the default HTML encoding to be UTF8.

This isn't feasible for text/html in general. Given existing content,  
you need to either allow a locale-dependent default or default to  
windows-1252 if you want a single global default.

It doesn't make sense to change this for HTML5 doctype documents only,  
because the HTML5 doctype is an author opt-in mechanism and authors  
already have three widely supported mechanisms for opting into UTF-8:  
chareset=utf-8 in HTTP header, in <meta> and the BOM.

> But, if we could get substantial agreement from the various user
> agents to tighten up the behavior covering this handling, we can
> greatly improve the Internet in the following regards:

Would Microsoft be willing to tighten the IE 5.5, IE7, IE8 Almost  
Standards and IE8 Standards modes likewise in IE9? What about  
tightening them in a point release of IE8?

> A.  HTML5 would no longer be vulnerable to script injection from
>     encodings such as UTF7 and EBCDIC which then tricks the auto-
>     detection code to reinterpret the entire page and run the
>     injected script.

It makes sense to eliminate this attack vector, but it doesn't make  
sense to eliminate it for documents with the HTML5 doctype only,  
because doing so would still leave browsers vulnerable when the  
attacker targets a system that emits legacy doctypes.

> B.  HTML5 would be able to process markup more efficiently by
>     reducing the scanning and computation required to merely
>     determine the encoding of the file.

Authors can already opt-in to more efficient computation by declaring  
on the HTTP layer that they use UTF-8.

Adding different rules for HTML5 would increase code complexity.

> C.  Since sometimes the heuristics or default encoding uses
>     information about the user’s environment, we often see pages
>     that display quite differently from one region to another.
>     As much as possible, browsing from across the globe should
>     give a consistent experience for a given page.  (Basically, I
>     want my children to one day stop seeing garbage when they
>     browse Japanese web sites from the US.)

This is indeed a problem. However, I don't see a way for browsers to  
force CJK and Cyrillic sites into making the experience for out-of-
locale readers better without losing market share within the local  
when doing so.

On May 27, 2009, at 07:39, Jonathan Rosenne wrote:

> EBCDIC and its national language variants, including visual encoding  
> of bidi languages, are in use and will continue to be in use as long  
> as mainframes are in use. A large quantity of data is stored in  
> mainframes in EBCDIC and its variants, and the easiest way of  
> interfacing this data to an HTML UI is by using the encoding  
> features of HTML.

It doesn't follow that mainframes should use EBCDIC variants for  
interchange with other systems. I think the burden to perform  
conversion should be on mainframes and other systems shouldn't take on  
the security risk of supporting encodings that aren't rough ASCII  
supersets.

--
Henri Sivonen
[hidden email]
http://hsivonen.iki.fi/



Reply | Threaded
Open this post in threaded view
|

RE: Auto-detect and encodings in HTML5

Jonathan Rosenne
In reply to this post by Jonathan Rosenne

My mistake – I confused with old DOS code pages such as IBM862.

 

The communication software converts EBCDIC to PC code pages, e.g. from 424 to IBM862.

 

Jony

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Jonathan Rosenne
Sent: Wednesday, May 27, 2009 7:40 AM
To: 'Phillips, Addison'; 'Travis Leithead'; [hidden email]; [hidden email]; 'Richard Ishida'; 'Ian Hickson'
Cc: 'Chris Wilson'; 'Harley Rosnow'; Yair Shmuel
Subject: RE: Auto-detect and encodings in HTML5

 

EBCDIC and its national language variants, including visual encoding of bidi languages, are in use and will continue to be in use as long as mainframes are in use. A large quantity of data is stored in mainframes in EBCDIC and its variants, and the easiest way of interfacing this data to an HTML UI is by using the encoding features of HTML.

 

I have no objection to banning auto-detection and to making the default HTML encoding UTF8.

 

Jony Rosenne

 

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Phillips, Addison
Sent: Wednesday, May 27, 2009 2:54 AM
To: Travis Leithead; [hidden email]; [hidden email]; Richard Ishida; Ian Hickson
Cc: Chris Wilson; Harley Rosnow
Subject: RE: Auto-detect and encodings in HTML5

 

Hello Travis,

 

The Internationalization WG is, of course, quite interested in the problem of encoding management and detection in HTML5.

 

I have added your note to the Internationalization WG’s agenda for our upcoming teleconference.

 

Regards,

 

Addison

 

Addison Phillips

Globalization Architect -- Lab126

Chair -- W3C Internationalization WG

 

Internationalization is not a feature.

It is an architecture.

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Travis Leithead
Sent: Tuesday, May 26, 2009 4:46 PM
To: [hidden email]; [hidden email]; Richard Ishida; Ian Hickson
Cc: Chris Wilson; Harley Rosnow
Subject: Auto-detect and encodings in HTML5

 

Ian, UA venders, and HTML/I18n mailing list folks:

 

I'd like to present the following feedback from one of our lead

Trident developers on the IE team. He and I work on a number of

parts of the web platform; the encoding and auto-detect subsystem

being the one most relevant to this mail. I'd really like to

generate some discussion from the other browser UAs on the this

topic.

 

The basic idea is that we feel like there are a few places that

the HTML5 spec could make assertions to improve the web's

international support and future ease of interoperability

regarding encodings and auto-detect. We recognize the need to be

as compatible as possible with currently deployed web sites, and

the technique proposed to maintain compatibility is by leveraging

the "HTML5 doctype". I don't want to focus too much on that

particular aspect of the proposal (though it's important), but to

also consider the implications and scenarios as well.

 

The proposal is straight-forward. Only in pages with the HTML5 doctype:

 

1.  Forbid the use of auto-detect heuristics for HTML encodings.

 

2.  Forbid the use problematic encodings such as UTF7 and EBCDIC.

 

    Essentially, get rid of the classes of encodings in which

    Jscript and tags do not correspond to simple ASCII characters

    in the raw byte stream.

 

3.  Only handling the encoding in the first META tag within the

    HEAD and requiring that the HEAD and META tags to appear within

    a well-defined, fixed byte distance into the file to take effect.

 

4.  Require the default HTML encoding to be UTF8.

 

I realize these changes depart somewhat from current practice and

may seem constraining.  But, I was very pleased to see UTF7 already

excluded and EBCDIC discouraged in the HTML5 draft.  The META tag

is supposed to be the first after the HEAD according to the draft.

But, if we could get substantial agreement from the various user

agents to tighten up the behavior covering this handling, we can

greatly improve the Internet in the following regards:

 

 

A.  HTML5 would no longer be vulnerable to script injection from

    encodings such as UTF7 and EBCDIC which then tricks the auto-

    detection code to reinterpret the entire page and run the

    injected script. 

 

    (Harley: I’ve had to fix a number of issues related to these

    security vulnerabilities but the problem is systemic in the

    products and the standard doesn’t help.)

 

B.  HTML5 would be able to process markup more efficiently by

    reducing the scanning and computation required to merely

    determine the encoding of the file.

 

C.  Since sometimes the heuristics or default encoding uses

    information about the user’s environment, we often see pages

    that display quite differently from one region to another.

    As much as possible, browsing from across the globe should

    give a consistent experience for a given page.  (Basically, I

    want my children to one day stop seeing garbage when they

    browse Japanese web sites from the US.)

 

D.  We’d greatly increase the consistency of implementation of

    markup handling by the various user agents. These openings

    for UA-specific heuristics and decisions, undermines the

    benefits of standards and standardization.

 

Thanks,

 

Travis and Harley

 

Internet Explorer Program Management/Development

Microsoft Corporation

 

Reply | Threaded
Open this post in threaded view
|

Re: Auto-detect and encodings in HTML5

Anne van Kesteren-2
In reply to this post by Travis Leithead-2
On Wed, 27 May 2009 01:45:53 +0200, Travis Leithead <[hidden email]> wrote:
> A.  HTML5 would no longer be vulnerable to script injection from
>     encodings such as UTF7 and EBCDIC which then tricks the auto-
>     detection code to reinterpret the entire page and run the
>     injected script.

Opera 10 does not support UTF-7, UTF-32, and EBCDIC for Web pages, regardless of rendering mode. So far we haven't run into issues. (I'm not sure EBCDIC was ever supported and UTF-32 support might have been removed earlier on.)


> B.  HTML5 would be able to process markup more efficiently by
>     reducing the scanning and computation required to merely
>     determine the encoding of the file.

As Henri indicates this might be possible for all pages.


> C.  Since sometimes the heuristics or default encoding uses
>     information about the user's environment, we often see pages
>     that display quite differently from one region to another.
>     As much as possible, browsing from across the globe should
>     give a consistent experience for a given page.  (Basically, I
>     want my children to one day stop seeing garbage when they
>     browse Japanese web sites from the US.)

This is something I'd like to see solved as well, but I'd really like it solved in a way that also works for the pages already deployed.


> D.  We'd greatly increase the consistency of implementation of
>     markup handling by the various user agents. These openings
>     for UA-specific heuristics and decisions, undermines the
>     benefits of standards and standardization.

Yeah, ideally we document the exact algorithms used and have a fixed set of encodings user agents must support and also forbid any other encodings. Define exactly how a byte stream labeled with an encoding maps to Unicode, etc. Unfortunately I haven't found much time to look into this more.


--
Anne van Kesteren
http://annevankesteren.nl/

Reply | Threaded
Open this post in threaded view
|

Re: Auto-detect and encodings in HTML5

Erik van der Poel
In reply to this post by Travis Leithead-2
Hi Travis,

First of all, I am really happy to see a browser vendor offer to get
stricter. :-)

I wonder whether the doctype is a very clean way to move forward in
this area, given that the HTTP charset ought to disable the
auto-detector, but if many authors prefer the META charset, then the
doctype might be a reasonable compromise. I am still thinking about
this part.

However, I object quite strongly to the UTF-8 default. If an HTML5
document includes the doctype but excludes the charset, old clients
might use their auto-detector and get it wrong. So I'd prefer to make
the charset mandatory with HTML5 doctype, and keep the rule that the
HTTP charset overrides the META charset for compatibility with old
clients.

Erik

On Tue, May 26, 2009 at 4:45 PM, Travis Leithead
<[hidden email]> wrote:

> Ian, UA venders, and HTML/I18n mailing list folks:
>
>
>
> I'd like to present the following feedback from one of our lead
>
> Trident developers on the IE team. He and I work on a number of
>
> parts of the web platform; the encoding and auto-detect subsystem
>
> being the one most relevant to this mail. I'd really like to
>
> generate some discussion from the other browser UAs on the this
>
> topic.
>
>
>
> The basic idea is that we feel like there are a few places that
>
> the HTML5 spec could make assertions to improve the web's
>
> international support and future ease of interoperability
>
> regarding encodings and auto-detect. We recognize the need to be
>
> as compatible as possible with currently deployed web sites, and
>
> the technique proposed to maintain compatibility is by leveraging
>
> the "HTML5 doctype". I don't want to focus too much on that
>
> particular aspect of the proposal (though it's important), but to
>
> also consider the implications and scenarios as well.
>
>
>
> The proposal is straight-forward. Only in pages with the HTML5 doctype:
>
>
>
> 1.  Forbid the use of auto-detect heuristics for HTML encodings.
>
>
>
> 2.  Forbid the use problematic encodings such as UTF7 and EBCDIC.
>
>
>
>     Essentially, get rid of the classes of encodings in which
>
>     Jscript and tags do not correspond to simple ASCII characters
>
>     in the raw byte stream.
>
>
>
> 3.  Only handling the encoding in the first META tag within the
>
>     HEAD and requiring that the HEAD and META tags to appear within
>
>     a well-defined, fixed byte distance into the file to take effect.
>
>
>
> 4.  Require the default HTML encoding to be UTF8.
>
>
>
> I realize these changes depart somewhat from current practice and
>
> may seem constraining.  But, I was very pleased to see UTF7 already
>
> excluded and EBCDIC discouraged in the HTML5 draft.  The META tag
>
> is supposed to be the first after the HEAD according to the draft.
>
> But, if we could get substantial agreement from the various user
>
> agents to tighten up the behavior covering this handling, we can
>
> greatly improve the Internet in the following regards:
>
>
>
>
>
> A.  HTML5 would no longer be vulnerable to script injection from
>
>     encodings such as UTF7 and EBCDIC which then tricks the auto-
>
>     detection code to reinterpret the entire page and run the
>
>     injected script.
>
>
>
>     (Harley: I’ve had to fix a number of issues related to these
>
>     security vulnerabilities but the problem is systemic in the
>
>     products and the standard doesn’t help.)
>
>
>
> B.  HTML5 would be able to process markup more efficiently by
>
>     reducing the scanning and computation required to merely
>
>     determine the encoding of the file.
>
>
>
> C.  Since sometimes the heuristics or default encoding uses
>
>     information about the user’s environment, we often see pages
>
>     that display quite differently from one region to another.
>
>     As much as possible, browsing from across the globe should
>
>     give a consistent experience for a given page.  (Basically, I
>
>     want my children to one day stop seeing garbage when they
>
>     browse Japanese web sites from the US.)
>
>
>
> D.  We’d greatly increase the consistency of implementation of
>
>     markup handling by the various user agents. These openings
>
>     for UA-specific heuristics and decisions, undermines the
>
>     benefits of standards and standardization.
>
>
>
> Thanks,
>
>
>
> Travis and Harley
>
>
>
> Internet Explorer Program Management/Development
>
> Microsoft Corporation
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Auto-detect and encodings in HTML5

Jungshik SHIN (신정식)
Hi,



2009/5/27 Erik van der Poel <[hidden email]>
Hi Travis,

First of all, I am really happy to see a browser vendor offer to get
stricter. :-)

So am I :-)
 

I wonder whether the doctype is a very clean way to move forward in
this area, given that the HTTP charset ought to disable the
auto-detector, but if many authors prefer the META charset, then the
doctype might be a reasonable compromise. I am still thinking about
this part.

My responese inlined below are also contingent on this issue. I'm on the fence about it.

 

However, I object quite strongly to the UTF-8 default. If an HTML5
document includes the doctype but excludes the charset, old clients
might use their auto-detector and get it wrong. So I'd prefer to make
the charset mandatory with HTML5 doctype, and keep the rule that the
HTTP charset overrides the META charset for compatibility with old
clients.


 

Erik

On Tue, May 26, 2009 at 4:45 PM, Travis Leithead
<[hidden email]> wrote:
> Ian, UA venders, and HTML/I18n mailing list folks:
>
>
>
> I'd like to present the following feedback from one of our lead
>
> Trident developers on the IE team. He and I work on a number of
>
> parts of the web platform; the encoding and auto-detect subsystem
>
> being the one most relevant to this mail. I'd really like to
>
> generate some discussion from the other browser UAs on the this
>
> topic.
>
>
>
> The basic idea is that we feel like there are a few places that
>
> the HTML5 spec could make assertions to improve the web's
>
> international support and future ease of interoperability
>
> regarding encodings and auto-detect. We recognize the need to be
>
> as compatible as possible with currently deployed web sites, and
>
> the technique proposed to maintain compatibility is by leveraging
>
> the "HTML5 doctype". I don't want to focus too much on that
>
> particular aspect of the proposal (though it's important), but to
>
> also consider the implications and scenarios as well.
>
>
>
> The proposal is straight-forward. Only in pages with the HTML5 doctype:
>
>
>
> 1.  Forbid the use of auto-detect heuristics for HTML encodings.
>
>

As far as I know (Simon will correct me if I'm not up-to-date), Firefox's charset autodetctor kicks in only when both of the following two conditions are satisfied:

1) Auto-detection is turned on explicitly by a user. It's OFF by default
2) No charset is specified anywhere.

Even if it's turned ON, Firefox does honor the explicitly specified charset (http or meta).

Webkit does the same except that it tries to detect the encoding when one of Japanese encodings is specified (, which I think has to be removed. Chrome 2.0 removed this in its copy of Webkit. So, Chrome 2.0's behavior is identical to Firefox).

IE's behavior seems to be different, but I haven't managed to figure out  when its auto-detector kicks in. Could you tell us what IE does with auto-detection?



 

>
> 2.  Forbid the use problematic encodings such as UTF7 and EBCDIC.
>
>
>
>     Essentially, get rid of the classes of encodings in which
>
>     Jscript and tags do not correspond to simple ASCII characters
>
>     in the raw byte stream.
>

I wholeheartedly support this. Firefox never supported EBCDIC encodings.

 I'm tempted to go a step further to forbid ISO-2022-XX and GB-HZ as well, but there might be a compatibility concern here. However, if that prohibition is triggered by HTML5 doctype, it should be ok.

 

>
>
> 3.  Only handling the encoding in the first META tag within the
>
>     HEAD and requiring that the HEAD and META tags to appear within
>
>     a well-defined, fixed byte distance into the file to take effect.
>
>


There are some web sites with meta tags deeply buried ( > 512 bytes from the beginning). Webkit even has a layout test for this (currently, it scans the first 1024 bytes).

By no means, I'm happy with those web pages. So, I agree with you on this except that I'm not sure of requiring the meta cahrset declaration to be inside <head>.





>
> 4.  Require the default HTML encoding to be UTF8.

Although I wish every web page were in UTF-8, I think I'm with Erik (mandating meta charset with http taking a higher priority).

Aha.. you may have had something else in mind. Even if HTML5 mandates meta charset with http taking a higher priority, some html5 pages are likely to be incompliant to the standard. In that case, we have to define the UA behavior and you want UTF-8 to be always assumed by UA's instead of the default encoding configurable by a user, which is the current practice (in Firefox and Webkit) when auto-detector is OFF


 

>
>
> I realize these changes depart somewhat from current practice and
>
> may seem constraining.  But, I was very pleased to see UTF7 already
>
> excluded and EBCDIC discouraged in the HTML5 draft.  The META tag
>
> is supposed to be the first after the HEAD according to the draft.
>
> But, if we could get substantial agreement from the various user
>
> agents to tighten up the behavior covering this handling, we can
>
> greatly improve the Internet in the following regards:
>
>
>
>
>
> A.  HTML5 would no longer be vulnerable to script injection from
>
>     encodings such as UTF7 and EBCDIC which then tricks the auto-
>
>     detection code to reinterpret the entire page and run the
>
>     injected script.
>
>
>
>     (Harley: I’ve had to fix a number of issues related to these
>
>     security vulnerabilities but the problem is systemic in the
>
>     products and the standard doesn’t help.)
>
>
>
> B.  HTML5 would be able to process markup more efficiently by
>
>     reducing the scanning and computation required to merely
>
>     determine the encoding of the file.
>
>
>
> C.  Since sometimes the heuristics or default encoding uses
>
>     information about the user’s environment, we often see pages
>
>     that display quite differently from one region to another.
>
>     As much as possible, browsing from across the globe should
>
>     give a consistent experience for a given page.  (Basically, I
>
>     want my children to one day stop seeing garbage when they
>
>     browse Japanese web sites from the US.)

 

>
>
>
> D.  We’d greatly increase the consistency of implementation of
>
>     markup handling by the various user agents. These openings
>
>     for UA-specific heuristics and decisions, undermines the
>
>     benefits of standards and standardization.
>
>
>
> Thanks,
>
>
>
> Travis and Harley
>
>
>
> Internet Explorer Program Management/Development
>
> Microsoft Corporation

Jungshik

Reply | Threaded
Open this post in threaded view
|

Re: Auto-detect and encodings in HTML5

Philip Taylor-5
Jungshik SHIN (신정식) wrote:
> There are some web sites with meta tags deeply buried ( > 512 bytes from the
> beginning). Webkit even has a layout test for this (currently, it scans the
> first 1024 bytes).
>
> By no means, I'm happy with those web pages. So, I agree with you on this
> except that I'm not sure of requiring the meta cahrset declaration to be
> inside <head>.

Some possibly relevant data for this:
http://philip.html5.org/data/encoding-detection.svg shows how many bytes
have to be read before HTML5's <meta> charset sniffing algorithm finds
an answer, based on 130K pages downloaded from dmoz.org (with a heavy
American/European bias). (http://philip.html5.org/data/charsets.html has
other charset data.)

--
Philip Taylor
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Auto-detect and encodings in HTML5

Henri Sivonen
In reply to this post by Jungshik SHIN (신정식)

On May 27, 2009, at 21:37, Jungshik SHIN (신정식) wrote:

> 2009/5/27 Erik van der Poel <[hidden email]>
>> However, I object quite strongly to the UTF-8 default. If an HTML5
>> document includes the doctype but excludes the charset, old clients
>> might use their auto-detector and get it wrong. So I'd prefer to make
>> the charset mandatory with HTML5 doctype, and keep the rule that the
>> HTTP charset overrides the META charset for compatibility with old
>> clients.

When the document has non-ASCII bytes, an explicit encoding  
declaration (or BOM) is required for document conformance. But  
implementations still need to deal with the violation of the  
requirement.

> As far as I know (Simon will correct me if I'm not up-to-date),  
> Firefox's charset autodetctor kicks in only when both of the  
> following two conditions are satisfied:
>
> 1) Auto-detection is turned on explicitly by a user. It's OFF by  
> default
> 2) No charset is specified anywhere.
>
> Even if it's turned ON, Firefox does honor the explicitly specified  
> charset (http or meta).

This also holds true in the HTML5 parser-enabled Gecko builds  
currently. The difference is how far the heuristic detector looks when  
it does kick in.

> I'm tempted to go a step further to forbid ISO-2022-XX and GB-HZ as  
> well, but there might be a compatibility concern here. However, if  
> that prohibition is triggered by HTML5 doctype, it should be ok.

The decoder needs to be instantiated before the doctype is parsed.  
Changing this would be pain. Let's not make the encoding stuff  
dependent on doctype.

> There are some web sites with meta tags deeply buried ( > 512 bytes  
> from the beginning). Webkit even has a layout test for this  
> (currently, it scans the first 1024 bytes).

The HTML5 parsing algorithm deals with the late <meta> case by causing  
a renavigation to the document. The question is how far the prescan  
should look. Philip's data shows the diminishing returns have kicked  
in even before 512.

--
Henri Sivonen
[hidden email]
http://hsivonen.iki.fi/



Reply | Threaded
Open this post in threaded view
|

Re: Auto-detect and encodings in HTML5

cowan
In reply to this post by Philip Taylor-5
Philip Taylor scripsit:

> (http://philip.html5.org/data/charsets.html has other charset data.)

It certainly does.  Who would have thought there was a charset named
"beagle kennel van der liniehoeve"?

--
John Cowan                              <[hidden email]>
            http://www.ccil.org/~cowan
                .e'osai ko sarji la lojban.
                Please support Lojban!          http://www.lojban.org

Reply | Threaded
Open this post in threaded view
|

Re: Auto-detect and encodings in HTML5

Leif Halvard Silli-2
John Cowan On 09-05-28 16.24:
> Philip Taylor scripsit:
>
>  
>> (http://philip.html5.org/data/charsets.html has other charset data.)
>>    
>
> It certainly does.  Who would have thought there was a charset named
> "beagle kennel van der liniehoeve"?
>  

It seems like the data analysis in this case jumped to the conclusion:

http://www.liniehoeve.nl/

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<meta http-equiv="Content-Language" content="nl">
<meta name="Webmaster" content="Rick Bleeker">
<meta name="Copyright" content="Hans Verbeek, 2008">
<meta name="Title" charset="Beagle Kennel van der Liniehoeve">
--
leif halvard silli

Reply | Threaded
Open this post in threaded view
|

Re: Auto-detect and encodings in HTML5

cowan
Leif Halvard Silli scripsit:

> <meta name="Title" charset="Beagle Kennel van der Liniehoeve">

Well, this does say "charset" rather than "content".

--
John Cowan    http://www.ccil.org/~cowan   <[hidden email]>
    "Any legal document draws most of its meaning from context.  A telegram
    that says 'SELL HUNDRED THOUSAND SHARES IBM SHORT' (only 190 bits in
    5-bit Baudot code plus appropriate headers) is as good a legal document
    as any, even sans digital signature." --me

Reply | Threaded
Open this post in threaded view
|

Re: Auto-detect and encodings in HTML5

Leif Halvard Silli-2
John Cowan On 09-05-28 23.08:
> Leif Halvard Silli scripsit:
>
>  
>> <meta name="Title" charset="Beagle Kennel van der Liniehoeve">
>>    
>
> Well, this does say "charset" rather than "content".
>  

Yes, currently HTML doesn't have any @charset attribute. @charset is
only a new invention of the HTML 5 draft. (May be this page tries to
document how usually correct - charset wise - the _current_ use of this
illegal attribute is?)

What I meant to point out, though, was that since the debate was about
how deep into the page one should sniff, then this page had a correct
"charset tag" as the first element of the <head> element. It can't
become any better than that, can it? That it also has this - in every
sense [except in the HTML 5 sense] - meaningless  meta element further
down in the <head> does not matter to the issue that was debated, I think.

But if I read the data correctly, then the HTML 5 draft algorithm that
Philip used, was unable to decode the correct charset info in the
_first_ meta element. I wonder why.

Measured against HTML 4, there seems to be _several_ errors in the
analysis/findings that is presented on that page. For instance, roughly
all the pages mentioned under the following fragment seems to have OK
charset info in their meta elements (and there are many other examples
of the same) - despite Philip's page saying there were errors:

http://philip.html5.org/data/charsets.html#charset-en

I don't know if this represents errors in the HTML 5 algorithm [back
then], or if Philip just weren't critical enough towards the errors he
believed that his analysis tool had found. (There are some that see any
error in current deployed HTML as a justification for HTML 5.)

But may be I just don't understand what the page tries to tell.
--
leif halvard silli


Reply | Threaded
Open this post in threaded view
|

Re: Auto-detect and encodings in HTML5

Philip Taylor-5
Leif Halvard Silli wrote:
> John Cowan On 09-05-28 23.08:
>> Leif Halvard Silli scripsit:
>>
>>> <meta name="Title" charset="Beagle Kennel van der Liniehoeve">
>>
>> Well, this does say "charset" rather than "content".
>
> Yes, currently HTML doesn't have any @charset attribute. @charset is
> only a new invention of the HTML 5 draft.

(It's newly specified in HTML 5, but it's been supported by the major
web browsers for practically forever.)

> if I read the data correctly, then the HTML 5 draft algorithm that
> Philip used, was unable to decode the correct charset info in the
> _first_ meta element.

I looked for the first charset in a <meta content>, and independently
looked for the first <meta charset>, so that particular page was counted
in both of those columns of the table. The "sniffer" column is the one
that matched the algorithm in HTML 5, which stops after finding the
first thing that looks like a charset specification, and for this page
it reported windows-1252.

> Measured against HTML 4, there seems to be _several_ errors in the
> analysis/findings that is presented on that page. For instance, roughly
> all the pages mentioned under the following fragment seems to have OK
> charset info in their meta elements (and there are many other examples
> of the same) - despite Philip's page saying there were errors:
>
> http://philip.html5.org/data/charsets.html#charset-en

Most of those pages are sending HTTP headers like "Content-Type:
text/html; charset=en" - the HTML has nothing to do with it. They're
marked as 'invalid' because "en" is not a known character encoding.
('invalid' in that data just means the page's bytes couldn't be decoded
with the specified encoding.)

--
Philip Taylor
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Auto-detect and encodings in HTML5

Leif Halvard Silli-2
Philip Taylor On 09-05-29 02.13:

> Leif Halvard Silli wrote:
>> John Cowan On 09-05-28 23.08:
>>> Leif Halvard Silli scripsit:
>>>
>>>> <meta name="Title" charset="Beagle Kennel van der Liniehoeve">
>>>
>>> Well, this does say "charset" rather than "content".
>>
>> Yes, currently HTML doesn't have any @charset attribute. @charset is
>> only a new invention of the HTML 5 draft.
>
> (It's newly specified in HTML 5, but it's been supported by the major
> web browsers for practically forever.)

Interesting how few pages that used it, though. I really don't
know if speccing it makes anything any clearer for anyone.

>> if I read the data correctly, then the HTML 5 draft algorithm that
>> Philip used, was unable to decode the correct charset info in the
>> _first_ meta element.
>
> I looked for the first charset in a <meta content>, and independently
> looked for the first <meta charset>, so that particular page was counted
> in both of those columns of the table. The "sniffer" column is the one
> that matched the algorithm in HTML 5, which stops after finding the
> first thing that looks like a charset specification, and for this page
> it reported windows-1252.

... may be I just don't understand the presentation: the caption
of the table says: "Number of pages declaring encoding (% decoded
without errors)" About the "beagle" page in particular, the
different columns say:

HTTP: U0; meta content: 0; Sniffer: 0; meta charset: 1 (0%);

?
--
leif halvard silli

Reply | Threaded
Open this post in threaded view
|

Re: Auto-detect and encodings in HTML5

M.T. Carrasco Benitez
In reply to this post by Travis Leithead-2

Near to Erik, but UTF8 in worse case:

1) Best: HTTP charset; unambiguous and "external"
2) Agree on ONE public detection algorithm
3) Mandatory declaration as near to the top as possible; if in META, the first in HEAD; within a certain range of bytes (e.g., 512)
4) Default UTF8 could be part of the algorithm; perhaps the last option
5) No BOM or similar

Regards
Tomas


--- On Wed, 27/5/09, Erik van der Poel <[hidden email]> wrote:

> From: Erik van der Poel <[hidden email]>
> Subject: Re: Auto-detect and encodings in HTML5
> To: "Travis Leithead" <[hidden email]>
> Cc: "[hidden email]" <[hidden email]>, "[hidden email]" <[hidden email]>, "Richard Ishida" <[hidden email]>, "Ian Hickson" <[hidden email]>, "Chris Wilson" <[hidden email]>, "Harley Rosnow" <[hidden email]>
> Date: Wednesday, 27 May, 2009, 7:30 PM
> Hi Travis,
>
> First of all, I am really happy to see a browser vendor
> offer to get
> stricter. :-)
>
> I wonder whether the doctype is a very clean way to move
> forward in
> this area, given that the HTTP charset ought to disable
> the
> auto-detector, but if many authors prefer the META charset,
> then the
> doctype might be a reasonable compromise. I am still
> thinking about
> this part.
>
> However, I object quite strongly to the UTF-8 default. If
> an HTML5
> document includes the doctype but excludes the charset, old
> clients
> might use their auto-detector and get it wrong. So I'd
> prefer to make
> the charset mandatory with HTML5 doctype, and keep the rule
> that the
> HTTP charset overrides the META charset for compatibility
> with old
> clients.
>
> Erik
>
> On Tue, May 26, 2009 at 4:45 PM, Travis Leithead
> <[hidden email]>
> wrote:
> > Ian, UA venders, and HTML/I18n mailing list folks:
> >
> >
> >
> > I'd like to present the following feedback from one of
> our lead
> >
> > Trident developers on the IE team. He and I work on a
> number of
> >
> > parts of the web platform; the encoding and
> auto-detect subsystem
> >
> > being the one most relevant to this mail. I'd really
> like to
> >
> > generate some discussion from the other browser UAs on
> the this
> >
> > topic.
> >
> >
> >
> > The basic idea is that we feel like there are a few
> places that
> >
> > the HTML5 spec could make assertions to improve the
> web's
> >
> > international support and future ease of
> interoperability
> >
> > regarding encodings and auto-detect. We recognize the
> need to be
> >
> > as compatible as possible with currently deployed web
> sites, and
> >
> > the technique proposed to maintain compatibility is by
> leveraging
> >
> > the "HTML5 doctype". I don't want to focus too much on
> that
> >
> > particular aspect of the proposal (though it's
> important), but to
> >
> > also consider the implications and scenarios as well.
> >
> >
> >
> > The proposal is straight-forward. Only in pages with
> the HTML5 doctype:
> >
> >
> >
> > 1.  Forbid the use of auto-detect heuristics for HTML
> encodings.
> >
> >
> >
> > 2.  Forbid the use problematic encodings such as UTF7
> and EBCDIC.
> >
> >
> >
> >     Essentially, get rid of the classes of
> encodings in which
> >
> >     Jscript and tags do not correspond to simple
> ASCII characters
> >
> >     in the raw byte stream.
> >
> >
> >
> > 3.  Only handling the encoding in the first META tag
> within the
> >
> >     HEAD and requiring that the HEAD and META tags
> to appear within
> >
> >     a well-defined, fixed byte distance into the
> file to take effect.
> >
> >
> >
> > 4.  Require the default HTML encoding to be UTF8.
> >
> >
> >
> > I realize these changes depart somewhat from current
> practice and
> >
> > may seem constraining.  But, I was very pleased to
> see UTF7 already
> >
> > excluded and EBCDIC discouraged in the HTML5 draft. 
> The META tag
> >
> > is supposed to be the first after the HEAD according
> to the draft.
> >
> > But, if we could get substantial agreement from the
> various user
> >
> > agents to tighten up the behavior covering this
> handling, we can
> >
> > greatly improve the Internet in the following
> regards:
> >
> >
> >
> >
> >
> > A.  HTML5 would no longer be vulnerable to script
> injection from
> >
> >     encodings such as UTF7 and EBCDIC which then
> tricks the auto-
> >
> >     detection code to reinterpret the entire page
> and run the
> >
> >     injected script.
> >
> >
> >
> >     (Harley: I’ve had to fix a number of issues
> related to these
> >
> >     security vulnerabilities but the problem is
> systemic in the
> >
> >     products and the standard doesn’t help.)
> >
> >
> >
> > B.  HTML5 would be able to process markup more
> efficiently by
> >
> >     reducing the scanning and computation required
> to merely
> >
> >     determine the encoding of the file.
> >
> >
> >
> > C.  Since sometimes the heuristics or default
> encoding uses
> >
> >     information about the user’s environment, we
> often see pages
> >
> >     that display quite differently from one region
> to another.
> >
> >     As much as possible, browsing from across the
> globe should
> >
> >     give a consistent experience for a given
> page.  (Basically, I
> >
> >     want my children to one day stop seeing garbage
> when they
> >
> >     browse Japanese web sites from the US.)
> >
> >
> >
> > D.  We’d greatly increase the consistency of
> implementation of
> >
> >     markup handling by the various user agents.
> These openings
> >
> >     for UA-specific heuristics and decisions,
> undermines the
> >
> >     benefits of standards and standardization.
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Travis and Harley
> >
> >
> >
> > Internet Explorer Program Management/Development
> >
> > Microsoft Corporation
> >
> >
>
>




Reply | Threaded
Open this post in threaded view
|

RE: Auto-detect and encodings in HTML5

masinter
I believe the stance of most of the participants in the
HTML working group is that no "version indicator" for
HTML5 is necessary, and there is no specific
"HTML5 doctype", against which newer, or stricter,
behavior can be keyed.

If charset defaulting is a reason for having a specific
HTML5 version indicator, in order to trigger a stricter
interpretation, say, of the default charset, that would
be interesting.

Larry
--
http://larry.masinter.net


-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of M.T. Carrasco Benitez
Sent: Sunday, May 31, 2009 1:18 AM
To: Travis Leithead; Erik van der Poel
Cc: [hidden email]; [hidden email]; Richard Ishida; Ian Hickson; Chris Wilson; Harley Rosnow
Subject: Re: Auto-detect and encodings in HTML5


Near to Erik, but UTF8 in worse case:

1) Best: HTTP charset; unambiguous and "external"
2) Agree on ONE public detection algorithm
3) Mandatory declaration as near to the top as possible; if in META, the first in HEAD; within a certain range of bytes (e.g., 512)
4) Default UTF8 could be part of the algorithm; perhaps the last option
5) No BOM or similar

Regards
Tomas


--- On Wed, 27/5/09, Erik van der Poel <[hidden email]> wrote:

> From: Erik van der Poel <[hidden email]>
> Subject: Re: Auto-detect and encodings in HTML5
> To: "Travis Leithead" <[hidden email]>
> Cc: "[hidden email]" <[hidden email]>, "[hidden email]" <[hidden email]>, "Richard Ishida" <[hidden email]>, "Ian Hickson" <[hidden email]>, "Chris Wilson" <[hidden email]>, "Harley Rosnow" <[hidden email]>
> Date: Wednesday, 27 May, 2009, 7:30 PM
> Hi Travis,
>
> First of all, I am really happy to see a browser vendor
> offer to get
> stricter. :-)
>
> I wonder whether the doctype is a very clean way to move
> forward in
> this area, given that the HTTP charset ought to disable
> the
> auto-detector, but if many authors prefer the META charset,
> then the
> doctype might be a reasonable compromise. I am still
> thinking about
> this part.
>
> However, I object quite strongly to the UTF-8 default. If
> an HTML5
> document includes the doctype but excludes the charset, old
> clients
> might use their auto-detector and get it wrong. So I'd
> prefer to make
> the charset mandatory with HTML5 doctype, and keep the rule
> that the
> HTTP charset overrides the META charset for compatibility
> with old
> clients.
>
> Erik
>
> On Tue, May 26, 2009 at 4:45 PM, Travis Leithead
> <[hidden email]>
> wrote:
> > Ian, UA venders, and HTML/I18n mailing list folks:
> >
> >
> >
> > I'd like to present the following feedback from one of
> our lead
> >
> > Trident developers on the IE team. He and I work on a
> number of
> >
> > parts of the web platform; the encoding and
> auto-detect subsystem
> >
> > being the one most relevant to this mail. I'd really
> like to
> >
> > generate some discussion from the other browser UAs on
> the this
> >
> > topic.
> >
> >
> >
> > The basic idea is that we feel like there are a few
> places that
> >
> > the HTML5 spec could make assertions to improve the
> web's
> >
> > international support and future ease of
> interoperability
> >
> > regarding encodings and auto-detect. We recognize the
> need to be
> >
> > as compatible as possible with currently deployed web
> sites, and
> >
> > the technique proposed to maintain compatibility is by
> leveraging
> >
> > the "HTML5 doctype". I don't want to focus too much on
> that
> >
> > particular aspect of the proposal (though it's
> important), but to
> >
> > also consider the implications and scenarios as well.
> >
> >
> >
> > The proposal is straight-forward. Only in pages with
> the HTML5 doctype:
> >
> >
> >
> > 1.  Forbid the use of auto-detect heuristics for HTML
> encodings.
> >
> >
> >
> > 2.  Forbid the use problematic encodings such as UTF7
> and EBCDIC.
> >
> >
> >
> >     Essentially, get rid of the classes of
> encodings in which
> >
> >     Jscript and tags do not correspond to simple
> ASCII characters
> >
> >     in the raw byte stream.
> >
> >
> >
> > 3.  Only handling the encoding in the first META tag
> within the
> >
> >     HEAD and requiring that the HEAD and META tags
> to appear within
> >
> >     a well-defined, fixed byte distance into the
> file to take effect.
> >
> >
> >
> > 4.  Require the default HTML encoding to be UTF8.
> >
> >
> >
> > I realize these changes depart somewhat from current
> practice and
> >
> > may seem constraining.  But, I was very pleased to
> see UTF7 already
> >
> > excluded and EBCDIC discouraged in the HTML5 draft. 
> The META tag
> >
> > is supposed to be the first after the HEAD according
> to the draft.
> >
> > But, if we could get substantial agreement from the
> various user
> >
> > agents to tighten up the behavior covering this
> handling, we can
> >
> > greatly improve the Internet in the following
> regards:
> >
> >
> >
> >
> >
> > A.  HTML5 would no longer be vulnerable to script
> injection from
> >
> >     encodings such as UTF7 and EBCDIC which then
> tricks the auto-
> >
> >     detection code to reinterpret the entire page
> and run the
> >
> >     injected script.
> >
> >
> >
> >     (Harley: I’ve had to fix a number of issues
> related to these
> >
> >     security vulnerabilities but the problem is
> systemic in the
> >
> >     products and the standard doesn’t help.)
> >
> >
> >
> > B.  HTML5 would be able to process markup more
> efficiently by
> >
> >     reducing the scanning and computation required
> to merely
> >
> >     determine the encoding of the file.
> >
> >
> >
> > C.  Since sometimes the heuristics or default
> encoding uses
> >
> >     information about the user’s environment, we
> often see pages
> >
> >     that display quite differently from one region
> to another.
> >
> >     As much as possible, browsing from across the
> globe should
> >
> >     give a consistent experience for a given
> page.  (Basically, I
> >
> >     want my children to one day stop seeing garbage
> when they
> >
> >     browse Japanese web sites from the US.)
> >
> >
> >
> > D.  We’d greatly increase the consistency of
> implementation of
> >
> >     markup handling by the various user agents.
> These openings
> >
> >     for UA-specific heuristics and decisions,
> undermines the
> >
> >     benefits of standards and standardization.
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Travis and Harley
> >
> >
> >
> > Internet Explorer Program Management/Development
> >
> > Microsoft Corporation
> >
> >
>
>


     

Reply | Threaded
Open this post in threaded view
|

Re: Auto-detect and encodings in HTML5

Erik van der Poel
I agree that it would be interesting if major HTML5 implementers and
(the) HTML5 spec writer(s) would agree on a UTF-8 default charset.

Just to make the HTML5 "version indicator" a bit more explicit, might
this be something like the following HTTP response header?

Content-Type: text/html; version=5; charset=gb2312

Erik

On Sun, May 31, 2009 at 8:05 AM, Larry Masinter <[hidden email]> wrote:

> I believe the stance of most of the participants in the
> HTML working group is that no "version indicator" for
> HTML5 is necessary, and there is no specific
> "HTML5 doctype", against which newer, or stricter,
> behavior can be keyed.
>
> If charset defaulting is a reason for having a specific
> HTML5 version indicator, in order to trigger a stricter
> interpretation, say, of the default charset, that would
> be interesting.
>
> Larry
> --
> http://larry.masinter.net
>
>
> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On Behalf Of M.T. Carrasco Benitez
> Sent: Sunday, May 31, 2009 1:18 AM
> To: Travis Leithead; Erik van der Poel
> Cc: [hidden email]; [hidden email]; Richard Ishida; Ian Hickson; Chris Wilson; Harley Rosnow
> Subject: Re: Auto-detect and encodings in HTML5
>
>
> Near to Erik, but UTF8 in worse case:
>
> 1) Best: HTTP charset; unambiguous and "external"
> 2) Agree on ONE public detection algorithm
> 3) Mandatory declaration as near to the top as possible; if in META, the first in HEAD; within a certain range of bytes (e.g., 512)
> 4) Default UTF8 could be part of the algorithm; perhaps the last option
> 5) No BOM or similar
>
> Regards
> Tomas
>
>
> --- On Wed, 27/5/09, Erik van der Poel <[hidden email]> wrote:
>
>> From: Erik van der Poel <[hidden email]>
>> Subject: Re: Auto-detect and encodings in HTML5
>> To: "Travis Leithead" <[hidden email]>
>> Cc: "[hidden email]" <[hidden email]>, "[hidden email]" <[hidden email]>, "Richard Ishida" <[hidden email]>, "Ian Hickson" <[hidden email]>, "Chris Wilson" <[hidden email]>, "Harley Rosnow" <[hidden email]>
>> Date: Wednesday, 27 May, 2009, 7:30 PM
>> Hi Travis,
>>
>> First of all, I am really happy to see a browser vendor
>> offer to get
>> stricter. :-)
>>
>> I wonder whether the doctype is a very clean way to move
>> forward in
>> this area, given that the HTTP charset ought to disable
>> the
>> auto-detector, but if many authors prefer the META charset,
>> then the
>> doctype might be a reasonable compromise. I am still
>> thinking about
>> this part.
>>
>> However, I object quite strongly to the UTF-8 default. If
>> an HTML5
>> document includes the doctype but excludes the charset, old
>> clients
>> might use their auto-detector and get it wrong. So I'd
>> prefer to make
>> the charset mandatory with HTML5 doctype, and keep the rule
>> that the
>> HTTP charset overrides the META charset for compatibility
>> with old
>> clients.
>>
>> Erik
>>
>> On Tue, May 26, 2009 at 4:45 PM, Travis Leithead
>> <[hidden email]>
>> wrote:
>> > Ian, UA venders, and HTML/I18n mailing list folks:
>> >
>> >
>> >
>> > I'd like to present the following feedback from one of
>> our lead
>> >
>> > Trident developers on the IE team. He and I work on a
>> number of
>> >
>> > parts of the web platform; the encoding and
>> auto-detect subsystem
>> >
>> > being the one most relevant to this mail. I'd really
>> like to
>> >
>> > generate some discussion from the other browser UAs on
>> the this
>> >
>> > topic.
>> >
>> >
>> >
>> > The basic idea is that we feel like there are a few
>> places that
>> >
>> > the HTML5 spec could make assertions to improve the
>> web's
>> >
>> > international support and future ease of
>> interoperability
>> >
>> > regarding encodings and auto-detect. We recognize the
>> need to be
>> >
>> > as compatible as possible with currently deployed web
>> sites, and
>> >
>> > the technique proposed to maintain compatibility is by
>> leveraging
>> >
>> > the "HTML5 doctype". I don't want to focus too much on
>> that
>> >
>> > particular aspect of the proposal (though it's
>> important), but to
>> >
>> > also consider the implications and scenarios as well.
>> >
>> >
>> >
>> > The proposal is straight-forward. Only in pages with
>> the HTML5 doctype:
>> >
>> >
>> >
>> > 1.  Forbid the use of auto-detect heuristics for HTML
>> encodings.
>> >
>> >
>> >
>> > 2.  Forbid the use problematic encodings such as UTF7
>> and EBCDIC.
>> >
>> >
>> >
>> >     Essentially, get rid of the classes of
>> encodings in which
>> >
>> >     Jscript and tags do not correspond to simple
>> ASCII characters
>> >
>> >     in the raw byte stream.
>> >
>> >
>> >
>> > 3.  Only handling the encoding in the first META tag
>> within the
>> >
>> >     HEAD and requiring that the HEAD and META tags
>> to appear within
>> >
>> >     a well-defined, fixed byte distance into the
>> file to take effect.
>> >
>> >
>> >
>> > 4.  Require the default HTML encoding to be UTF8.
>> >
>> >
>> >
>> > I realize these changes depart somewhat from current
>> practice and
>> >
>> > may seem constraining.  But, I was very pleased to
>> see UTF7 already
>> >
>> > excluded and EBCDIC discouraged in the HTML5 draft.
>> The META tag
>> >
>> > is supposed to be the first after the HEAD according
>> to the draft.
>> >
>> > But, if we could get substantial agreement from the
>> various user
>> >
>> > agents to tighten up the behavior covering this
>> handling, we can
>> >
>> > greatly improve the Internet in the following
>> regards:
>> >
>> >
>> >
>> >
>> >
>> > A.  HTML5 would no longer be vulnerable to script
>> injection from
>> >
>> >     encodings such as UTF7 and EBCDIC which then
>> tricks the auto-
>> >
>> >     detection code to reinterpret the entire page
>> and run the
>> >
>> >     injected script.
>> >
>> >
>> >
>> >     (Harley: I’ve had to fix a number of issues
>> related to these
>> >
>> >     security vulnerabilities but the problem is
>> systemic in the
>> >
>> >     products and the standard doesn’t help.)
>> >
>> >
>> >
>> > B.  HTML5 would be able to process markup more
>> efficiently by
>> >
>> >     reducing the scanning and computation required
>> to merely
>> >
>> >     determine the encoding of the file.
>> >
>> >
>> >
>> > C.  Since sometimes the heuristics or default
>> encoding uses
>> >
>> >     information about the user’s environment, we
>> often see pages
>> >
>> >     that display quite differently from one region
>> to another.
>> >
>> >     As much as possible, browsing from across the
>> globe should
>> >
>> >     give a consistent experience for a given
>> page.  (Basically, I
>> >
>> >     want my children to one day stop seeing garbage
>> when they
>> >
>> >     browse Japanese web sites from the US.)
>> >
>> >
>> >
>> > D.  We’d greatly increase the consistency of
>> implementation of
>> >
>> >     markup handling by the various user agents.
>> These openings
>> >
>> >     for UA-specific heuristics and decisions,
>> undermines the
>> >
>> >     benefits of standards and standardization.
>> >
>> >
>> >
>> > Thanks,
>> >
>> >
>> >
>> > Travis and Harley
>> >
>> >
>> >
>> > Internet Explorer Program Management/Development
>> >
>> > Microsoft Corporation
>> >
>> >
>>
>>
>
>
>
>
>

123