FW: UTF-16 and Byte Order Mark

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

FW: UTF-16 and Byte Order Mark

Grosso, Paul

Forwarding to the public comment's list.

paul

-----Original Message-----
From: [hidden email]
[mailto:[hidden email]] On Behalf Of John Cowan
Sent: Wednesday, 2006 December 20 14:52
To: [hidden email]
Cc: [hidden email]
Subject: Re: UTF-16 and Byte Order Mark


Our apologies for the long delay in responding to your message.
The content of this message has been approved by the XML Core WG.

You wrote at
<http://lists.w3.org/Archives/Public/xml-editor/2006JulSep/0007.html>:

> Appendix F.1 of the XML specs presents examples about how to
> automatically detect the encoding of an entity from the first
> characters of an XML encoding declaration without a byte order mark.
> These examples include UTF-16BE and UTF-16LE. However, section 4.3.3
> says that entities encoded in UTF-16 MUST begin with a byte order
mark.

That is strictly limited to the UTF-16 encoding, and excludes the
related UTF-16LE and UTF-16BE encodings, in which BOMs are not present.
Note that "UTF16-LE" does not mean "UTF-16 encoding whose BOM shows it
to be little-endian" but rather "UTF-16-like encoding in little-endian
order without a BOM."  If U+FEFF appears at the beginning of a UTF-16LE
or
UTF16-BE document, it is not a BOM but a ZWNBSP character (and therefore
the document cannot be well-formed XML.  cannot be well-formed XML),
not a BOM.

> In the light of the examples it seems that the intention of the specs
is
> to demand a UTF-16 byte order mark only when no XML declaration is
used.
> Is this interpretation of the specs correct?

No.  If the encoding is UTF-16, a BOM is mandatory, whether or not an
XML declaration is present.

> If the answer is "no", I would suggest to remove the two incriminated
> examples from Appendix F.1 and to add an appropriate warning.

The examples are not in error, because they refer to the UTF-16LE and
UTF-16BE encodings rather than the UTF-16 encoding.

The Core WG will be adding language to 4.3.3 stating that UTF-16BE and
UTF-16LE are specifically not UTF-16.

--
I marvel at the creature: so secret and         John Cowan
so sly as he is, to come sporting in the pool   [hidden email]
before our very window.  Does he think that
http://www.ccil.org/~cowan
Men sleep without watch all night?