Quantcast

Clarification of CharMod C045

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Clarification of CharMod C045

Doug Schepers-3
Hi, Folks-

While reviewing DOM3 Events, Richard Ishida pointed out that the use of
surrogate pairs in escaped character strings is frowned upon, citing
C045 [1]:

[[
C045  [S]  Whenever specifications define character escapes that allow
the representation of characters using a number, the number MUST
represent the Unicode code point of the character and SHOULD be in
hexadecimal notation.
]]

A superficial reading of that point doesn't make a clear distinction
between surrogate pairs and Unicode code points, since surrogate pairs
are Unicode code points as well.

His explanation was that the surrogate code points are not the code
point of the character, but rather they are codepoints of two surrogate
characters; the codepoint of the character is only and always a single
number.

While I now understand and agree with his point, I think a clarifying
errata might benefit people like me who want to be good citizens but
might not get the implications immediately.

[1] http://www.w3.org/TR/charmod/#C045

Regards-
-Doug Schepers
W3C Team Contact, SVG and WebApps WGs

Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Clarification of CharMod C045

Martin J. Dürst
Hello Doug,

Thanks for your comment.

On 2009/10/30 5:59, Doug Schepers wrote:

> Hi, Folks-
>
> While reviewing DOM3 Events, Richard Ishida pointed out that the use of
> surrogate pairs in escaped character strings is frowned upon, citing
> C045 [1]:
>
> [[
> C045 [S] Whenever specifications define character escapes that allow the
> representation of characters using a number, the number MUST represent
> the Unicode code point of the character and SHOULD be in hexadecimal
> notation.
> ]]
>
> A superficial reading of that point doesn't make a clear distinction
> between surrogate pairs and Unicode code points, since surrogate pairs
> are Unicode code points as well.

Yes, surrogates are code points as well, but they are not characters.
Therefore, as far as I understand, "MUST represent the Unicode code
point of the *character*" (emphasis added) makes it clear that surrogate
code points (whether in pairs or not) are not allowed.

> His explanation was that the surrogate code points are not the code
> point of the character, but rather they are codepoints of two surrogate
> characters; the codepoint of the character is only and always a single
> number.

Actually, there's no such thing as a "surrogate character". Surrogates
don't have character names, they don't have representative glyphs, nor
do they have anything else that characters typically have. A good place
to understand this is Table 2-3 on page 27 of Unicode Version 5.

> While I now understand and agree with his point, I think a clarifying
> errata might benefit people like me who want to be good citizens but
> might not get the implications immediately.

Can you propose actual text?

Regards,   Martin.


> [1] http://www.w3.org/TR/charmod/#C045
>
> Regards-
> -Doug Schepers
> W3C Team Contact, SVG and WebApps WGs
>
>

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:[hidden email]

Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Clarification of CharMod C045

Doug Schepers-3
Hi, Martin-

"Martin J. Dürst" wrote (on 10/29/09 10:19 PM):

> On 2009/10/30 5:59, Doug Schepers wrote:
>> Hi, Folks-
>>
>> While reviewing DOM3 Events, Richard Ishida pointed out that the use of
>> surrogate pairs in escaped character strings is frowned upon, citing
>> C045 [1]:
>>
>> [[
>> C045 [S] Whenever specifications define character escapes that allow the
>> representation of characters using a number, the number MUST represent
>> the Unicode code point of the character and SHOULD be in hexadecimal
>> notation.
>> ]]
>>
>> While I now understand and agree with his point, I think a clarifying
>> errata might benefit people like me who want to be good citizens but
>> might not get the implications immediately.
>
> Can you propose actual text?

Maybe something as simple as "Note that UTF-16 surrogate pairs are
comprised of two separate Unicode code points, for which an appropriate
UTF-32 Unicode code point always exists which SHOULD (MUST?) be used
instead."

But honestly, I would feel more comfortable if someone brainier than me
came up with it.

Regards-
-Doug Schepers
W3C Team Contact, SVG and WebApps WGs

Loading...