uri templates: NFKC or NFC

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

uri templates: NFKC or NFC

Roy T. Fielding
The URI Templates draft currently requires use of the NFKC for
normalization of Unicode strings.  I've never understood why
that is, considering that IRI does no require it and
browsers appear to use NFC (if anything).  Also, it should only
apply to the expansions -- the literal parts don't need to be
normalized.

Should I change it to NFC?

....Roy

Reply | Threaded
Open this post in threaded view
|

Re: uri templates: NFKC or NFC

Chris Weber-4
On 7/14/2011 4:31 PM, Roy T. Fielding wrote:

> The URI Templates draft currently requires use of the NFKC for
> normalization of Unicode strings.  I've never understood why
> that is, considering that IRI does no require it and
> browsers appear to use NFC (if anything).  Also, it should only
> apply to the expansions -- the literal parts don't need to be
> normalized.
>
> Should I change it to NFC?
>
> ....Roy
>

 From my recent test results, Safari was the only browser applying NFC
to an IRI path, query, and fragment parts.  Chrome applied NFC to the
fragment part only, and the others did not apply NFC anywhere.  Needless
to say this results in an interop problem.  An overview of the results
are up at:

https://spreadsheets.google.com/spreadsheet/ccc?key=0AifoWoA0trUndEZSTlRRNnd5MzE3N3RYOVlIVFFMREE&hl=en_US#gid=5

And raw results including the test case fragments are up at:

https://spreadsheets.google.com/spreadsheet/ccc?key=0AifoWoA0trUndEZSTlRRNnd5MzE3N3RYOVlIVFFMREE&hl=en_US#gid=3

I was testing browsers in HTML Quirks mode using a UTF-8 charset
declaration set by the HTTP Content-Type header.  My observations were
based on a) the way an anchor href was parsed in the DOM, and b) the way
the HTTP GET request was sent on the wire.

Best regards,
Chris


Reply | Threaded
Open this post in threaded view
|

RE: uri templates: NFKC or NFC

Phillips, Addison-2
In reply to this post by Roy T. Fielding
>
> The URI Templates draft currently requires use of the NFKC for normalization
> of Unicode strings.  I've never understood why that is, considering that IRI does
> no require it and browsers appear to use NFC (if anything).  Also, it should only
> apply to the expansions -- the literal parts don't need to be normalized.
>
> Should I change it to NFC?
>

Most definitely! NFKC destroys some real semantic differences (whereas NFC is generally considered fairly benign). It could even introduce some visual oddities, such as the character U+00BC (vulgar fraction one quarter) becoming the sequence "1/4" (albeit the / is not %2F it is U+2044 FRACTION SLASH)

That said, the trend in IRI (and elsewhere) is away from mandatory normalization in processing. If IRIs do not require NFC (which I don't believe that they will or should), then having a requirement for it in URI Templates will mean that there are some IRIs that cannot be represented using a template (because the difference in the IRI is normalized away in the template).

The main reason to have normalization for templates would appear to me to be the normalization of character sequences in a variable name. It might be better to just handle sequences that don't match as not matching (e.g. the user is responsible for normalization) or perhaps referencing UAX#31 http://www.unicode.org/reports/tr31/ on what makes a valid identifier. Note that normalization does not eliminate the potential for problems such as combining marks to start a sequence.

Thanks,

Addison

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.


Reply | Threaded
Open this post in threaded view
|

Re: uri templates: NFKC or NFC

Martin J. Dürst
In reply to this post by Roy T. Fielding
On 2011/07/15 8:31, Roy T. Fielding wrote:
> The URI Templates draft currently requires use of the NFKC for
> normalization of Unicode strings.  I've never understood why
> that is, considering that IRI does no require it and
> browsers appear to use NFC (if anything).  Also, it should only
> apply to the expansions -- the literal parts don't need to be
> normalized.
>
> Should I change it to NFC?

Hello Roy,

NFKC is heavily used in IDNA, but it's not the whole story, and just
doing NFKC for the domain-name part doesn't make sense to me. For the
rest, I'd definitely go with NFC (as opposed to NFKC), for the reasons
you gave.

Regards,    Martin.

Reply | Threaded
Open this post in threaded view
|

RE: uri templates: NFKC or NFC

Sampo Syreeni
In reply to this post by Phillips, Addison-2
On 2011-07-14, Phillips, Addison wrote:

> NFKC destroys some real semantic differences (whereas NFC is generally
> considered fairly benign).

Unicode characters are not supposed to carry any semantics beyond what
is encoded in them by the standard(s). Thus, canonical equivalence means
any two characters which are related by it are exactly the same. If
they're handled in any way differently from each other, anywhere, the
implementation is by definition not Unicode/ISO 10646 compatible. The
different compliance levels kind of muck up this basic idea, true, but
this is how it was meant to be.

As for compatibility equivalence, it's basically an interim measure and
a concession to existing character codings which do carry meaning, and
roundtripping between Unicode and existing, stupider encodings. It's not
something you should espouse when working primarily in Unicode, but
something you should do away with in lieu of explicit tagging. In fact
most of the time you should just drop the difference altogether without
any further tagging and treat compatibility equivalent characters as the
same. But if you really, really can't, you should still compatibility
decompose and move the semantics onto a higher level protocol, like HTML
or whatnot.

As such, in the end, what Unicode is supposed to be like in its pure
form is what follows from putting everything into NFKD. Without
exception, and also raising an exception for illformed character
encoding every time you see something that is not in compliance. If you
need anything beyond that, you're supposed to relegate that to some
higher level protocol, while flat out refusing to input and output
anything that isn't formally and verifiably in NFKD (i.e. in True
Unicode).

> It could even introduce some visual oddities, such as the character
> U+00BC (vulgar fraction one quarter) becoming the sequence "1/4"
> (albeit the / is not %2F it is U+2044 FRACTION SLASH)

That is then by design: that sort of thing isn't part of the character
model, but about how characters might be used as part of some higher
level protocol/syntax. Such as MathML or whatnot. Fractions and the like
do not belong in Unicode, and the only reason they have been allowed
into it is as an interim blemish, hopefully soon to go away for good.

If NFKD leads to "visual oddities", it's because your software for some
reason doesn't implement the proper higher level protocol correctly,
and/or misunderstands what Unicode is about.

> [...] The main reason to have normalization for templates would appear
> to me to be the normalization of character sequences in a variable
> name. [...]

To me it seems there is a definite disconnect between how the
Unicode/ISO folks think about the character model, and how it is being
utilized in practice. If the original intent behind the character model
was the real aim, we wouldn't have these sorts of discussions in the
first place. We'd only wonder about how to deal with NFKD, with its
unorthodox, open-ended, suffix form. It could then be tackled purely by
technical means, without these kinds of policy debates, even if it lead
to some rather nasty string parsing code in the process.

> It might be better to just handle sequences that don't match as not
> matching (e.g. the user is responsible for normalization) or perhaps
> referencing UAX#31 http://www.unicode.org/reports/tr31/ on what makes
> a valid identifier. Note that normalization does not eliminate the
> potential for problems such as combining marks to start a sequence.

Such things are ungrammatical wrt Unicode, so I'd say just fail
gracefully on them. After that, either a) fail for any NFKD violation in
either comparand and after that for any bitwise or lengthwise mismatch,
or (more usually) b) always normalize to strict, formal NFKD and fail
upon the first unmatched bit and/or string length mismatch, then. That's
how it's supposed to work, it's easier/cheaper to implement than most
alternatives, and as a matter of fact it already shields you from half
of the homonym attacks which things like stringprep try to defend
against. Not to mention all of the Unicode specific attacks...
--
Sampo Syreeni, aka decoy - [hidden email], http://decoy.iki.fi/front
+358-50-5756111, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2