draft-newman-i18n-collation-09.txt just posted

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

draft-newman-i18n-collation-09.txt just posted

Arnt Gulbrandsen

As far as I know, this addresses, ignores or adds open issues for all
requests. If something is ignored, that's because other people wanted
the opposite, or because I overlooked it when I went over all the mail
this week. I'm sorry about it in either case.

Review, please.

Arnt

Reply | Threaded
Open this post in threaded view
|

Re: draft-newman-i18n-collation-09.txt just posted

cowan


Arnt Gulbrandsen scripsit:

> As far as I know, this addresses, ignores or adds open issues for all
> requests. If something is ignored, that's because other people wanted
> the opposite, or because I overlooked it when I went over all the mail
> this week. I'm sorry about it in either case.
>
> Review, please.

Posted where?  Neither rfc-editor.org nor ietf.org seems to have it.

--
John Cowan  [hidden email]  http://ccil.org/~cowan
The penguin geeks is happy / As under the waves they lark
The closed-source geeks ain't happy / They sad cause they in the dark
But geeks in the dark is lucky / They in for a worser treat
One day when the Borg go belly-up / Guess who wind up on the street.

Reply | Threaded
Open this post in threaded view
|

Re: draft-newman-i18n-collation-09.txt just posted

Mark Davis-2
In reply to this post by Arnt Gulbrandsen

The release of this is timely (we didn't get notified of a 07 or 08
draft), since the Unicode Technical Committee is meeting next week, and
can discuss it.

Could you indicate which of the items raised in the email of 2006-02-21
from the Unicode Technical Committee have been addressed in this release
(and if not accepted then why)? That would help greatly with the review.
(I couldn't find any archive for discussion of
draft-newman-i18n-comparator where that email could be publicly linked
from, so I am appending it at the end of this message.) At a quick
glance, it appears that a number of comments have been incorporated.

Mark

BTW, despite the subject of the message, the document is at
http://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-09.txt.
It helps to send out a link, especially if the name (comparator vs
collation) is wrong ;-)

BTW, it was pointed out to us that the original email shouldn't have
been sent to "Network Working Group", even though that is the name at
the top of
http://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-09.txt

Arnt Gulbrandsen wrote:
>
> As far as I know, this addresses, ignores or adds open issues for all
> requests. If something is ignored, that's because other people wanted
> the opposite, or because I overlooked it when I went over all the mail
> this week. I'm sorry about it in either case.
>
> Review, please.
>
> Arnt

=================

Mark Davis wrote:

> To: Network Working Group
> Re: draft-newman-i18n-comparator
> Date: 2006-02-21
> From: Unicode Technical Committee
>
> The Unicode Technical Committee has reviewed the document
> http://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-06.txt.
> While UTC is in favor of the goal, there are a number of problems with
> the document. The main problems are outlined below. Once these are
> addressed, then further review can continue.
>
>
>     Details
>
>
>       > 2.1 Definitions
>
>
>         Content
>
> The document needs to include the definitions of the technical terms
> used in the document,  including all those that may not be familiar to
> implementers, such as "trichotomous" and "collation identifiers". In
> particular, the notion of a substring is /prima facie/ quite simple,
> but there are complications that require a clear definition. The text
> in the document does not make clear that there may be more than one
> match for a substring in a string, and that the matches can overlap.
> It says "the starting offset", for example, when there may be multiple
> ones.
>
> Moreover, language sensitive matches have additional complications
> which need to be called out. For more information, see
> http://www.unicode.org/reports/tr10/#Searching
>
>
>         Format
>
> If there is a "Definitions" section, readers have a reasonable
> expectation that that section should contain all the required
> definitions. However, a number of definitions are scattered within the
> text. One of two approaches should be taken
>
>    1. Move all the definitions into this section.
>    2. Remove the definitions section, but clearly call out in the text
>       the definitions of  each terms on its own line.
>
> Mixing these two styles is needlessly confusing for readers.
>
>
>       > 2.4 Sort Keys
>
> The use of the term "collation canonicalization" to refer to sort keys
> is very misleading. The term "canonicalization" implies that the
> results are still text in some fashion, whereas a sortkey is simply a
> string of octets generated from a given string by a specific
> comparator, whereby the binary comparison (ordering) of two sort keys
> is guaranteed to match *that* comparator's compare function for the
> original strings. The octets may have no readily discernable relation
> to the original text. For example, the ICU sort keys generated for the
> following strings are:
>
> cote 2c 44 4e 30 01 08 01 08 00
> côté 2c 44 4e 30 01 85 93 85 8d 01 0a 00
> Αραβικά 5c 20 52 20 22 36 3a 20 01 80 8d 01 8f 0b 00
>
> See
> http://www-950.ibm.com/software/globalization/icu/demo/locales/en/?_=el&d_=en&x=col 
> <http://www-950.ibm.com/software/globalization/icu/demo/locales/en/?_=el&d_=en&x=col>
> for other examples.
>
> > 3.2
>
> This specifies that clients that support disconnected operation should
> not use wildcards while clients that provide collation operations only
> when connected to the server may use wildcards.
>
> It appears the restrictions are may not be really needed and the
> restrictions may need to be deleted from the draft. Otherwise, it
> would really helpful if the rationale behind the restrictions are
> provided at the draft.
>
> The EBNF syntax shown in section 3.2 says that the collation-wild must
> not exceed 255 characters total while the section 3.1 specifies that
> the collation name must not exceed 254 characters.
>
> It seems having the same maximum possible length for both collation
> name and wildcard string would be desirable for actual implementations.
>
>
>       > 4.2.1 Equality
>
> It needs to be made clear that the return values are not physically
> the strings "match", etc. but enumerated values such as /equal/ and  
> /not_equal/. The document could describe a notation used for them,
> such as single quotes, since italic is not available in RFCs.
> Similarly, the results of the ordering function should be specified as
> an enumeration with three values: /less/, /equal/, /greater./ The
> mapping actual API return values in implementations to these
> enumerated values can be outside of the scope of this document. For
> example, the mapping might take -1 onto /less/ in one implementation,
> or anything negative onto /less/ in another implementation.
>
> One extremely important point is that for a given comparator, the
> equality function must be synchronized with the ordering function.
> That is, it must return 'equal' if and only if the ordering function
> returns 'equal'. Otherwise any coordinated usage of the functions will
> fail. This also implies that either 'error' is allowed for both
> functions or for neither.
>
> The term 'error' is also problematic, since what is really at issue is
> a question of domain. For all those strings in the domain, either
> 'equal' or 'not_equal' should be returned from the equality function.
> For any string not in the domain, 'undefined' should be returned. That
> avoids coherency problems. Then the requirements are clear:
>
>     * if A and B are in the domain, then the result of an equality
>       test is either /equal/ or /not_equal/
>     * if A or B (or both) are not in the domain, then the result of an
>       equality test is /undefined/.
>
> There is a typo at the 4'th line of the second paragraph of the
> section 4.2 saying "... For example, an collation" which should be
> changed to "... For example, a collation" instead.
>
>
>       > 4.2.2 Substring
>
> Prefix and suffix matching are not fully spelled out. The operations
> and their results must be clarified. And as noted before, it is very
> important to precisely define the substring operations, especially the
> starting offset and ending offset. It also must be clarified whether
> what is being asked for is the first possible matching location in the
> string, the last, or the nth one.
>
>
>       > 4.3.3 Ordering
>
> > It MUST be transitive and trichotomous.
>
> As above, these should be defined. The exposition in this section
> would be simpler if you also defined "reversible", whereby f(a,b) =
> less iff f(b,a) = greater. Then the statement would be:
>
>     It MUST be transitive, trichotomous, and reversible.
>
> >When the collation is used with a
>    "-" prefix, the result of the ordering function of the collation MUST
>    be reversed.
>
> => When the collation is used with a
>    "-" prefix, the result of the ordering function of the collation
> when applied to two strings A and B  MUST
>    be the same as the result with a "+" prefix applied to B and A.
>
> An 'undefined' value can be allowed if, as per equality above, it
> means that at least one of the operands is outside of the domain. The
> function then imposes a total order on all strings in the domain;
> moreover, a wrapper can easily convert the function to a total order
> over all strings by putting all items outside the domain either below
> or above the ones in the domain -- or even excluding them,/ at its
> choice./
>
>  > In general, collations SHOULD NOT return "0" unless the two strings
> are identical.
>
> => The ordering function MUST return 'equal' if and only if the equality function returns 'equal'
>
> [Note: it is very important to avoid the confusion between "identical"
> and "equal". According to a caseless compare, "Mark" and "mark" are
> equal; however, the strings are not identical.]
>
> [Either 'ordering function' or 'comparison function' should be used
> consistently, not sometimes 'collations'].
>
>
>       > 4.3.  Internal Canonicalization Algorithm
>
> This section is difficult to understand. It appears that goal is that
> any registration must specify sufficient detail, both data and
> algorithm, so as to enable someone to reproduce the results. But it is
> not at all clear that that is the goal. And that would make the
> registration require, in some cases, a huge accompanying document. To
> duplicate the results of CLDR collators, for example, would require
> the UCA specification, plus the LDML specification, plus all the
> relevant data in the CLDR repository.
>
>
>       > 4.4.  Use of Lookup Tables
>
> It is not at all clear what is meant by "customizable lookup tables".
>
>
>       > 4.5.  Multi-Value Attributes
>
> This is very unclear. It describes attributes as applying to only
> equality (since it only refers to "match" vs "no-match" (and
> forgetting "error")).
>
> This is a very important feature that needs to be spelled out in
> detail, and clearly reflected in the template for registration. In
> particular, the template should have provision for multiple
> attributes, with the ability to specify the acceptable operands for
> that attribute. (See below). The specification of the operands could
> be either a list of values, or a regular expression (with the former
> preferred). Suggested regular expression syntax would be Perl or XML
> Schema.
>
>
>       > 5.1Character Encoding
>
>    The protocol specification has to make sure that it is clear on which
>    characters (rather than just octets) the collations are used.  This
>    can be done by specifying the protocol itself in terms of characters
>    (e.g. in the case of a query language), by specifying a single
>    character encoding for the protocol (e.g.  UTF-8 [3]), or by
>    carefully describing the relevant issues of character encoding
>    labeling and conversion.  In the later case, details to consider
>    include how to handle unknown charsets, any charsets which are
>    mandatory-to-implement, any issues with byte-order that might apply,
>    and any transfer encodings which need to be supported.
>
> If a collation is able to advertise itself as being able to handle,
> say, SJIS and UTF-8, then there should a required description of a
> protocol for indicating that and for communicating which encodings are
> handled, and how it handles error conditions (such as a charset
> outside of those it can handle. Otherwise, it is difficult to
> understand how this paragraph would be applied in practice.
>
>
>       > 5.3
>
> The section 5.3 specifies:
>
>     The protocol MUST specify how comparisons behave in the absence of
>     explicit collation negotiation or when a collation of "*" is
>     requested. The protocol MAY specify that the default collation
>     used in such circumstances is sensitive to server configuration.
>
> and the section 3.2 specifies:
>
>     ... If the wildcard string matches multiple collations, the server
>     SHOULD select the collation with the broadest scope (preferably
>     international scope), the most recent table versions and the
>     greatest number of supported operations. A single wildcard
>     character ("*") refers to the application protocol collation
>     behavior that would occur if no explicit negotiation were used.
>
> These appear inconsistent.
>
>
>       7.5.  Example Initial Registry Summary
>
> The sample registry would suffer a combinatorial explosion if
> parameters are not handled differently. For example, with CLDR
> collations, there can be hundreds of locales, six different strength
> settings; four different case-first settings; three different
> alternate settings, backwards settings, normalization settings, case
> level settings, hiragana settings, and numeric settings; plus a
> variable-top setting which has a string as an operand. Registering the
> combinations that people are allowed to use would be untenable.
>
> http://www.unicode.org/draft/reports/tr35/tr35.html#Setting_Options
>
> Instead, as remarked above, the allowable attribute values need to be
> associated with the registered name in a machine-readable form.
>
> > 11.  Security Considerations
>
> This is insufficient. It should at least point to the problems related
> in UCA and in http://www.unicode.org/reports/tr36/tr36-4.html (note
> that that document has been approved by the UTC and will be posted as
> an approved version soon.)
>
>
>     General
>
> One of the real problems with the IANA character registry is that the
> entries are underspecified. It quite often occurs that two vendors
> implement the same IANA charset conversion different ways, leading to
> significant interoperability problems and text corruption. See, for
> example, http://www.w3.org/Submission/japanese-xml/#ambiguity_of_yen.
>
> We have the real concern that this registry could lead down the same path.
>
> > collation, it has to say so
>
> There are places where the text should be clarified, as to whether a
> MUST or SHOULD is implied; this is just an example.
>
> > "comparator" vs "collator"
>
> Either one term or the other should be used consistently.
>
> > Unicode 3.2
>
> Unicode 3.2 is obsolete; the the reference versions for the Collation
> Registry should be Unicode 5.0 and UCA 5.0, since those will be
> approved and published by the time the Internet Application Protocol
> Collation Registry has completed its review and been approved.
>
> Because of the use of NamePrep, it is probably the case that Unicode
> 3.2 also needs to be included, but strongly recommended for usage only
> by protocols or systems dependent on NamePrep. Note that as of UCA 4.0
> and beyond, the version number of UCA is guaranteed to be identical
> with the version number of Unicode that it is defined for.
>
> > Versioning
>
> This is tricky, and should be clarified. In many instances, it is
> sufficient to use an unversioned collator, such as simply "UCA". In
> other cases, there are requirements to use a specific version, or a
> version of at least X. This needs to be described.
>


Reply | Threaded
Open this post in threaded view
|

Re: draft-newman-i18n-collation-09.txt just posted

Arnt Gulbrandsen
In reply to this post by cowan

John Cowan writes:
> Arnt Gulbrandsen scripsit:
>>  As far as I know, this addresses, ignores or adds open issues for
>>  all requests. If something is ignored, that's because other people
>>  wanted the opposite, or because I overlooked it when I went over
>>  all the mail this week. I'm sorry about it in either case.
>>
>>  Review, please.
>
> Posted where? Neither rfc-editor.org nor ietf.org seems to have it.

http://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-09.txt 
has it. Sorry. I posted that only an hour or two after the i-d editor
told me it was posted (at that URL), which must have been too quickly
for most mirrors.

Arnt

Reply | Threaded
Open this post in threaded view
|

Re: draft-newman-i18n-collation-09.txt just posted

Arnt Gulbrandsen
In reply to this post by Mark Davis-2

Mark Davis writes:

> The release of this is timely (we didn't get notified of a 07 or 08
> draft), since the Unicode Technical Committee is meeting next week,
> and can discuss it.
>
> Could you indicate which of the items raised in the email of
> 2006-02-21 from the Unicode Technical Committee have been addressed
> in this release (and if not accepted then why)? That would help
> greatly with the review. (I couldn't find any archive for discussion
> of draft-newman-i18n-comparator where that email could be publicly
> linked from, so I am appending it at the end of this message.) At a
> quick glance, it appears that a number of comments have been
> incorporated.

Lots. Some not. See below.

It is possible that some of my changes don't satisfy you. I had
conflicting requests for many things. Feel free to repeat, rephrase or
add arguments.

> Mark
>
> BTW, despite the subject of the message, the document is at
> http://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-09.txt.
> It helps to send out a link, especially if the name (comparator vs
> collation) is wrong ;-)

Mea culpa. My apologies.

...

>> To:   Network Working Group
>> Re:   draft-newman-i18n-comparator
>> Date:         2006-02-21
>> From:         Unicode Technical Committee
>>
>> The Unicode Technical Committee has reviewed the document
>> http://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-06.txt.
>> While UTC is in favor of the goal, there are a number of problems
>> with the document. The main problems are outlined below. Once these
>> are addressed, then further review can continue.
>>
>>     Details
>>
>>       > 2.1 Definitions
>>
>>         Content
>>
>> The document needs to include the definitions of the technical terms
>> used in the document,  including all those that may not be familiar
>> to implementers, such as "trichotomous" and "collation identifiers".
>> In particular, the notion of a substring is /prima facie/ quite
>> simple, but there are complications that require a clear definition.
>> The text in the document does not make clear that there may be more
>> than one match for a substring in a string, and that the matches can
>> overlap. It says "the starting offset", for example, when there may
>> be multiple ones.

Changed.

>> Moreover, language sensitive matches have additional complications
>> which need to be called out. For more information, see
>> http://www.unicode.org/reports/tr10/#Searching

Not really changed. As I recall, I added a little bit of text.

>>         Format
>>
>> If there is a "Definitions" section, readers have a reasonable
>> expectation that that section should contain all the required
>> definitions. However, a number of definitions are scattered within
>> the text. One of two approaches should be taken
>>
>>    1. Move all the definitions into this section.
>>    2. Remove the definitions section, but clearly call out in the text
>>       the definitions of  each terms on its own line.
>>
>> Mixing these two styles is needlessly confusing for readers.

Not changed; I'm going by what confuses reviewers.

>>       > 2.4 Sort Keys
>>
>> The use of the term "collation canonicalization" to refer to sort
>> keys is very misleading. ...

Changed; the text now speaks of sort keys. I'm afraid there still are
instances of the old term around, I found one today.

>> > 3.2
>>
>> This specifies that clients that support disconnected operation
>> should not use wildcards while clients that provide collation
>> operations only when connected to the server may use wildcards.

This restrinction has been lifted.

>> The EBNF syntax shown in section 3.2 says that the collation-wild
>> must not exceed 255 characters total while the section 3.1 specifies
>> that the collation name must not exceed 254 characters.

Brought into sync.

>> It seems having the same maximum possible length for both collation
>> name and wildcard string would be desirable for actual
>> implementations.

I picked 254, not 255, but I confess I cannot remember why.

>>       > 4.2.1 Equality
>>
>> It needs to be made clear that the return values are not physically
>> the strings "match", etc. but enumerated values such as /equal/ and
>> /not_equal/.

Changed. Also other similar changes.

>> One extremely important point is that for a given comparator, the
>> equality function must be synchronized with the ordering function.

I've done this and all the other equivalences/connections/implications I
could see.

>> The term 'error' is also problematic, since what is really at issue
>> is a question of domain. For all those strings in the domain, either
>> 'equal' or 'not_equal' should be returned from the equality
>> function. For any string not in the domain, 'undefined' should be
>> returned.

Not changed. Back in February, I agreed that "error" was not ideal, but
did not see "undefined" as better, and could not find a really apt
term. The collations were a little too well-defined in the "undefined"
cases then.

However, in -10, I think they really will be undefined outside their
domain, so I'll change to using "undefined" instead of "error". (I'm
removing the bits that fall back to i;octet.)

>> There is a typo at the 4'th line of the second paragraph of the
>> section 4.2 saying "... For example, an collation" which should be
>> changed to "... For example, a collation" instead.

Fixed.

>>       > 4.2.2 Substring
>>
>> Prefix and suffix matching are not fully spelled out.

I think they are now.

>> The operations and their results must be clarified. And as noted
>> before, it is very important to precisely define the substring
>> operations, especially the starting offset and ending offset. It
>> also must be clarified whether what is being asked for is the first
>> possible matching location in the string, the last, or the nth one.

Partly changed. I didn't do the bits you ask for in the last sentence. I
can add an open issue.

>>       > 4.3.3 Ordering
>>
>> > It MUST be transitive and trichotomous.
>>
>> As above, these should be defined.

I did not, since I think this document is the wrong place to define
these terms.

>> The exposition in this section would be simpler if you also defined
>> "reversible", whereby f(a,b) = less iff f(b,a) = greater.

The exposition changed enough as a result of other commens that I
isregarded this comment.

>> An 'undefined' value can be allowed if, as per equality above, it
>> means that at least one of the operands is outside of the domain.
>> The function then imposes a total order on all strings in the
>> domain; moreover, a wrapper can easily convert the function to a
>> total order over all strings by putting all items outside the domain
>> either below or above the ones in the domain -- or even excluding
>> them,/ at its choice./

I'm doing something like this in -10. (Removing the fallback to i;octet.)

>> [Note: it is very important to avoid the confusion between
>> "identical" and "equal". According to a caseless compare, "Mark" and
>> "mark" are equal; however, the strings are not identical.]

Changed all over the place.

>> [Either 'ordering function' or 'comparison function' should be used
>> consistently, not sometimes 'collations'].

Changed.

>>       > 4.3.  Internal Canonicalization Algorithm
>>
>> This section is difficult to understand.

Changed; I hope the new text is better.

>>       > 4.4.  Use of Lookup Tables
>>
>> It is not at all clear what is meant by "customizable lookup tables".

Clarified and partly removed.

>>       > 4.5.  Multi-Value Attributes
>>
>> This is very unclear.

Deleted.

>> This is a very important feature that needs to be spelled out in
>> detail, and clearly reflected in the template for registration. In
>> particular, the template should have provision for multiple
>> attributes, with the ability to specify the acceptable operands for
>> that attribute. (See below). The specification of the operands could
>> be either a list of values, or a regular expression (with the former
>> preferred). Suggested regular expression syntax would be Perl or XML
>> Schema.

I asked Martin Dürst and you to provide a new DTD. Martin said okay, I
don't remember whether you answered. I think the DTD should come before
this.

>>       > 5.1Character Encoding
>>
>>    The protocol specification has to make sure that it is clear on which
>>    characters (rather than just octets) the collations are used.  This
>>    can be done by specifying the protocol itself in terms of characters
>>    (e.g. in the case of a query language), by specifying a single
>>    character encoding for the protocol (e.g.  UTF-8 [3]), or by
>>    carefully describing the relevant issues of character encoding
>>    labeling and conversion.  In the later case, details to consider
>>    include how to handle unknown charsets, any charsets which are
>>    mandatory-to-implement, any issues with byte-order that might apply,
>>    and any transfer encodings which need to be supported.
>>
>> If a collation is able to advertise itself as being able to handle,
>> say, SJIS and UTF-8, then there should a required description of a
>> protocol for indicating that and for communicating which encodings
>> are handled, and how it handles error conditions (such as a charset
>> outside of those it can handle. Otherwise, it is difficult to
>> understand how this paragraph would be applied in practice.
>>
>>       > 5.3
>>
>> The section 5.3 specifies:
>>
>>     The protocol MUST specify how comparisons behave in the absence of
>>     explicit collation negotiation or when a collation of "*" is
>>     requested. The protocol MAY specify that the default collation
>>     used in such circumstances is sensitive to server configuration.
>>
>> and the section 3.2 specifies:
>>
>>     ... If the wildcard string matches multiple collations, the server
>>     SHOULD select the collation with the broadest scope (preferably
>>     international scope), the most recent table versions and the
>>     greatest number of supported operations. A single wildcard
>>     character ("*") refers to the application protocol collation
>>     behavior that would occur if no explicit negotiation were used.
>>
>> These appear inconsistent.

Changed.

>>       7.5.  Example Initial Registry Summary
>>
>> The sample registry would suffer a combinatorial explosion if
>> parameters are not handled differently.
...

This is the DTD issue.

>> > 11.  Security Considerations
>>
>> This is insufficient. It should at least point to the problems
>> related in UCA and in
>> http://www.unicode.org/reports/tr36/tr36-4.html (note that that
>> document has been approved by the UTC and will be posted as an
>> approved version soon.)

It now refers.

>>     General
>>
>> One of the real problems with the IANA character registry is that the
>> entries are underspecified. It quite often occurs that two vendors
>> implement the same IANA charset conversion different ways, leading
>> to significant interoperability problems and text corruption. See,
>> for example,
>> http://www.w3.org/Submission/japanese-xml/#ambiguity_of_yen.
>>
>> We have the real concern that this registry could lead down the same path.

Noted.

>> > collation, it has to say so
>>
>> There are places where the text should be clarified, as to whether a
>> MUST or SHOULD is implied; this is just an example.
>>
>> > "comparator" vs "collator"
>>
>> Either one term or the other should be used consistently.

Collator, now.

>> > Unicode 3.2
>>
>> Unicode 3.2 is obsolete; the the reference versions for the Collation
>> Registry should be Unicode 5.0 and UCA 5.0, since those will be
>> approved and published by the time the Internet Application Protocol
>> Collation Registry has completed its review and been approved.

I'll update to the then-current versions immediately before submitting
the final draft as an RFC.

>> Because of the use of NamePrep, it is probably the case that Unicode
>> 3.2 also needs to be included, but strongly recommended for usage
>> only by protocols or systems dependent on NamePrep. Note that as of
>> UCA 4.0 and beyond, the version number of UCA is guaranteed to be
>> identical with the version number of Unicode that it is defined for.
>>
>> > Versioning
>>
>> This is tricky, and should be clarified. In many instances, it is
>> sufficient to use an unversioned collator, such as simply "UCA". In
>> other cases, there are requirements to use a specific version, or a
>> version of at least X. This needs to be described.

IETF documents should have only immutable references. Thus, I can
reference "UCAv14", but not "UCA", because the latter moves to v15, v16
and onwards.

Arnt

Reply | Threaded
Open this post in threaded view
|

Re: draft-newman-i18n-collation-09.txt just posted

cowan

I propose that the procedure specified in draft-newman-i18n-collation-09
for getting new collations approved should be changed to the procedure
used for new language tags.  Instead of the requestor sending the
request to IANA, who sends it to the Collation Reviewer for discussion
on the list, and then the Collation Reviewer sends it back to IANA for
registration (or doesn't), remove the first pass through IANA.

Have people post directly to the list and work out the details.  When the
requestor thinks it's ready, the Collation Reviewer wakes up and then
either sends the latest draft of the request to IANA or else sends a
rejection (with reasons) to the list.  This lowers the load on IANA.

This scheme has worked very well for [hidden email] for the
past 11 years.

--
John Cowan  [hidden email]   http://ccil.org/~cowan
"The exception proves the rule."  Dimbulbs think: "Your counterexample proves
my theory."  Latin students think "'Probat' means 'tests': the exception puts
the rule to the proof."  But legal historians know it means "Evidence for an
exception is evidence of the existence of a rule in cases not excepted from."

Reply | Threaded
Open this post in threaded view
|

Re: draft-newman-i18n-collation-09.txt just posted

Arnt Gulbrandsen

John Cowan writes:
> Instead of the requestor sending the request to IANA, who sends it to
> the Collation Reviewer for discussion on the list, and then the
> Collation Reviewer sends it back to IANA for registration (or
> doesn't), remove the first pass through IANA.

Fine. Done.

Arnt

Reply | Threaded
Open this post in threaded view
|

Re: draft-newman-i18n-collation-09.txt just posted

Arnt Gulbrandsen
In reply to this post by Arnt Gulbrandsen

Arnt Gulbrandsen writes:
> Mark Davis writes:
>> ...
>> At a quick glance, it appears that a number of comments have been
>> incorporated.
>
> It is possible that some of my changes don't satisfy you. I had
> conflicting requests for many things. Feel free to repeat, rephrase
> or add arguments.

In -10 (which I'll send off once I finish work this evening) I've made
another few changes.

>>>       > 2.4 Sort Keys
>>>
>>> The use of the term "collation canonicalization" to refer to sort
>>> keys is very misleading. ...
>
> Changed; the text now speaks of sort keys. I'm afraid there still are
> instances of the old term around, I found one today.

In -10, all should be dead.

>>> The term 'error' is also problematic, since what is really at issue
>>> is a question of domain. For all those strings in the domain,
>>> either 'equal' or 'not_equal' should be returned from the equality
>>> function. For any string not in the domain, 'undefined' should be
>>> returned.
>
> Not changed. Back in February, I agreed that "error" was not ideal,
> but did not see "undefined" as better, and could not find a really
> apt term. The collations were a little too well-defined in the
> "undefined" cases then.
>
> However, in -10, I think they really will be undefined outside their
> domain, so I'll change to using "undefined" instead of "error". (I'm
> removing the bits that fall back to i;octet.)

Changed. The fallback to i;octet is now in the server, if the protocol
requires it.

This means that if a server can escape implementing i;octet, it can keep
all its strings in UCS-2 or UCS-4 internally, even as it implements
collations which are defined in terms of octets.

Arnt

Reply | Threaded
Open this post in threaded view
|

Re: draft-newman-i18n-collation-09.txt just posted

Mark Davis-2
Thanks. I'm at the UTC meeting right now, and we were just talking about this. In general the committee is quite happy with the changes and direction.

The remaining serious issue is the combinatorics. You have a question on the open issue asking if the new DTD solves the issue, but as far as I can tell it does not.

The combinatorics are very large for CLDR. (See http://www.unicode.org/draft/reports/tr35/tr35.html#Locale_IDs). Here is a back-of-the-envelope calculation:
- There are a couple of hundred locales and growing (http://unicode.org/cldr/apps/survey)
- Many have variants; so all of the German ones can have "phonebook" for example. That probably adds another 50 combinations, so call it around 300.
- All can be parameterized: with the following numbers of settings:
 5 strength, 2 alternates, 2 directions, 2 normalizations, 2 caseLevels, 3 caseFirsts, 2 hirgana, 2 numeric, for a total of 960 combinations.
 Since these are parameters, they can each be combined with all the locales, so that gives about 30,000 different registrations.
- This doesn't account for the variableTop parameter, which takes a string, and is at least theoretically, unbounded.

I'm sure what we don't want to see is 30,000 entries in the form discussed in 7.5. Example Initial Registry Summary

What I suggest be done instead is that a set of parameters be registered. Thus we could have the equivalent of:

CLDR-Parameters:
  locale=aa, aa_DJ, aa_ER, aa_ER_SAAHO, aa_ET, ..., zu_ZA
  collation=phonebook, ..., gb2312han
  colStrength=primary, secondary, tertiary, quaternary, identical
  ...
  variableTop=(u<unicodeCodePoint>)+

Then the corresponding line in 7.5 could be:

     i;basic;uca=5.0.0;uv=5.0.0;CLDR-Parameters    e, o, s   i18n
BTW, I'm going to be gone until June 7, so won't be able to respond until then.

There are a few other areas where I have remarks on the language. In particular, the handling of the error strings needs some further fixes. For example, the following statement is false:

4.2.3.  Ordering
...
It MUST be transitive and trichotomous.
The way to handle this is to say that a collation is defined over a domain of strings. Any string outside of that domain will is called "invalid", and return an error value when used with any operation. Then you can truely say:

For all strings in its domain, it MUST it be transitive and trichotomous.
Mark


On 5/16/06, Arnt Gulbrandsen <[hidden email]> wrote:
Arnt Gulbrandsen writes:
> Mark Davis writes:
>> ...
>> At a quick glance, it appears that a number of comments have been
>> incorporated.
>
> It is possible that some of my changes don't satisfy you. I had
> conflicting requests for many things. Feel free to repeat, rephrase
> or add arguments.

In -10 (which I'll send off once I finish work this evening) I've made
another few changes.

>>>       > 2.4 Sort Keys
>>>
>>> The use of the term "collation canonicalization" to refer to sort
>>> keys is very misleading. ...
>
> Changed; the text now speaks of sort keys. I'm afraid there still are
> instances of the old term around, I found one today.

In -10, all should be dead.

>>> The term 'error' is also problematic, since what is really at issue
>>> is a question of domain. For all those strings in the domain,
>>> either 'equal' or 'not_equal' should be returned from the equality
>>> function. For any string not in the domain, 'undefined' should be
>>> returned.
>
> Not changed. Back in February, I agreed that "error" was not ideal,
> but did not see "undefined" as better, and could not find a really
> apt term. The collations were a little too well-defined in the
> "undefined" cases then.
>
> However, in -10, I think they really will be undefined outside their
> domain, so I'll change to using "undefined" instead of "error". (I'm
> removing the bits that fall back to i;octet.)

Changed. The fallback to i;octet is now in the server, if the protocol
requires it.

This means that if a server can escape implementing i;octet, it can keep
all its strings in UCS-2 or UCS-4 internally, even as it implements
collations which are defined in terms of octets.

Arnt