Feedback on

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Feedback on

Doug Schepers
Hi, Addison, Richard, I18n–

(BCCing the Web Annotation WG mailing list, to keep them in the loop)

I'd like to schedule a liaison telcon between the Internationalization
WG and the Web Annotation WG, to discuss issues around a client-side API
for searching for strings in a web document.

The Web Annotation WG is chartered to deliver a spec for "fuzzy
anchoring", which basically means a way to link to a specific passage in
a document, even if there is no ID and even if the document may have
changed.

One manifestation of this is my Rangefinder API spec [1], which is
basically a find-in-page API with fuzzy matching (e.g. case folding,
Levenshtein distance tolerance, Unicode normalization [2]) and location
scoping.

For the Unicode normalization, we'd like to refer normatively to the
updated Charmod-Norm [3]. In any case, we'd like to discuss our use
cases and requirements around i18n with you, for your best advice on how
we should proceed.

I spoke with Richard today, and he suggested the best next step would be
have you take a look at my rough early draft of the Rangefinder API, so
we have some basis for discussion. Please excuse the sketchy nature of
the spec, and note that the examples are illustrative but out of date
with the spec's development.

If you want to meet, would you want to join us, or have some of us join
you? We normally meet on Wednesdays at 11am ET.


[1] http://w3c.github.io/rangefinder/
[2] http://w3c.github.io/rangefinder/#widl-RangeFinder-unicodeFolding
[3] http://www.w3.org/TR/2014/WD-charmod-norm-20140715/

Regards–
–Doug

Reply | Threaded
Open this post in threaded view
|

Feedback on i28n in Rangefinder API

Doug Schepers-3
Hi, Addison, Richard, I18n–

Oops, hit send too soon, sorry... resending.

(BCCing the Web Annotation WG mailing list, to keep them in the loop)

I'd like to schedule a liaison telcon between the Internationalization
WG and the Web Annotation WG, to discuss issues around a client-side API
for searching for strings in a web document.

The Web Annotation WG is chartered to deliver a spec for "fuzzy
anchoring", which basically means a way to link to a specific passage in
a document, even if there is no ID and even if the document may have
changed.

One manifestation of this is my Rangefinder API spec [1], which is
basically a find-in-page API with fuzzy matching (e.g. case folding,
Levenshtein distance tolerance, Unicode normalization [2]) and location
scoping.

For the Unicode normalization, we'd like to refer normatively to the
updated Charmod-Norm [3]. In any case, we'd like to discuss our use
cases and requirements around i18n with you, for your best advice on how
we should proceed.

I spoke with Richard today, and he suggested the best next step would be
have you take a look at my rough early draft of the Rangefinder API, so
we have some basis for discussion. Please excuse the sketchy nature of
the spec, and note that the examples are illustrative but out of date
with the spec's development.

If you want to meet, would you want to join us, or have some of us join
you? We normally meet on Wednesdays at 11am ET.


[1] http://w3c.github.io/rangefinder/
[2] http://w3c.github.io/rangefinder/#widl-RangeFinder-unicodeFolding
[3] http://www.w3.org/TR/2014/WD-charmod-norm-20140715/

Regards–
–Doug


Reply | Threaded
Open this post in threaded view
|

RE: Feedback on i28n in Rangefinder API

Phillips, Addison-2
Hello Doug,

Thanks for this. This is an interesting problem you're working on. It certainly dovetails with our work on Charmod-Norm.

We could certainly meet. Note that I18N meets on Thursdays at 10am ET, which is after your meeting this week. In addition, we haven't had time to review your materials. Would it be possible for representatives from I18N to participate in your call next week (20 May)? Or would you prefer to have someone(s) dial into our meeting this week (14 May)?

Addison

> -----Original Message-----
> From: Doug Schepers [mailto:[hidden email]]
> Sent: Tuesday, May 12, 2015 11:47 AM
> To: i18n WG; Richard Ishida; Phillips, Addison; W3C Public Annotation List
> Subject: Feedback on i28n in Rangefinder API
>
> Hi, Addison, Richard, I18n–
>
> Oops, hit send too soon, sorry... resending.
>
> (BCCing the Web Annotation WG mailing list, to keep them in the loop)
>
> I'd like to schedule a liaison telcon between the Internationalization WG and
> the Web Annotation WG, to discuss issues around a client-side API for
> searching for strings in a web document.
>
> The Web Annotation WG is chartered to deliver a spec for "fuzzy anchoring",
> which basically means a way to link to a specific passage in a document, even
> if there is no ID and even if the document may have changed.
>
> One manifestation of this is my Rangefinder API spec [1], which is basically a
> find-in-page API with fuzzy matching (e.g. case folding, Levenshtein distance
> tolerance, Unicode normalization [2]) and location scoping.
>
> For the Unicode normalization, we'd like to refer normatively to the updated
> Charmod-Norm [3]. In any case, we'd like to discuss our use cases and
> requirements around i18n with you, for your best advice on how we should
> proceed.
>
> I spoke with Richard today, and he suggested the best next step would be
> have you take a look at my rough early draft of the Rangefinder API, so we
> have some basis for discussion. Please excuse the sketchy nature of the spec,
> and note that the examples are illustrative but out of date with the spec's
> development.
>
> If you want to meet, would you want to join us, or have some of us join you?
> We normally meet on Wednesdays at 11am ET.
>
>
> [1] http://w3c.github.io/rangefinder/
> [2] http://w3c.github.io/rangefinder/#widl-RangeFinder-unicodeFolding
> [3] http://www.w3.org/TR/2014/WD-charmod-norm-20140715/
>
> Regards–
> –Doug
>

Reply | Threaded
Open this post in threaded view
|

Re: Feedback on i28n in Rangefinder API

Doug Schepers-3
Hi, Addison–

On 5/12/15 3:04 PM, Phillips, Addison wrote:
>
> Thanks for this. This is an interesting problem you're working on. It
> certainly dovetails with our work on Charmod-Norm.

Yes, I was pleased when I stumbled on Charmod-Norm. Good timing, it
seems. I appreciate that you're taking that work on, since referring to
the Unicode docs is not as clear, from an implementation perspective.


> We could certainly meet. Note that I18N meets on Thursdays at 10am
> ET, which is after your meeting this week. In addition, we haven't
> had time to review your materials. Would it be possible for
> representatives from I18N to participate in your call next week (20
> May)? Or would you prefer to have someone(s) dial into our meeting
> this week (14 May)?

I leave it up to the chairs of the Web Annotation WG to determine the
best date for you to join us.

I'd be happy to join you on your call this week to help provide context
for your review. Or if you prefer, given the late timing, I could also
join a future telcon, as well.

Regards–
–Doug

>> -----Original Message----- From: Doug Schepers
>> [mailto:[hidden email]] Sent: Tuesday, May 12, 2015 11:47 AM To:
>> i18n WG; Richard Ishida; Phillips, Addison; W3C Public Annotation
>> List Subject: Feedback on i28n in Rangefinder API
>>
>> Hi, Addison, Richard, I18n–
>>
>> Oops, hit send too soon, sorry... resending.
>>
>> (BCCing the Web Annotation WG mailing list, to keep them in the
>> loop)
>>
>> I'd like to schedule a liaison telcon between the
>> Internationalization WG and the Web Annotation WG, to discuss
>> issues around a client-side API for searching for strings in a web
>> document.
>>
>> The Web Annotation WG is chartered to deliver a spec for "fuzzy
>> anchoring", which basically means a way to link to a specific
>> passage in a document, even if there is no ID and even if the
>> document may have changed.
>>
>> One manifestation of this is my Rangefinder API spec [1], which is
>> basically a find-in-page API with fuzzy matching (e.g. case
>> folding, Levenshtein distance tolerance, Unicode normalization [2])
>> and location scoping.
>>
>> For the Unicode normalization, we'd like to refer normatively to
>> the updated Charmod-Norm [3]. In any case, we'd like to discuss our
>> use cases and requirements around i18n with you, for your best
>> advice on how we should proceed.
>>
>> I spoke with Richard today, and he suggested the best next step
>> would be have you take a look at my rough early draft of the
>> Rangefinder API, so we have some basis for discussion. Please
>> excuse the sketchy nature of the spec, and note that the examples
>> are illustrative but out of date with the spec's development.
>>
>> If you want to meet, would you want to join us, or have some of us
>> join you? We normally meet on Wednesdays at 11am ET.
>>
>>
>> [1] http://w3c.github.io/rangefinder/ [2]
>> http://w3c.github.io/rangefinder/#widl-RangeFinder-unicodeFolding
>> [3] http://www.w3.org/TR/2014/WD-charmod-norm-20140715/
>>
>> Regards– –Doug
>>
>

Reply | Threaded
Open this post in threaded view
|

RE: Feedback on i28n in Rangefinder API

Phillips, Addison-2
In reply to this post by Doug Schepers-3
Some comments from reading the document through initially. I understand that this is a work in progress.

'caseFolding': There is a default Unicode case folding. However, it is not applicable in all cases. For example, see the note box in [1]. Certainly a default case folding could be the default. But there should be a means of tailoring the case fold using a language tag.

'unicodeFolding': This also presents a number of difficulties. Not just canonical (NFC/NFD) equivalence but also compatibility equivalence (NFKC/NFKD) is sometimes useful. In addition, there are textual variations that are not related to Unicode character properties that searches may wish to deal with. For example, Japanese uses both katakana and hiragana phonetic scripts: one might wish to normalize these differences away when searching text. In other words, I think probably this parameter needs more thought.

As an aside, there are other things that you note that users might want to ignore/not ignore when searching. This is discussed at length in UTS#10, Chapter 8 [2] and language-specific tailoring and different "weights" come into play.

'wholeWord': This seems simple at first, but some languages (Thai, Japanese, Chinese) that do not use spaces between words have a difficult relationship with this feature. This doesn't make the feature invalid, but does require a health warning that the items selected may not, in fact, always be words.

Normalization in general: it may be possible that the searched text is itself not provided in a normalized form. Health warnings or solid implementation guidance is certainly necessary here.

The discussion of using Unicode decomposition in section 9 might need to be carefully thought through. For example, the Korean Hangul script decomposes in a way that might interfere with searching operations (a character that had a Levenshtein distance of '1' when composed might have a distance as large as '4' when decomposed).

The example 'character count': what exactly would be counted here? Unicode code points? Graphemes?

There are invisible characters in Unicode, such as variation selectors or the new emoji skin tone characters, which may not meaningfully affect the user's intention, but might prevent searches from being successful.

Anyway, food for thought. I look forward to further discussion.

~Addison

[1] http://w3c.github.io/charmod-norm/#definitionCaseFolding 
[2] http://www.unicode.org/reports/tr10/#Searching 

> -----Original Message-----
> From: Doug Schepers [mailto:[hidden email]]
> Sent: Tuesday, May 12, 2015 11:47 AM
> To: i18n WG; Richard Ishida; Phillips, Addison; W3C Public Annotation List
> Subject: Feedback on i28n in Rangefinder API
>
> Hi, Addison, Richard, I18n–
>
> Oops, hit send too soon, sorry... resending.
>
> (BCCing the Web Annotation WG mailing list, to keep them in the loop)
>
> I'd like to schedule a liaison telcon between the Internationalization WG and
> the Web Annotation WG, to discuss issues around a client-side API for
> searching for strings in a web document.
>
> The Web Annotation WG is chartered to deliver a spec for "fuzzy anchoring",
> which basically means a way to link to a specific passage in a document, even
> if there is no ID and even if the document may have changed.
>
> One manifestation of this is my Rangefinder API spec [1], which is basically a
> find-in-page API with fuzzy matching (e.g. case folding, Levenshtein distance
> tolerance, Unicode normalization [2]) and location scoping.
>
> For the Unicode normalization, we'd like to refer normatively to the updated
> Charmod-Norm [3]. In any case, we'd like to discuss our use cases and
> requirements around i18n with you, for your best advice on how we should
> proceed.
>
> I spoke with Richard today, and he suggested the best next step would be
> have you take a look at my rough early draft of the Rangefinder API, so we
> have some basis for discussion. Please excuse the sketchy nature of the spec,
> and note that the examples are illustrative but out of date with the spec's
> development.
>
> If you want to meet, would you want to join us, or have some of us join you?
> We normally meet on Wednesdays at 11am ET.
>
>
> [1] http://w3c.github.io/rangefinder/
> [2] http://w3c.github.io/rangefinder/#widl-RangeFinder-unicodeFolding
> [3] http://www.w3.org/TR/2014/WD-charmod-norm-20140715/
>
> Regards–
> –Doug
>

Reply | Threaded
Open this post in threaded view
|

Re: Feedback on i28n in Rangefinder API

Robert Sanderson

Dear all,

Apologies from Frederick and myself for letting the timing for the discussion fall off the radar.

Would it be possible to join a call next week on Wednesday June 6 at 8am PST / 11am EST / 4pm UK / 5pm Europe to discuss internationalization issues regarding annotation?

In particular, it would be great to make progress on the  points that Addison made and also the issue that Takeshi brought up at the F2F regarding different lengths of character strings in different (programming) languages.

Thanks!

Rob



On Tue, May 12, 2015 at 1:09 PM, Phillips, Addison <[hidden email]> wrote:
Some comments from reading the document through initially. I understand that this is a work in progress.

'caseFolding': There is a default Unicode case folding. However, it is not applicable in all cases. For example, see the note box in [1]. Certainly a default case folding could be the default. But there should be a means of tailoring the case fold using a language tag.

'unicodeFolding': This also presents a number of difficulties. Not just canonical (NFC/NFD) equivalence but also compatibility equivalence (NFKC/NFKD) is sometimes useful. In addition, there are textual variations that are not related to Unicode character properties that searches may wish to deal with. For example, Japanese uses both katakana and hiragana phonetic scripts: one might wish to normalize these differences away when searching text. In other words, I think probably this parameter needs more thought.

As an aside, there are other things that you note that users might want to ignore/not ignore when searching. This is discussed at length in UTS#10, Chapter 8 [2] and language-specific tailoring and different "weights" come into play.

'wholeWord': This seems simple at first, but some languages (Thai, Japanese, Chinese) that do not use spaces between words have a difficult relationship with this feature. This doesn't make the feature invalid, but does require a health warning that the items selected may not, in fact, always be words.

Normalization in general: it may be possible that the searched text is itself not provided in a normalized form. Health warnings or solid implementation guidance is certainly necessary here.

The discussion of using Unicode decomposition in section 9 might need to be carefully thought through. For example, the Korean Hangul script decomposes in a way that might interfere with searching operations (a character that had a Levenshtein distance of '1' when composed might have a distance as large as '4' when decomposed).

The example 'character count': what exactly would be counted here? Unicode code points? Graphemes?

There are invisible characters in Unicode, such as variation selectors or the new emoji skin tone characters, which may not meaningfully affect the user's intention, but might prevent searches from being successful.

Anyway, food for thought. I look forward to further discussion.

~Addison

[1] http://w3c.github.io/charmod-norm/#definitionCaseFolding
[2] http://www.unicode.org/reports/tr10/#Searching

> -----Original Message-----
> From: Doug Schepers [mailto:[hidden email]]
> Sent: Tuesday, May 12, 2015 11:47 AM
> To: i18n WG; Richard Ishida; Phillips, Addison; W3C Public Annotation List
> Subject: Feedback on i28n in Rangefinder API
>
> Hi, Addison, Richard, I18n–
>
> Oops, hit send too soon, sorry... resending.
>
> (BCCing the Web Annotation WG mailing list, to keep them in the loop)
>
> I'd like to schedule a liaison telcon between the Internationalization WG and
> the Web Annotation WG, to discuss issues around a client-side API for
> searching for strings in a web document.
>
> The Web Annotation WG is chartered to deliver a spec for "fuzzy anchoring",
> which basically means a way to link to a specific passage in a document, even
> if there is no ID and even if the document may have changed.
>
> One manifestation of this is my Rangefinder API spec [1], which is basically a
> find-in-page API with fuzzy matching (e.g. case folding, Levenshtein distance
> tolerance, Unicode normalization [2]) and location scoping.
>
> For the Unicode normalization, we'd like to refer normatively to the updated
> Charmod-Norm [3]. In any case, we'd like to discuss our use cases and
> requirements around i18n with you, for your best advice on how we should
> proceed.
>
> I spoke with Richard today, and he suggested the best next step would be
> have you take a look at my rough early draft of the Rangefinder API, so we
> have some basis for discussion. Please excuse the sketchy nature of the spec,
> and note that the examples are illustrative but out of date with the spec's
> development.
>
> If you want to meet, would you want to join us, or have some of us join you?
> We normally meet on Wednesdays at 11am ET.
>
>
> [1] http://w3c.github.io/rangefinder/
> [2] http://w3c.github.io/rangefinder/#widl-RangeFinder-unicodeFolding
> [3] http://www.w3.org/TR/2014/WD-charmod-norm-20140715/
>
> Regards–
> –Doug
>




--
Rob Sanderson
Information Standards Advocate
Digital Library Systems and Services
Stanford, CA 94305
Reply | Threaded
Open this post in threaded view
|

Re: Feedback on i28n in Rangefinder API

Benjamin Young
Wednesday June 3 at 8am PST / 11am EST / 4pm UK / 5pm Europe

Rather. :)

On Wed, May 27, 2015 at 11:27 AM, Robert Sanderson <[hidden email]> wrote:

>
> Dear all,
>
> Apologies from Frederick and myself for letting the timing for the
> discussion fall off the radar.
>
> Would it be possible to join a call next week on Wednesday June 6 at 8am PST
> / 11am EST / 4pm UK / 5pm Europe to discuss internationalization issues
> regarding annotation?
>
> In particular, it would be great to make progress on the  points that
> Addison made and also the issue that Takeshi brought up at the F2F regarding
> different lengths of character strings in different (programming) languages.
>
> Thanks!
>
> Rob
>
>
>
> On Tue, May 12, 2015 at 1:09 PM, Phillips, Addison <[hidden email]>
> wrote:
>>
>> Some comments from reading the document through initially. I understand
>> that this is a work in progress.
>>
>> 'caseFolding': There is a default Unicode case folding. However, it is not
>> applicable in all cases. For example, see the note box in [1]. Certainly a
>> default case folding could be the default. But there should be a means of
>> tailoring the case fold using a language tag.
>>
>> 'unicodeFolding': This also presents a number of difficulties. Not just
>> canonical (NFC/NFD) equivalence but also compatibility equivalence
>> (NFKC/NFKD) is sometimes useful. In addition, there are textual variations
>> that are not related to Unicode character properties that searches may wish
>> to deal with. For example, Japanese uses both katakana and hiragana phonetic
>> scripts: one might wish to normalize these differences away when searching
>> text. In other words, I think probably this parameter needs more thought.
>>
>> As an aside, there are other things that you note that users might want to
>> ignore/not ignore when searching. This is discussed at length in UTS#10,
>> Chapter 8 [2] and language-specific tailoring and different "weights" come
>> into play.
>>
>> 'wholeWord': This seems simple at first, but some languages (Thai,
>> Japanese, Chinese) that do not use spaces between words have a difficult
>> relationship with this feature. This doesn't make the feature invalid, but
>> does require a health warning that the items selected may not, in fact,
>> always be words.
>>
>> Normalization in general: it may be possible that the searched text is
>> itself not provided in a normalized form. Health warnings or solid
>> implementation guidance is certainly necessary here.
>>
>> The discussion of using Unicode decomposition in section 9 might need to
>> be carefully thought through. For example, the Korean Hangul script
>> decomposes in a way that might interfere with searching operations (a
>> character that had a Levenshtein distance of '1' when composed might have a
>> distance as large as '4' when decomposed).
>>
>> The example 'character count': what exactly would be counted here? Unicode
>> code points? Graphemes?
>>
>> There are invisible characters in Unicode, such as variation selectors or
>> the new emoji skin tone characters, which may not meaningfully affect the
>> user's intention, but might prevent searches from being successful.
>>
>> Anyway, food for thought. I look forward to further discussion.
>>
>> ~Addison
>>
>> [1] http://w3c.github.io/charmod-norm/#definitionCaseFolding
>> [2] http://www.unicode.org/reports/tr10/#Searching
>>
>> > -----Original Message-----
>> > From: Doug Schepers [mailto:[hidden email]]
>> > Sent: Tuesday, May 12, 2015 11:47 AM
>> > To: i18n WG; Richard Ishida; Phillips, Addison; W3C Public Annotation
>> > List
>> > Subject: Feedback on i28n in Rangefinder API
>> >
>> > Hi, Addison, Richard, I18n–
>> >
>> > Oops, hit send too soon, sorry... resending.
>> >
>> > (BCCing the Web Annotation WG mailing list, to keep them in the loop)
>> >
>> > I'd like to schedule a liaison telcon between the Internationalization
>> > WG and
>> > the Web Annotation WG, to discuss issues around a client-side API for
>> > searching for strings in a web document.
>> >
>> > The Web Annotation WG is chartered to deliver a spec for "fuzzy
>> > anchoring",
>> > which basically means a way to link to a specific passage in a document,
>> > even
>> > if there is no ID and even if the document may have changed.
>> >
>> > One manifestation of this is my Rangefinder API spec [1], which is
>> > basically a
>> > find-in-page API with fuzzy matching (e.g. case folding, Levenshtein
>> > distance
>> > tolerance, Unicode normalization [2]) and location scoping.
>> >
>> > For the Unicode normalization, we'd like to refer normatively to the
>> > updated
>> > Charmod-Norm [3]. In any case, we'd like to discuss our use cases and
>> > requirements around i18n with you, for your best advice on how we should
>> > proceed.
>> >
>> > I spoke with Richard today, and he suggested the best next step would be
>> > have you take a look at my rough early draft of the Rangefinder API, so
>> > we
>> > have some basis for discussion. Please excuse the sketchy nature of the
>> > spec,
>> > and note that the examples are illustrative but out of date with the
>> > spec's
>> > development.
>> >
>> > If you want to meet, would you want to join us, or have some of us join
>> > you?
>> > We normally meet on Wednesdays at 11am ET.
>> >
>> >
>> > [1] http://w3c.github.io/rangefinder/
>> > [2] http://w3c.github.io/rangefinder/#widl-RangeFinder-unicodeFolding
>> > [3] http://www.w3.org/TR/2014/WD-charmod-norm-20140715/
>> >
>> > Regards–
>> > –Doug
>> >
>>
>
>
>
> --
> Rob Sanderson
> Information Standards Advocate
> Digital Library Systems and Services
> Stanford, CA 94305