Re: HTML5 Paragraphs, Sentences and Phrases

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: HTML5 Paragraphs, Sentences and Phrases

Thomas A. Fine

This is in response to Benjamin Hawkes-Lewis' response to
Adam Sobieski's proposal for sentence and phrase tags.

Speaking to the "necessity" of these tags, while I'm not sure really
any tag, or HTML or the web or even a good slice of pizza can be
described as necessary, these tags can definitely be useful, and
most likely they can be important.  Sentence and phrase markings can
be very useful to:
  People relying on audio conversion to access the web.
  People relying on automated translation.
  People who are just learning to read.
  People who are reading an article not in their native language.
  People who are interested in inter-sentence spacing or inter-phrase spacing.
  People with commercial interests, looking to maximize their reach.

Of course, simply adding tags won't really help any of these people.
The real point is that such tags can facilitate tools that help
these people.

The problem with using span tags is that they won't facilitate tool
development.  In the absence of a real standard, no one is going
to develop software to process sentences by searching for spans
that might be labeled "sentence" or "sent" or "stc" or who knows
what else.  Only in the presence of a standard tag, can developers use
these tags to improve translation, or emphasize phrasing and sentence
structure for improved readability.

Mr. Hawkes-Lewis wrote:
>The web corpus is not going to get marked up with phrases and
>sentences in the absence of NLP advances that would make such markup
>mostly redundant.

Natural Language Processing is riddled with problems, and there is
nothing to suggest that this will change in the near future.  On
the other hand, someone who is authoring content is in the perfect
situation to accurately identify sentences or phrases.  NLP can be
an aid to that user, and can provide hints to help them select
sentence structure.  But as I said above, no such software would
ever be developed to use NLP to aid users in marking sentence
structure unless there were already dedicated sentence and phrase
tags.  So in essence, you are correct, but only because you're
argument is a self-fulfilling prophesy.

You also suggest simply using a CSS pseudo-tag, and relying on the
unicode sentence breaking conventions.  However, looking at these
conventions, they are just another attempt at some sort of automated
processing, and they acknowledge that this will not work for all cases.
This is just one more argument in favor of giving content providers
the ability to accurately mark up sentence structure.

I'll further note that any form of automated NLP is wholly inadequate
when it comes to users interested simply in formatting control issues.
Giving them a mechanism that does not provide control over where and
when content will be formatted (other than some outside algorithm they
don't control) is not providing any real control over formatting.

If you are saying that you don't think most people will bother, that is
probably true.  But that doesn't mean that there aren't people with
a legitimate and important interest.

So back to the original question, are these tags necessary?  I would
now say yes, these tags are necessary to the development of software
tools to aid users in marking sentence structure, and they are
necessary to the development of tools that allow content providers
to improve readability of their web pages for several classes of
web users.

     tom


Reply | Threaded
Open this post in threaded view
|

Re: HTML5 Paragraphs, Sentences and Phrases

Arthur Clifford
At what point does one reach a point of absurdity?

Should there be a <punct char="." name="period"  role="full-stop" /> and a <punct char="," name="comma" role="pause" />

Should there be a word tag? <word role="verb"  tense="past">ran</word>

I am sure these things would be great to have, but ultimately if somebody wants to make content available with that level of detail they should work on a conversion program that generates tagged content in XML. It would probably be something like the NLML Natural Language Markup Language.

If HTML is for markup of presentation content in browsers or similar user agents, then div and span are adequate for the job. You could namespace your divs and spans to accomplish what you want in terms of <span id="word_verb_past" >ran</span> and have a reading technology know how to process the ids for spans to determine how to present, read, or interact with the user.

If HTML is supposed to be semantic then the argument in favor of sentence, sentence_fragment, phrase, word are not unreasonable because they do after all explain what you are seeing, at least for english speakers. Then again, I know its semantics, but a div with a specially formatted id, name, or perhaps a role attribute (if you really needed to add something) would semantically suggest what you are looking at. As would span. They suggest you are either looking at a block of content (div) or a fragment of a block or sub-set of a block (span); the only thing missing is the role of the div or span. While styles do imply role, style semantically suggests visualization.

I don't think the problem here is one of reading as much as writing. Nobody in their right might wants to sit down and markup their sentences unless they are working on something to teach someone about sentence structure. In which case they are better of learning XML XSLT and HTML and how to really use them and to work with with dedicated/controlled content. Frankly the majority of the content creators are not interested in teaching anybody how to read, but rather wants to sell a product, or blog about something, tweet their brainfart of the moment, or even share research as was originally the purpose of the web. However, if Word or other programs can tell me my grammar is wrong then it should be able to export my document in an xml format that marks up my content with grammatical markup. XSLT could transform that for use in a browser or translate it for use in other technologies. This request needs to start at the places where we produce content. Honestly, most of us still don't use even Word correctly (do you bold or italic individual words or do you apply a style?).

Based on how I've seen folks respond here, the HTML standard is based on what people are doing. So, rather than asking for something which may help something possibly do something, I think the key is to ask the right sector in the industry to actually build something that produces a dedicated markup language that HTML 6 can incorporate later. While I don't always agree with decisions made by folks here, I can understand their perspective that this is a fringe use case and not compelling enough to warrant new tags, especially when you can do that yourself with XHTML.


Art C.



On Apr 12, 2012, at 1:48 PM, Thomas A. Fine wrote:

>
> This is in response to Benjamin Hawkes-Lewis' response to
> Adam Sobieski's proposal for sentence and phrase tags.
>
> Speaking to the "necessity" of these tags, while I'm not sure really
> any tag, or HTML or the web or even a good slice of pizza can be
> described as necessary, these tags can definitely be useful, and
> most likely they can be important.  Sentence and phrase markings can
> be very useful to:
>  People relying on audio conversion to access the web.
>  People relying on automated translation.
>  People who are just learning to read.
>  People who are reading an article not in their native language.
>  People who are interested in inter-sentence spacing or inter-phrase spacing.
>  People with commercial interests, looking to maximize their reach.
>
> Of course, simply adding tags won't really help any of these people.
> The real point is that such tags can facilitate tools that help
> these people.
>
> The problem with using span tags is that they won't facilitate tool
> development.  In the absence of a real standard, no one is going
> to develop software to process sentences by searching for spans
> that might be labeled "sentence" or "sent" or "stc" or who knows
> what else.  Only in the presence of a standard tag, can developers use
> these tags to improve translation, or emphasize phrasing and sentence
> structure for improved readability.
>
> Mr. Hawkes-Lewis wrote:
>> The web corpus is not going to get marked up with phrases and
>> sentences in the absence of NLP advances that would make such markup
>> mostly redundant.
>
> Natural Language Processing is riddled with problems, and there is
> nothing to suggest that this will change in the near future.  On
> the other hand, someone who is authoring content is in the perfect
> situation to accurately identify sentences or phrases.  NLP can be
> an aid to that user, and can provide hints to help them select
> sentence structure.  But as I said above, no such software would
> ever be developed to use NLP to aid users in marking sentence
> structure unless there were already dedicated sentence and phrase
> tags.  So in essence, you are correct, but only because you're
> argument is a self-fulfilling prophesy.
>
> You also suggest simply using a CSS pseudo-tag, and relying on the
> unicode sentence breaking conventions.  However, looking at these
> conventions, they are just another attempt at some sort of automated
> processing, and they acknowledge that this will not work for all cases.
> This is just one more argument in favor of giving content providers
> the ability to accurately mark up sentence structure.
>
> I'll further note that any form of automated NLP is wholly inadequate
> when it comes to users interested simply in formatting control issues.
> Giving them a mechanism that does not provide control over where and
> when content will be formatted (other than some outside algorithm they
> don't control) is not providing any real control over formatting.
>
> If you are saying that you don't think most people will bother, that is
> probably true.  But that doesn't mean that there aren't people with
> a legitimate and important interest.
>
> So back to the original question, are these tags necessary?  I would
> now say yes, these tags are necessary to the development of software
> tools to aid users in marking sentence structure, and they are
> necessary to the development of tools that allow content providers
> to improve readability of their web pages for several classes of
> web users.
>
>     tom
>
>


Reply | Threaded
Open this post in threaded view
|

Re: HTML5 Paragraphs, Sentences and Phrases

Thomas A. Fine
In reply to this post by Thomas A. Fine

For 800 years, people set type by hand, and during that time they could
choose to format sentences as separate elements (by adding extra space),
and most typesetters in fact chose to do that.

So what's absurd is that with all of our wondrous modern technologies,
authors today still do not have this most basic of abilities in any
usable form.  We can do 100 amazing things that typesetters could not
but we still can't handle the most fundamental typesetting task that
was routine to a typesetter.

Imagine if we were designing HTML for a traditional typesetter.
The two most important tags would be a paragraph tag and a sentence
tag.  The majority of traditional typesetting could have been
accomplished with only these two tags.  This is why the lack of
a sentence tag is absurd.

Semantic tags are a problem?  Well we already had "paragraph".  But
HTML5 adds "header", "footer", "section", "figure", "aside", "article",
and others.  Again, the absurdity here is to provide all of these tags,
and yet offer no clear standardized way to tag the most fundamental
and most common element.  Especially since a sentence is BOTH a semantic
element AND (traditionally) an element with particular styling needs.

And what exactly is the harm in adding a tag that many people might
not use?  The days when HTML can simply address the basic common
case are long gone, folks.  HTML is arguably the most popular form
of one-to-many written communication in use today.  It needs to fill
that role, and serve the diverse needs of its large audience and many
authors.

      tom


Reply | Threaded
Open this post in threaded view
|

Re: HTML5 Paragraphs, Sentences and Phrases

T.J. Crowder
All,

No special point to make other than to say that a sentence tag just makes obvious sense to me. I could present paragraphs as styled divs. I don't, because we have a tag that more clearly identifies the text therein as a paragraph.

So I support adding a sentence tag, provided it's a nice short tag (too bad <s> was wasted on strikethrough), recognizing that a lot of people won't use it and software that wants to interpret the structure of written text will continue to have to infer sentences in the majority of texts. That fact in no way diminishes the value of an author being able, when willing, to clearly identify the sentences.
--
T.J. Crowder
[hidden email]

On 13 April 2012 17:41, Thomas A. Fine <[hidden email]> wrote:

For 800 years, people set type by hand, and during that time they could
choose to format sentences as separate elements (by adding extra space),
and most typesetters in fact chose to do that.

So what's absurd is that with all of our wondrous modern technologies,
authors today still do not have this most basic of abilities in any
usable form.  We can do 100 amazing things that typesetters could not
but we still can't handle the most fundamental typesetting task that
was routine to a typesetter.

Imagine if we were designing HTML for a traditional typesetter.
The two most important tags would be a paragraph tag and a sentence
tag.  The majority of traditional typesetting could have been
accomplished with only these two tags.  This is why the lack of
a sentence tag is absurd.

Semantic tags are a problem?  Well we already had "paragraph".  But
HTML5 adds "header", "footer", "section", "figure", "aside", "article",
and others.  Again, the absurdity here is to provide all of these tags,
and yet offer no clear standardized way to tag the most fundamental
and most common element.  Especially since a sentence is BOTH a semantic
element AND (traditionally) an element with particular styling needs.

And what exactly is the harm in adding a tag that many people might
not use?  The days when HTML can simply address the basic common
case are long gone, folks.  HTML is arguably the most popular form
of one-to-many written communication in use today.  It needs to fill
that role, and serve the diverse needs of its large audience and many
authors.

     tom