[MathML4] Whitespace and attributes canonicalization in MathML VS HTML5/CSS

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[MathML4] Whitespace and attributes canonicalization in MathML VS HTML5/CSS

Frédéric Wang-2
Hi Math WG,

Continuing on feedback for a future MathML specification, here is a
(probably non-exhaustive) list of inconsistencies between MathML and
HTML5/CSS regarding whitespace and attributes canonicalization. As a
rule of thumb, it would be better for web engines if MathML can align on
HTML5 so that we can reuse as much code as possible and avoid extra code
to handle MathML special cases. Also people familiar with HTML5 will be
less surprised when handling MathML.

1) Whitespace collapsing/trimming
   https://www.w3.org/TR/MathML/chapter2.html#fund.collapse

   Whitespace collapsing is consistent with the default CSS property
"white-space" and people are familiar with it.

   Removing "whitespace at the beginning and end of the content" is less
expected. Gecko has some code to handle this but it would be very
helpful to avoid this additional complexity. WebKit does not handle it
at the moment and it's not clear it's worth doing it... Except in the
MathML spec/test, everybody seems to just write <mo>(</mo> and not <mo>
( </mo>. Can we deprecate this behavior in MathML4? Or maybe you should
work with the HTML5 WG to define such collapsing rules during document
parsing, so that the MathML rendering code no longer need to handle it?

2) In MathML, white spaces are understood as XML spaces (U+0020), tabs
(U+0009), line feeds (U+000A), and carriage returns (U+000D) while HTML5
also includes "form feed" (U+000C).

    https://www.w3.org/TR/html5/infrastructure.html#space-character
   
3) MathML attributes are case-sensitive while HTML5 attributes are
case-insensitive. case-sensitiveness is probably not a problem for users
and it's easier for the parsing. However, WebKit developers writing or
reviewing patches have often considered doing case-insensitive
comparisons as that's consistent with the rest of the code base.

4) MathML boolean attributes take value "true" and "false". In HTML5,
the boolean value is given by the presence/absence of the attribute and
the only allowed value is the name of the attribute. This allows to get
more compact syntax like <mo largeop stretchy> instead of <mo
largeop="true" stretchy="true">. However, Web engines and authoring
tools will continue to support the true/false syntax anyway, so it's
probably not worth adding complexity here...

   https://www.w3.org/TR/html5/infrastructure.html#boolean-attributes

5) As I said in a previous message, the values "small", "normal", "big"
of mathsize do not exist for CSS font-size. Removing them will simplify
a bit the parsing code.

6) The definition of numbers is also not very accurate in the MathML
recommendation compared to HTML5. One has to check the RelaxNG schemas
and the predefined RelaxNG types to know the exact syntax. Again, it
think it would be best to rely on the HTML5 definitions. For example,
<math><mspace width="1E1em" height="10em" mathbackground="red"/></math>
draws a red square in WebKit but Gecko says "1E1em" is invalid.

   https://www.w3.org/TR/html5/infrastructure.html#numbers

Frédéric


Reply | Threaded
Open this post in threaded view
|

Re: [MathML4] Whitespace and attributes canonicalization in MathML VS HTML5/CSS

David Carlisle
On 01/08/2016 16:31, Frédéric Wang wrote:
> Hi Math WG,

Some personal "first thought" replies ...

>
> Continuing on feedback for a future MathML specification, here is a
> (probably non-exhaustive) list of inconsistencies between MathML and
> HTML5/CSS regarding whitespace and attributes canonicalization. As a
> rule of thumb, it would be better for web engines if MathML can align on
> HTML5 so that we can reuse as much code as possible and avoid extra code
> to handle MathML special cases. Also people familiar with HTML5 will be
> less surprised when handling MathML.
>
> 1) Whitespace collapsing/trimming
>    https://www.w3.org/TR/MathML/chapter2.html#fund.collapse
>
>    Whitespace collapsing is consistent with the default CSS property
> "white-space" and people are familiar with it.
>
>    Removing "whitespace at the beginning and end of the content" is less
> expected. Gecko has some code to handle this but it would be very
> helpful to avoid this additional complexity. WebKit does not handle it
> at the moment and it's not clear it's worth doing it... Except in the
> MathML spec/test, everybody seems to just write <mo>(</mo> and not <mo>
> ( </mo>. Can we deprecate this behavior in MathML4? Or maybe you should
> work with the HTML5 WG to define such collapsing rules during document
> parsing, so that the MathML rendering code no longer need to handle it?

white space is always a problem:-) but I'd be sorry to just drop this
completely, it's a well established feature of math typesetting (in TeX
and elsewhere) that user-whitespace is ignored and the math typesetter
re-adds white space as needed.  That said, I agree that the fact that
TeX treats 1+2 like 1 + 2 doesn't necessarily mean that mathml should
treat <mo>+</mo> like <mo> + </mo>. If the trimming could happen during
text/html parsing that would simplify some things.


>
> 2) In MathML, white spaces are understood as XML spaces (U+0020), tabs
> (U+0009), line feeds (U+000A), and carriage returns (U+000D) while HTML5
> also includes "form feed" (U+000C).
>
>     https://www.w3.org/TR/html5/infrastructure.html#space-character
>

Probably we should just change that. Either always include U+000C or
specify white space characters are XML white space in application/xml
parsing and html white space in text/html parsing or something ...



> 3) MathML attributes are case-sensitive while HTML5 attributes are
> case-insensitive. case-sensitiveness is probably not a problem for users
> and it's easier for the parsing. However, WebKit developers writing or
> reviewing patches have often considered doing case-insensitive
> comparisons as that's consistent with the rest of the code base.

Do you mean the attribute values or the attribute names? For the latter
my understanding is that it's the same as (x)html in that the text/html
parser will normalise the case of the attribute name (to lower case
except for definitionURL) so giving an appearance of case insensitivity

>
> 4) MathML boolean attributes take value "true" and "false". In HTML5,
> the boolean value is given by the presence/absence of the attribute and
> the only allowed value is the name of the attribute. This allows to get
> more compact syntax like <mo largeop stretchy> instead of <mo
> largeop="true" stretchy="true">. However, Web engines and authoring
> tools will continue to support the true/false syntax anyway, so it's
> probably not worth adding complexity here...

I don't think allowing stretchy=stretchy as an alternative to
stretch=true would break anything on the XML side of things, and would
potentially, as you say, allow just stretchy in text/html using its
version of the old SGML shorttag feature. You could say more than me
whether that would simplify or complicate things at implementation level.

>
>    https://www.w3.org/TR/html5/infrastructure.html#boolean-attributes
>
> 5) As I said in a previous message, the values "small", "normal", "big"
> of mathsize do not exist for CSS font-size. Removing them will simplify
> a bit the parsing code.

Are these conceptually more difficult than css names like
small,medium,large,x-large? (just asking:-)

>
> 6) The definition of numbers is also not very accurate in the MathML
> recommendation compared to HTML5. One has to check the RelaxNG schemas
> and the predefined RelaxNG types to know the exact syntax.

Well hopefully section 2.1.5.1
https://www.w3.org/Math/draft-spec/mathml.html#chapter2_id.2.1.5.1
is reasonably exact (but the main point that it's not exactly the same
as HTML5 is of course undeniable)


> Again, it
> think it would be best to rely on the HTML5 definitions. For example,
> <math><mspace width="1E1em" height="10em" mathbackground="red"/></math>
> draws a red square in WebKit but Gecko says "1E1em" is invalid.

Certainly scope for documenting the syntaxes there and seeing whether
any differences are giving extra functionality or just historical, I
suspect that we should be able to specify a profile of mathml for
text/html parsing that brings things more in to line with html/css
numeric syntax if that's needed.


>
>    https://www.w3.org/TR/html5/infrastructure.html#numbers
>
> Frédéric
>
>

David

________________________________


The Numerical Algorithms Group Ltd is a company registered in England and Wales with company number 1249803. The registered office is:

Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom.



This e-mail has been scanned for all viruses by Microsoft Office 365.

________________________________

Reply | Threaded
Open this post in threaded view
|

Re: [MathML4] Whitespace and attributes canonicalization in MathML VS HTML5/CSS

Frédéric Wang-2
Le 01/08/2016 à 18:45, David Carlisle a écrit :
>
> white space is always a problem:-) but I'd be sorry to just drop this
> completely, it's a well established feature of math typesetting (in TeX
> and elsewhere) that user-whitespace is ignored and the math typesetter
> re-adds white space as needed.  That said, I agree that the fact that
> TeX treats 1+2 like 1 + 2 doesn't necessarily mean that mathml should
> treat <mo>+</mo> like <mo> + </mo>. If the trimming could happen during
> text/html parsing that would simplify some things.
TeX-to-MathML converters trim and collapse the whitespace and that's
probably the same for other MathML generators. So that's why I don't see
it's needed to be done again by renderers.


> Do you mean the attribute values or the attribute names?
Values.

>
>> 4) MathML boolean attributes take value "true" and "false". In HTML5,
>> the boolean value is given by the presence/absence of the attribute and
>> the only allowed value is the name of the attribute. This allows to get
>> more compact syntax like <mo largeop stretchy> instead of <mo
>> largeop="true" stretchy="true">. However, Web engines and authoring
>> tools will continue to support the true/false syntax anyway, so it's
>> probably not worth adding complexity here...
>
> I don't think allowing stretchy=stretchy as an alternative to
> stretch=true would break anything on the XML side of things, and would
> potentially, as you say, allow just stretchy in text/html using its
> version of the old SGML shorttag feature. You could say more than me
> whether that would simplify or complicate things at implementation level.
My guess is that it will add more code if we do not remove the current
MathML syntax (which I guess we don't want as that will break all
existing documents). But maybe it's the nex syntax is not too much to
add if it's really something users want.

>>
>>    https://www.w3.org/TR/html5/infrastructure.html#boolean-attributes
>>
>> 5) As I said in a previous message, the values "small", "normal", "big"
>> of mathsize do not exist for CSS font-size. Removing them will simplify
>> a bit the parsing code.
>
> Are these conceptually more difficult than css names like
> small,medium,large,x-large? (just asking:-)
True, I forgot these keywords. I'll have to read the Gecko/WebKit code
to check where these values are resolved. But as I see the lists of
keywords are different so we won't be able to use exactly the same code
anyway.

>
>>
>> 6) The definition of numbers is also not very accurate in the MathML
>> recommendation compared to HTML5. One has to check the RelaxNG schemas
>> and the predefined RelaxNG types to know the exact syntax.
>
> Well hopefully section 2.1.5.1
> https://www.w3.org/Math/draft-spec/mathml.html#chapter2_id.2.1.5.1
> is reasonably exact (but the main point that it's not exactly the same
> as HTML5 is of course undeniable)
Yes. Note that HTML5 is also very accurate about the parsing steps, so
it could be worth just reusing the HTML5 definitions so that web engines
implementers don't have to check the differences.


Reply | Threaded
Open this post in threaded view
|

Re: [MathML4] Whitespace and attributes canonicalization in MathML VS HTML5/CSS

William F Hammond
In reply to this post by Frédéric Wang-2

Frédéric Wang <[hidden email]> writes in part:


A specific point:

> ...
> For example, <math><mspace width="1E1em"
> height="10em" mathbackground="red"/></math> draws a red
> square in WebKit but Gecko says "1E1em" is invalid.

1E1 is ridiculous.  For one thing, to my eye, it's 10.0
(floating point) -- implied by the E notation -- rather than
simply 10

                                    -- Bill


Reply | Threaded
Open this post in threaded view
|

Re: [MathML4] Whitespace and attributes canonicalization in MathML VS HTML5/CSS

David Carlisle
On 01/08/2016 22:33, William F Hammond wrote:

>
> Frédéric Wang <[hidden email]> writes in part:
>
>
> A specific point:
>
>> ...
>> For example, <math><mspace width="1E1em"
>> height="10em" mathbackground="red"/></math> draws a red
>> square in WebKit but Gecko says "1E1em" is invalid.
>
> 1E1 is ridiculous.  For one thing, to my eye, it's 10.0
> (floating point) -- implied by the E notation -- rather than
> simply 10
>
>                                     -- Bill

? the length is a floating point quantity here. 1E1em isn't valid mathml
syntax but it seems perfectly reasonable suggested extension, isn't it?
10em is same as 10.0em and could have been the same as 1e1em if we'd
specified it that way couldn't it?

David


________________________________


The Numerical Algorithms Group Ltd is a company registered in England and Wales with company number 1249803. The registered office is:

Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom.



This e-mail has been scanned for all viruses by Microsoft Office 365.

________________________________

Reply | Threaded
Open this post in threaded view
|

Re: [MathML4] Whitespace and attributes canonicalization in MathML VS HTML5/CSS

Frédéric WANG
In reply to this post by David Carlisle
Le 01/08/2016 à 18:45, David Carlisle a écrit :
> I don't think allowing stretchy=stretchy as an alternative to
> stretch=true would break anything on the XML side of things, and would
> potentially, as you say, allow just stretchy in text/html using its
> version of the old SGML shorttag feature. You could say more than me
> whether that would simplify or complicate things at implementation level.
One additional complexity is that MathML boolean attributes really have
three values: "false" or "true" when they are explicit and "automatic"
(computed from the operator dictionary etc) when they are not specified.
So I'm not sure the HTML5 syntax will work.



signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [MathML4] Whitespace and attributes canonicalization in MathML VS HTML5/CSS

Frédéric Wang-2
In reply to this post by William F Hammond
Le 01/08/2016 à 23:33, William F Hammond a écrit :
>
> 1E1 is ridiculous.  For one thing, to my eye, it's 10.0
> (floating point) -- implied by the E notation -- rather than
> simply 10
>
>                                     -- Bill
Not sure I understand your point either. As David said, lengths use
floating point numbers. Gecko's MathML code implement its own parsing to
verify that the number matches the MathML syntax before converting to
float while WebKit's parsing code is simpler and just calls an internal
toFloat method immediately (letting it decide what's the valid syntax).
If MathML aligns on HTML5 and the typical syntax for floats then Gecko's
code could be simplified a bit. Maybe that will also help converters
that generate lengths from via some calculations, I don't know.


Reply | Threaded
Open this post in threaded view
|

Re: [MathML4] Whitespace and attributes canonicalization in MathML VS HTML5/CSS

Hammond, William F
My intention was to defend the Gecko behavior and to say that 'E' notation should not be used with human-scale lengths

Sent from my iPhone

> On Aug 2, 2016, at 9:30 AM, Frédéric Wang <[hidden email]> wrote:
>
>> Le 01/08/2016 à 23:33, William F Hammond a écrit :
>>
>> 1E1 is ridiculous.  For one thing, to my eye, it's 10.0
>> (floating point) -- implied by the E notation -- rather than
>> simply 10
>>
>>                                    -- Bill
> Not sure I understand your point either. As David said, lengths use
> floating point numbers. Gecko's MathML code implement its own parsing to
> verify that the number matches the MathML syntax before converting to
> float while WebKit's parsing code is simpler and just calls an internal
> toFloat method immediately (letting it decide what's the valid syntax).
> If MathML aligns on HTML5 and the typical syntax for floats then Gecko's
> code could be simplified a bit. Maybe that will also help converters
> that generate lengths from via some calculations, I don't know.
>
>