Errata in section 2.4 of Extensible Markup Language (XML) 1.0 (Fifth Edition)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Errata in section 2.4 of Extensible Markup Language (XML) 1.0 (Fifth Edition)

Daniel van Vugt
ERROR #1: Ambiguous grammar

These rules make the grammar ambiguous:

[14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)
[43] content ::= CharData? ((element | Reference | CDSect | PI |
Comment) CharData?)*

CharData is allowed to match an empty string due to its use of "*".
However CharData is referenced as CharData? meaning this potentially
empty string is optional. Therefore, if content is blank, it is
ambiguous as to whether CharData is matched as the empty string or if
CharData is omitted completely.

Functionally this is low severity. However grammar parsers such as my
own will find both interpretations and treat it as an error because the
grammar is ambiguous.

The fix is simple. Change:
[14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)
to:
[14] CharData ::= [^<&]+ - ([^<&]* ']]>' [^<&]*)


ERROR #2: CharData supports, and doesn't support, character references

Section 2.4 seems to suggest that Character Data may contain character
references such as &amp;. However at the same time, the grammar rule
[14] for CharData does not appear to be able to match ampersand
character references at all:

[14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)


Regards,

Daniel van Vugt


Reply | Threaded
Open this post in threaded view
|

RE: Errata in section 2.4 of Extensible Markup Language (XML) 1.0 (Fifth Edition)

Grosso, Paul
Daniel,

Thank you for your interest in the XML spec and your
comments [1,2,3] on the XML 1.0 5th edition.

The XML Core Working Group discussed them and came to the
following conclusion:

Regarding the several ambiguous grammar reports
-----------------------------------------------
You are correct that the productions as written do not themselves
specify a non-ambiguous grammar, and the alterations you are
suggesting are exactly the kind that a parser writer should
be making if a non-ambiguous grammar is needed or desired.

However, the technical ambiguities in the productions in the XML
specification have been there since the first edition in 1998,
and it was never the intention to imply that the productions
in the document can be used without change as a non-ambiguous
grammar.  The original authors of the specification felt that
logical clarity was better served by the productions as written,
and parser writers are free to translate them into an equivalent
non-ambiguous grammar.

Perhaps that sentiment should have been spelled out explicitly
in the document, but it does not seem necessary or prudent to
do that or to alter the productions at this late date.

Regarding the CharData construct
--------------------------------
CharData does not include character references.

The discussion in section 2.4 starts with "_Text_ consists of
intermingled character data and markup."  The discussion in
the next few paragraphs about character references is talking
about character references in _Text_.  The CharData term that,
as you note, does not allow the < or & character, is only
referenced from production [43] for "content" which is the
production for _text_, and that production defines "content"
as being CharData interspersed with various markup constructs
including Reference (which includes entity and character
references).


Paul Grosso, co-chair of the XML Core WG

[1] http://lists.w3.org/Archives/Public/xml-editor/2011OctDec/0000
[2] http://lists.w3.org/Archives/Public/xml-editor/2011OctDec/0001
[3] http://lists.w3.org/Archives/Public/xml-editor/2011OctDec/0002

> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On
> Behalf Of Daniel van Vugt
> Sent: Thursday, 2011 October 20 0:20
> To: [hidden email]
> Subject: Errata in section 2.4 of Extensible Markup Language (XML) 1.0
> (Fifth Edition)
>
> ERROR #1: Ambiguous grammar
>
> These rules make the grammar ambiguous:
>
> [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)
> [43] content ::= CharData? ((element | Reference | CDSect | PI |
> Comment) CharData?)*
>
> CharData is allowed to match an empty string due to its use of "*".
> However CharData is referenced as CharData? meaning this potentially
> empty string is optional. Therefore, if content is blank, it is
> ambiguous as to whether CharData is matched as the empty string or if
> CharData is omitted completely.
>
> Functionally this is low severity. However grammar parsers such as my
> own will find both interpretations and treat it as an error because
the

> grammar is ambiguous.
>
> The fix is simple. Change:
> [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)
> to:
> [14] CharData ::= [^<&]+ - ([^<&]* ']]>' [^<&]*)
>
>
> ERROR #2: CharData supports, and doesn't support, character references
>
> Section 2.4 seems to suggest that Character Data may contain character
> references such as &amp;. However at the same time, the grammar rule
> [14] for CharData does not appear to be able to match ampersand
> character references at all:
>
> [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)
>
>
> Regards,
>
> Daniel van Vugt
>


Reply | Threaded
Open this post in threaded view
|

Re: Errata in section 2.4 of Extensible Markup Language (XML) 1.0 (Fifth Edition)

Daniel van Vugt
I am very surprised you are not accepting corrections to the standard,
for mistakes that you acknowledge do exist. Especially a correction such
as this which only requires changing a single character.

However, this is not the first time I have encountered an official
language specification with BNF grammar where the authors have stated
they don't guarantee the grammar to be technically accurate...

For the benefit of the wider community, I think it would be helpful to
still publish the errata, even indefinitely, and even if you have no
intention of ever resolving the problems in the main document.

- Daniel


On 04/11/11 04:49, Grosso, Paul wrote:

> Daniel,
>
> Thank you for your interest in the XML spec and your
> comments [1,2,3] on the XML 1.0 5th edition.
>
> The XML Core Working Group discussed them and came to the
> following conclusion:
>
> Regarding the several ambiguous grammar reports
> -----------------------------------------------
> You are correct that the productions as written do not themselves
> specify a non-ambiguous grammar, and the alterations you are
> suggesting are exactly the kind that a parser writer should
> be making if a non-ambiguous grammar is needed or desired.
>
> However, the technical ambiguities in the productions in the XML
> specification have been there since the first edition in 1998,
> and it was never the intention to imply that the productions
> in the document can be used without change as a non-ambiguous
> grammar.  The original authors of the specification felt that
> logical clarity was better served by the productions as written,
> and parser writers are free to translate them into an equivalent
> non-ambiguous grammar.
>
> Perhaps that sentiment should have been spelled out explicitly
> in the document, but it does not seem necessary or prudent to
> do that or to alter the productions at this late date.
>
> Regarding the CharData construct
> --------------------------------
> CharData does not include character references.
>
> The discussion in section 2.4 starts with "_Text_ consists of
> intermingled character data and markup."  The discussion in
> the next few paragraphs about character references is talking
> about character references in _Text_.  The CharData term that,
> as you note, does not allow the<  or&  character, is only
> referenced from production [43] for "content" which is the
> production for _text_, and that production defines "content"
> as being CharData interspersed with various markup constructs
> including Reference (which includes entity and character
> references).
>
>
> Paul Grosso, co-chair of the XML Core WG
>
> [1] http://lists.w3.org/Archives/Public/xml-editor/2011OctDec/0000
> [2] http://lists.w3.org/Archives/Public/xml-editor/2011OctDec/0001
> [3] http://lists.w3.org/Archives/Public/xml-editor/2011OctDec/0002
>
>> -----Original Message-----
>> From: [hidden email] [mailto:[hidden email]] On
>> Behalf Of Daniel van Vugt
>> Sent: Thursday, 2011 October 20 0:20
>> To: [hidden email]
>> Subject: Errata in section 2.4 of Extensible Markup Language (XML) 1.0
>> (Fifth Edition)
>>
>> ERROR #1: Ambiguous grammar
>>
>> These rules make the grammar ambiguous:
>>
>> [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)
>> [43] content ::= CharData? ((element | Reference | CDSect | PI |
>> Comment) CharData?)*
>>
>> CharData is allowed to match an empty string due to its use of "*".
>> However CharData is referenced as CharData? meaning this potentially
>> empty string is optional. Therefore, if content is blank, it is
>> ambiguous as to whether CharData is matched as the empty string or if
>> CharData is omitted completely.
>>
>> Functionally this is low severity. However grammar parsers such as my
>> own will find both interpretations and treat it as an error because
> the
>> grammar is ambiguous.
>>
>> The fix is simple. Change:
>> [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)
>> to:
>> [14] CharData ::= [^<&]+ - ([^<&]* ']]>' [^<&]*)
>>
>>
>> ERROR #2: CharData supports, and doesn't support, character references
>>
>> Section 2.4 seems to suggest that Character Data may contain character
>> references such as&amp;. However at the same time, the grammar rule
>> [14] for CharData does not appear to be able to match ampersand
>> character references at all:
>>
>> [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)
>>
>>
>> Regards,
>>
>> Daniel van Vugt
>>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Errata in section 2.4 of Extensible Markup Language (XML) 1.0 (Fifth Edition)

Henry S. Thompson
Daniel van Vugt writes:

> I am very surprised you are not accepting corrections to the standard,
> for mistakes that you acknowledge do exist. Especially a correction
> such as this which only requires changing a single character.

Sorry, not a mistake.  An ambiguous grammar defines a language just
fine.  Non-ambiguity is not a requirement.

> However, this is not the first time I have encountered an official
> language specification with BNF grammar where the authors have stated
> they don't guarantee the grammar to be technically accurate...

Again, "technically accurate" is not a defined term wrt context-free
grammars.  I'm not aware of any suggestion that saying a grammar is
expressed in BNF implies it is unambiguous.  the ambiguities you
identified are benign, in that they have no impact on the semantics of
the relevant expressions.

> For the benefit of the wider community, I think it would be helpful to
> still publish the errata, even indefinitely, and even if you have no
> intention of ever resolving the problems in the main document.

If we ever issue another edition, a Note to the effect that the
grammar represented by the BNF is ambiguous should be considered, I
agree.

ht
--
       Henry S. Thompson, School of Informatics, University of Edinburgh
      10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
                Fax: (44) 131 651-1426, e-mail: [hidden email]
                       URL: http://www.ltg.ed.ac.uk/~ht/
 [mail from me _always_ has a .sig like this -- mail without it is forged spam]