ERROR #1: Ambiguous grammar
These rules make the grammar ambiguous: [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*) [43] content ::= CharData? ((element | Reference | CDSect | PI | Comment) CharData?)* CharData is allowed to match an empty string due to its use of "*". However CharData is referenced as CharData? meaning this potentially empty string is optional. Therefore, if content is blank, it is ambiguous as to whether CharData is matched as the empty string or if CharData is omitted completely. Functionally this is low severity. However grammar parsers such as my own will find both interpretations and treat it as an error because the grammar is ambiguous. The fix is simple. Change: [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*) to: [14] CharData ::= [^<&]+ - ([^<&]* ']]>' [^<&]*) ERROR #2: CharData supports, and doesn't support, character references Section 2.4 seems to suggest that Character Data may contain character references such as &. However at the same time, the grammar rule [14] for CharData does not appear to be able to match ampersand character references at all: [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*) Regards, Daniel van Vugt |
Daniel,
Thank you for your interest in the XML spec and your comments [1,2,3] on the XML 1.0 5th edition. The XML Core Working Group discussed them and came to the following conclusion: Regarding the several ambiguous grammar reports ----------------------------------------------- You are correct that the productions as written do not themselves specify a non-ambiguous grammar, and the alterations you are suggesting are exactly the kind that a parser writer should be making if a non-ambiguous grammar is needed or desired. However, the technical ambiguities in the productions in the XML specification have been there since the first edition in 1998, and it was never the intention to imply that the productions in the document can be used without change as a non-ambiguous grammar. The original authors of the specification felt that logical clarity was better served by the productions as written, and parser writers are free to translate them into an equivalent non-ambiguous grammar. Perhaps that sentiment should have been spelled out explicitly in the document, but it does not seem necessary or prudent to do that or to alter the productions at this late date. Regarding the CharData construct -------------------------------- CharData does not include character references. The discussion in section 2.4 starts with "_Text_ consists of intermingled character data and markup." The discussion in the next few paragraphs about character references is talking about character references in _Text_. The CharData term that, as you note, does not allow the < or & character, is only referenced from production [43] for "content" which is the production for _text_, and that production defines "content" as being CharData interspersed with various markup constructs including Reference (which includes entity and character references). Paul Grosso, co-chair of the XML Core WG [1] http://lists.w3.org/Archives/Public/xml-editor/2011OctDec/0000 [2] http://lists.w3.org/Archives/Public/xml-editor/2011OctDec/0001 [3] http://lists.w3.org/Archives/Public/xml-editor/2011OctDec/0002 > -----Original Message----- > From: [hidden email] [mailto:[hidden email]] On > Behalf Of Daniel van Vugt > Sent: Thursday, 2011 October 20 0:20 > To: [hidden email] > Subject: Errata in section 2.4 of Extensible Markup Language (XML) 1.0 > (Fifth Edition) > > ERROR #1: Ambiguous grammar > > These rules make the grammar ambiguous: > > [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*) > [43] content ::= CharData? ((element | Reference | CDSect | PI | > Comment) CharData?)* > > CharData is allowed to match an empty string due to its use of "*". > However CharData is referenced as CharData? meaning this potentially > empty string is optional. Therefore, if content is blank, it is > ambiguous as to whether CharData is matched as the empty string or if > CharData is omitted completely. > > Functionally this is low severity. However grammar parsers such as my > own will find both interpretations and treat it as an error because > grammar is ambiguous. > > The fix is simple. Change: > [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*) > to: > [14] CharData ::= [^<&]+ - ([^<&]* ']]>' [^<&]*) > > > ERROR #2: CharData supports, and doesn't support, character references > > Section 2.4 seems to suggest that Character Data may contain character > references such as &. However at the same time, the grammar rule > [14] for CharData does not appear to be able to match ampersand > character references at all: > > [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*) > > > Regards, > > Daniel van Vugt > |
I am very surprised you are not accepting corrections to the standard,
for mistakes that you acknowledge do exist. Especially a correction such as this which only requires changing a single character. However, this is not the first time I have encountered an official language specification with BNF grammar where the authors have stated they don't guarantee the grammar to be technically accurate... For the benefit of the wider community, I think it would be helpful to still publish the errata, even indefinitely, and even if you have no intention of ever resolving the problems in the main document. - Daniel On 04/11/11 04:49, Grosso, Paul wrote: > Daniel, > > Thank you for your interest in the XML spec and your > comments [1,2,3] on the XML 1.0 5th edition. > > The XML Core Working Group discussed them and came to the > following conclusion: > > Regarding the several ambiguous grammar reports > ----------------------------------------------- > You are correct that the productions as written do not themselves > specify a non-ambiguous grammar, and the alterations you are > suggesting are exactly the kind that a parser writer should > be making if a non-ambiguous grammar is needed or desired. > > However, the technical ambiguities in the productions in the XML > specification have been there since the first edition in 1998, > and it was never the intention to imply that the productions > in the document can be used without change as a non-ambiguous > grammar. The original authors of the specification felt that > logical clarity was better served by the productions as written, > and parser writers are free to translate them into an equivalent > non-ambiguous grammar. > > Perhaps that sentiment should have been spelled out explicitly > in the document, but it does not seem necessary or prudent to > do that or to alter the productions at this late date. > > Regarding the CharData construct > -------------------------------- > CharData does not include character references. > > The discussion in section 2.4 starts with "_Text_ consists of > intermingled character data and markup." The discussion in > the next few paragraphs about character references is talking > about character references in _Text_. The CharData term that, > as you note, does not allow the< or& character, is only > referenced from production [43] for "content" which is the > production for _text_, and that production defines "content" > as being CharData interspersed with various markup constructs > including Reference (which includes entity and character > references). > > > Paul Grosso, co-chair of the XML Core WG > > [1] http://lists.w3.org/Archives/Public/xml-editor/2011OctDec/0000 > [2] http://lists.w3.org/Archives/Public/xml-editor/2011OctDec/0001 > [3] http://lists.w3.org/Archives/Public/xml-editor/2011OctDec/0002 > >> -----Original Message----- >> From: [hidden email] [mailto:[hidden email]] On >> Behalf Of Daniel van Vugt >> Sent: Thursday, 2011 October 20 0:20 >> To: [hidden email] >> Subject: Errata in section 2.4 of Extensible Markup Language (XML) 1.0 >> (Fifth Edition) >> >> ERROR #1: Ambiguous grammar >> >> These rules make the grammar ambiguous: >> >> [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*) >> [43] content ::= CharData? ((element | Reference | CDSect | PI | >> Comment) CharData?)* >> >> CharData is allowed to match an empty string due to its use of "*". >> However CharData is referenced as CharData? meaning this potentially >> empty string is optional. Therefore, if content is blank, it is >> ambiguous as to whether CharData is matched as the empty string or if >> CharData is omitted completely. >> >> Functionally this is low severity. However grammar parsers such as my >> own will find both interpretations and treat it as an error because > the >> grammar is ambiguous. >> >> The fix is simple. Change: >> [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*) >> to: >> [14] CharData ::= [^<&]+ - ([^<&]* ']]>' [^<&]*) >> >> >> ERROR #2: CharData supports, and doesn't support, character references >> >> Section 2.4 seems to suggest that Character Data may contain character >> references such as&. However at the same time, the grammar rule >> [14] for CharData does not appear to be able to match ampersand >> character references at all: >> >> [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*) >> >> >> Regards, >> >> Daniel van Vugt >> > > |
Daniel van Vugt writes:
> I am very surprised you are not accepting corrections to the standard, > for mistakes that you acknowledge do exist. Especially a correction > such as this which only requires changing a single character. Sorry, not a mistake. An ambiguous grammar defines a language just fine. Non-ambiguity is not a requirement. > However, this is not the first time I have encountered an official > language specification with BNF grammar where the authors have stated > they don't guarantee the grammar to be technically accurate... Again, "technically accurate" is not a defined term wrt context-free grammars. I'm not aware of any suggestion that saying a grammar is expressed in BNF implies it is unambiguous. the ambiguities you identified are benign, in that they have no impact on the semantics of the relevant expressions. > For the benefit of the wider community, I think it would be helpful to > still publish the errata, even indefinitely, and even if you have no > intention of ever resolving the problems in the main document. If we ever issue another edition, a Note to the effect that the grammar represented by the BNF is ambiguous should be considered, I agree. ht -- Henry S. Thompson, School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 651-1426, e-mail: [hidden email] URL: http://www.ltg.ed.ac.uk/~ht/ [mail from me _always_ has a .sig like this -- mail without it is forged spam] |
Free forum by Nabble | Edit this page |