>> -----Original Message----- >> From: bacchi raffaele [mailto:[hidden email]] >> Sent: Monday, 2011 December 12 3:45 >> To: [hidden email] >> Subject: XML grammar error? >> >> Hi, >> I think that rule  (and other similar) are wrong: >> CData ::= (Char* - (Char* ']]>' Char*)) >> The purpose of the rule is to match (reduce) any Char sequence not >> containing ']]>'. >> But this result is not achieved since the Char definition includes ']' >> and '>' so the exception part of the rule: >> -(Char* ']]>' Char*) >> is ambiguous. Most parsers solve the ambiguity by applying the rule >> "reduce as soon, as much as possible" >> thus the rule will always mismatch because the first Char* reduces also >> the sequence ']]>' and the next terminal ']]>' will never match. > > >There is no ambiguity here. A - B matches if A matches, provided B does >not also match what A matches. The regular expression (in conventional >notation) /^.*]]>.*$/ matches any string that contains at least one ']]>'. >It is ambiguous in the sense that if there are multiple tokens of ']]>' >in the string, different matchers will match ']]>' in the pattern against >the first or the last. But that makes no difference to the meaning of >the pattern. > >Specifically, a leftmost-longest matcher will first match the first >Char* against the whole string, then attempt to match ']' and fail. >It will then reduce the Char* by one character and try again to match >']'. Iff there is a ']]>' in the string, it will eventually be matched >as a result of the shortening of the first Char*; the second Char* will >then match whatever is left. If there is more than one, the rightmost >will be the one that matches. > >By way of contrast, a DFA matcher will match the leftmost occurrence >of ']]>'. But as stated, exactly which ']]>' is matched is irrelevant. > > >> I think the rule (and other similar) should be written: >> Cdata ::= ( Char - ']]>' )* > >This will not work since it says to match a single character which is >not a three-character sequence. No single character can be three >characters, so it will match every character. > > >Paul Grosso >for the XML Core WG
Hi, I think the term (Char* ']]>' Char*) would be ok for a nondeterministic parser that tries all sentences with plausible parse. However a deterministic parser (for example leftmost-longest) does not "...reduce the Char* by one character and try again..." because if the rule was: (xxx ']]>' xxx) it had to try all the possible different length of xxx (elsewhere defined) forcing the parser to be actually nondeterministic for xxx and all its nested rules.
On the contrary Cdata ::= ( Char - ']]>' )* is correct. According to Extended BNF (ISO/IEC 14977 : 1996): "Syntactic-term ... When a syntactic-term is a syntactic-factor followed by an except-symbol followed by a syntactic-exception it represents any sequence of symbols that satisfies both of the conditions: a) it is a sequence of symbols represented by the syntactic-factor, b) it is not a sequence of symbols represented by the syntactic-exception. ... " and there is no constraint to have syntactic-term same length of syntactic-exception. The parser can match a Char and mismatch the sequence ']]>' then reduce the matched Char (advance 1 char in the source) then repeat while the 2 conditions are satisfied.