RE: XML grammar error?

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

RE: XML grammar error?

bacchi raffaele
>> -----Original Message-----
>> From: bacchi raffaele [mailto:[hidden email]]
>> Sent: Monday, 2011 December 12 3:45
>> To: [hidden email]
>> Subject: XML grammar error?
>> Hi,
>> I think that rule [20] (and other similar) are wrong:
>> CData ::= (Char* - (Char* ']]>' Char*))
>> The purpose of the rule is to match (reduce) any Char sequence not
>> containing ']]>'.
>> But this result is not achieved since the Char definition includes ']'
>> and '>' so the exception part of the rule:
>> -(Char* ']]>' Char*)
>> is ambiguous. Most parsers solve the ambiguity by applying the rule
>> "reduce as soon, as much as possible"
>> thus the rule will always mismatch because the first Char* reduces also
>> the sequence ']]>' and the next terminal ']]>' will never match.
>There is no ambiguity here.  A - B matches if A matches, provided B does
>not also match what A matches.  The regular expression (in conventional
>notation) /^.*]]>.*$/ matches any string that contains at least one ']]>'.
>It is ambiguous in the sense that if there are multiple tokens of ']]>'
>in the string, different matchers will match ']]>' in the pattern against
>the first or the last.  But that makes no difference to the meaning of
>the pattern.
>Specifically, a leftmost-longest matcher will first match the first
>Char* against the whole string, then attempt to match ']' and fail.
>It will then reduce the Char* by one character and try again to match
>']'.  Iff there is a ']]>' in the string, it will eventually be matched
>as a result of the shortening of the first Char*; the second Char* will
>then match whatever is left.  If there is more than one, the rightmost
>will be the one that matches.
>By way of contrast, a DFA matcher will match the leftmost occurrence
>of ']]>'.  But as stated, exactly which ']]>' is matched is irrelevant.
>> I think the rule (and other similar) should be written:
>> Cdata ::= ( Char - ']]>' )*
>This will not work since it says to match a single character which is
>not a three-character sequence.  No single character can be three
>characters, so it will match every character.
>Paul Grosso
>for the XML Core WG

I think the term
(Char* ']]>' Char*)
would be ok for a nondeterministic parser that tries all sentences with plausible parse.
However a deterministic parser (for example leftmost-longest) does not
"...reduce the Char* by one character and try again..."
because if the rule was:
(xxx ']]>' xxx)
it had to try all the possible different length of xxx (elsewhere defined)
forcing the parser to be actually nondeterministic for xxx and all its nested rules.

On the contrary
Cdata ::= ( Char - ']]>' )*
is correct. According to Extended BNF (ISO/IEC 14977 : 1996):
When a syntactic-term is a syntactic-factor followed by
an except-symbol followed by a syntactic-exception it
represents any sequence of symbols that satisfies both of
the conditions:
a) it is a sequence of symbols represented by the syntactic-factor,
b) it is not a sequence of symbols represented by the syntactic-exception.
and there is no constraint to have syntactic-term same length of syntactic-exception.
The parser can match a Char and mismatch the sequence ']]>' then reduce the matched Char
(advance 1 char in the source) then repeat while the 2 conditions are satisfied.

Best regards
Bacchi Raffaele