multiple pattern facet conjunction

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

multiple pattern facet conjunction

Syd Bauman

Previous-subject: "Re: [oXygen-user] wrong conjunction for multiple pattern facets?"

The following is a follow-on of a discussion that has been occurring
on the oxygen-user mailing list
(http://oxygenxml.com/mailman/listinfo/oxygen-user/).


It seems pretty clear that in RelaxNG, multiple occurrences of a
<param name='pattern"> inside a single <data> element (whose type=
must be a W3C datatype that allows the pattern facet) must all be
met, i.e., they are ANDed together. The following is from section 2
of "Guidelines for using W3C XML Schema Datatypes with RELAX NG"[1]

   If the 'pattern' parameter is specified more than once for a
   single 'data' element, then a string matches the 'data' element
   only if it matches all of the patterns.

I think this means that if I have

        <rng:element name="duck" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
          <rng:data type="token">
            <rng:param name="pattern">R1</rng:param>
            <rng:param name="pattern">R2</rng:param>
          </rng:data>
        </rng:element>

then the content of <duck> must match both R1 and R2 in order to be
valid. This seems to make a lot of sense. After all, if I had wanted
a string to be a valid <duck> if it matched R1 *or* R2, I could have
written

          <rng:data type="token">
            <rng:param name="pattern">(R1)|(R2)</rng:param>
          </rng:data>


But in W3C XML Schema things seem a lot less clear, although this may
be because I am close to the furthest thing there is from an expert.
I was referred to section 4.3.4.3 of the spec[2]. I had never heard
of, let alone read, 4.3.4.3 before today. But upon reading it, I have
to admit I don't quite understand what it means, and whether or not
it has any significance with respect to RelaxNG validation. (I
suspect not.)

The text of 4.3.4.3 seems problematic.

   If multiple <pattern> element information items appear as
   [children] of a <simpleType>, the [value]s should be combined as
   if they appeared in a single regular expression as separate
   branches.

First, I am under the (perhaps erroneous) impression that a <pattern>
element can not be the child of a <simpleType> element. Although
perhaps the infoset definition of "children" includes descendants? (I
don't think it does -- I had thought "appearing immediately within
the current element" meant child, not descendant.)

Second, the idea seems unhelpful. If I wanted two regular expressions
R1 and R2 to appear in a single regular expression as separate
branches, I could have just written "R1|R2", no? So my gut instinct
is that this rule isn't useful, but I may be missing something.
(E.g., perhaps this is a general idea which, although not very useful
with regular expressions, is expected to be useful with some future
structures not yet devised?)

The note attached to 4.3.4.3 says

   ... pattern facets specified on the same step in a type derivation
   are ORed together, while pattern facets specified on different
   steps of a type derivation are ANDed together.

but I have yet to really figure out what a "step" is. However,
playing around a bit with the output of `trang`[3] is potentially
very instructive.

The following is the above RelaxNG schema fragment transformed into
W3C Schema; in the one test I performed (using xmllint) it validated
as I wanted: the contents of <duck> must match both R1 and R2.

  <xs:element name="duck">
    <xs:simpleType>
      <xs:restriction>
        <xs:simpleType>
          <xs:restriction base="xs:token">
            <xs:pattern value="R1"/>
          </xs:restriction>
        </xs:simpleType>
        <xs:pattern value="R2"/>
      </xs:restriction>
    </xs:simpleType>
  </xs:element>

A minor change, as follows, caused strings matching either R1 or R2
to be considered valid.

  <xs:element name="duck">
    <xs:simpleType>
      <xs:restriction>
        <xs:simpleType>
          <xs:restriction base="xs:token">
            <xs:pattern value="R1"/>
            <xs:pattern value="R2"/>
          </xs:restriction>
        </xs:simpleType>
      </xs:restriction>
    </xs:simpleType>
  </xs:element>

My instinct is that this could be simplified to

  <xs:element name="duck">
    <xs:simpleType>
      <xs:restriction base="xs:token">
        <xs:pattern value="R1"/>
        <xs:pattern value="R2"/>
      </xs:restriction>
    </xs:simpleType>
  </xs:element>

without any change to the set of documents that would be considered
valid.

Have I got any of this right?

Note
----
[1] Which I found at http://relaxng.org/xsd-20010907.html; it is
    linked to from the main RelaxNG home page.
[2] http://www.w3.org/TR/xmlschema-2/#src-multiple-patterns
[3] Version 20030619.


Reply | Threaded
Open this post in threaded view
|

Re: multiple pattern facet conjunction

C. M. Sperberg-McQueen


On 30 Dec 2006, at 04:46 , Syd Bauman wrote:

> The text of 4.3.4.3 seems problematic.
>
>    If multiple <pattern> element information items appear as
>    [children] of a <simpleType>, the [value]s should be combined as
>    if they appeared in a single regular expression as separate
>    branches.
>
> First, I am under the (perhaps erroneous) impression that a <pattern>
> element can not be the child of a <simpleType> element.

I think that's true; Schema 1.0 had a typo ('simpleType' for  
'restriction'
-- not 'children' for 'descendant', though, since simple type  
definitions
can nest).  That may be one reason that the paragraph in question
has been deleted from the current draft of XML Schema 1.1 and
the rule has been reworded.

> Second, the idea seems unhelpful. If I wanted two regular expressions
> R1 and R2 to appear in a single regular expression as separate
> branches, I could have just written "R1|R2", no?

Yes.  But not if you wished to annotate the two branches
separately, either for a human reader or for a machine.

> So my gut instinct
> is that this rule isn't useful, but I may be missing something.

It doesn't enlarge the expressive power of the language, as
regards validation, no.

> The note attached to 4.3.4.3 says
>
>    ... pattern facets specified on the same step in a type derivation
>    are ORed together, while pattern facets specified on different
>    steps of a type derivation are ANDed together.
>
> but I have yet to really figure out what a "step" is.

A step is one derivation in a derivation chain.

When one defines type T1 as a restriction of some primitive
type, and T2 as a restriction of T1, and T3 as a restriction of
T2, one has a derivation chain with three steps.  If patterns
P1 and P2 are specified as part of the definition of T1, and
P3 and P4 as part of the definition of T2 and T3 respectively,
then the lexical space of T3 contains only character
sequences which match P1|P2 and P3 and P4.

>   <xs:element name="duck">
>     <xs:simpleType>
>       <xs:restriction>
>         <xs:simpleType>
>           <xs:restriction base="xs:token">
>             <xs:pattern value="R1"/>
>             <xs:pattern value="R2"/>
>           </xs:restriction>
>         </xs:simpleType>
>       </xs:restriction>
>     </xs:simpleType>
>   </xs:element>
>
> My instinct is that this could be simplified to
>
>   <xs:element name="duck">
>     <xs:simpleType>
>       <xs:restriction base="xs:token">
>         <xs:pattern value="R1"/>
>         <xs:pattern value="R2"/>
>       </xs:restriction>
>     </xs:simpleType>
>   </xs:element>
>
> without any change to the set of documents that would be considered
> valid.

Yes.  In the second formulation, 'duck' is a restriction of token; in
the second formulation, 'duck' is a vacuous restriction of an
anonymous type which is a restriction of token.

I hope this helps.

--C. M. Sperberg-McQueen



Reply | Threaded
Open this post in threaded view
|

Re: multiple pattern facet conjunction

Syd Bauman

Thanks for the quick and helpful reply, Michael.


> I think that's true; Schema 1.0 had a typo ('simpleType' for
> 'restriction' ...)

Good to know I'm not going nuts ... I did check the errata before
posting.


> That may be one reason that the paragraph in question has been
> deleted from the current draft of XML Schema 1.1 and the rule has
> been reworded.

Ah, check. I see that the paragraph has been deleted from
http://www.w3.org/TR/xmlschema11-2/#pattern-rep-constr, but where has
the rule about handing multiple patterns been moved to?


> > Second, the idea seems unhelpful. If I wanted two regular
> > expressions R1 and R2 to appear in a single regular expression as
> > separate branches, I could have just written "R1|R2", no?

> Yes.  But not if you wished to annotate the two branches
> separately, either for a human reader or for a machine.

Good point. (Where's that /x modifier when you need it?)


> [Explanation of 'step' and of my examples.]
> I hope this helps.

Yes, it does; thanks again.


Now, can anyone verify my belief that a Relax NG validator should
require that the content of <duck> match *both* pattern R1 *and*
pattern R2 given the following schema?

datatypes xsd = "http://www.w3.org/2001/XMLSchema-datatypes"
start =
  element test {
    element duck {
      xsd:token {
         pattern = "R1"
         pattern = "R2"
      }
    }+
  }


Reply | Threaded
Open this post in threaded view
|

Re: multiple pattern facet conjunction

cowan

Syd Bauman scripsit:

> Now, can anyone verify my belief that a Relax NG validator should
> require that the content of <duck> match *both* pattern R1 *and*
> pattern R2 given the following schema?
>
> datatypes xsd = "http://www.w3.org/2001/XMLSchema-datatypes"
> start =
>   element test {
>     element duck {
>       xsd:token {
>          pattern = "R1"
>          pattern = "R2"
>       }
>     }+
>   }

Easily.  See the non-normative, but widely accepted,
"Guidelines for using W3C XML Schema Datatypes with RELAX NG" at
http://relaxng.org/xsd-20010907.html .  Section 2 says (in part):

# If the pattern parameter is specified more than once for a single
# data element, then a string matches the data element only if it
# matches all of the patterns. It is an error to specify a parameter
# other than pattern more than once for a single data element.

The point is, of course, that the union of two patterns can be
easily expressed in RELAX NG using a choice, whereas the intersection
of two patterns cannot, since RELAX NG has no general intersection
operator.

--
LEAR: Dost thou call me fool, boy?      John Cowan
FOOL: All thy other titles              http://www.ccil.org/~cowan
             thou hast given away:      [hidden email]
      That thou wast born with.

Reply | Threaded
Open this post in threaded view
|

Re: multiple pattern facet conjunction

Syd Bauman

> Easily. See the non-normative, but widely accepted, "Guidelines for
> using W3C XML Schema Datatypes with RELAX NG" at
> http://relaxng.org/xsd-20010907.html . Section 2 says (in part):

> # If the pattern parameter is specified more than once for a single
> # data element, then a string matches the data element only if it
> # matches all of the patterns.

Right, the same sentence I quoted in the original post. Thanks for
verifying that it means what I think it means, John. (I didn't think
I was going nuts, but sometimes it's good to be sure :-)