Re: Is there a tool which tells me if my XML is "fully normalized"?

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Open this post in threaded view
Report Content as Inappropriate

Re: Is there a tool which tells me if my XML is "fully normalized"?

Paul Grosso

> -------- Original Message --------
> Subject: Is there a tool which tells me if my XML is "fully normalized"?
> Resent-Date: Sat, 16 Feb 2013 22:57:06 +0000
> Resent-From: [hidden email]
> Date: Sat, 16 Feb 2013 22:56:36 +0000
> From: Costello, Roger L. <[hidden email]>
> To: [hidden email] <[hidden email]>
> Hi Folks,

Hi Roger,

By way of generalities:

* As you know, the Character Model spec [1] defines and discusses
fully-normalized text.

* The XML specifications mostly define what XML *processors* should
and must do, and only occasionally suggest what XML *applications*
should (but never must) do. I've tried to use these terms precisely
in this response.

* XML 1.0 doesn't say anything about such normalization (the use of
the word "normalization" in XML 1.0 is related to attribute value
normalization which has nothing to do with Unicode normalization).

* XML 1.1 says [2] that the relevant constructs of all XML input
should be fully normalized, and it lists the relevant constructs
as those constructs in an XML document containing character data
plus the constructs containing Names and Nmtokens. Note that this
implies that markup is recognized before considering the normalization
of the character content, so things like combining characters do not
combine with markup characters as far as XML processors are concerned.

It does also say that:
XML processors SHOULD provide a user option to verify that the
document being processed is in fully normalized form, and report
to the application whether it is or not.
The only processor of which we are aware that currently provides
such a user option is the RXP processor (more detail below).

Finally, it says that:
XML processors MUST NOT transform the input to be in fully
normalized form. XML applications that create XML 1.1 output
from either XML 1.1 or XML 1.0 input SHOULD ensure that the
output is fully normalized....

[1] http://www.w3.org/TR/charmod-norm/#sec-FullyNormalized
[2] http://www.w3.org/TR/xml11/#sec-normalization-checking

> 1. Is there a tool which evaluates an XML document and returns an
> indication of whether it is fully normalized or not?

The RXP processor [3] (Unix man page at [4]) can optionally
check whether an XML 1.1 document is fully normalize.  It
has a -U flag that controls Unicode normalization checking,
but this flag is only relevant when parsing XML 1.1 documents.
If it is 0, no checking is done.  If it is 1, RXP checks
that the document is fully normalized as defined by the
W3C character model.  If it is 2, the document is checked
and any unknown characters (which may be ones corresponding
to a newer version of Unicode than RXP knows about) will
also cause an error.

Google found at [5] a mention of a project to add normalization
checking to Xerces, but I could not find any definitive evidence
that such a project was completed.

At [6], the CharMod spec lists some "freely available programming
resources related to normalization".

[3] http://www.cogsci.ed.ac.uk/~richard/rxp.html
[4] http://www.cogsci.ed.ac.uk/~richard/rxp.txt
[6] http://www.w3.org/TR/charmod-norm/#sec-n11n-resources

> 2. This element:
> <comment>&#x338;</comment>
> is not fully normalized, right? (Since the content of the <comment>
> element begins with a combining character and "content" is defined
> to be a "relevant construct.") Note: hex 338 is the combining solidus
> overlay character.

That element is fully normalized--see below.

> 3. Section 2.13 of the XML 1.1 specification says:
> XML applications that create XML 1.1 output from either XML 1.1 or
> XML 1.0 input SHOULD ensure that the output is fully normalized
> What should an XML application output, given this non-fully-normalized
> input:
> <comment>&#x0338;</comment>
> How does an XML application "ensure that the output is fully normalized"?

An application that produces


has produced fully normalized output. There's nothing that isn't
Unicode normalized about that sequence 27 characters.

An application that produced


where "X" is a single U0338 character would not be producing
normalized output.

Note that the above quote from section 2.13 of XML 1.1 is talking
about applications that create XML. In your question, you are
asking what an application (that presumably will output XML) should
do when given (presumably XML) input that is not fully normalized.
So the application that produced the original non-normalized XML
did something it "shouldn't" have done, and your question is what
"should" the downstream application do about that.

No XML specification says anything about that, so the downstream
application is free to do as it wishes. This is just like an XML
editor that may adjust white space within character data or emit
double quotes around attribute values where the input may have had
single quotes, etc.

> 4. If the combining solidus overlay character follows a greater-than
> character in element content:
> <comment> &gt;&#x0338; </comment>
> then normalizing XML applications will combine them to create the
> not-greater-than character:
> <comment> ≯ </comment>

As mentioned above, the input you show is normalized, so there
are really two questions here:

4a. What should an application do with:

<comment> &gt;&#x0338; </comment>

4b. What should an application do with:

<comment> &gt;X </comment>

where X is the single U0338 character.

4a isn't a normalization issue; 4b is. But as discussed under 3
above, an application given either such input is free to do anything
reasonable with either of those inputs.

Given 4a, we have found XML applications (e.g., Saxon) that produce:
<comment> &gt;/ </comment>
as well as those (e.g., MarkLogics, Arbortext Editor) that produce:
<comment> ≯ </comment>

Similarly, given text input of "e&acute;", some XML editors
write out é while others leave it as e´ (an e followed by the
individual acute character). (Arbortext Editor has an option
setting to get either behavior.)

All such behaviors are allowable.

> However, if the combining solidus overlay character follows a
> greater-than
> character that is part of a start-tag:
> <comment>&#x0338;</comment>
> then normalizing XML applications do not combine them:
> <comment>/</comment>
> There must be some W3C document which says, "The long solidus combining
> character shall not combine with the '>' in a start tag but it shall
> combine with the '>' if it is located elsewhere."

Again, there are two questions:

4c. What should an application do with:


4d. What should an application do with:


where X is the single U0338 character.

In the 4c case as you show above, there is no normalization issue.
Recognizing markup boundaries takes place before--or, at the very
latest, at the same time as--entity expansion. So there is no
">" in front of the &#x0338; when the entity is expanded.

In the 4d case, there is a normalization issue. But an XML
processor MUST NOT normalize its input, so when an XML processor
is handed 4d as input, it will recognize markup boundaries as
usual so that the comment element will end up with character
data content consisting of the single U0338 character which
will have nothing with which to combine.

Note the previous paragraph talked about XML processors. In
theory, an XML application could have a lexicographic layer--that
preceded the parsing by the XML processor--in which normalization
was done. In this case, the U0338 character would presumably be
combined with the > resulting in


which would not be well-formed XML and would therefore
presumably be rejected by the XML processor. While there is no
W3C specification that forbids such behavior by an XML application,
one would expect users of such an application to file bug reports
or stop using such an application.