'Leading White Space' Topic

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

'Leading White Space' Topic

Steve Fogoros
On 27 June, 2008, I wrote to [hidden email] regarding XML Recommendation (V1.0, Editions 2-5) description of how leading white space is defined in well-formed documents. I contend that the recommendation allows leading white space; that is white space before the prolog. Yet, many implementations fail to consider an XML document with leading white space as well-formed, and claim productions [22], [23], and [1] completely describe their implementation [while also relying on the non-normative section F]. Section 2.4 clearly describes any white space outside the document entity as markup and is allowed.
 
There has been no activity on this topic to clarify leading white space in well-formed XML documents as allowed or prohibited. I think namespace has everybody's time and attention at the moment.
 
Since there doesn't appear to be any interest in revising the recommendation regarding leading white space, and many current implementations consider the non-normative description that no leading white space is required for well-formed, I would like to discuss this here to explore if the test cases match the recommendation regarding leading white space outside the document entity. I believe they don't adequately test for this and should contain test cases where leading white space outside the document entity validates as well-formed.
 
Here is the text of my email on 27 June, 2008:
 
>> Subject: XML Recommendation Inconsistencies Regarding Leading White Space in Well-Formed Documents
>>
>> There appears to be some difficulty interpreting the Recommendation's

>> specification regarding leading white space that occurs prior to the xml
>> declaration as being prohibited or well-formed. Researching the Internet
>> indicates that leading white space is a frequent error at the
>> application level. In discussions on expat mailing list, it is claimed
>> that expat, i.e., is following the XML recommendation as specified
>> regarding leading white space in that it is not allowed. Typically,
>> productions [22] prolog, and [23] XMLDecl, are cited as the formal
>> specification that prohibits leading white space.
>> 
>> On reviewing the latest XML recommendation (Fifth Edition), I found
>> this to be not true. Section 2.4 (as far back as the Second Edition) is
>> very clear that any white space at the top level of the document entity
>> can exist in a well-formed xml document. I found other sections that
>> support this. If this email leads to further discussions, I will be
>> happy to enumerate in detail.
>>  
>> I did find one reference in Section F Autodetection of Character
>> Encodings (Non-Normative), that stated '... the XML encoding declaration
>> is restricted in position and content in order ...', but nowhere else in
>> the recommendation exists such a restriction, except in Section F.1
>> Detection Without External Encoding Information, where it states,
>> 'Because each XML entity not accompanied by external encoding
>> information and not in UTF-8 or UTF-16 encoding must begin with an XML
>> encoding declaration, in which the first characters must be '<?xml',
>> ....'. As this is a Non-Normative exception case, I don't interpret it as
>> a restriction in position and content of the normative case.
>>  
>> Depending on the intent of the recommendation regarding leading white
>> space being prohibited or well-formed, I would like to contribute
>> suggestions that make this more concise. 
Thank you for considering this topic,
 
Steve Fogoros
Manager of Academic Systems and Programming
Academic Information Services
University of North Texas Health Science Center


** Confidentiality Notice: This e-mail and any files transmitted with it are confidential to the extent permitted by law and intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error please notify the originator of the message and destroy all copies. **
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 'Leading White Space' Topic

Steve Fogoros
I so much want to agree, and I wish the recommendation to be concise on this. I'm reading XML 1.0, Fifth Edition, Section 2.4. Here is a cut/paste of the first paragraph:
 
Text consists of intermingled character data and markup. [Definition: Markup takes the form of start-tags, end-tags, empty-element tags, entity references, character references, comments, CDATA section delimiters, document type declarations, processing instructions, XML declarations, text declarations, and any white space that is at the top level of the document entity (that is, outside the document element and not inside any other markup).]

It says that '... any white space that is at the top level of the document entity (that is, outside the document element and ...' is markup and it allowed.
 
Production [1] defines the document element as document ::= prolog element Misc*
 
I understand this to mean that 'any white space' outside the document element includes any white space before the prolog. How could this be interpreted any other way?
 
Steve Fogoros

>>> Daniel Veillard <[hidden email]> 9/17/2009 2:43 PM >>>
On Thu, Sep 17, 2009 at 01:42:06PM -0500, Steve Fogoros wrote:
> On 27 June, 2008, I wrote to [hidden email] regarding XML
> Recommendation (V1.0, Editions 2-5) description of how leading white
> space is defined in well-formed documents. I contend that the
> recommendation allows leading white space; that is white space before
> the prolog.

where did you read that ? what section what paragraph ?

> Yet, many implementations fail to consider an XML document
> with leading white space as well-formed, and claim productions [22],
> [23], and [1] completely describe their implementation [while also
> relying on the non-normative section F]. Section 2.4 clearly describes
> any white space outside the document entity as markup and is allowed.

  Production [1] defines what a document is and leading white space
before the XMLDecl are forbidden (an optional BOM is not really a
white space but an encoding indication). If you have no XMLDecl you
can stack all the whitespaces you want as they are consumed by [27]
Misc, but that's definitely *in* the prolog, not before.

  If you have spaces before your document, you will have to discard them
outside of the parsing process. I totally agree with the expat devel
on this. Any other interpretation is clearly contradicting the spec.

Daniel

--
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
[hidden email]  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/



** Confidentiality Notice: This e-mail and any files transmitted with it are confidential to the extent permitted by law and intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error please notify the originator of the message and destroy all copies. **
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 'Leading White Space' Topic

Steve Fogoros
In reply to this post by Steve Fogoros
Daniel,

Thank you for taking the time to understand my question and explain what
was wrong with my interpretation. I believed the authors used entity and
element interchangeably, possibly due to different authors at different
times. I feel much better now that I've been shown that the
recommendation and implementations are consistent and well written.

It is clear to me now that white space at the document entity level
means before and/or after the document element which always comes after
the prolog, if it exists.

Thanks again,

Steve Fogoros

>>> Daniel Veillard <[hidden email]> 09/17/09 3:52 PM >>>
On Thu, Sep 17, 2009 at 03:32:56PM -0500, Steve Fogoros wrote:
> I so much want to agree, and I wish the recommendation to be concise
on
> this. I'm reading XML 1.0, Fifth Edition, Section 2.4. Here is a
> cut/paste of the first paragraph:
>  
> Text consists of intermingled character data and markup. [Definition:
> Markup takes the form of start-tags, end-tags, empty-element tags,
> entity references, character references, comments, CDATA section
> delimiters, document type declarations, processing instructions, XML
> declarations, text declarations, and any white space that is at the
top
> level of the document entity (that is, outside the document element
and
> not inside any other markup).]

this is about white space in the top level of the document entity as
in not within the subtree of the top level element. I.e. the Misc
production called after the top level element in [1] and as part of
[22] prolog.

> It says that '... any white space that is at the top level of the
> document entity (that is, outside the document element and ...' is
> markup and it allowed.

  yes those space exist, the fact that your assuming they can come
before prolog is where you get this wrong.

> Production [1] defines the document element as document ::= prolog
> element Misc*
>
> I understand this to mean that 'any white space' outside the document
> element includes any white space before the prolog. How could this be
> interpreted any other way?

  By the obvious fact that the document element is production [39]
and that prolog allows white spaces as part of it's Misc derivation but
only in certain places.
  The Bakus-Naur grammar is normative, it defines what an XML
document can be. And leading spaces can only be consumed by such
a grammar only if there is no XMLDecl.

  There is no confusion about this in implementations, there is actually
at test in the test suite making sure that implementation reject this:

  ./sun/not-wf/sgml02.xml

Case is clear and unambiguous to the exception of a leading Byte Order
Mark allowed by some encoding. In any case a parser accepting leading
space is just nor conformant, i.e. not an XML parser.

Daniel

--
Daniel Veillard      | libxml Gnome XML XSLT toolkit
http://xmlsoft.org/
[hidden email]  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/




** Confidentiality Notice: This e-mail and any files transmitted with it are confidential to the extent permitted by law and intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error please notify the originator of the message and destroy all copies. **


Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 'Leading White Space' Topic

Henry S. Thompson
Steve Fogoros wrote [4 years ago]:

> It is clear to me now that white space at the document entity level
> means before and/or after the document element which always comes after
> the prolog, if it exists.

For the record, tests not-wf-sa-147 and not-wf-sa-148 test this.

ht
--
       Henry S. Thompson, School of Informatics, University of Edinburgh
      10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
                Fax: (44) 131 650-4587, e-mail: [hidden email]
                       URL: http://www.ltg.ed.ac.uk/~ht/
 [mail from me _always_ has a .sig like this -- mail without it is forged spam]

Loading...