Quantcast

Lack of 'fatal error' tests for invalid encodings

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Lack of 'fatal error' tests for invalid encodings

Leif Halvard Silli-4
Per a discussion with John Cowan,[1][2] it seems reaonsable to
conclude,

         FIRSTLY, that the XML test suite is lacking many relevant
              encoding tests
        SECONDLY, that there is shortage of tests where there is
              external encoding information (read: HTTP).
     THIRDLY, as a result, many testable 'fatal error' situations
              described in XML 1.0 do not have tests.

Practical Questions:

 1) Dp I seend the test cases that I see needed directly to this list?
 2) Can we create some specific HTTP tests, online?

Test Descriptions:

All the bugs/tests I have im mind tend to be related to the UTF-8 BOM.
To illustrate the kind of tests/bugs, here are some bugs in the RXP
parser:

1) HTTP bugs

* Parsers must obey the charset parameter in the Content-Type:
  header to the extent that they must ignore the BOM  and the
  encoding declaratation when they determine the encoding.
  Thus, if the charset parameter is incorrect, parsers should emit
  en fatal error.

  But RXP simply ignores the HTTP Content-Type: charset parameter.
  Instead RXP treats HTTP served files as if they were files on
  the hard disk. As a result, RXP fails to emit 'fatal error'
  if for instance HTTP says "ISO-8859-1" when the served document
  is UTF-8 and with the BOM.

  (Of course, if the Content-Type header does not have a charset
  parameter, then the charset must be determined as if it was
  located on the harddisk.)

2) File bugs

* Parsers must omit a 'fatal error' if the BOM disagree with the  
  XML encoding declaration.

  But for a UTF-8 encoded file with the BOM but which has been
  labeled with <?xml version="1.0" encoding="ISO-8859-5"?>, RXP
  simply ignores the BOM and reports the file to be encoded as
  ISO-8859-5.

Conclusion: RXP accepts non-well-formed XML files.

As a matter of fact, errors related to the UTF-8 BOM are common. I have
so far not found a single parser which emits a 'fatal error' if there
is a UTF-8 BOM which conflicts with either the charset parameter of the
Content-Type: header or with the XML encoding declaration. The parsers
in the common Web browsers tend to obey the BOM and ignore both HTTP
and the encoding declaration. Whereas RXP and the XMLmind editor
instead seem to ignore the BOM.

The errors are so common that perhaps XML 1.0 should be changed?
Effectively, that is what I have proposed in bug 12897 agains the HTML5
specification. [3] If XML 1.0 were to change so that these currenlty
non-well-formed document are considered well-formed, then there are two
choice: the RXP behaviour (= ignoring the BOM) or the behavior that Web
browsers show (adhering to the BOM). The most logical thing seems to
change what is in XML's domain, and not to touch the BOM.

Thus, my proposal is that XML parsers MUST ignore the HTTP charset
parameter as well as the XML encoding declartion *if* the document
begins with the UTF-8 BOM. It seems to me that that this will be more
fruitful than the current rules which are broken the one way (RXP) or
the other (web browsers). There is also already a presedence for
ignoring the XML encodign declaration. Namely, it must be ignored if
HTTP says so. And finally, I believe there is prospect for better
convergance with HTML, if the UTF-8 BOM always has priority.

Of course, it is not this mailinglsit which eventually effects this
change to XML 1.0 revsion 6 ... Nevertheless it is worth keeping these
things in mind.

[1]
http://lists.w3.org/Archives/Public/www-international/2011AprJun/0094
[2]
http://lists.w3.org/Archives/Public/www-international/2011AprJun/0095
[3] http://www.w3.org/Bugs/Public/show_bug.cgi?id=12897
--
Leif Halvard Silli

Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Lack of 'fatal error' tests for invalid encodings

Henry S. Thompson
Leif Halvard Silli wrote [2 years ago]:

> Per a discussion with John Cowan,[1][2] it seems reaonsable to
> conclude,
>
> FIRSTLY, that the XML test suite is lacking many relevant
>               encoding tests
> SECONDLY, that there is shortage of tests where there is
>               external encoding information (read: HTTP).
>      THIRDLY, as a result, many testable 'fatal error' situations
>               described in XML 1.0 do not have tests.
>
> Practical Questions:
>
>  1) Dp I seend the test cases that I see needed directly to this list?

Sure, but see below.

>  2) Can we create some specific HTTP tests, online?

Maybe, but not until 3023bis [1] is settled.


> * Parsers must omit a 'fatal error' if the BOM disagree with the  
>   XML encoding declaration.
>
>   But for a UTF-8 encoded file with the BOM but which has been
>   labeled with <?xml version="1.0" encoding="ISO-8859-5"?>, RXP
>   simply ignores the BOM and reports the file to be encoded as
>   ISO-8859-5.

I've added a test for this, and a couple for similar examples wrt
UTF-16.

They'll be in the new release appearing shortly.

Thanks for your input (and patience :-),

ht
--
       Henry S. Thompson, School of Informatics, University of Edinburgh
      10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
                Fax: (44) 131 650-4587, e-mail: [hidden email]
                       URL: http://www.ltg.ed.ac.uk/~ht/
 [mail from me _always_ has a .sig like this -- mail without it is forged spam]

Loading...