UTF-8 Errors on file upload, not by URI

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

UTF-8 Errors on file upload, not by URI

Kessler,Nathan

I'm trying to validate http://worldcat.org. If I run a scan by URI or by direct input, the scan runs as expected. However, when the HTML source is saved in a file and uploaded, this error is reported on line 651:


"The error was: utf8 "\xED" does not map to Unicode" and the scan doesn't run.


The specific character in question: http://www.fileformat.info/info/unicode/char/ed/index.htm


If this character is removed, it fails on the fancy character in "traducción" -- it's not just the character above. The encoding of the page is UTF-8 and it is saved as UTF-8 before being uploaded. The scan works when the encoding is set to UTF-16, but not when it reads UTF-8 from the HTML.


Can anyone provide any advice here? We have an automated system that downloads web pages and runs them against our local validator via a file upload and this page won't scan due to this error. Is the encoding set improperly on the web page? Am I missing something else here?


Thanks for your time and work,

Nathan Kessler

Reply | Threaded
Open this post in threaded view
|

Re: UTF-8 Errors on file upload, not by URI

Jukka K. Korpela
2014-09-09 22:40, Kessler,Nathan wrote:

> I'm trying to validate http://worldcat.org <http://worldcat.org.>. If I
> run a scan by URI or by direct input, the scan runs as expected.

The validator reports 53 errors and 22 warnings.

> However, when the HTML source is saved in a file and uploaded, this
> error is reported on line 651:

I saved the page, using Firefox, and tried validation by file upload.
There is no error reported on line 651 or anywhere close to it.

> "The error was: utf8 "\xED" does not map to Unicode" and the scan
> doesn't run.

This sounds like you saved the page somehow in windows-1252 encoding and
are then trying to validate it as utf-8 encoded.

> We have an automated system that
> downloads web pages and runs them against our local validator via a file
> upload and this page won't scan due to this error. Is the encoding set
> improperly on the web page?

The encoding of the page appears to be properly declared in an HTTP
header. So the problem is with the automated system. It seems to change
the encoding or otherwise mess up the character data.

Yucca