Fwd: [pedantic-web] Encoding issues when dereferencing "formats:" URIs

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Fwd: [pedantic-web] Encoding issues when dereferencing "formats:" URIs

Damian Steer
Forwarded from the pedantic-web list.

Initially this was (erroneously) reported as an issue with ARP and UTF-8 BOMs, but there's no BOM involved and ARP has never had an issue with BOMs.

It seems that validating (all?) rdf files under www.w3.org results in errors of the form:

"An attempt to load the RDF from URI 'http://www.w3.org/ns/formats/data/RDF_XML' failed. (Undecodable data when reading URI at byte 0 using encoding 'UTF-8'. Please check encoding and encoding declaration of your document.)"

But the byte value may vary, e.g. 24574 for http://www.w3.org/ns/ma-ont.rdf.

I understand that the same file (RDF_XML) validated without issue when copied to a remote server.

The code is question is presumably:

        try {// read whole file as characters
            int c;
            while ((c = isr.read()) != -1) {
                sb.append((char)c);
                bytenum++;
            }
        }
        catch (IOException e){
            throw new getRDFException("Undecodable data when reading URI at byte "+bytenum+" using encoding '"+finalCharset+"'."+" Please check encoding and encoding declaration of your document.");
        }

<http://dev.w3.org/cvsweb/2006/RDFValidator/WEB-INF/src/org/w3c/rdfvalidator/ARPServlet.java?rev=1.6>

So the issue may not be encoding, the same message being reported for any IO exception.

Thanks for your help,

Damian Steer

Begin forwarded message:

> From: Damian Steer <[hidden email]>
> Subject: Re: [pedantic-web] Encoding issues when dereferencing "formats:" URIs
> Date: 25 April 2012 16:07:07 GMT+01:00
> To: [hidden email]
> Reply-To: [hidden email]
>
> On 25/04/12 15:49, Andreas Radinger wrote:
>> Hi,
>>
>> I don't think any of these files (neither .rdf nor .ttl) have a BOM at
>> the beginning of the file.
>> http://people.w3.org/rishida/utils/bomtester/index.php?filename=http%3A%2F%2Fwww.w3.org%2Fns%2Fformats%2Fdata%2FRDF_XML.rdf
>>
>> The W3C RDF Validator has also no bug in dealing with RDF/XML files that
>> have a BOM.
>
> +1.
>
> I tried another file under ns/:
>
> <http://www.w3.org/ns/ma-ont.rdf>
>
> => "Undecodable data when reading URI at byte 24574 using encoding 'UTF-8'."
>
> And then the rdf namespace:
>
> => "... byte 0 ..."
>
> But <http://people.w3.org/simon/foaf.rdf> was fine.
>
> Hypothesis: validating rdf under the www.w3.org domain is broken.
>
> It may be unrelated to encoding. The error is triggered by any
> IOException reading characters from an input stream reader.
>
> Damian


Reply | Threaded
Open this post in threaded view
|

RDF Validator issue when validating www.w3.org URLs (was: Re: [pedantic-web] Encoding issues when dereferencing "formats:" URIs)

Richard Cyganiak-2
Dear RDF Validator team,

The issue that Damian reports looks like some bizarre caching/networking/load-balancer thing in the W3C infrastructure.

1. It has nothing to do with RDF, encoding, or BOMs. It affects all URLs (incl. non-RDF files) from certain domains. The reported byte value depends on the size of the file and is always somewhere within the last couple of kByte.

2. The domains that don't work are www.w3.org, dev.w3.org, and all the various aliases for www.w3.org such as web4.w3.org, web5.w3.org, www-mit.w3.org, hans.w3.org, and ipv4.w3.org. I couldn't find any non-w3.org domains that exhibit the problem, but that doesn't mean they don't exist. Other w3.org subdomains like people.w3.org work fine.

3. Curiously, for all the alias subdomains listed above, I was able to validate *the first* URL successfully. After that, re-validating the same URL, or any other URL from the same domain, results in the usual error.

4. Furthermore, there is *one* URL on www.w3.org that can always be successfully validated, and that's the URL of the validator servlet itself: http://www.w3.org/RDF/Validator/ARPServlet . This is probably related to the fact that the ARPServlet doesn't run on the main W3C webserver(s), but on a separate machine at http://smithers.w3.org/servlet/ARPServlet that seems to be reverse-proxied into the www.w3.org URL space.

5. I ran the troublesome function ARPServlet.getRDFFromURI() locally on my own machine, and it can read from all the affected URLs just fine. (Assuming the link that Damian posted is the right version of ARPServlet.) There's nothing suspicious in the code, it all looks very normal and just uses standard Java APIs, no potentially buggy libraries or anything. So I doubt the answer is in there. (It should have better error reporting for the IOException as Damian pointed out; seeing what the actual exception is *might* reveal a clue.)

Now I hope someone at W3C can make sense of this!

Best,
Richard



On 26 Apr 2012, at 16:33, Damian Steer wrote:

> Forwarded from the pedantic-web list.
>
> Initially this was (erroneously) reported as an issue with ARP and UTF-8 BOMs, but there's no BOM involved and ARP has never had an issue with BOMs.
>
> It seems that validating (all?) rdf files under www.w3.org results in errors of the form:
>
> "An attempt to load the RDF from URI 'http://www.w3.org/ns/formats/data/RDF_XML' failed. (Undecodable data when reading URI at byte 0 using encoding 'UTF-8'. Please check encoding and encoding declaration of your document.)"
>
> But the byte value may vary, e.g. 24574 for http://www.w3.org/ns/ma-ont.rdf.
>
> I understand that the same file (RDF_XML) validated without issue when copied to a remote server.
>
> The code is question is presumably:
>
> try {// read whole file as characters
>    int c;
>    while ((c = isr.read()) != -1) {
> sb.append((char)c);
> bytenum++;
>    }
> }
> catch (IOException e){
>    throw new getRDFException("Undecodable data when reading URI at byte "+bytenum+" using encoding '"+finalCharset+"'."+" Please check encoding and encoding declaration of your document.");
> }
>
> <http://dev.w3.org/cvsweb/2006/RDFValidator/WEB-INF/src/org/w3c/rdfvalidator/ARPServlet.java?rev=1.6>
>
> So the issue may not be encoding, the same message being reported for any IO exception.
>
> Thanks for your help,
>
> Damian Steer
>
> Begin forwarded message:
>
>> From: Damian Steer <[hidden email]>
>> Subject: Re: [pedantic-web] Encoding issues when dereferencing "formats:" URIs
>> Date: 25 April 2012 16:07:07 GMT+01:00
>> To: [hidden email]
>> Reply-To: [hidden email]
>>
>> On 25/04/12 15:49, Andreas Radinger wrote:
>>> Hi,
>>>
>>> I don't think any of these files (neither .rdf nor .ttl) have a BOM at
>>> the beginning of the file.
>>> http://people.w3.org/rishida/utils/bomtester/index.php?filename=http%3A%2F%2Fwww.w3.org%2Fns%2Fformats%2Fdata%2FRDF_XML.rdf
>>>
>>> The W3C RDF Validator has also no bug in dealing with RDF/XML files that
>>> have a BOM.
>>
>> +1.
>>
>> I tried another file under ns/:
>>
>> <http://www.w3.org/ns/ma-ont.rdf>
>>
>> => "Undecodable data when reading URI at byte 24574 using encoding 'UTF-8'."
>>
>> And then the rdf namespace:
>>
>> => "... byte 0 ..."
>>
>> But <http://people.w3.org/simon/foaf.rdf> was fine.
>>
>> Hypothesis: validating rdf under the www.w3.org domain is broken.
>>
>> It may be unrelated to encoding. The error is triggered by any
>> IOException reading characters from an input stream reader.
>>
>> Damian
>