Fwd: Re: Document fragment vocabulary

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Fwd: Re: Document fragment vocabulary

Sebastian Hellmann
Dear all,
a topic came up on [hidden email] , which might best be posted here.
I will summarize most of it here real quick, the details can be found attached.

Currently, I am working on an interchange format in RDF for Natural Language Processing(NLP), called NIF [1] (slides[2]), which is part of LOD2[3].
It heavily relies on URI Fragment IDs that address substrings of a plain/text document.
Although RFC5147 [4] exists it does not cover some requirements of the NLP Use Case.

RFC5147 provides integrity checks, but there is no proposal that produces robust fragment IDs. e.g. something that works on the context and not on line or position. A change in the document on position 0 might render all fragment ids obsolete. E.g. "#range=(574,585)" would not be valid any more, if one character was inserted at the beginning of the document, changing the index.
The RFC was already extended for CSV[5], but I would even go further and allow more extension and then collect them all in a structured format such as an RDF/OWL vocabulary. We have already done this for our cases [6]

For our purposes, we defined 2 fragment recipes, in this case to annotate the third occurrence of "Semantic Web":
http://www.w3.org/DesignIssues/LinkedData.html#offset__Semantic+Web_14406-14418
http://www.w3.org/DesignIssues/LinkedData.html#hash_md5_4__12_Semantic+Web_abeb272fe2deadd2cd486c4cea6cddf1

I'm quite unsure how to proceed now: Use our own fragment recipes, write another RFC or try to generalize the approach with the help of a vocabulary.
RFC5147 would then need to be extended by a "type=RFC5147" or "type=offset" or "type=hash" parameter and you would be able to lookup what "RFC5147", "offset" or "hash" meant. Could you give us some suggestions as we do not want to invent the 15th competing standard[7] .

Another problem, we have is that the fragment id is not sent to the server. Did this ever play a practical role up to now? For Linked Data it can be cumbersome: Let's say you have a 200 MB text file, with average 3 annotations per line (200,000 lines, 600,000 triples ).
Somebody attached an annotation on line 20000:
<http://example.com/text.txt#line=20000> my:comment "Please remove this line. It is so negative!" .
When making a query with RDF/XML Accept Header. You would always need to retrieve all annotations for all lines.
Then after transferring the 900k triples, the client would throw away all other triples, except the one for this line.
Currently, we do not care whether we will use "?nif=" or "/" or "#" and leave this up to the implementer.

The summary got quite long now and below are even more aspects mentioned. I hope this is not too confusing.
All the best,
Sebastian


[1] http://www.slideshare.net/kurzum/nif-nlp-interchange-format
[2] http://aksw.org/Projects/NIF
[3] http://lod2.eu
[4] http://tools.ietf.org/html/rfc5147
[5] http://tools.ietf.org/html/draft-hausenblas-csv-fragment
[6] http://nlp2rdf.lod2.eu/schema/string/
[7] http://xkcd.com/927/
-------- Original-Nachricht --------
Betreff: Re: Document fragment vocabulary
Weitersenden-Datum: Tue, 16 Aug 2011 06:15:14 +0000
Weitersenden-Von: [hidden email]
Datum: Tue, 16 Aug 2011 15:09:21 +0900
Von: Sebastian Hellmann [hidden email]
An: Michael Hausenblas [hidden email]
CC: Michael Martin [hidden email], public-lod [hidden email], Alexander Dutton [hidden email]


Am 16.08.2011 14:12, schrieb Michael Hausenblas:
>> It is not really LinkedData friendly.
>
> Why?
>
It does not scale for large documents. Let's say you have a 200 MB text 
file, with average 3 annotations per line (200,000 lines, 600,000 triples ).
Somebody attached an annotation on line 20000:

<http://example.com/text.txt#line=20000>  my:comment "Please remove this line. It is so negative!" .

When making a query with RDF/XML Accept Header. You would always need to 
retrieve all annotations for all lines.
Then after transferring the 200 MB, the client would throw away all 
other triples but the one.

>> @Michael: is there some standardisation respective URIs for text  
>> going on?
>
> As you've rightly identified, an RFC already exists. What would this 
> new standardisation activity be chartered for?
>
> As and aside, this reminds me a bit of http://xkcd.com/927/
Hm, actually you created an extra standard yourself for csv, because the 
approach by Wilde and Dürst did not cover your use case.
It does not cover mine either for 100%.  Potentially, there are a lot of 
text based formats. So there should be a way to extend the pattern somehow.
>> The approach by Wilde and Dürst[1] seems to lack stability.
> I don't know what you mean by this. Lack of take-up, yes. Stability, 
> what's that?
Wilde and Dürst provide integrity checks, but there is no proposal that 
produces robust fragment IDs.  e.g. something that works on the context 
and not on line or position. A change in the document on position 0 
might render all fragment ids obsolete. E.g. "#range=(574,585)" would 
not be valid, if one character was inserted at the beginning of the 
document.

>> Do you think we could do such standardisation for document fragments 
>> and text fragments within the Media Fragments Group[3] ?
> No. Disclaimer: I'm a MF WG member. Look at our charter [1] ...
>
Ok, thanks for clarifying that.
>
> Maybe this thread should slowly be moved over to [hidden email] [2]?
>
The # part not being sent to the server might be interesting for this 
list as it is a linked data problem. Also I think we should create an 
OWL Vocabulary to describe, document and standardize different fragment 
identifiers, as Alexander has started. But we should only do it with the 
w3c. Otherwise it will truly become "competing standard 15" .
The ontology could also just be descriptive, reflecting the RFCs.
Should we cross-post? Alternatively I could just start another thread there.
Sebastian

>
> Cheers,
>     Michael
>
> [1] http://www.w3.org/2008/01/media-fragments-wg.html
> [2] http://lists.w3.org/Archives/Public/uri/
> -- 
> Dr. Michael Hausenblas, Research Fellow
> LiDRC - Linked Data Research Centre
> DERI - Digital Enterprise Research Institute
> NUIG - National University of Ireland, Galway
> Ireland, Europe
> Tel. +353 91 495730
> http://linkeddata.deri.ie/
> http://sw-app.org/about.html
>
> On 16 Aug 2011, at 05:40, Sebastian Hellmann wrote:
>
>> Hi Michael and Alex,
>> sorry to answer so late, I was in holiday in France.
>> I looked at the three provided resources [1,2,3] and there are still 
>> some comments and questions I have.
>>
>> 1. The part after the # is actually not sent to the server. Are there 
>> any solutions for this? It is not really LinkedData friendly.
>> Compare 
>> http://linkedgeodata.org/triplify/near/51.033333,13.733333/1000/class/Amenity
>> (Currently not working, but it gives all points within a 1000m radius)
>>
>> The client would be required to calculate the subset of triples from 
>> the resource, that are addressed.
>>
>> 2. [1] is quite basic and they are basically using position and 
>> lines. I made a qualitative comparison of different fragment id 
>> approaches for text in [4] slide 7.
>> I was wondering if anybody has researched such properties of URI 
>> fragments. Currently, I am benchmarking stability of these uris using 
>> Wikipedia changes.
>> Has such work been done before?
>>
>> 3. @Alex: In my opinion, your proposed fragment ontology can  only be 
>> used to provide documentation for different fragments.
>> I would rather propose to just use one triple:
>> <http://www.w3.org/DesignIssues/LinkedData.html#offset__14406-14418> 
>> a <http://nlp2rdf.lod2.eu/schema/string/OffsetBasedString>
>> The ontology I made for Strings might be generalized for formats 
>> other than text based [5]
>> One triple is much shorter. As you can see I also tried to encode the 
>> type of fragment right into the fragment "offset", although a 
>> notation like "type=offset"  might be better.
>>
>> 4.  @Michael: is there some standardisation respective URIs for text  
>> going on?
>> I heard there would be a Language Technology W3C group. The approach 
>> by Wilde and Dürst[1] seems to lack stability.
>> Do you think we could do such standardisation for document fragments 
>> and text fragments within the Media Fragments Group[3] ?
>> I really thought the liveUrl project was quite good, but it seems 
>> dead[6].
>>
>>
>> In LOD2[7] and NIF[8] we will need some fragment identifiers to 
>> Standardize NLP tools for the LOD2 stack.
>> It would be great to reuse stuff instead of starting from scratch. I 
>> had to extend [1] for example, because it did not produce stable uris 
>> and also it did not contain the type of algorithm used to produce the 
>> URI.
>>
>> All the best,
>> Sebastian
>>
>>
>> [1] http://tools.ietf.org/html/rfc5147
>> [2] http://tools.ietf.org/html/draft-hausenblas-csv-fragment
>> [3] http://www.w3.org/TR/media-frags/
>> [4] http://www.slideshare.net/kurzum/nif-nlp-interchange-format
>> [5] http://nlp2rdf.lod2.eu/schema/string/
>> [6] http://liveurls.mozdev.org/index.html
>> [7] http://lod2.eu
>> [8] http://aksw.org/Projects/NIF
>>
>> Am 04.08.2011 22:37, schrieb Michael Hausenblas:
>>>
>>>
>>> Alex,
>>>
>>>> Has something already done this? Is it even (mostly?) sane?
>>>
>>> Sane yes, IMO. Done, sort of, see:
>>>
>>> + URI Fragment Identifiers for the text/plain [1]
>>> + URI Fragment Identifiers for the text/csv [2]
>>>
>>> Cheers,
>>>     Michael
>>>
>>> [1] http://tools.ietf.org/html/rfc5147
>>> [2] http://tools.ietf.org/html/draft-hausenblas-csv-fragment
>>>
>>> -- 
>>> Dr. Michael Hausenblas, Research Fellow
>>> LiDRC - Linked Data Research Centre
>>> DERI - Digital Enterprise Research Institute
>>> NUIG - National University of Ireland, Galway
>>> Ireland, Europe
>>> Tel. +353 91 495730
>>> http://linkeddata.deri.ie/
>>> http://sw-app.org/about.html
>>>
>>> On 4 Aug 2011, at 14:22, Alexander Dutton wrote:
>>>
>>>>
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA1
>>>>
>>>> Hi all,
>>>>
>>>> Say I have an XML document, <http://example.org/something.xml>, and I
>>>> want to talk about about some part of it in RDF. As this is XML, being
>>>> able to point into it using XPath sounds ideal, leading to 
>>>> something like:
>>>>
>>>> <#fragment> a fragment:Fragment ;
>>>>  fragment:within <http://example.org/something.xml> ;
>>>>  fragment:locator "/some/path[1]"^^fragment:xpath .
>>>>
>>>> (For now we can ignore whether we wanted a nodeset or a single node,
>>>> and how to handle XML namespaces.)
>>>>
>>>> More generally, we might want other ways of locating fragments
>>>> (probably with a datatype for each):
>>>>
>>>> * character offsets / ranges
>>>> * byte offsets / ranges
>>>> * line numbers / ranges
>>>> * some sub-rectangle of an image
>>>> * XML node IDs
>>>> * page ranges of a paginated document
>>>>
>>>> Some of these will be IMT-specific and may need some more thinking
>>>> about, but the idea is there.
>>>>
>>>>
>>>> Has something already done this? Is it even (mostly?) sane?
>>>>
>>>>
>>>> Yours,
>>>>
>>>> Alex
>>>>
>>>>
>>>> NB. Our actual use-case is having pointers into an NLM XML file
>>>> (embodying a journal article) so we can hook up our in-text reference
>>>> pointer¹ URIs to the original XML elements (<xref/>s) they were
>>>> generated from. This will allow us to work out the context of each
>>>> citation for use in further analysis of the relationship between the
>>>> citing and cited articles.
>>>>
>>>> ¹ See
>>>> <http://opencitations.wordpress.com/2011/07/01/nomenclature-for-citations-and-references/> 
>>>>
>>>> for an explanation of the terminology.
>>>>
>>>> - -- 
>>>> Alexander Dutton
>>>> Developer, data.ox.ac.uk, InfoDev, Oxford University Computing 
>>>> Services
>>>>           Open Citations Project, Department of Zoology, University
>>>> of Oxford
>>>> -----BEGIN PGP SIGNATURE-----
>>>> Version: GnuPG v1.4.11 (GNU/Linux)
>>>> Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/
>>>>
>>>> iEYEARECAAYFAk46nS4ACgkQS0pRIabRbjDVZQCdGblvoMgNqEietlE5EwAkPJY8
>>>> pikAn2KApM0HjcXj6TZegA+Dek/DJIQX
>>>> =UcCr
>>>> -----END PGP SIGNATURE-----
>>>>
>>>>
>>>
>>>
>>
>>
>> -- 
>> Dipl. Inf. Sebastian Hellmann
>> Department of Computer Science, University of Leipzig
>> Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
>> Research Group: http://aksw.org
>
>


-- 
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org


Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: Document fragment vocabulary

Erik Wilde-3
hello.

On 2011-08-16 10:36 , Sebastian Hellmann wrote:
> RFC5147 provides integrity checks, but there is no proposal that
> produces robust fragment IDs. e.g. something that works on the context
> and not on line or position. A change in the document on position 0
> might render all fragment ids obsolete. E.g. "#range=(574,585)" would
> not be valid any more, if one character was inserted at the beginning of
> the document, changing the index.

being one of the authors of this RFC, i'd like to point out that the
initial ideas were quite a bit more complicated and included features
similar to what you are looking for. however, during the process of
getting community support, it became clear that the preference of most
people was to have simpler and easier to implement fragment identifier
features. this does make them more brittle, but things on the web can
break, and even a more complicated feature set would only have made them
less likely to break. in the end, i think it was good that the final RFC
ended up being simple and easy to understand and implement, but it
definitely may not be enough for your use cases.

we're still working on the CSV version, and any feedback about this is
highly welcome, and this list is exactly the right place for this.
here's our announcement about the CSV fragid draft:

http://lists.w3.org/Archives/Public/uri/2011Apr/0003.html

kind regards,

erik wilde | mailto:[hidden email]  -  tel:+1-510-6432253 |
            | UC Berkeley  -  School of Information (ISchool) |
            | http://dret.net/netdret http://twitter.com/dret |

Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: Document fragment vocabulary

Erik Wilde-3
In reply to this post by Sebastian Hellmann
hello sebastian.

> Another problem, we have is that the fragment id is not sent to the
> server. Did this ever play a practical role up to now? For Linked Data
> it can be cumbersome: Let's say you have a 200 MB text file, with
> average 3 annotations per line (200,000 lines, 600,000 triples ).
> Somebody attached an annotation on line 20000:
> <http://example.com/text.txt#line=20000> my:comment "Please remove this
> line. It is so negative!" .
> When making a query with RDF/XML Accept Header. You would always need to
> retrieve all annotations for all lines.
> Then after transferring the 900k triples, the client would throw away
> all other triples, except the one for this line.

the fact that fragment identifiers are client-side only is something
that it pretty deeply engrained in web architecture. interactions on the
web are based on resources, and if you're unhappy with interaction
granularity (as you're indicating above), then this does not necessarily
mean that you have to change web architecture, but instead that you may
have a problem with your resource model. if you want interactions to be
finer grained, then identify and build interactions around those
finer-grained resources. linking can help you to find links from
coarse-grained to fine-grained and vice versa, if you model it in a way
where there are possible interactions with both finer and more coarsely
grained resources.

speaking from the REST perspective, i think there's still very
interesting and pretty much unexplored territory there. the question is
how to come up with general RESTful models of how a service can expose
resources at varying levels of granularity. but this is not so much a
URI issue than more something that probably could be solved by a RESTful
design pattern.

(as a side note: if you want to change resources in a diff-like way, you
may want to look at the HTTP PATCH method, which allows you to request
that a resource should be changed in a certain way.)

cheers,

erik wilde | mailto:[hidden email]  -  tel:+1-510-6432253 |
            | UC Berkeley  -  School of Information (ISchool) |
            | http://dret.net/netdret http://twitter.com/dret |

Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: Document fragment vocabulary

Sebastian Hellmann
In reply to this post by Erik Wilde-3
Hello Erik,

Am 16.08.2011 20:38, schrieb Erik Wilde:

> hello.
>
> On 2011-08-16 10:36 , Sebastian Hellmann wrote:
>> RFC5147 provides integrity checks, but there is no proposal that
>> produces robust fragment IDs. e.g. something that works on the context
>> and not on line or position. A change in the document on position 0
>> might render all fragment ids obsolete. E.g. "#range=(574,585)" would
>> not be valid any more, if one character was inserted at the beginning of
>> the document, changing the index.
>
> being one of the authors of this RFC, i'd like to point out that the
> initial ideas were quite a bit more complicated and included features
> similar to what you are looking for. however, during the process of
> getting community support, it became clear that the preference of most
> people was to have simpler and easier to implement fragment identifier
> features. this does make them more brittle, but things on the web can
> break, and even a more complicated feature set would only have made
> them less likely to break. in the end, i think it was good that the
> final RFC ended up being simple and easy to understand and implement,
> but it definitely may not be enough for your use cases.

Easier to implement is only one aspect and I can understand that this
was one of the major criteria for the community as it seems to be an
easy common denominator. The format we are creating for LOD2 is for a
Natural Language Processing developer community. I doubt, that they
would be scared by a more complex URI pattern, but would rather embrace
any offered advantages such as a tool annotating a web page and the
frag-IDs  either stay robust or can be corrected automatically.  The
different patterns will be implemented for several dozen NLP tools over
the project lifetime of LOD2.

What is your suggestion then, what we should be doing? We consider
addressing fragments of text documents in general, with CSV and XML and
XHTML being specialisations. We might just add an additional
"type=RFC5147" to the fragment and then add several other types
ourselves: a stable one, one for morpho-syntax, etc.

I still have the following questions:
- Do you know of any systems, that implement RFC5147?
- What was your original use case for designing the frag-ids?
- Can you point me to a site where the less brittle version you
suggested are discussed? Or could you give an example? My proposal for
this is here:
http://aksw.org/Projects/NIF#context-hash-nif-uri-recipe
- Do you know of any benchmarking of the different URI approaches w.r.t.
to robustness, uniqueness, etc? I'm currently doing an evaluation so
please tell me, if I should include anything. I might include your
CSV-Frag Ids, but I would need some data that is changing (although I
could simulate it)
- What does "proposed standard" mean? This means, that the RFC is not a
standard, but only "proposed" ?

Thanks for your answers,
Sebastian


--
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org


Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: Document fragment vocabulary

Sebastian Hellmann
In reply to this post by Erik Wilde-3
Hello Erik,
(I liked, how you separated the topics...)

Am 16.08.2011 20:49, schrieb Erik Wilde:

> hello sebastian.
>
>> Another problem, we have is that the fragment id is not sent to the
>> server. Did this ever play a practical role up to now? For Linked Data
>> it can be cumbersome: Let's say you have a 200 MB text file, with
>> average 3 annotations per line (200,000 lines, 600,000 triples ).
>> Somebody attached an annotation on line 20000:
>> <http://example.com/text.txt#line=20000> my:comment "Please remove this
>> line. It is so negative!" .
>> When making a query with RDF/XML Accept Header. You would always need to
>> retrieve all annotations for all lines.
>> Then after transferring the 900k triples, the client would throw away
>> all other triples, except the one for this line.
>
> the fact that fragment identifiers are client-side only is something
> that it pretty deeply engrained in web architecture. interactions on
> the web are based on resources, and if you're unhappy with interaction
> granularity (as you're indicating above), then this does not
> necessarily mean that you have to change web architecture, but instead
> that you may have a problem with your resource model. if you want
> interactions to be finer grained, then identify and build interactions
> around those finer-grained resources. linking can help you to find
> links from coarse-grained to fine-grained and vice versa, if you model
> it in a way where there are possible interactions with both finer and
> more coarsely grained resources.
The problem is not with our modelling. We are working on a format for
NLP tools, that everyone can then implement. So the modelling should be
up to the developer. I think the core of the problem is, that the uris
should serve to use cases: 1. serve as RDF subjects and allow for
LinkedData without too much overhead 2. highlight it in a browser/client
We might assume equality of both and just allow in the NIF format[1],
that developers can use both as they like, but then when querying
LinkedData they should replace all # with ? and vice versa for
browser/highlighting clients.

Would you think this is too hacky? It might also be that the whole
problem is rather hypothetical at the moment, so # might be the choice
now and then we will just wait until the problems arise...

Thanks,
Sebastian

[1] http://aksw.org/Projects/NIF


--
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org


Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: Document fragment vocabulary

Erik Wilde-3
In reply to this post by Sebastian Hellmann
hello sebastian.

On 2011-08-16 09:22 , Sebastian Hellmann wrote:
> What is your suggestion then, what we should be doing? We consider
> addressing fragments of text documents in general, with CSV and XML and
> XHTML being specialisations. We might just add an additional
> "type=RFC5147" to the fragment and then add several other types
> ourselves: a stable one, one for morpho-syntax, etc.

i am not quite sure how you could see CSV and XML and XHTML as
specialization of plain text. they do have different metamodels (at
least plain text and CSV and *ML) and thus need pretty different
approaches when it comes to fragment identification. i think the problem
you're having may be a well-know ugliness in web architecture: fragment
identifiers are specific for the media type, but URIs are (often) not.
this is just a design defect of the web, and there;'s no easy way around
it. sometimes people try to engineer around it somehow, but as soon as
you're starting to think about decentralization and redirections, things
typically fall apart. all sorts of things have been proposed over the
years to fix this defect, but there it's a hard problem to solve in the
general case and without breaking backwards compatibility.

> I still have the following questions:
> - Do you know of any systems, that implement RFC5147?

i've seen it being used for annotations locally, but i haven't seen
support in any widely used pieces of software.

> - What was your original use case for designing the frag-ids?

the ability to create hyperlinks for plain text files. creating a link
between a fragment of a plain text file and something else, for example
an annotation system for log files (which conveniently grow very stable
only by adding text at the end), saying "this line really looks like
something suspicious may have happened".

> - Can you point me to a site where the less brittle version you
> suggested are discussed? Or could you give an example? My proposal for
> this is here: http://aksw.org/Projects/NIF#context-hash-nif-uri-recipe

i would have to go back to earlier versions of the draft which i have
somewhere in my local archive, they may not be online anymore. it has
been a while, and all i know is that we had some regex-based approach,
which of course created the problem that *authoring* these identifiers
can be become quite a challenge with a lot of decisions to be made. the
advantage for the regex approach is that most programming environments
have regex implementations, so implementation would have been easier
than with a completely proprietary method.

> - Do you know of any benchmarking of the different URI approaches w.r.t.
> to robustness, uniqueness, etc? I'm currently doing an evaluation so
> please tell me, if I should include anything. I might include your
> CSV-Frag Ids, but I would need some data that is changing (although I
> could simulate it)

i don't think you can make benchmarking without being very specific
about the scenario and use cases. which means you would need to have a
sample dataset of resources changing over time that would reflect the
scenario you are interested in, and then you could start comparing
approaches. without that, benchmarking would be pointless.

> - What does "proposed standard" mean? This means, that the RFC is not a
> standard, but only "proposed" ?

that's just IETF terminology, don't worry about it.

cheers,

dret.

--
erik wilde | mailto:[hidden email]  -  tel:+1-510-6432253 |
            | UC Berkeley  -  School of Information (ISchool) |
            | http://dret.net/netdret http://twitter.com/dret |

Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: Document fragment vocabulary

Erik Wilde-3
In reply to this post by Sebastian Hellmann
hello sebastian.

On 2011-08-16 09:36 , Sebastian Hellmann wrote:
> The problem is not with our modelling. We are working on a format for
> NLP tools, that everyone can then implement. So the modelling should be
> up to the developer.>

i don't understand that. if you're working on the format, doesn't that
imply you're defining the model? developers then simply implement the
model that is defined by your format, right?

> I think the core of the problem is, that the uris
> should serve two use cases: 1. serve as RDF subjects and allow for
> LinkedData without too much overhead 2. highlight it in a browser/client
> We might assume equality of both and just allow in the NIF format[1],
> that developers can use both as they like, but then when querying
> LinkedData they should replace all # with ? and vice versa for
> browser/highlighting clients.

i don't think i can follow you here, but the substitution rules you are
mentioning don't look very nice. URI-wise, # and ? serve different
purposes, and creating such a substitution rule to me looks as if you're
pretty much guaranteeing that things will break for anybody not aware of
your special rules. if for passing around URIs you also have to pass
around special rules how to handle them, that's not a good sign.

> Would you think this is too hacky? It might also be that the whole
> problem is rather hypothetical at the moment, so # might be the choice
> now and then we will just wait until the problems arise...

i think i don't fully understand what you're trying to do and what
you're proposing to solve your problem, but like i said above, the
special handling rules look a little suspicious. maybe michael has a
better idea of your scenario and can help.

cheers,

dret.

--
erik wilde | mailto:[hidden email]  -  tel:+1-510-6432253 |
            | UC Berkeley  -  School of Information (ISchool) |
            | http://dret.net/netdret http://twitter.com/dret |

Reply | Threaded
Open this post in threaded view
|

Re: Document fragment vocabulary

Michael Hausenblas
In reply to this post by Erik Wilde-3

> i am not quite sure how you could see CSV and XML and XHTML as  
> specialization of plain text. they do have different metamodels (at  
> least plain text and CSV and *ML) and thus need pretty different  
> approaches when it comes to fragment identification. i think the  
> problem you're having may be a well-know ugliness in web  
> architecture: fragment identifiers are specific for the media type,  
> but URIs are (often) not. this is just a design defect of the web,  
> and there;'s no easy way around it.
+1

Cheers,
        Michael
--
Dr. Michael Hausenblas, Research Fellow
LiDRC - Linked Data Research Centre
DERI - Digital Enterprise Research Institute
NUIG - National University of Ireland, Galway
Ireland, Europe
Tel. +353 91 495730
http://linkeddata.deri.ie/
http://sw-app.org/about.html

On 23 Aug 2011, at 22:46, Erik Wilde wrote:

> hello sebastian.
>
> On 2011-08-16 09:22 , Sebastian Hellmann wrote:
>> What is your suggestion then, what we should be doing? We consider
>> addressing fragments of text documents in general, with CSV and XML  
>> and
>> XHTML being specialisations. We might just add an additional
>> "type=RFC5147" to the fragment and then add several other types
>> ourselves: a stable one, one for morpho-syntax, etc.
>
> i am not quite sure how you could see CSV and XML and XHTML as  
> specialization of plain text. they do have different metamodels (at  
> least plain text and CSV and *ML) and thus need pretty different  
> approaches when it comes to fragment identification. i think the  
> problem you're having may be a well-know ugliness in web  
> architecture: fragment identifiers are specific for the media type,  
> but URIs are (often) not. this is just a design defect of the web,  
> and there;'s no easy way around it. sometimes people try to engineer  
> around it somehow, but as soon as you're starting to think about  
> decentralization and redirections, things typically fall apart. all  
> sorts of things have been proposed over the years to fix this  
> defect, but there it's a hard problem to solve in the general case  
> and without breaking backwards compatibility.
>
>> I still have the following questions:
>> - Do you know of any systems, that implement RFC5147?
>
> i've seen it being used for annotations locally, but i haven't seen  
> support in any widely used pieces of software.
>
>> - What was your original use case for designing the frag-ids?
>
> the ability to create hyperlinks for plain text files. creating a  
> link between a fragment of a plain text file and something else, for  
> example an annotation system for log files (which conveniently grow  
> very stable only by adding text at the end), saying "this line  
> really looks like something suspicious may have happened".
>
>> - Can you point me to a site where the less brittle version you
>> suggested are discussed? Or could you give an example? My proposal  
>> for
>> this is here: http://aksw.org/Projects/NIF#context-hash-nif-uri- 
>> recipe
>
> i would have to go back to earlier versions of the draft which i  
> have somewhere in my local archive, they may not be online anymore.  
> it has been a while, and all i know is that we had some regex-based  
> approach, which of course created the problem that *authoring* these  
> identifiers can be become quite a challenge with a lot of decisions  
> to be made. the advantage for the regex approach is that most  
> programming environments have regex implementations, so  
> implementation would have been easier than with a completely  
> proprietary method.
>
>> - Do you know of any benchmarking of the different URI approaches  
>> w.r.t.
>> to robustness, uniqueness, etc? I'm currently doing an evaluation so
>> please tell me, if I should include anything. I might include your
>> CSV-Frag Ids, but I would need some data that is changing (although I
>> could simulate it)
>
> i don't think you can make benchmarking without being very specific  
> about the scenario and use cases. which means you would need to have  
> a sample dataset of resources changing over time that would reflect  
> the scenario you are interested in, and then you could start  
> comparing approaches. without that, benchmarking would be pointless.
>
>> - What does "proposed standard" mean? This means, that the RFC is  
>> not a
>> standard, but only "proposed" ?
>
> that's just IETF terminology, don't worry about it.
>
> cheers,
>
> dret.
>
> --
> erik wilde | mailto:[hidden email]  -  tel:+1-510-6432253 |
>           | UC Berkeley  -  School of Information (ISchool) |
>           | http://dret.net/netdret http://twitter.com/dret |


Reply | Threaded
Open this post in threaded view
|

Re: Document fragment vocabulary

Michael Hausenblas
In reply to this post by Erik Wilde-3

Disclaimer: both Sebastian and I are working in a large-scale EU  
research project called LOD2 [1] and IIRC this is part of it, right,  
Sebastian?

> i think i don't fully understand what you're trying to do and what  
> you're proposing to solve your problem, but like i said above, the  
> special handling rules look a little suspicious. maybe michael has a  
> better idea of your scenario and can help.

I must honestly admit that I'm not sure what exactly you're after,  
Sebastian. Can you please provide us with some concrete markup along  
with a simple use case? Something along the line: "Emil User has a XXX  
document and wants to do YYY with it, etc."?

Cheers,
        Michael

[1] http://lod2.eu/
--
Dr. Michael Hausenblas, Research Fellow
LiDRC - Linked Data Research Centre
DERI - Digital Enterprise Research Institute
NUIG - National University of Ireland, Galway
Ireland, Europe
Tel. +353 91 495730
http://linkeddata.deri.ie/
http://sw-app.org/about.html

On 23 Aug 2011, at 22:59, Erik Wilde wrote:

> hello sebastian.
>
> On 2011-08-16 09:36 , Sebastian Hellmann wrote:
>> The problem is not with our modelling. We are working on a format for
>> NLP tools, that everyone can then implement. So the modelling  
>> should be
>> up to the developer.>
>
> i don't understand that. if you're working on the format, doesn't  
> that imply you're defining the model? developers then simply  
> implement the model that is defined by your format, right?
>
>> I think the core of the problem is, that the uris
>> should serve two use cases: 1. serve as RDF subjects and allow for
>> LinkedData without too much overhead 2. highlight it in a browser/
>> client
>> We might assume equality of both and just allow in the NIF format[1],
>> that developers can use both as they like, but then when querying
>> LinkedData they should replace all # with ? and vice versa for
>> browser/highlighting clients.
>
> i don't think i can follow you here, but the substitution rules you  
> are mentioning don't look very nice. URI-wise, # and ? serve  
> different purposes, and creating such a substitution rule to me  
> looks as if you're pretty much guaranteeing that things will break  
> for anybody not aware of your special rules. if for passing around  
> URIs you also have to pass around special rules how to handle them,  
> that's not a good sign.
>
>> Would you think this is too hacky? It might also be that the whole
>> problem is rather hypothetical at the moment, so # might be the  
>> choice
>> now and then we will just wait until the problems arise...
>
> i think i don't fully understand what you're trying to do and what  
> you're proposing to solve your problem, but like i said above, the  
> special handling rules look a little suspicious. maybe michael has a  
> better idea of your scenario and can help.
>
> cheers,
>
> dret.
>
> --
> erik wilde | mailto:[hidden email]  -  tel:+1-510-6432253 |
>           | UC Berkeley  -  School of Information (ISchool) |
>           | http://dret.net/netdret http://twitter.com/dret |


Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: Document fragment vocabulary

Sebastian Hellmann
In reply to this post by Erik Wilde-3
Hi Erik and Michael,
sorry for the delay, I was sick last week.
I tried to better state the problems I have in this email and hope that
we are getting closer to the core now.

Am 23.08.2011 23:46, schrieb Erik Wilde:

> hello sebastian.
>
> On 2011-08-16 09:22 , Sebastian Hellmann wrote:
>> What is your suggestion then, what we should be doing? We consider
>> addressing fragments of text documents in general, with CSV and XML and
>> XHTML being specialisations. We might just add an additional
>> "type=RFC5147" to the fragment and then add several other types
>> ourselves: a stable one, one for morpho-syntax, etc.
>
> i am not quite sure how you could see CSV and XML and XHTML as
> specialization of plain text. they do have different metamodels (at
> least plain text and CSV and *ML) and thus need pretty different
> approaches when it comes to fragment identification. i think the
> problem you're having may be a well-know ugliness in web architecture:
> fragment identifiers are specific for the media type, but URIs are
> (often) not. this is just a design defect of the web, and there;'s no
> easy way around it. sometimes people try to engineer around it
> somehow, but as soon as you're starting to think about
> decentralization and redirections, things typically fall apart. all
> sorts of things have been proposed over the years to fix this defect,
> but there it's a hard problem to solve in the general case and without
> breaking backwards compatibility.

The basic problem seems to be the definition of what plain text is. I
guess you are talking about the media type, while I am talking about
plain text in general. My definition would be a bit broader such as:
"Plain text is basically anything that makes sense to open in a text
editor. " or negatively "Not a binary format." or "a character
sequence". Then CSV and *ML impose certain rules upon the plain text and
requires certain patterns of characters.
The easiest way to show that it is a specialisation is, that
http://www.w3.org/DesignIssues/LinkedData.html#range=14406-144018 would
theoretically work and point to the fragment "Semantic Web" based on the
html source. The problem here is again that plain text according to RFC
2046/3676 should not contain any markup or other things. But this is
actually not an intuitive definition and I was not aware that this
separation is made here.


The main use case I have is for Natural Language Processing. Application
A creates a annotation in RDF (e.g. part of speech tags) .
@base <http://www.w3.org/DesignIssues/LinkedData.html#>
@prefix sso: <http://nlp2rdf.lod2.eu/schema/sso/>
<range=14406-144014> sso:posTag "JJ" , rdf:type
<http://purl.oclc.org/olia/olia.owl#Adjective>

  A second application B can now read an understand this RDF, because it
can understand the produced RDF, which is defined as NIF.
Such understanding is reached, because 1. common ontologies are used
such as olia.owl#Adjective  and 2. because the URIs <range=14406-144014>
have a well-defined Semantics.

When it comes to RDF however the Semantics of URIs seem to be different
suddenly:
1.  is <range=14406-144014> the same as
<range=14406-144014;md5=43tz8sfel8jilfeu8sfejkl> .
";md5=43tz8sfel8jilfeu8sfejkl" is just an integrity check, so if the
integrity is given an application can assume that both URIs have the
same meaning and point to "Semantic".
2. Furthermore using "?",
<http://www.w3.org/DesignIssues/LinkedData.html?range=14406-144014>
might refer to the same. It is an RDF subject and the reference can be
arbitrarily defined.
3. As the DesignIssues web site is HTML, we could also use XLink or
XPointer to mean the same thing.
4. If the annotations are expensie to calculate it would be nice, if
they could stay valid as long as possible, thus using more robust
identifiers:
<contextlength=4&length=8&text=Semantic&md5=438jil89sfdkljise79>

All URI variants could be defined as equal using owl:sameAs. As NLP is a
hacky business sometimes, some of the modelling should not be fixed,
e.g. applications could also define that ranges, that differ just 1
character are still the same or other fuzzyness.

So for RDF it does not really matter, what exactly is used. But for the
Web it seems to matter. So I am currently looking for common ground here
as it would be nice to have compatability. E.g. applications
implementing NIF might be able to understand your annotations using RFC
5147 and also more specialised URIs like the ones for CSV and LiveUrls
[1]. Furthermore, the chance increases that annotations produced by NIF
tools can be highlighted in a browser per default.

I am a little worried as fragment-ids are so restricted to media types,
especially since you could easily reuse them, i.e. plain text RFC 5147
for CSV and *ML
It seems difficult to find a good definition what the URIs used in the
RDF actually denote. It might also not be possible to make this coherent
with Fragment ID Semantics as defined by W3C. What do you think?

>> - Do you know of any benchmarking of the different URI approaches w.r.t.
>> to robustness, uniqueness, etc? I'm currently doing an evaluation so
>> please tell me, if I should include anything. I might include your
>> CSV-Frag Ids, but I would need some data that is changing (although I
>> could simulate it)
>
> i don't think you can make benchmarking without being very specific
> about the scenario and use cases. which means you would need to have a
> sample dataset of resources changing over time that would reflect the
> scenario you are interested in, and then you could start comparing
> approaches. without that, benchmarking would be pointless.

Let's say you apply a spell checker to Wikipedia pages. You find a page
with 10 spelling mistakes. if the first one is
<http://en.wikipedia.org/wiki/Fragment_identifier#range=102,105> <is>
"potion", <shouldBe> "portion" .
and the last one is:
<http://en.wikipedia.org/wiki/Fragment_identifier#range=992,997> <is>
"exlamation", <shouldBe> "exclamation" .

Then the URI scheme is poorly choosing unless you edit the page either
backwards or fix all mistakes one at a time.
But what would be the best URI scheme for this Use Case ?

I can understand your use case of annotating log files, but I guess it
would be nice to be able to annotate Wikipedia pages.
This is what I would benchmark as it might produce a best practice in
Web Annotation.
As I said, I would also try to benchmark the CSV URIs if you have a CSV
corpus that I could use.

All the best and thanks for all your answers,
Sebastian

[1] http://liveurls.mozdev.org/index.html

--
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org


Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: Document fragment vocabulary

John Cowan-3
Sebastian Hellmann scripsit:

> The basic problem seems to be the definition of what plain text is. I
> guess you are talking about the media type, while I am talking about  
> plain text in general. My definition would be a bit broader such      
> as:  "Plain text is basically anything that makes sense to open in a  
> text editor. " or negatively "Not a binary format." or "a character  
> sequence".                                                            

I would say that what can be edited in a text editor is text.  Plain
text, then, is a particular form of text that doesn't have any explicit
presentation or semantic markup, with the significant exception of
horizontal and vertical whitespace encoded as characters.  There may be
markup, but it's implicit.

> I am a little worried as fragment-ids are so restricted to media
> types, especially since you could easily reuse them, i.e. plain text
> RFC 5147 for CSV and *ML

You could, but fragment ids that match the semantic model would be much
more robust and useful, like row/column for CSV and XPath for XML.

--
All Norstrilians knew that humor was            John Cowan
"pleasurable corrigible malfunction".          [hidden email]
        --Cordwainer Smith, Norstrilia

Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: Document fragment vocabulary

Sebastian Hellmann
Am 29.08.2011 17:34, schrieb John Cowan:

> Sebastian Hellmann scripsit:
>
>> The basic problem seems to be the definition of what plain text is. I
>> guess you are talking about the media type, while I am talking about
>> plain text in general. My definition would be a bit broader such
>> as:  "Plain text is basically anything that makes sense to open in a
>> text editor. " or negatively "Not a binary format." or "a character
>> sequence".
> I would say that what can be edited in a text editor is text.  Plain
> text, then, is a particular form of text that doesn't have any explicit
> presentation or semantic markup, with the significant exception of
> horizontal and vertical whitespace encoded as characters.  There may be
> markup, but it's implicit.
Fair enough, I guess you can define it any way you want.  I am not 100%
sure what exactly "semantic markup" is.
Wikipedia makes three distinction:
http://en.wikipedia.org/wiki/Text_%28disambiguation%29
http://en.wikipedia.org/wiki/Plain_text
http://en.wikipedia.org/wiki/Formatted_text
http://en.wikipedia.org/wiki/Enriched_text

Although Plain Text is defined as the opposite of formatted text,
programming source code is defined as plain also.
Maybe http://en.wikipedia.org/wiki/Text_file
is closest to your definition of text, i.e. what can be edited in a text
editor.


>> I am a little worried as fragment-ids are so restricted to media
>> types, especially since you could easily reuse them, i.e. plain text
>> RFC 5147 for CSV and *ML
> You could, but fragment ids that match the semantic model would be much
> more robust and useful, like row/column for CSV and XPath for XML.
>
I would argue this directly.
If e.g. file://myfile.csv or file://myfile.xml
have a syntax error (not well-formed) then #line=10,11 or #range=88,105
will perform much better than CSV specific things or XPath, which do not
work any more.
Furthermore, it will be much more interoperable as implementors could
implement fragment identification once and it will work for many other
formats.
So there is another usefulness to it. I agree that matching the semantic
model has certain benefits, reusing general Fragment Ids , however,
should also be considered.
Cheers,
Sebastian


--
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org


Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: Document fragment vocabulary

Erik Wilde-3
hello.

On 2011-08-29 10:08 , Sebastian Hellmann wrote:
> Maybe http://en.wikipedia.org/wiki/Text_file
> is closest to your definition of text, i.e. what can be edited in a text
> editor.

in that case XML would be plain text, which does not make a whole lot of
sense. XML is a tree which happens to be text-encoded, but there is a
reason why all XML technologies are based on the tree (XDM) and not on
the text serialization. if something has a text-based serialization
that's convenient, but if the standard application-level access to that
data uses parsing into some form of higher-level data structure, then
it's not plain text anymore.

> I would argue this directly.
> If e.g. file://myfile.csv or file://myfile.xml
> have a syntax error (not well-formed) then #line=10,11 or #range=88,105
> will perform much better than CSV specific things or XPath, which do not
> work any more.

that simply depends on how you define "working". ranges/lines always
select something, but not necessarily what you wanted them to select.
the fact that (some) fragment identifiers can break is a good thing, in
the same way as it is good that the web has 404s. in decentralized
systems things change and break and you have to deal with it. if an XML
document is broken, you cannot feed it into an XML pipeline, and
therefore it's just not suitable for processing anymore.

> Furthermore, it will be much more interoperable as implementors could
> implement fragment identification once and it will work for many other
> formats.

how would that work? even if you had some cross-media-type fragment
identifiers, the actual mapping of identifiers to fragments would need
to be implemented for each individual media type.

> So there is another usefulness to it. I agree that matching the semantic
> model has certain benefits, reusing general Fragment Ids , however,
> should also be considered.

it's a good idea in theory, but very hard in practice. pretty much the
only thing you can probably do would be to have ids, and even then the
lexical structure of these probably would start to interfere badly with
some of the targeted media types. i think there's an important reason
why cross-media-type fragment identifiers never got off the ground: it
would make the decentralized nature of media type definition much harder
(they would need to coordinated to support fragment identifiers of a
certain kind), and it would be impossible to enforce retroactively.

this is just my opinion, of course, and i am looking forward to see what
you will end up doing. cheers,

dret.

--
erik wilde | mailto:[hidden email]  -  tel:+1-510-6432253 |
            | UC Berkeley  -  School of Information (ISchool) |
            | http://dret.net/netdret http://twitter.com/dret |

Reply | Threaded
Open this post in threaded view
|

Re: Document fragment vocabulary

Michael Hausenblas

> that simply depends on how you define "working". ranges/lines always  
> select something, but not necessarily what you wanted them to  
> select. the fact that (some) fragment identifiers can break is a  
> good thing, in the same way as it is good that the web has 404s. in  
> decentralized systems things change and break and you have to deal  
> with it. if an XML document is broken, you cannot feed it into an  
> XML pipeline, and therefore it's just not suitable for processing  
> anymore.

+1


> this is just my opinion, of course, and i am looking forward to see  
> what you will end up doing.


Agreed, same here ...


Cheers,
        Michael
--
Dr. Michael Hausenblas, Research Fellow
LiDRC - Linked Data Research Centre
DERI - Digital Enterprise Research Institute
NUIG - National University of Ireland, Galway
Ireland, Europe
Tel. +353 91 495730
http://linkeddata.deri.ie/
http://sw-app.org/about.html

On 30 Aug 2011, at 17:41, Erik Wilde wrote:

> hello.
>
> On 2011-08-29 10:08 , Sebastian Hellmann wrote:
>> Maybe http://en.wikipedia.org/wiki/Text_file
>> is closest to your definition of text, i.e. what can be edited in a  
>> text
>> editor.
>
> in that case XML would be plain text, which does not make a whole  
> lot of sense. XML is a tree which happens to be text-encoded, but  
> there is a reason why all XML technologies are based on the tree  
> (XDM) and not on the text serialization. if something has a text-
> based serialization that's convenient, but if the standard  
> application-level access to that data uses parsing into some form of  
> higher-level data structure, then it's not plain text anymore.
>
>> I would argue this directly.
>> If e.g. file://myfile.csv or file://myfile.xml
>> have a syntax error (not well-formed) then #line=10,11 or  
>> #range=88,105
>> will perform much better than CSV specific things or XPath, which  
>> do not
>> work any more.
>
> that simply depends on how you define "working". ranges/lines always  
> select something, but not necessarily what you wanted them to  
> select. the fact that (some) fragment identifiers can break is a  
> good thing, in the same way as it is good that the web has 404s. in  
> decentralized systems things change and break and you have to deal  
> with it. if an XML document is broken, you cannot feed it into an  
> XML pipeline, and therefore it's just not suitable for processing  
> anymore.
>
>> Furthermore, it will be much more interoperable as implementors could
>> implement fragment identification once and it will work for many  
>> other
>> formats.
>
> how would that work? even if you had some cross-media-type fragment  
> identifiers, the actual mapping of identifiers to fragments would  
> need to be implemented for each individual media type.
>
>> So there is another usefulness to it. I agree that matching the  
>> semantic
>> model has certain benefits, reusing general Fragment Ids , however,
>> should also be considered.
>
> it's a good idea in theory, but very hard in practice. pretty much  
> the only thing you can probably do would be to have ids, and even  
> then the lexical structure of these probably would start to  
> interfere badly with some of the targeted media types. i think  
> there's an important reason why cross-media-type fragment  
> identifiers never got off the ground: it would make the  
> decentralized nature of media type definition much harder (they  
> would need to coordinated to support fragment identifiers of a  
> certain kind), and it would be impossible to enforce retroactively.
>
> this is just my opinion, of course, and i am looking forward to see  
> what you will end up doing. cheers,
>
> dret.
>
> --
> erik wilde | mailto:[hidden email]  -  tel:+1-510-6432253 |
>           | UC Berkeley  -  School of Information (ISchool) |
>           | http://dret.net/netdret http://twitter.com/dret |


Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: Document fragment vocabulary

Erik Wilde-3
In reply to this post by Sebastian Hellmann
hello.

On 2011-08-29 06:43 , Sebastian Hellmann wrote:
> It seems difficult to find a good definition what the URIs used in the
> RDF actually denote. It might also not be possible to make this coherent
> with Fragment ID Semantics as defined by W3C. What do you think?

i don't think there are "fragment identifier semantics". there is the
URI syntax, and then things get ugly because the semantics depend on the
media type of the media type of the resource representation that you
might GET. you could in theory create a framework that would "map"
fragment identifiers based on the result of a retrieval, but that sounds
awfully brittle, in particular when resources can change.

> Then the URI scheme is poorly choosing unless you edit the page either
> backwards or fix all mistakes one at a time.
> But what would be the best URI scheme for this Use Case ?

you mean "fragment identifier" here. HTML only has id-based
identification, so you cannot identify words. it sounds like the
application you want to build has to do its own thing. at some point in
time i suggested to the HTML5 group to improve on HTML's fragment's
identification capabilities by at least allowing child paths (something
like #1/2/1/3/12, counting child element nodes down the tree), but they
had/have too many other things to do. i still think that HTML5 would be
a great opportunity to make HTML a better hypertext citizen, but HTML5
mainly focuses on web apps and scripting and not so much on hypertext.

> I can understand your use case of annotating log files, but I guess it
> would be nice to be able to annotate Wikipedia pages.

i see your point, but the big difference is that log files are plain
text, and wikipedia pages are not.

> This is what I would benchmark as it might produce a best practice in
> Web Annotation.

it would be great to have that, but then you probably want to focus on
the web's main content type, HTML.

> As I said, I would also try to benchmark the CSV URIs if you have a CSV
> corpus that I could use.

i don't have that, and even if you had such a corpus, you would also
need a change model and a model of how to deal with breaking changes and
how useful/appropriate it is to "fix" them.

cheers,

dret.

--
erik wilde | mailto:[hidden email]  -  tel:+1-510-2061079 |
            | UC Berkeley  -  School of Information (ISchool) |
            | http://dret.net/netdret http://twitter.com/dret |

Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Re: Document fragment vocabulary

Sebastian Hellmann
Hello all,
this discussion really helped me a lot to get a different perspective on
the whole issue.
I also see, that my terminology was wrong with regard to plain text.  I
thought about it and I think I understand now that it is not so easy to
make universal fragment identifier for the Web.

For my main use case (interoperability of NLP tools) this fact is not
really relevant as the focus is on text and text annotation.  One big
problem in this domain is for example to have multi-layered and
overlapping annotations, sometimes solved with milestones embedded in
XML. [1] proposes a docuverse, which seems to be a little overkill.
Overall, however, the question about which media type is not so relevant
in this domain. I also intend to make it possible to embed the text into
the RDF as an RDF Literal. Then the media type would be fixed.


To wrap up this discussion, here is what I plan to do:

• First, I think I will collect most approaches to fragment identifiers
and make a table "media type vs. possible fragment ids", then in a next
step I will write down some use cases  and then derive criteria for
fragment ids. Then I will do some benchmarking with that and create a
table for comparison. I have just submitted the LOD2 EU deliverable so I
already did some of the things.
• Based on the deliverable, we will specify a NIF version 1.0 and then
implement it for several tools and do a field test. Results will be
collected in a NIF 2.0 draft. NIF-1.0 will have the recipes I already
mentioned, offset based and context - hash based. I think we will also
fix the '#' and not leave the choice of #, ?nif=, /  to the implementor.
During NIF-1.0 we will see, if any problems come up doing it this way.
•  End of September I will give a presentation at a W3C workshop [2].
There I will try to talk to David Filip ( LRC/CNGL/LT-Web, LT-Web:
Meta-data interoperability between Web CMS, Localization tools and
Language Technologies at the W3C)
• We hope to submit NIF 2.0 draft to some organization who standardizes
it (W3C and ISO are both options to be considered) .
• Lastly, if we have time, we might pick up and continue/extend the
liveURL project [3] . Maybe we could implement some RFC also along with
it. It is of course just one plugin for one browser, but it would be a
start. This will need some time though before we pick up on this. If
anybody would be willing to join, please mail me ;)


Overall, I am a little bit sad that compatibility with the RFCs can not
be achieved so easily. Especially the "optional" parts need to be
stripped, because of the "owl:sameAs" dilemma I sketched in a previous
email.  For now, I will probably make a page that describes the relation
between the NIF URIs and different W3C RFCs. Maybe it is possible to
find some convergence on the way.

Thanks a lot for having all this patience and answering all my questions,
Sebastian

[1]  http://palindrom.es/phd/research/earmark/
[2]
http://www.multilingualweb.eu/documents/limerick-workshop/limerick-program
[3] http://liveurls.mozdev.org/index.html