My RDF Manifesto

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

My RDF Manifesto

Grant Robertson
I have been watching the RDF working group's discussion about layers and
surfaces and named graphs with great interest. This is a topic of genuine
concern for me. When I first learned of RDF I felt it was very limited. More
like a CS 201 homework assignment run amok than anything that could be
applied to the real world. I only recently became interested in RDFa as a
means of embedding citation information within HTML (family) formatted
content. Even though there are now triplestore databases that can store
billions of triples and reasoners that can sus out who is a friend of whom,
I still feel the same way about RDF.  In my view, the "triples can do
everything" model is seriously limited. It really does seem as though the
triple model was created primarily to make it easier to write programs and
ever since then people have been trying to cram the real world into a
three-cornered box.

So it is nice to see the RDF WG thinking about expanding the RDF model
beyond just a bunch of triples. However, I really feel that you aren't going
nearly far enough. Up to this point, I have kept my thoughts on this matter
to myself - limiting my comments to how to write good documentation for
RDFa. However, now that the RDF WG is tossing around ideas - and in the
spirit of Dan Brickley's post about not making hasty decisions that could
block future ideas - I thought it would be a good time to send you this
"manifesto" on my thoughts about RDF and where it should go in the future.

=============================================
First, the limitations of RDF, as I see them:
=============================================
I can understand the notion of using a simple construct to build more
complex constructs. This does make programming easier and it certainly makes
embedding data within XML documents more feasible. But triples - as they are
currently used - don't tell the whole story.

1) There is no meta-metadata.
-----------------------------
In other words, triples encode metadata about other things but there is no
way to encode metadata about the triples themselves. There is no way to
indicate where a triple came from, how well it is trusted, how old is the
reference, how much influence it should have on reasoning software, or
anything else.

2) RDF is entirely Boolean.
---------------------------
I can see how an entirely Boolean system would appeal to computer
scientists. However - just as the world is not flat - the world is not
Boolean. The world is full of "somewhat"s, "probably"s, and "kinda-sorta"s.
Under RDF I either foaf:knows you or I do not. There is no way to tell if we
are like blood-brothers or if I just met you at a conference a couple of
times. If one wants to express different levels of "knowledge" - from
acquaintance up through "carnal" - then one has to create an entirely
different predicate for each different level. Sure, it is possible to create
an entire vocabulary expressing a dozen different levels of knowing someone
and then use RDFS or OWL to rank them using some predicate that means "is
stronger than" and then subclass them all under foaf:knows However, if a
reasoner has access to some RDF data which uses this vocabulary but doesn't
have access to the vocabulary definition files themselves, then it will have
no idea that "acquaintance" is similar to "Buddy," differing primarily by
degree.

3) RDF is fragile and impermanent.
----------------------------------
RDF is based upon IRIs. People in the internet community like to think of
these IRIs as relatively permanent, but they are suffering from a delusion.
As far as I know, every IRI (which uses the http:// URI-Scheme) is dependent
upon the owner of the domain reregistering that domain name on a regular
basis. I am unaware of any means of registering a domain name for
perpetuity. This means that RDF data I create today could be useless as
early as tomorrow. (Not likely, but possible.) In addition, if a domain name
is forfeited and taken over by someone else, that second party could
redefine the vocabulary, completely changing the meaning of legacy RDF data.
I am primarily concerned with scientific and educational information,
therefore I am thinking in terms of hundreds of years. Finally, IRIs are
often at the mercy of web site administrators who may not know that a URL on
their site has been used as an IRI in some RDF data somewhere. So, said
administrator may reorganize a web site and totally destroy years of work in
one afternoon.

Now, someone could attempt to maintain an archive of all older vocabularies
as well as the IRIs from which they were retrieved and the date-range within
which they were valid. However: A) How would any RDF reasoner know which
date range to use if the stored triples don't have date metadata attached to
them? And B) How can such a system know which IRIs to archive.

4) Blank Nodes are too ambiguous yet not fuzzy enough.
------------------------------------------------------
Within one document or file it is possible to know with certainty that two
blank nodes are the same node. However, across documents - even if those
documents have been "merged" into one data store - it is impossible to know
for sure if two blank nodes refer to the exact same entity. Especially if
you consider very long term storage of RDF data. Many tutorials on RDF give
the example of using e-mail addresses to "pin down" a blank node. The
reasoning goes: "If two blank nodes from two different sources are
associated with the same e-mail address then it can be assumed the blank
nodes refer to the same entity." However, it is entirely possible for one
person to give up an e-mail address and then - after what seems like a
reasonable period - that e-mail address could be assigned to a different
person. But a reasonable period for daily use by people is different from a
reasonable period for very long term archival of data. A hundred years from
now, your current e-mail address may have been used by five different
people.

There does not seem to be any mechanism for indicating - with certainty -
that two different blank nodes from two different original sources both
refer to the exact same entity. All that reasoners can do is conjecture.
And, while I have nothing against conjecture, there is also no means to
indicate the degree of certainty with which said conjectures are stated.
Either two blank nodes are assumed to be the same or they are not associated
at all, which goes back to the Boolean nature of RDF.

5) I hate RDF-Linked-Lists too!
-------------------------------
Like Manu, I am really NOT a fan of RDF-Linked-Lists. And, by extension, not
too very interested in the RDF-Linked-List part of the RDFa Core 1.1 spec at
all. First of all, what's up with all those extra blank nodes? You could
have just left them out entirely. Secondly, as Manu stated, most web
designers don't know from linked lists. The data structure that is used is
so complicated, as far as they will be concerned, that they will just assume
that lists are "Too Hard" and ignore them altogether. If there was ever a
part of a spec that was doomed to be largely ignored, this is it. Third, the
attribute chosen in the RDFa Core 1.1 spec to indicate that something should
be in a RDF-Linked-List is inappropriate. By using "inlist" you make it seem
as if that is the only type of list that could ever occur. You are closing
the door to simple ordered lists, etcetera.

One should never choose a generic term to refer to a very specific entity.
What, then, does one do when yet another entity that also falls under that
generic term comes into use. Do you then use a more specific term for that
entity? That is guaranteed to lead to confusion. Unfortunately,
RDF-Linked-Lists are now part of the RDFa Candidate Recommendation and it
would be specification suicide to try to take it out now. However, the
situation can still be salvaged by simply changing the name of the attribute
used to something more specific like "rdf-linked-list" that conveys what is
really going to be done with that data. Then, future specifications can use
"orderedlist" (and "unorderedlist") to mean a simple list that is ordered
based on the order of appearance within the document (or not, respectively).
These are terms with which regular web developers are intimately familiar.
In fact, it may even be possible to figure out a way so that web designers
could incorporate an "orderedlist" or "unorderedlist" directly into an <ol>
or <ul> tag structure in an HTML document.


==================================================
Next, the limitations of - or issues I have with - the current proposed
surface, layers, named graph model being discussed.
==================================================
I have mentioned how it seems as if RDF is like a three-cornered box. The
current discussion makes me think that the RDF community has gotten so used
to that box that they can no longer think outside of it. The changes being
discussed are radical changes, to be sure, but it has the feel of merely
expanding the box from the inside, using what cardboard you have laying
around. I may have missed it, but I have not seen any discussion about where
you really want to go with these changes; what people will do with them; or
how these changes will allow for creating more accurate models of the real
world. Before you go any further, it seems to me that you all need to sit
down and map out a long term vision for what you want people to be able to
encode in RDF data as well as a pragmatic look at what will and won't be
possible.


A) N-Quads is only a single step in the right direction.
--------------------------------------------------------
N-Quads allow one to assign a "layer" (or "level" or "tag" or "name" or
whatever you want to call it) to a triple. However, what if a triple needs
to be in more than one layer? Do you repeat the triple? How many times? Now,
I know that this repetition can be normalized in a data store but a
serialized version of this data structure would be excessively verbose. It
seems it would be better to allow any number of additional "layer IRIs" to
be listed along with a triple. Let the author of the document decide which
is more efficient: repeating triples or repeating additional "layer IRIs."

B) N-Quads seem to be telling the users of the data
   what they can do with it.
---------------------------------------------------
If a triple has a particular "layer IRI" associated with it this seems to
imply that it must go on that "layer" in a data model. I think it would be
best to merely call these additional values "tags" and let users of the data
do with them whatever they choose. Leave it up to developers of RDF analysis
software to invent new metaphors like layers and surfaces and such. They can
take the information in the triples along with the information in the
associated tags and filter, sort, display, or otherwise rejigger it any way
they please. By pre-imposing a metaphor, and then choosing your terminology
and writing your specs around that metaphor, you are locking in peoples'
thinking about what they can do or how they can re-envision that data.

C) Metaphors are nice, but don't lock people in.
------------------------------------------------
Discussed above. I just wanted to reiterate this point. Choose terminology
that is as UNlimiting as possible, rather than picking a metaphor, and
choosing terminology to match that metaphor. Just provide means to express
as much information as possible and let people decide what they want to
express and what they don't. Leave it to user documentation writers like me
to explain to web authors or data managers all the different ways they can
make use of the information that may or may not be stored in RDF data (as
well as encourage them to be creative and invent new ways to envision that
same information). Specifications should only tell people how to encode the
information, not tell them what to do with it or how to think about it.


============
My proposal:
============
Fortunately, it is possible to solve most of these issues with relatively
simple changes.

Terminology:
------------
Call the extra values, which you are adding with N-Quads, "tags" and leave
it at that.

Don't try to apply any metaphor at all. Let software developers and users do
that.

N-Tuples:
---------
Basically just allow any number of additional tags instead of only one.

If two identical triples exist ANYWHERE but with different tags, treat them
as one triple with a union of all the tags.

Permanent IRIs:
---------------
Create some new official URI-Schemes[1]. Under those create a new "domain"
system, organized based on the purpose of the domain, rather than who paid
money to register a name. Create a legal structure that allows you to
register sub-domains in perpetuity.

For instance: For my DEMML project[2], I intend to register an official
URI-Scheme called "demml://" under which I will organize a vast tree of
topic codes for every subject that may need to be learned about by anyone.
(Yes, it will be a huge tree but I have figured out how to encode around
10^31 topics with only 40 characters, including slashes[3]. When someone
attempts to dereference a demml:// URI, plugins or other software on their
system will first look in a specified folder on their hard drive, then look
at a near-by server, then a more distant mirror, etc., automatically
following links made available on each of those machines to find the next
machine up in the chain where the sought-after material may exist. Those
topic codes could also be used as IRIs to specify the topic of a web page or
something. However, the IRI may not be directly dereferenceable. Instead, it
may only be able to be dereferenced via the aforementioned mechanism and
possibly only asynchronously[4]. While at the same time, that IRI may be
used to mark millions of other documents which could be found by searching
for that unique string of characters. This unique string of characters will
be valid for all of perpetuity because I didn't have to register a domain
through ICANN.

Reduce dependence on DNS system:
--------------------------------
At any point the DNS system can be co-opted by governments or hackers. IRIs
could be redirected or made entirely unavailable. However, very rarely is
the entire internet blocked from availability. By redundantly posting RDF
data and vocabularies on various different servers it makes it much more
difficult for malicious agents to block access to this information. It also
simply protects information from being destroyed by catastrophic disasters.
Then, instead of relying on the DNS to locate the ONE copy of that data
stored on ONE vulnerable server, we can rely on search engines to index this
information for us. I know, it sounds radical and even more fragile but
think about it. There can be an almost infinite number of search engines
available to choose from. It would be nearly impossible to block them all.
Sure, any one search engine may filter out some content, but that search
engine would not be very popular. Or it may filter content in just the way
you want, something the DNS doesn't do. At first people would have to browse
to a specific web site to search in a specific search engine. However,
software would soon be written to automatically search via a selection of
preferred search engines so that the user of RDF analysis software would
never notice that the file had not necessarily been directly dereferenced.
Besides, a lot of IRIs used in RDF are never meant to be directly
dereferenced in normal use anyway. They are, in fact, merely labels. The
only difference here, is that - if that IRI needs to be dereferenced -
software can do so even if the original server is blocked or down.

The best part is that all that is really required is to change one's
conception of what an IRI is used for. Simply use IRIs like labels to be
searched for rather than addresses representing locations (real or
imaginary). You can still use the same formatting. And there is nothing to
prevent one from storing an original copy at the location specified by the
IRI. It is just that you don't depend upon that IRI to be dereferencable at
all. Instead, you expect to be able to find many instances of it used within
metadata on a page, spread all over the internet. Additional metadata will
indicate whether a found instance of an IRI is a label for an exact copy of
the original content or is a subset (like a quote) or simply a reference
back to that original content.

This search-engine-based dereferencing system can be combined with the above
notion of "permanent IRIs via newly-created URI-Schemes" by the creation of
a special URI-Scheme with the following simple rule: First try to
dereference via HTTP or HTTPS but if that fails, then dereference via search
engines. Leave it up to software developers to determine various innovative
means to optimize this searching process. They could cache previous results
on the desktop, or use a master list stored on some server. Software users
will vote on the best methods with their dollars and their feet (or their
clicks).

Give RDF more granularity while making it all warm and fuzzy:
-------------------------------------------------------------
My primary beef with RDF is the Boolean nature mentioned above. There is no
way to create a weighted graph. This can be solved by making yet another
relatively simple conceptual change. Currently, the entire string of an IRI
is considered when comparing two IRIs for equivalence. (At least according
to answers to my question on Stack Overflow[5].) All that is necessary is
to, instead, say that only the path and fragment parts of an IRI are
considered the official resource identifier for RDF purposes. Any query
string or CGI data is treated as additional metadata about that IRI. This
allows a weight factor to be assigned to a specific edge in a graph simply
by adding a key-value pair to the IRI of the predicate. Now, instead of an
all-or-nothing "foaf:knows" I can say I kind-of know you by using
"foaf:knows?weight=.5". One might list a good friend using
"foaf:knows?weight=.9" and perhaps claim "foaf:knows?weight=1.1" for one's
wife (though she might say it is closer to "foaf:knows?weight=.4").

Additional key-value pairs could be added to indicate additional metadata.
One could indicate the original source of an IRI; The dc:creator of a
triple; or the date that IRI was born on. Some metadata would be more
appropriate for use in the subject-IRI and other metadata would be best in
the predicate or object IRIs. Only certain keys would be used for IRI
matching purposes. For instance, two subject IRIs with identical paths and
fragments but different "born on" dates might be considered to be two
distinct nodes. However, two triples with identical paths & fragments for
each of the subject, predicate, and object but with different "triple
created on" and "source" values listed in the predicate might be considered
to be identical triples that just happened to say the same thing in
different places, on different days. Kind of like retweeting. Naturally, two
predicates with different weight values would be considered the same but
different. So, if two identical triples have different weight values then
some kind of calculation could be done to determine the weight to use for
reasoning purposes. Different reasoners could use different calculations, as
they see fit.

Some may say that this change could break some IRIs currently in use. But
does anyone really use the query string in their IRIs? Considering that all
that would currently achieve is creating an entirely different IRI, I cannot
see anyone choosing that method of creating different IRIs over the method
of simply using fragments. Therefore, I can't see that making this change
would break more than a handful of IRIs.

Give blank nodes identifiers and make them fuzzy too:
-----------------------------------------------------
This seems counterintuitive. A blank node is supposed to be blank, right?
But how do you tell the difference between one blank node and another? Well,
within parsing and reasoning software, blank nodes are assigned identifiers.
This is similar to the index number that is added to a row in a database
table but is rarely seen by users of the database. Rather than saying that
internal identifiers should be ignored when transmitting blank nodes from
one system to another, provide a means of labeling those blank nodes with a
globally unique serial number. This can be achieved by, again, using the
query string to add metadata to the existing blank node IRI. Instead of just
'_:reference' (where "reference" is usually a short string that is unique
within a particular document), we could allow '_:reference?query="string'.
Using this query string, one can add metadata to a blank node indicating
date and time of creation, a global serial number, etcetera. Then reasoners
can use this additional metadata along with the information in the triples
that contain the blank nodes and calculate a probability that those two
blank nodes are, in fact, the exact same node. This probability can then be
expressed by a weight factor in the query string of the predicate connecting
the two blank nodes.

Now, web authors should not be required to devise and insert a globally
unique serial number into the metadata of each and every one of their blank
nodes. This information could instead be derived from the unique IRI of the
document and the short name given within the document. So a processor, when
parsing the RDF data embedded within a document would simply convert
something like '_:john' to
'_:john?source="http://example.com/aboutjohn.html"&dateretrieved="2012-03-14
^^xsd:date"'  (with appropriate escaping and conversion to a format
compatible with query strings). Implicitly created blank nodes (created
through chaining, etc.) could simply be given a document-unique serial
number to use along with the IRI of the document as above. However, I have
yet to derive an algorithm that could ensure that the same blank nodes
within a document get assigned the same serial number upon different
parsings, even if the document has been edited between parsings. So that
will require some more thought. Perhaps web authors could assign IDs to
blank nodes through the use of an additional predicate and literal object
each time they intentionally use chaining. I know, this would be a pain, but
software tools could make this easier.


===========
Conclusion:
===========
>From its conception RDF was both genius and incredibly limited. By making a
few conceptual changes, throwing in some outside help in the form of special
URI-Schemes, and opening up to the possibility that the DNS is not
infallible, it is possible to grow RDF into a data standard that truly
models the real world instead of some tri-cornered, cookie-cutter version of
it. No one should be required to use any of these additional features.
However, allowing the use of this additional metadata, embedded within the
query string of an IRI, would give RDF reasoners a vast amount of additional
information with which to, well, reason. Yes, this will then require the
definition of certain standard query-string field-names and the approved
datatypes that can be used with them. But just enough to avoid chaos. Allow
web designers and experimenters to come up with their own field names as
they see fit (with the warning that any one of those could be co-opted and
put into a standard some day) and then sit back and see what happens. As has
been mentioned many times, the best innovation sometimes occurs outside of
the working groups. Give people the incredible flexibility that these
changes will provide, and I believe a whole new world of
data-interoperability will open up.


[1] http://en.wikipedia.org/wiki/URI_scheme#Official_IANA-registered_schemes
[2] www.demml.org
[3] http://www.demml.org/standard/classification/
[4] http://www.ideationizing.com/2009/07/intelligent-epidemic-routing.html
[5]
http://stackoverflow.com/questions/9171416/is-a-query-string-allowed-in-a-ur
i-used-in-rdf


Reply | Threaded
Open this post in threaded view
|

Re: My RDF Manifesto

Danny Ayers
Hi Grant,

I'm not a member of the group, but would like to respond to some of
your comments.

tl;dr
I believe RDF with named graphs does actually cover most of your
issues when viewed as part of a layered system. Add the bits you
believe are missing on top.
But I too have been worrying a little about the IRI/HTTP/DNS issue.
I'm looking into how the magnet: URI scheme may help. (It's
essentially like a data: URI with a hash of the representation).

1) There is no meta-metadata.

Named graphs appear to be the best solution here. Whatever they
ultimately wind up looking like, there is always the HTTP model to
fall back on: put an RDF-friendly format representation of a resource
online (set of triples), talk about that.

2) RDF is entirely Boolean.

Not quite, "true" vs. "unknown" is a slightly different species. But I
had exactly the same kind of issue with RDF when I first encountered
it, and others have been there before. I believe Aaron Swartz came up
with a property :kindaLike. But the fact that you can put numeric
values in literals mean that it is possible to connect to
number-crunching systems. I'd suggest that putting

3) RDF is fragile and impermanent.

It's more or less as fragile and impermanent as Platonic solids
(though YMMV as far as specs and mindshare are concerned :) But I
agree very much that dependency on HTTP's dependency on DNS is
troublesome.

4) Blank Nodes are too ambiguous yet not fuzzy enough

Then don't use them :)

5) I hate RDF-Linked-Lists too!

Me too! Had so much grief with them. But the underlying list model is
clearly sound. The projection into syntax gets a bit messy, but I
can't see a better alternative. Having said that, in some recent code
for pragmatic reasons I'm leaning towards using lists expressed as
property-numbered resources, more like the old rdf:Seq.

Incidentally, re. registering "demml://" - new URI schemes are rarely
the best approach, check
http://www.w3.org/TR/webarch/#URI-registration

Cheers,
Danny.


On 11 May 2012 03:32, Grant Robertson <[hidden email]> wrote:

> I have been watching the RDF working group's discussion about layers and
> surfaces and named graphs with great interest. This is a topic of genuine
> concern for me. When I first learned of RDF I felt it was very limited. More
> like a CS 201 homework assignment run amok than anything that could be
> applied to the real world. I only recently became interested in RDFa as a
> means of embedding citation information within HTML (family) formatted
> content. Even though there are now triplestore databases that can store
> billions of triples and reasoners that can sus out who is a friend of whom,
> I still feel the same way about RDF.  In my view, the "triples can do
> everything" model is seriously limited. It really does seem as though the
> triple model was created primarily to make it easier to write programs and
> ever since then people have been trying to cram the real world into a
> three-cornered box.
>
> So it is nice to see the RDF WG thinking about expanding the RDF model
> beyond just a bunch of triples. However, I really feel that you aren't going
> nearly far enough. Up to this point, I have kept my thoughts on this matter
> to myself - limiting my comments to how to write good documentation for
> RDFa. However, now that the RDF WG is tossing around ideas - and in the
> spirit of Dan Brickley's post about not making hasty decisions that could
> block future ideas - I thought it would be a good time to send you this
> "manifesto" on my thoughts about RDF and where it should go in the future.
>
> =============================================
> First, the limitations of RDF, as I see them:
> =============================================
> I can understand the notion of using a simple construct to build more
> complex constructs. This does make programming easier and it certainly makes
> embedding data within XML documents more feasible. But triples - as they are
> currently used - don't tell the whole story.
>
> 1) There is no meta-metadata.
> -----------------------------
> In other words, triples encode metadata about other things but there is no
> way to encode metadata about the triples themselves. There is no way to
> indicate where a triple came from, how well it is trusted, how old is the
> reference, how much influence it should have on reasoning software, or
> anything else.
>
> 2) RDF is entirely Boolean.
> ---------------------------
> I can see how an entirely Boolean system would appeal to computer
> scientists. However - just as the world is not flat - the world is not
> Boolean. The world is full of "somewhat"s, "probably"s, and "kinda-sorta"s.
> Under RDF I either foaf:knows you or I do not. There is no way to tell if we
> are like blood-brothers or if I just met you at a conference a couple of
> times. If one wants to express different levels of "knowledge" - from
> acquaintance up through "carnal" - then one has to create an entirely
> different predicate for each different level. Sure, it is possible to create
> an entire vocabulary expressing a dozen different levels of knowing someone
> and then use RDFS or OWL to rank them using some predicate that means "is
> stronger than" and then subclass them all under foaf:knows However, if a
> reasoner has access to some RDF data which uses this vocabulary but doesn't
> have access to the vocabulary definition files themselves, then it will have
> no idea that "acquaintance" is similar to "Buddy," differing primarily by
> degree.
>
> 3) RDF is fragile and impermanent.
> ----------------------------------
> RDF is based upon IRIs. People in the internet community like to think of
> these IRIs as relatively permanent, but they are suffering from a delusion.
> As far as I know, every IRI (which uses the http:// URI-Scheme) is dependent
> upon the owner of the domain reregistering that domain name on a regular
> basis. I am unaware of any means of registering a domain name for
> perpetuity. This means that RDF data I create today could be useless as
> early as tomorrow. (Not likely, but possible.) In addition, if a domain name
> is forfeited and taken over by someone else, that second party could
> redefine the vocabulary, completely changing the meaning of legacy RDF data.
> I am primarily concerned with scientific and educational information,
> therefore I am thinking in terms of hundreds of years. Finally, IRIs are
> often at the mercy of web site administrators who may not know that a URL on
> their site has been used as an IRI in some RDF data somewhere. So, said
> administrator may reorganize a web site and totally destroy years of work in
> one afternoon.
>
> Now, someone could attempt to maintain an archive of all older vocabularies
> as well as the IRIs from which they were retrieved and the date-range within
> which they were valid. However: A) How would any RDF reasoner know which
> date range to use if the stored triples don't have date metadata attached to
> them? And B) How can such a system know which IRIs to archive.
>
> 4) Blank Nodes are too ambiguous yet not fuzzy enough.
> ------------------------------------------------------
> Within one document or file it is possible to know with certainty that two
> blank nodes are the same node. However, across documents - even if those
> documents have been "merged" into one data store - it is impossible to know
> for sure if two blank nodes refer to the exact same entity. Especially if
> you consider very long term storage of RDF data. Many tutorials on RDF give
> the example of using e-mail addresses to "pin down" a blank node. The
> reasoning goes: "If two blank nodes from two different sources are
> associated with the same e-mail address then it can be assumed the blank
> nodes refer to the same entity." However, it is entirely possible for one
> person to give up an e-mail address and then - after what seems like a
> reasonable period - that e-mail address could be assigned to a different
> person. But a reasonable period for daily use by people is different from a
> reasonable period for very long term archival of data. A hundred years from
> now, your current e-mail address may have been used by five different
> people.
>
> There does not seem to be any mechanism for indicating - with certainty -
> that two different blank nodes from two different original sources both
> refer to the exact same entity. All that reasoners can do is conjecture.
> And, while I have nothing against conjecture, there is also no means to
> indicate the degree of certainty with which said conjectures are stated.
> Either two blank nodes are assumed to be the same or they are not associated
> at all, which goes back to the Boolean nature of RDF.
>
> 5) I hate RDF-Linked-Lists too!
> -------------------------------
> Like Manu, I am really NOT a fan of RDF-Linked-Lists. And, by extension, not
> too very interested in the RDF-Linked-List part of the RDFa Core 1.1 spec at
> all. First of all, what's up with all those extra blank nodes? You could
> have just left them out entirely. Secondly, as Manu stated, most web
> designers don't know from linked lists. The data structure that is used is
> so complicated, as far as they will be concerned, that they will just assume
> that lists are "Too Hard" and ignore them altogether. If there was ever a
> part of a spec that was doomed to be largely ignored, this is it. Third, the
> attribute chosen in the RDFa Core 1.1 spec to indicate that something should
> be in a RDF-Linked-List is inappropriate. By using "inlist" you make it seem
> as if that is the only type of list that could ever occur. You are closing
> the door to simple ordered lists, etcetera.
>
> One should never choose a generic term to refer to a very specific entity.
> What, then, does one do when yet another entity that also falls under that
> generic term comes into use. Do you then use a more specific term for that
> entity? That is guaranteed to lead to confusion. Unfortunately,
> RDF-Linked-Lists are now part of the RDFa Candidate Recommendation and it
> would be specification suicide to try to take it out now. However, the
> situation can still be salvaged by simply changing the name of the attribute
> used to something more specific like "rdf-linked-list" that conveys what is
> really going to be done with that data. Then, future specifications can use
> "orderedlist" (and "unorderedlist") to mean a simple list that is ordered
> based on the order of appearance within the document (or not, respectively).
> These are terms with which regular web developers are intimately familiar.
> In fact, it may even be possible to figure out a way so that web designers
> could incorporate an "orderedlist" or "unorderedlist" directly into an <ol>
> or <ul> tag structure in an HTML document.
>
>
> ==================================================
> Next, the limitations of - or issues I have with - the current proposed
> surface, layers, named graph model being discussed.
> ==================================================
> I have mentioned how it seems as if RDF is like a three-cornered box. The
> current discussion makes me think that the RDF community has gotten so used
> to that box that they can no longer think outside of it. The changes being
> discussed are radical changes, to be sure, but it has the feel of merely
> expanding the box from the inside, using what cardboard you have laying
> around. I may have missed it, but I have not seen any discussion about where
> you really want to go with these changes; what people will do with them; or
> how these changes will allow for creating more accurate models of the real
> world. Before you go any further, it seems to me that you all need to sit
> down and map out a long term vision for what you want people to be able to
> encode in RDF data as well as a pragmatic look at what will and won't be
> possible.
>
>
> A) N-Quads is only a single step in the right direction.
> --------------------------------------------------------
> N-Quads allow one to assign a "layer" (or "level" or "tag" or "name" or
> whatever you want to call it) to a triple. However, what if a triple needs
> to be in more than one layer? Do you repeat the triple? How many times? Now,
> I know that this repetition can be normalized in a data store but a
> serialized version of this data structure would be excessively verbose. It
> seems it would be better to allow any number of additional "layer IRIs" to
> be listed along with a triple. Let the author of the document decide which
> is more efficient: repeating triples or repeating additional "layer IRIs."
>
> B) N-Quads seem to be telling the users of the data
>   what they can do with it.
> ---------------------------------------------------
> If a triple has a particular "layer IRI" associated with it this seems to
> imply that it must go on that "layer" in a data model. I think it would be
> best to merely call these additional values "tags" and let users of the data
> do with them whatever they choose. Leave it up to developers of RDF analysis
> software to invent new metaphors like layers and surfaces and such. They can
> take the information in the triples along with the information in the
> associated tags and filter, sort, display, or otherwise rejigger it any way
> they please. By pre-imposing a metaphor, and then choosing your terminology
> and writing your specs around that metaphor, you are locking in peoples'
> thinking about what they can do or how they can re-envision that data.
>
> C) Metaphors are nice, but don't lock people in.
> ------------------------------------------------
> Discussed above. I just wanted to reiterate this point. Choose terminology
> that is as UNlimiting as possible, rather than picking a metaphor, and
> choosing terminology to match that metaphor. Just provide means to express
> as much information as possible and let people decide what they want to
> express and what they don't. Leave it to user documentation writers like me
> to explain to web authors or data managers all the different ways they can
> make use of the information that may or may not be stored in RDF data (as
> well as encourage them to be creative and invent new ways to envision that
> same information). Specifications should only tell people how to encode the
> information, not tell them what to do with it or how to think about it.
>
>
> ============
> My proposal:
> ============
> Fortunately, it is possible to solve most of these issues with relatively
> simple changes.
>
> Terminology:
> ------------
> Call the extra values, which you are adding with N-Quads, "tags" and leave
> it at that.
>
> Don't try to apply any metaphor at all. Let software developers and users do
> that.
>
> N-Tuples:
> ---------
> Basically just allow any number of additional tags instead of only one.
>
> If two identical triples exist ANYWHERE but with different tags, treat them
> as one triple with a union of all the tags.
>
> Permanent IRIs:
> ---------------
> Create some new official URI-Schemes[1]. Under those create a new "domain"
> system, organized based on the purpose of the domain, rather than who paid
> money to register a name. Create a legal structure that allows you to
> register sub-domains in perpetuity.
>
> For instance: For my DEMML project[2], I intend to register an official
> URI-Scheme called "demml://" under which I will organize a vast tree of
> topic codes for every subject that may need to be learned about by anyone.
> (Yes, it will be a huge tree but I have figured out how to encode around
> 10^31 topics with only 40 characters, including slashes[3]. When someone
> attempts to dereference a demml:// URI, plugins or other software on their
> system will first look in a specified folder on their hard drive, then look
> at a near-by server, then a more distant mirror, etc., automatically
> following links made available on each of those machines to find the next
> machine up in the chain where the sought-after material may exist. Those
> topic codes could also be used as IRIs to specify the topic of a web page or
> something. However, the IRI may not be directly dereferenceable. Instead, it
> may only be able to be dereferenced via the aforementioned mechanism and
> possibly only asynchronously[4]. While at the same time, that IRI may be
> used to mark millions of other documents which could be found by searching
> for that unique string of characters. This unique string of characters will
> be valid for all of perpetuity because I didn't have to register a domain
> through ICANN.
>
> Reduce dependence on DNS system:
> --------------------------------
> At any point the DNS system can be co-opted by governments or hackers. IRIs
> could be redirected or made entirely unavailable. However, very rarely is
> the entire internet blocked from availability. By redundantly posting RDF
> data and vocabularies on various different servers it makes it much more
> difficult for malicious agents to block access to this information. It also
> simply protects information from being destroyed by catastrophic disasters.
> Then, instead of relying on the DNS to locate the ONE copy of that data
> stored on ONE vulnerable server, we can rely on search engines to index this
> information for us. I know, it sounds radical and even more fragile but
> think about it. There can be an almost infinite number of search engines
> available to choose from. It would be nearly impossible to block them all.
> Sure, any one search engine may filter out some content, but that search
> engine would not be very popular. Or it may filter content in just the way
> you want, something the DNS doesn't do. At first people would have to browse
> to a specific web site to search in a specific search engine. However,
> software would soon be written to automatically search via a selection of
> preferred search engines so that the user of RDF analysis software would
> never notice that the file had not necessarily been directly dereferenced.
> Besides, a lot of IRIs used in RDF are never meant to be directly
> dereferenced in normal use anyway. They are, in fact, merely labels. The
> only difference here, is that - if that IRI needs to be dereferenced -
> software can do so even if the original server is blocked or down.
>
> The best part is that all that is really required is to change one's
> conception of what an IRI is used for. Simply use IRIs like labels to be
> searched for rather than addresses representing locations (real or
> imaginary). You can still use the same formatting. And there is nothing to
> prevent one from storing an original copy at the location specified by the
> IRI. It is just that you don't depend upon that IRI to be dereferencable at
> all. Instead, you expect to be able to find many instances of it used within
> metadata on a page, spread all over the internet. Additional metadata will
> indicate whether a found instance of an IRI is a label for an exact copy of
> the original content or is a subset (like a quote) or simply a reference
> back to that original content.
>
> This search-engine-based dereferencing system can be combined with the above
> notion of "permanent IRIs via newly-created URI-Schemes" by the creation of
> a special URI-Scheme with the following simple rule: First try to
> dereference via HTTP or HTTPS but if that fails, then dereference via search
> engines. Leave it up to software developers to determine various innovative
> means to optimize this searching process. They could cache previous results
> on the desktop, or use a master list stored on some server. Software users
> will vote on the best methods with their dollars and their feet (or their
> clicks).
>
> Give RDF more granularity while making it all warm and fuzzy:
> -------------------------------------------------------------
> My primary beef with RDF is the Boolean nature mentioned above. There is no
> way to create a weighted graph. This can be solved by making yet another
> relatively simple conceptual change. Currently, the entire string of an IRI
> is considered when comparing two IRIs for equivalence. (At least according
> to answers to my question on Stack Overflow[5].) All that is necessary is
> to, instead, say that only the path and fragment parts of an IRI are
> considered the official resource identifier for RDF purposes. Any query
> string or CGI data is treated as additional metadata about that IRI. This
> allows a weight factor to be assigned to a specific edge in a graph simply
> by adding a key-value pair to the IRI of the predicate. Now, instead of an
> all-or-nothing "foaf:knows" I can say I kind-of know you by using
> "foaf:knows?weight=.5". One might list a good friend using
> "foaf:knows?weight=.9" and perhaps claim "foaf:knows?weight=1.1" for one's
> wife (though she might say it is closer to "foaf:knows?weight=.4").
>
> Additional key-value pairs could be added to indicate additional metadata.
> One could indicate the original source of an IRI; The dc:creator of a
> triple; or the date that IRI was born on. Some metadata would be more
> appropriate for use in the subject-IRI and other metadata would be best in
> the predicate or object IRIs. Only certain keys would be used for IRI
> matching purposes. For instance, two subject IRIs with identical paths and
> fragments but different "born on" dates might be considered to be two
> distinct nodes. However, two triples with identical paths & fragments for
> each of the subject, predicate, and object but with different "triple
> created on" and "source" values listed in the predicate might be considered
> to be identical triples that just happened to say the same thing in
> different places, on different days. Kind of like retweeting. Naturally, two
> predicates with different weight values would be considered the same but
> different. So, if two identical triples have different weight values then
> some kind of calculation could be done to determine the weight to use for
> reasoning purposes. Different reasoners could use different calculations, as
> they see fit.
>
> Some may say that this change could break some IRIs currently in use. But
> does anyone really use the query string in their IRIs? Considering that all
> that would currently achieve is creating an entirely different IRI, I cannot
> see anyone choosing that method of creating different IRIs over the method
> of simply using fragments. Therefore, I can't see that making this change
> would break more than a handful of IRIs.
>
> Give blank nodes identifiers and make them fuzzy too:
> -----------------------------------------------------
> This seems counterintuitive. A blank node is supposed to be blank, right?
> But how do you tell the difference between one blank node and another? Well,
> within parsing and reasoning software, blank nodes are assigned identifiers.
> This is similar to the index number that is added to a row in a database
> table but is rarely seen by users of the database. Rather than saying that
> internal identifiers should be ignored when transmitting blank nodes from
> one system to another, provide a means of labeling those blank nodes with a
> globally unique serial number. This can be achieved by, again, using the
> query string to add metadata to the existing blank node IRI. Instead of just
> '_:reference' (where "reference" is usually a short string that is unique
> within a particular document), we could allow '_:reference?query="string'.
> Using this query string, one can add metadata to a blank node indicating
> date and time of creation, a global serial number, etcetera. Then reasoners
> can use this additional metadata along with the information in the triples
> that contain the blank nodes and calculate a probability that those two
> blank nodes are, in fact, the exact same node. This probability can then be
> expressed by a weight factor in the query string of the predicate connecting
> the two blank nodes.
>
> Now, web authors should not be required to devise and insert a globally
> unique serial number into the metadata of each and every one of their blank
> nodes. This information could instead be derived from the unique IRI of the
> document and the short name given within the document. So a processor, when
> parsing the RDF data embedded within a document would simply convert
> something like '_:john' to
> '_:john?source="http://example.com/aboutjohn.html"&dateretrieved="2012-03-14
> ^^xsd:date"'  (with appropriate escaping and conversion to a format
> compatible with query strings). Implicitly created blank nodes (created
> through chaining, etc.) could simply be given a document-unique serial
> number to use along with the IRI of the document as above. However, I have
> yet to derive an algorithm that could ensure that the same blank nodes
> within a document get assigned the same serial number upon different
> parsings, even if the document has been edited between parsings. So that
> will require some more thought. Perhaps web authors could assign IDs to
> blank nodes through the use of an additional predicate and literal object
> each time they intentionally use chaining. I know, this would be a pain, but
> software tools could make this easier.
>
>
> ===========
> Conclusion:
> ===========
> >From its conception RDF was both genius and incredibly limited. By making a
> few conceptual changes, throwing in some outside help in the form of special
> URI-Schemes, and opening up to the possibility that the DNS is not
> infallible, it is possible to grow RDF into a data standard that truly
> models the real world instead of some tri-cornered, cookie-cutter version of
> it. No one should be required to use any of these additional features.
> However, allowing the use of this additional metadata, embedded within the
> query string of an IRI, would give RDF reasoners a vast amount of additional
> information with which to, well, reason. Yes, this will then require the
> definition of certain standard query-string field-names and the approved
> datatypes that can be used with them. But just enough to avoid chaos. Allow
> web designers and experimenters to come up with their own field names as
> they see fit (with the warning that any one of those could be co-opted and
> put into a standard some day) and then sit back and see what happens. As has
> been mentioned many times, the best innovation sometimes occurs outside of
> the working groups. Give people the incredible flexibility that these
> changes will provide, and I believe a whole new world of
> data-interoperability will open up.
>
>
> [1] http://en.wikipedia.org/wiki/URI_scheme#Official_IANA-registered_schemes
> [2] www.demml.org
> [3] http://www.demml.org/standard/classification/
> [4] http://www.ideationizing.com/2009/07/intelligent-epidemic-routing.html
> [5]
> http://stackoverflow.com/questions/9171416/is-a-query-string-allowed-in-a-ur
> i-used-in-rdf
>
>



--
http://dannyayers.com

http://webbeep.it  - text to tones and back again

Reply | Threaded
Open this post in threaded view
|

RE: My RDF Manifesto

Grant Robertson
Danny, sorry to have taken so long to respond. Other life stuff sometimes
interferes...

> -----Original Message-----
> From: Danny Ayers [mailto:[hidden email]]
>
> 1) There is no meta-metadata.
>
> Named graphs appear to be the best solution here. Whatever
> they ultimately wind up looking like, there is always the
> HTTP model to fall back on: put an RDF-friendly format
> representation of a resource online (set of triples), talk about that.

I disagree. I have been watching the discussion of the possible uses for
that last member of a Quad and it seems there is disagreement as to whether
that member should be A) An IRI pointing to additional metadata (as you seem
to suggest here), or B) merely a label to indicate which "layer" a triple
should be on or which "space" said triple should be considered to be in. The
latter limits each triple to being in only one layer. Or, if two quads exist
that are identical except for the last member, how is one to know that they
are truly intended to be the same node that spans two different layers or if
they are supposed to be independent? Quads add more data but also simply
move the ambiguity a step or two to the right while actually multiplying
said ambiguity.

Relying on the "HTTP model" - wherein I suppose you mean the IRI of the
fourth member of the quad is actually a URL pointing to metadata about the
triple - utterly depends on either A) all those other web pages to be
available each and every time the data needs to be processed or B) the data
on those web pages to be cached within some extension of an RDF data store.
It also requires either a separate web page for each triple posted on the
internet or for a set of different fragments within one large web page
describing a set of RDF data.

I have to say here: Seriously? Does the RDF community really think it is
simpler to: A) Require processors and data stores to handle a possible
additional element (that fourth element that makes a triple into a quad)? B)
Require the creation and subsequent repeated dereferencing of perhaps
thousands of additional separate web pages which hold that metadata? C)
Cause ambiguity as to whether that IRI should be dereferenced for metadata
or merely used as a label for a "layer" or "space"? ... All as opposed to
simply allowing use of the query string in existing IRIs? A change that will
be immensely easier for developers of processors to program around - if they
choose to ignore it - because all they have to do is ignore the query string
if they want to maintain the status quo (remember, no one uses the query
string because it is currently of no useful significance). A change that
will keep the meta-metadata with the triple itself. And a change that will
allow meta-metadata to be included with the triple while ALSO allowing
labels to be applied to the triple to indicate which "layer" they should be
on. Yes, it is possible to turn left by making three rights - and it may be
necessary every once in a while - but would you really want to get around
town that way all the time?


>
> 2) RDF is entirely Boolean.
>
> Not quite, "true" vs. "unknown" is a slightly different
> species. But I had exactly the same kind of issue with RDF
> when I first encountered it, and others have been there
> before. I believe Aaron Swartz came up with a property
> :kindaLike. But the fact that you can put numeric values in
> literals mean that it is possible to connect to
> number-crunching systems. I'd suggest that putting

Adding a numeric value in a literal only connects that value to a subject.
It does not associate it with a particular triple that contains said
subject. If I want to say that Jim foaf:knows(100%) Sally but that Jim
foaf:knows(50%) Bill, your solution provides no means of doing that other
than putting the metadata in an entirely separate file. Please explain to me
how that is simpler to process and easier for web authors to write into
their web pages or RDF files.



> 3) RDF is fragile and impermanent.
>
> It's more or less as fragile and impermanent as Platonic
> solids (though YMMV as far as specs and mindshare are
> concerned :) But I agree very much that dependency on HTTP's
> dependency on DNS is troublesome.

I fail to see how the stability of the internet can be compared to a
geometrical construct.


> 4) Blank Nodes are too ambiguous yet not fuzzy enough
>
> Then don't use them :)

That is not a solution. Blank nodes are often absolutely required in order
to create the more complicated data structures that TBL promised we would be
able to create using the much simpler and easier to program around "triple."
Yet, once one has created a blank node - either implicitly through chaining
or explicitly through the use of _:name - it is impossible to tell for
certain whether a blank node in one data source (be it a web page or a
triple store) is really, REALLY the same as another blank node in another
data source... Especially after long periods of time.


Please remember, I am not thinking in terms of simply adding a bit of
citation information to a quote on my web page. Nor am I interested in
telling the world who my friends are (let alone through the least user
friendly means possible). I am considering the implications of using RDF and
RDFa to store scientific and research data that can be used by as many
people as possible, over the course of potentially hundreds of years. I am
thinking in terms of what will meet the needs of scientists and real people.
Whether this system matches up with what someone's Discrete Math teacher
taught them in graduate is not even on my priority list. Nor do I think it
is on the priority list of the people who will really use RDF, if it ever
finally meets their needs. I don't think even scientists are ever going to
say, "Well this system doesn't really meet my needs and it is exceedingly
difficult to store and use my data in this format... but hell, it is a
mathematically based model, so I guess I should use it."  I know I am not a
gray-beard in this community, nor do I have all kinds of math or CS degrees,
but this whole thing really seems like a huge case of a whole group of
people having gone down a path that looked interesting at first, but now
refusing to backtrack regardless of how treacherous that path has become or
how far it has taken them from their original goal.