-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feedback on new rdf intro vignette #13
Comments
Hey @amoeba , Thanks much, this is great and very helpful. I think it would be very nice to flush these issues out further both in the package functions & vignettes better, and then sketch up a general-purpose introduction to RDF and related packages for R Journal with you and @annakrystalli. A few quick replies while I read through this:
Yes, I agree entirely; I haven't quite figured out how to introduce the whole concept of namespaces; but it would be much better if I could avoid them creeping in at all until I address the concept properly. Maybe the issue of resolving vs non-resolving URIs should just be discussed at this stage?
Yeah, this might be a good idea; the one thing that gives me pause is that write methods are serialization-specific: e.g.
Sure, sounds reasonable. Related, but the current
Ha, I blame you for that! (You suggested the argument to
Definitely; I have a sketch of this now here: https://github.com/cboettig/rdflib/blob/patch-performance/vignettes/data-lake.Rmd Again feedback would be great. This vignette tries to focus on the more complex graph query examples and applications, whereas the
Great idea. Re nesting, I think the more natural "R" way is just to use lists! Maybe something like: list("@id" = "http://example.org/Bryce",
"pets:hasDog" = list("pets:isBreed" = "Alaskan Husky")) %>%
as_rdf() (this works with the experimental gotta run, but more thoughts later |
A few more quick thoughts: Re:
Yup, this works, and in fact you can omit the
This is a great idea. Though actually I think the whole namespace prefix thing isn't an optimal API -- it's familiar of course to anyone with an XML background but I think a stumbling block for most users. I think what JSON-LD did with
The latter is most interesting because on the face of it this looks more cumbersome; if you're seeped in RDF or XML we're more used to the notion that all terms come from specific vocabularies and we'd rather be explicit about it, saying this is a So I think the namespace implementation should emulate the ability to essentially define a context (using an R list), rather than a prefix per se. Of course you can define prefixes in a context as well (and also define types, e.g. a URI vs literal), which is super nice. So this needs some work to implement but I'd like to support bare names with a full mapping rather than only support prefixes.
Scott has https://github.com/ropensci/ghql which wraps Jeroen's bindings at https://github.com/ropensci/graphql . I think GraphQL is really elegant; but the backend setup to create a graphQL database looks complex to me: basically asks you to create your own ontology of types and for some reason does not reuse and RDF notions. But haven't looked closely enough at it yet to honestly judge...
The vignette should definitely mention this somewhere to define Ontology, OWL, and the role of a reasoner. But I'm still warming up to OWL, with the following objections:
Other examples might include inferring membership of a containing class; e.g. identify all frogs from a species list, also seem more natural as functions or table joins of species ids to a table of the ranks. Making the user specify their inferences in their own functions also seems like the best way to avoid the dangers of poorly specified OWL relationships leading to nonsensical conclusions. But I could be entirely wrong on this -- I'd love to have a compelling example or two of OWL use that we could compare against doing the same inference with functions. I also suspect that a philosophical difference between the "data science" mindset of today and the reasoning behind the design of OWL (and also SQL as it applies to being a user interface) is the assumption that the user does not have a convenient general-purpose programming language at their disposal, and thus doesn't have the option of just writing on-the-fly functions to perform inference... |
Wow, how did nearly a month go by without me responding to this? Sorry about that! I'll take a look this week for sure. |
ha, no problem, we all have day jobs. meanwhile, I've added support for the other backends for storing large triplestores (SQLite, MySQL, Postgres, Virtuoso). I still need to test the virtuoso link; I'm thinking I can get it working through https://hub.docker.com/r/tenforce/virtuoso/. Others are already tested on circleci with |
Awesome! I use the tenforce image for all my virtuoso work and it's super easy to use, especially for initial bulk rdf data loading. |
Oh, that's great to hear! Actually, maybe you can help me figure out how to set up the Virtuoso connection? It's not entirely obvious to me how that works. e.g. with postgres or mysql, here's how I do the usual thing of passing the host, user, password etc: rdflib/tests/testthat/test-rdf_storage.R Lines 45 to 48 in a66cfa9
(In the above example, MySQL server is running in a separate, linked container named Here I try the basically same thing to try and connect to the tenforce container, but no go. How do I set the rdflib/tests/testthat/test-rdf_storage.R Lines 73 to 74 in a66cfa9
Here's the rdflib C library docs for the virtuoso link (Interestingly, no Lines 14 to 31 in 349abee
If you do |
Not sure off-hand but I can sure take a look. |
At a quick glance at the redland docs, have you tried passing a DSN in the ...
virtuoso = list(dsn = dsn, user = user, password = password,
database = database, host = host, charset = charset)
...
virtuoso = new("Storage", world, "virtuoso", name = name, options = options)
... So the ultimate call is something like this, in effect: new("Storage", world, "virtuoso", "dsn='Local Virtuoso',user='demo',password='demo'") I'm going to recompile redland here so I can test it out. |
yeah, that's basically what it's already doing under the hood but no go; see https://github.com/ropensci/rdflib/blob/master/R/rdf.R . Could be that I don't have redland compiled quite right, but librdf at least recognizes virtuoso as a storage type once I compiled redland with |
Ah, I skipped over |
Okay, at the very least I see how things are working and get to this failure case. I think I'm catching up :) library(redland)
world <- new("World")
storage <- new("Storage", world, "virtuoso", "db", "dsn='Virtuoso', user='dba', password='rdflib'")
model <- new("Model", world, storage, options="")
queryString <- "select * where { ?s ?p ?o . }"
query <- new("Query", world, queryString, base_uri=NULL, query_language="sparql", query_uri=NULL)
queryResult <- executeQuery(query, model)
> librdf error - Virtuoso SQLConnect() failed [IM002] [iODBC][Driver Manager]Data source name not found and no default driver specified. Driver could not be loaded Gotta run for now but this is interesting. |
that was fast. yeah, that's exactly what I'm seeing too. seems to be an ODBC problem but not sure why that wasn't an issue for the MySQL or Postgres. maybe http://docs.openlinksw.com/virtuoso/virtmanconfodbcdsnunix/ is related but really I have no idea. |
@amoeba cool, looks like that works. A bit weird to set that up with an external container though, since the Anyway, I just dropped a basic virtuoso install into my Lines 2 to 6 in 34b251f
That seems to let my virtuoso test pass. Yay, will be nice to have a high performance database. |
Yes, though I think it would be good to have a scientist-oriented metaphor to make this more palatable. I also expect scientists also won't understand the idea of a URL 'resolving' without some help. Maybe the nearest thing would be things like ISBNs. Or for the ecologically-minded, taxonomic or genetics IDs.
That feels right.
Ah, the perennial "wtf do I call my data variable" problem. I think I prefer names more specific than "my_data" (here, "my_rdf") that actually described the variable(s) described.
I think can get behind this. That said, I think this type of approach has some usability wins: schema_org <- rdf_ns("https://schema.org") # grabs a list of schema.org terms
schema_org$<TAB> # autocompletes terms from schema_org The list of classes and properties would just be a list. The downside of this is that it's not really how RDF works. The binding in this case would be to a name, rather than an IRI. So it's just a hack to get autocomplete.
Yeah, this doesn't exist. I think it's safest to assume the user of rdflib is a user who knows how to author RDF and just needs to do it in R. So they know their ontologies.
Yes, perhaps.
Sure.
I don't get this but I also can't profess to understand what
My understanding is that the biomedical community makes great use of OWL so maybe that's the place to look. I'll look into this. To your taxonmoy example, my take on OWL is that it would allow one group to describe a bunch of taxon in RDF, then another group could come along and define, in OWL, what it means to be "more derived than" or "sister taxa to" and then the could query for taxa conforming to those relationships. It feels powerful, but you're right in pointing out that we have many less-obstruse ways to do this already in database land.
You know, I think you're probably right here. It's clear you're thinking hard about this which is really nice to see. |
Hey Carl, I think this is fantastic. You've hit the nail on the head.
I thought I'd drop my feedback here rather than over on ropensci/software-review#169. I took a read-through for content and was very impressed with the current state of it and also with the sections you've penciled out for future work. I made a variety of minor notes as a list (immediately below) and made some high level comments in sub-sections below.
tidyr::gather
equivalence is fantastic because I think it is a good bridge for users.iris
data, using local IDs, the problem that arises, and how an RDF mindest would solve them.but I think a better example to show graph queries (and thus graph thinking) might be useful.
It kinda looks like you've put in a placeholder for this to happen already so 👍 I'd be happy to work on this part with the more graph-like tidyverse datasets.
rdf_serialize
tordf_write
to match the{xml,html,etc}_write/_read
pattern in the tidyverse?rdf_serialize
print to stdout by default? I realizeprint.rdf
andformat.rdf
effectively do this, combined with therdf_print_format
option. This would be like howwriteLines
usescon = stdout()
as a default.rdf <- rdf()
which was a bit confusing as I didn't know that, somehow, R lets you do this. Do you think this could be a bit confusing?rdf_add
APII noticed this:
What do you think about adding support for first-class namespaces the way librdf does:
In R, this'd look similar, but you'd write like
n("bob")
instead:Or maybe some more syntax-fu could be used, such as pipes or some other syntax magrittr provides. Maybe list assignment (
iris["obs1']
) or/
(iris / "obs1"
) can be overridden?The power (or utility?) of graph query langs
I think you do a nice job of showing graph queries, but when I've seen graph query languages (specifically GraphQL) demoed, the "why" is usually that, in graph query languages, we describe the shape of the result, rather than the instructing the database engine how to generate the result. SPARQL is very much one of these graph query languages and I think going into this could be useful. I wonder as I write this if there GraphQL is being adopted in R much yet.
SPARQL, inference, and reasoners
A big problem with the RDF world is getting all of OWL's features to work.
When people create data, they often use OWL to create their own domain-specific ontology, rathern than making use of a pre-existing one. Many people would say this is a good thing (and many others would groan), because then they can map their ontology onto other ontologies and the linked data graph can grow. For example, if they're publishing their data with an additonal triple stating that their class
theirontology:Dataset
, is equivalent to, say,Schema.org/Dataset
. Then, any query asking for aSchema.org/Datset
would also return anytheirontology:Dataset
's, in addition to anySchema.org/Dataset
's. The rub is that this requires an OWL reasoner to be running behind the SPARQL query engine, which I'm not sure is happening everywhere we run SPARQL, such as with redlands.If this could me made to work for users of
rdflib
I think that would be a really useful feature. Perhaps this means helping the user get a triple store set up from within the package and use the RStudio Connection API to make it all fit within the RStudio ecosystem.More complex uses of
rdf_add
rdf_add
works great for adding single statements and can obviously be used to create as complex a data model as RDF can support. But could there be an easier way to create complex data models? Say I want to say that I have a husky, and want to use a URI for myself but not for my dog:Or if I want to quickly add multiple statements about the same subject, like:
Is the best way of inputting this type of stuff just a
data.frame
?The text was updated successfully, but these errors were encountered: