Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feedback on new rdf intro vignette #13

Open
amoeba opened this issue Feb 19, 2018 · 14 comments
Open

Feedback on new rdf intro vignette #13

amoeba opened this issue Feb 19, 2018 · 14 comments

Comments

@amoeba
Copy link

amoeba commented Feb 19, 2018

Hey Carl, I think this is fantastic. You've hit the nail on the head.

I thought I'd drop my feedback here rather than over on ropensci/software-review#169. I took a read-through for content and was very impressed with the current state of it and also with the sections you've penciled out for future work. I made a variety of minor notes as a list (immediately below) and made some high level comments in sub-sections below.

  • I like the human tone the intro starts out on and I think the style is maintained throughout which was nice.
  • Starting with the tidyr::gather equivalence is fantastic because I think it is a good bridge for users.
  • You continually draw equivalencies for the user to technologies they are likely much more familiar with which is great.
  • I really like Table 1!
  • I really liked how you walked the user through tidying the iris data, using local IDs, the problem that arises, and how an RDF mindest would solve them.
  • In the section "Subject URIs", when you start moving into namespaces, you offer up a namespace of "http://example.com/iris#" in order to world-wide-ify your IDs in the iris dataset. I think this might be a place you could start to lose users. For example, what is example.com/iris, why does this thing that looks like a URL not resolve to anything, and why would using broken URLs in my data be useful? I don't have any of the answers to these questions and I constantly struggle with and argue about this in my own work. The W3C intros use examples of Bob and Alice which, I think, make use of resolvable URIs. Do you think this stuff matters at this point in the intro?
  • It's great that you show the example SPARQL query to pull the iris data into a tabular form
    but I think a better example to show graph queries (and thus graph thinking) might be useful.
    It kinda looks like you've put in a placeholder for this to happen already so 👍 I'd be happy to work on this part with the more graph-like tidyverse datasets.
  • What do you think about renaming rdf_serialize to rdf_write to match the {xml,html,etc}_write/_read pattern in the tidyverse?
  • What do you think about making rdf_serialize print to stdout by default? I realize print.rdf and format.rdf effectively do this, combined with the rdf_print_format option. This would be like how writeLines uses con = stdout() as a default.
  • You use code like rdf <- rdf() which was a bit confusing as I didn't know that, somehow, R lets you do this. Do you think this could be a bit confusing?

rdf_add API

I noticed this:

rdf <- rdf()
rdf %>% rdf_add("",   
                "iris:Sepal.Length", 
                object = 5.1)
rdf

What do you think about adding support for first-class namespaces the way librdf does:

n = Namespace("http://example.org/people/")
n.bob # = rdflib.term.URIRef(u'http://example.org/people/bob')

In R, this'd look similar, but you'd write like n("bob") instead:

iris <- rdf_ns("http://example.com/iris#")

rdf <- rdf()
rdf %>% rdf_add("",   
                iris("Sepal.Length"),
                object = 5.1)
rdf

Or maybe some more syntax-fu could be used, such as pipes or some other syntax magrittr provides. Maybe list assignment (iris["obs1']) or /(iris / "obs1") can be overridden?

The power (or utility?) of graph query langs

I think you do a nice job of showing graph queries, but when I've seen graph query languages (specifically GraphQL) demoed, the "why" is usually that, in graph query languages, we describe the shape of the result, rather than the instructing the database engine how to generate the result. SPARQL is very much one of these graph query languages and I think going into this could be useful. I wonder as I write this if there GraphQL is being adopted in R much yet.

SPARQL, inference, and reasoners

A big problem with the RDF world is getting all of OWL's features to work.
When people create data, they often use OWL to create their own domain-specific ontology, rathern than making use of a pre-existing one. Many people would say this is a good thing (and many others would groan), because then they can map their ontology onto other ontologies and the linked data graph can grow. For example, if they're publishing their data with an additonal triple stating that their class theirontology:Dataset, is equivalent to, say, Schema.org/Dataset. Then, any query asking for a Schema.org/Datset would also return any theirontology:Dataset's, in addition to any Schema.org/Dataset's. The rub is that this requires an OWL reasoner to be running behind the SPARQL query engine, which I'm not sure is happening everywhere we run SPARQL, such as with redlands.

If this could me made to work for users of rdflib I think that would be a really useful feature. Perhaps this means helping the user get a triple store set up from within the package and use the RStudio Connection API to make it all fit within the RStudio ecosystem.

More complex uses of rdf_add

rdf_add works great for adding single statements and can obviously be used to create as complex a data model as RDF can support. But could there be an easier way to create complex data models? Say I want to say that I have a husky, and want to use a URI for myself but not for my dog:

rdf_data <- rdf()

# Manual blank nodes, probably already works?
rdf_data %>%
  rdf_add("http://example.org/Bryce",
          "pets:hasDog",
           "_:dog") %>%
  rdf_add("_:dog",
          "pets:isBreed",
           "Alaskan Husky")

# Nested `rdf_add` statements?
rdf_data %>%
  rdf_add("http://example.org/Bryce",
          "pets:hasDog",
           rdf_add(predicate = "pets:isBreed",
                   object = "Alaskan Husky"))

Or if I want to quickly add multiple statements about the same subject, like:

rdf %>%
  rdf_add("ex:Me", c("ex:isSpecies", "ex:hasFirstName"), c("Human", "Bryce"))

Is the best way of inputting this type of stuff just a data.frame?

@cboettig
Copy link
Member

Hey @amoeba , Thanks much, this is great and very helpful. I think it would be very nice to flush these issues out further both in the package functions & vignettes better, and then sketch up a general-purpose introduction to RDF and related packages for R Journal with you and @annakrystalli.

A few quick replies while I read through this:

you offer up a namespace of "http://example.com/iris#" in order to world-wide-ify your IDs in the iris dataset. I think this might be a place you could start to lose users.

Yes, I agree entirely; I haven't quite figured out how to introduce the whole concept of namespaces; but it would be much better if I could avoid them creeping in at all until I address the concept properly. Maybe the issue of resolving vs non-resolving URIs should just be discussed at this stage?

renaming rdf_serialize to rdf_write?

Yeah, this might be a good idea; the one thing that gives me pause is that write methods are serialization-specific: e.g. readr provides write_csv, write_tsv, write_delim for each possible serialization of the same data.frame structure. Should we introduce write_nquads, write_rdfxml, write_turtle etc to match this pattern? (I think getting the user comfortable with thinking about RDF as an abstraction with multiple serializations, rather than as something with a single write_rdf, is important to communicate in the API design here, but not sure which is best).

rdf_serialize print to stdout by default

Sure, sounds reasonable. Related, but the current print methods don't work with head() and don't stop printing, so they have issues (see #14).

you use code like rdf <- rdf() ...

Ha, I blame you for that! (You suggested the argument to rdf_query, rdf_serialize be called rdf instead of x, and so it made senes to me to use the same variable as the argument). As you noticed, this is fine (package function names belong to the package namespace / environment, not the global environment), but probably isn't a good practice. Other choices for generic examples? my_rdf <- rdf()? x <- rdf()?

... better example to show graph queries ...

Definitely; I have a sketch of this now here: https://github.com/cboettig/rdflib/blob/patch-performance/vignettes/data-lake.Rmd

Again feedback would be great. This vignette tries to focus on the more complex graph query examples and applications, whereas the rdfintro.Rmd is really more about figuring out how to introduce basic RDF concepts to the typical R user. I think the "Data Lake" and "Schema on Read" metaphors are useful here, though these examples could definitely be improved to be made more interesting and show more complex cases (in particular, really should have an example of combining related but previously un-integrated data).

complex uses of rdf_add()

Great idea. Re nesting, I think the more natural "R" way is just to use lists! Maybe something like:

  list("@id" = "http://example.org/Bryce",
         "pets:hasDog" = list("pets:isBreed" = "Alaskan Husky")) %>%
as_rdf()

(this works with the experimental as_rdf() function, routing through JSON-LD of course...

gotta run, but more thoughts later

@cboettig
Copy link
Member

A few more quick thoughts:

Re:

# Manual blank nodes, probably already works?
rdf_data %>%
  rdf_add("http://example.org/Bryce",
          "pets:hasDog",
           "_:dog") %>%
  rdf_add("_:dog",
          "pets:isBreed",
           "Alaskan Husky")

Yup, this works, and in fact you can omit the _: and just do "dog" and you still get a blank node. (Redland doesn't support literal subjects)

namespaces for rdf_add()

This is a great idea. Though actually I think the whole namespace prefix thing isn't an optimal API -- it's familiar of course to anyone with an XML background but I think a stumbling block for most users. I think what JSON-LD did with @context would actually be more intuitive (though I could be talked out of that). That is:

  • you can define a default @vocab and then all your properties (predicates and URI-type objects) default to that namespace without a prefix.
  • you can a define a @base (same as most RDF serializations?) which becomes the base URI for any blank node (e.g. subject with a string value instead of absolute URI).
  • you can associate any individual term with a URI, e.g. "name": "http://schema.org/name".

The latter is most interesting because on the face of it this looks more cumbersome; if you're seeped in RDF or XML we're more used to the notion that all terms come from specific vocabularies and we'd rather be explicit about it, saying this is a dc:title and that is a schema:name. But for most users / programming / developer use it's much more natural to work with bare names (not least because in R you can say list(name = "bob") but need quotes to say list("schema:name" = "bob")....

So I think the namespace implementation should emulate the ability to essentially define a context (using an R list), rather than a prefix per se. Of course you can define prefixes in a context as well (and also define types, e.g. a URI vs literal), which is super nice. So this needs some work to implement but I'd like to support bare names with a full mapping rather than only support prefixes.

GraphQL ...

Scott has https://github.com/ropensci/ghql which wraps Jeroen's bindings at https://github.com/ropensci/graphql . I think GraphQL is really elegant; but the backend setup to create a graphQL database looks complex to me: basically asks you to create your own ontology of types and for some reason does not reuse and RDF notions. But haven't looked closely enough at it yet to honestly judge...

OWL ....

The vignette should definitely mention this somewhere to define Ontology, OWL, and the role of a reasoner. But I'm still warming up to OWL, with the following objections:

  1. As you know, we don't have access to any reasoner without a Java wrapper. Would love a good reasoner as a C library ...

  2. For a most use cases, the user won't have access to a good ontology. (Even finding a good namespace is often hard). Of course it would still make sense to support ontologies when they are available. An ideal approach should probably also support easy construction of common OWL clauses that a user could write on the fly to add reasoning, but:

  3. I'm still skeptical of the utility of general-purpose reasoning. (probably my own ignorance, I could definitely do a 180 on this). My current thinking is that R users (or me at any rate) would find it easier to introduce reasoning on triples by writing specific functions that operate on their triplestore than by writing out OWL logic. For instance, the phenology ontology gives the example of inferring the triple "hasFlowers = true" from a statement of flowerCount = 5 -- this seems to me the kind of thing that would be more natural to do as a function call on a database, and hard to imagine an ontology successfully anticipating all such possible deductions ahead of time. More generally, any mutate operation in dplyr is essentially a logical inference from existing triplestore that adds more triples, right? OWL feels to me like it is essentially a subset of such functions?

Other examples might include inferring membership of a containing class; e.g. identify all frogs from a species list, also seem more natural as functions or table joins of species ids to a table of the ranks. Making the user specify their inferences in their own functions also seems like the best way to avoid the dangers of poorly specified OWL relationships leading to nonsensical conclusions. But I could be entirely wrong on this -- I'd love to have a compelling example or two of OWL use that we could compare against doing the same inference with functions.

I also suspect that a philosophical difference between the "data science" mindset of today and the reasoning behind the design of OWL (and also SQL as it applies to being a user interface) is the assumption that the user does not have a convenient general-purpose programming language at their disposal, and thus doesn't have the option of just writing on-the-fly functions to perform inference...dplyr (and pandas) brought the speed and abstractions of SQL to users inside of a general-purpose programming environment rather than a stand-alone RDBM client, which I think has proven very powerful. I think there are similarly powerful abstractions in RDF which are really just an extension of concepts already in relational database design we can leverage.

@amoeba
Copy link
Author

amoeba commented Mar 19, 2018

Wow, how did nearly a month go by without me responding to this? Sorry about that! I'll take a look this week for sure.

@cboettig
Copy link
Member

ha, no problem, we all have day jobs.

meanwhile, I've added support for the other backends for storing large triplestores (SQLite, MySQL, Postgres, Virtuoso). I still need to test the virtuoso link; I'm thinking I can get it working through https://hub.docker.com/r/tenforce/virtuoso/. Others are already tested on circleci with docker-compose, and drafting a quick vignette on this.

@amoeba
Copy link
Author

amoeba commented Mar 19, 2018

Awesome! I use the tenforce image for all my virtuoso work and it's super easy to use, especially for initial bulk rdf data loading.

@cboettig
Copy link
Member

Oh, that's great to hear! Actually, maybe you can help me figure out how to set up the Virtuoso connection? It's not entirely obvious to me how that works.

e.g. with postgres or mysql, here's how I do the usual thing of passing the host, user, password etc:

rdf <- rdf(storage="mysql", host = "mariadb",
user="root", password="rdflib",
database = "mysql",
new_db = TRUE)

(In the above example, MySQL server is running in a separate, linked container named mariadb, see the docker-compose.yml

Here I try the basically same thing to try and connect to the tenforce container, but no go. How do I set the user in the tenforce docker image?

r <- rdf(storage="virtuoso", host = "virtuoso", user="demo",
password="rdflib",new_db = TRUE)

Here's the rdflib C library docs for the virtuoso link (Interestingly, no port option like you see in the mysql docs). Here's my docker-compose.yml bit for the virtuoso:

virtuoso:
image: tenforce/virtuoso:1.3.1-virtuoso7.2.2
environment:
DBA_PASSWORD: "rdflib"
SPARQL_UPDATE: "true"
DEFAULT_GRAPH: "http://www.example.com/my-graph"
volumes:
- ./data/virtuoso:/data
rdflib:
image: ropensci/rdflib
volumes:
- .:/data
working_dir: /data
links:
- postgres
- mariadb
- virtuoso

If you do docker-compose run rdflib R that should drop you into a connected R session where you can test stuff.

@amoeba
Copy link
Author

amoeba commented Mar 19, 2018

Not sure off-hand but I can sure take a look.

@amoeba
Copy link
Author

amoeba commented Mar 20, 2018

At a quick glance at the redland docs, have you tried passing a DSN in the options slot like "dsn='Local Virtuoso',user='demo',password='demo'"? I see your current code passes the options as a list:

...
virtuoso = list(dsn = dsn, user = user, password = password, 
                    database = database, host = host, charset = charset)
...
virtuoso =  new("Storage", world, "virtuoso", name = name, options = options)
...

So the ultimate call is something like this, in effect:

new("Storage", world, "virtuoso", "dsn='Local Virtuoso',user='demo',password='demo'")

I'm going to recompile redland here so I can test it out.

@cboettig
Copy link
Member

yeah, that's basically what it's already doing under the hood but no go; see https://github.com/ropensci/rdflib/blob/master/R/rdf.R .

Could be that I don't have redland compiled quite right, but librdf at least recognizes virtuoso as a storage type once I compiled redland with librdf-storage-virtuoso https://github.com/ropensci/rdflib/blob/master/inst/docker/Dockerfile#L25 (feel free to use the docker-compose I mentioned above to test as well, though maybe I'm missing something from my Dockerfile recipe).

@amoeba
Copy link
Author

amoeba commented Mar 20, 2018

Ah, I skipped over options_to_str. Got it. I'll keep on banging away.

@amoeba
Copy link
Author

amoeba commented Mar 20, 2018

Okay, at the very least I see how things are working and get to this failure case. I think I'm catching up :)

library(redland)

world <- new("World")
storage <- new("Storage", world, "virtuoso", "db", "dsn='Virtuoso', user='dba', password='rdflib'")
model <- new("Model", world, storage, options="")
queryString <- "select * where { ?s ?p ?o . }"
query <- new("Query", world, queryString, base_uri=NULL, query_language="sparql", query_uri=NULL)
queryResult <- executeQuery(query, model)
> librdf error - Virtuoso SQLConnect() failed [IM002] [iODBC][Driver Manager]Data source name not found and no default driver specified. Driver could not be loaded

Gotta run for now but this is interesting.

@cboettig
Copy link
Member

that was fast. yeah, that's exactly what I'm seeing too. seems to be an ODBC problem but not sure why that wasn't an issue for the MySQL or Postgres. maybe http://docs.openlinksw.com/virtuoso/virtmanconfodbcdsnunix/ is related but really I have no idea.

@cboettig
Copy link
Member

@amoeba cool, looks like that works. A bit weird to set that up with an external container though, since the /etc/odbc.ini file needs to be on the app container (i.e. where R is running), and thus the .so driver also needs to be on the app container; thus I don't see how to do this when linking an external virtuoso container (linking a random .so via volume sharing seems like a bad idea).

Anyway, I just dropped a basic virtuoso install into my ropensci/rdflib docker image, and start virtuoso up before running the tests:

Sys.sleep(5)
system("virtuoso-t -c /etc/virtuoso-opensource-6.1/virtuoso.ini")
Sys.sleep(5)
devtools::load_all()
x <- testthat::test_file("tests/testthat/test-rdf_storage.R")

That seems to let my virtuoso test pass. Yay, will be nice to have a high performance database.

@amoeba
Copy link
Author

amoeba commented Apr 2, 2018

Maybe the issue of resolving vs non-resolving URIs should just be discussed at this stage?

Yes, though I think it would be good to have a scientist-oriented metaphor to make this more palatable. I also expect scientists also won't understand the idea of a URL 'resolving' without some help. Maybe the nearest thing would be things like ISBNs. Or for the ecologically-minded, taxonomic or genetics IDs.

Should we introduce write_nquads, write_rdfxml, write_turtle etc to match this pattern?

That feels right. write_{csv|tsv} are specializations of write_delim. For packages attempting to mimic the tidyverse naming approach, I think serialize is a word to be avoided.

Other choices for generic examples? my_rdf <- rdf()? x <- rdf()?

Ah, the perennial "wtf do I call my data variable" problem. I think I prefer names more specific than "my_data" (here, "my_rdf") that actually described the variable(s) described.

I'd like to support bare names with a full mapping rather than only support prefixes.

I think can get behind this. That said, I think this type of approach has some usability wins:

schema_org <- rdf_ns("https://schema.org") # grabs a list of schema.org terms
schema_org$<TAB> # autocompletes terms from schema_org

The list of classes and properties would just be a list. The downside of this is that it's not really how RDF works. The binding in this case would be to a name, rather than an IRI. So it's just a hack to get autocomplete.

For a most use cases, the user won't have access to a good ontology.

Yeah, this doesn't exist. I think it's safest to assume the user of rdflib is a user who knows how to author RDF and just needs to do it in R. So they know their ontologies.

My current thinking is that R users (or me at any rate) would find it easier to introduce reasoning on triples by writing specific functions that operate on their triplestore than by writing out OWL logic.

Yes, perhaps. mutate_if(has_flowers).

More generally, any mutate operation in dplyr is essentially a logical inference from existing triplestore that adds more triples, right?

Sure.

OWL feels to me like it is essentially a subset of such functions?

I don't get this but I also can't profess to understand what mutate_if(has_flowers) is beyond some R code or what OWL is besides a thing I use. I think OWL involves things called Description Logics and I think I have no idea what that is.

But I could be entirely wrong on this -- I'd love to have a compelling example or two of OWL use that we could compare against doing the same inference with functions.

My understanding is that the biomedical community makes great use of OWL so maybe that's the place to look. I'll look into this. To your taxonmoy example, my take on OWL is that it would allow one group to describe a bunch of taxon in RDF, then another group could come along and define, in OWL, what it means to be "more derived than" or "sister taxa to" and then the could query for taxa conforming to those relationships. It feels powerful, but you're right in pointing out that we have many less-obstruse ways to do this already in database land.

I think there are similarly powerful abstractions in RDF which are really just an extension of concepts already in relational database design we can leverage.

You know, I think you're probably right here. It's clear you're thinking hard about this which is really nice to see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants