Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple resources #12

Closed
pheyvaer opened this issue Aug 7, 2023 · 14 comments · Fixed by #21
Closed

Support multiple resources #12

pheyvaer opened this issue Aug 7, 2023 · 14 comments · Fixed by #21
Assignees
Labels
enhancement New feature or request priority

Comments

@pheyvaer
Copy link
Contributor

pheyvaer commented Aug 7, 2023

It's possible to read and write to a single resource, but certain use cases need multiple resources.
Reading from multiple resources is supported by Comunica, but for the writing we have to have a closer look at how we should do this in YARRRML/RML.

@sevrijss
Copy link
Collaborator

https://developers.google.com/sheets/api/guides/metadata this could be used to keep track where each piece of information comes from.

@pheyvaer
Copy link
Contributor Author

@bjdmeest For this, do we want to try/do something specific with Targets?

@bjdmeest
Copy link

Currently, the single source also isn't done via Targets, so I'd first try to figure how to do it (without thinking too much about RML), and then have a look to see whether RML targets makes sense for this or not (I think we'll first need to add a Solid-target or equivalent to RMLMapper, Els is working on that)

@pheyvaer
Copy link
Contributor Author

So what do you suggest as next step? Can you maybe put that in a separate issue?

@bjdmeest
Copy link

I think just doing this issue will be tricky enough on itself. For next steps I'd first need to see the outcome of this one

@pheyvaer
Copy link
Contributor Author

But I don't understand from your initial comment what we should use? Something with Targets? Or something else in the RML space? Or just something engineered that works?

@bjdmeest
Copy link

I suggest that @sevrijss analyses the problem and suggests a course of action. We can then review and decide whether that course of action makes sense as is or should be enhanced (eg by using RML target to generate multiple local files), I don't know what the potential pitfalls may be so I can't give a more clear direction yet

@sevrijss sevrijss linked a pull request Aug 21, 2023 that will close this issue
@sevrijss
Copy link
Collaborator

Possible courses of action:

  • 1 rules file / resource
    If there is 1 rules file / resource, every resource can be dealt with in a separate way. The rules file should contain the rules to go from the sheet data to the triples that need to be stored in that specific resource. All the paths could be specified in the config file. A disadvantage is that there will be a lot of files when you have a log of resources.

  • 1 rules file using targets from RML spec
    I've done some research about the target functionality in the RML spec, it seems promising but I think there are some problems:

    • multiple subjects
      there is no guarantee that all the resources use the same subject identifiers. You would have to define a subject target link in the rules file.
    • complex to set up
      If everything is contained in 1 single rules file with multiple subject entries and targets, it becomes very difficult to setup / maintain.

    I've taken a look at the spec and tried a couple of things in matey. The targets seems to work fine, but I'm having trouble with the multiple subjects functionality.

Both approaches have advantages and disadvantages. The first is easier to setup but will require a lot of files. The second way is more RML based, but will be more complex to setup and maintain.

@pheyvaer
Copy link
Contributor Author

Interesting! What would happen if you do a query over multiple resources, but you don't know which data is coming from which data source? How do you know which resources to update for which specific data?

@sevrijss
Copy link
Collaborator

sevrijss commented Aug 22, 2023

In a specific case, that might pose a problem.
Say you have 2 resources, 1 containing data about a TV show and another about actors. Those 2 can be easily linked using a sparql query and comunica.
But as soon as the same data is spread out over multiple pods, e.g. 2 pods containing book information, I don't think it's possible to know where data came from when using the RML spec. Also, where would data go?

@pheyvaer
Copy link
Contributor Author

I got imagine that you would want to write the changes to the resource where originally the data came from. But indeed I don't think you can specify that with RML. Comunica might be able to tell us where every triple came from. But then we still need to see how that information can be used by RML.

@bjdmeest What are your thoughts?

@bjdmeest
Copy link

for me, the most relevant case is when a single query result/row contains data from multiple sources. So I'd start with specifying which column updates should be written to which source.

e.g a sheet mentioning my favorite tv shows and my personal ratings: the tv show metadata comes from dbpedia, the ratings come from my pod. my rating updates should be persisted in my pod, tv show metadata updates should be (ideally) feed back to dbpedia, but currently practicaly will just fail, and that's OK :)

You can probably figure that out by combining the RML and SPARQL query (?tv_rating comes from term map <tm_001> which comes from logical source <myPersonalPod>), but adding that as separate metadata as initial test is fine for me

@pheyvaer
Copy link
Contributor Author

@sevrijss Based on Ben's answer do you think do that the RML Target-based solution will work? Ignoring the potential complexity of the files.

Also about this from your earlier comment

there is no guarantee that all the resources use the same subject identifiers

This also an issue when querying so I would not worry about that now.

@sevrijss
Copy link
Collaborator

I think such a solution will definitely work. A lot of heavy lifting will be done by the RML api endpoint and not by the code.
But I might need some help writing such a query, since I never managed to combine targets and multiple subjects in matey.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants