-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch JSON-LD Scoring from spreadsheet #1
Comments
@dev-aravind Please advance this and try to get it to work without waiting for the other issue to migrate to Github Artifacts. culturecreates/artsdata-orion#70 because I am blocked with being able to use Github artifact download urls. |
@saumier The development part is complete, but the initial step of calling the fetch-data workflow is failing because artsdata-score cannot commit the JSON-LD that artsdata-orion generated because of a permission issue. This is blocking us from adding the score to the JSON-LD. sample inputs: |
@dev-aravind I propose the following enhancement to the fetch-data workflow. Let's pass an optional parameter to instruct the workflow on how to save the artifact. I propose the following choices: none | S3 | artifact. If you implement the "none" parameter in the workflow then we can complete this project. The default (when no param is passed) would be the same as it is today which is 'commit'. If this makes sense then please proceed so we can get the project completed and delivered to CAPACOA. Thx. |
@dev-aravind This is working pretty well. I made a couple of small changes to the report-sparql to get a single line per webpage. I noticed that the Artifact contains ALL the websites from Orion, and not just the website being scored. It would be nice to create the Artifact with only the website being scored. It would also be good to not have to add it to Orion at all, but only create the artifact in the artsdata-score workflow. This is a low priority since we don't have many sites for now... |
@saumier The artifact now contains just the file that the current run is working on and I also updated the workflow and the ruby code to only create the JSON-LD file as an artifact and not commit it. Please review this and let me know if you need any changes. |
@dev-aravind Can you try this workflow to crawl the website sandersoncentre.ca and let me know why it is not working. |
@saumier This was caused because when you are dealing with a source with pagination, you have to provide the baseurl with the page param. In this case https://calendar.sandersoncentre.ca was changed to https://calendar.sandersoncentre.ca/?page= . You can find the report for ~70 events here. |
@dev Excellent. Thanks for the link to the report. I was wondering how the page param was getting passed in and now I understand. |
There is a structured data score algorithm created using SHACL and SPARQL (in this repo
culturecreates/artsdata-score
). Here is the full back story culturecreates/artsdata-data-model#120.This issue is to build a solution that will batch score events. This is a prototype.
We should start with Approach 1. I have included a second alternative approach using Google sheets but that is not needed yet.
Prerequisites:
Approach 1 - Workflow creates a report
Step 1: Create an interactive workflow in this repo
Step 2: The workflow calls Orion's new Action to fetch-data (not including push to artsdata) with the page_url and entity_identifier to extract the webpage urls and options like headless and is_paginated --> creates a Github artifact (instead of a commit)
Step 3: The workflow calls a new ruby code in this repo:
graph = RDF::Graph.load(artifact)
SHACL.open(shacl.ttl).execute(graph)
and the existing construct SPARQLSPARQL.execute(score_sparql, graph, update: true)
to insert the score.report = SPARQL.execute(report_sparql, graph)
to create a report with url, event URI, score, breakdown for each webpageAlternate Approach - Google sheet App script [NOT NEEDED YET]
Build an App script to call the structured data score API in Artsdata from a Google sheet. This Google sheet would have a column of webpage urls for event details. So if a website had 50 events there would be 50 webpage urls, one per event.
The idea is to load the score (a number from 0 to 60) into the cell next to the url. For example, if a person enters "=strucutredDateScore(A1)" into the cell it should call the App script to get the url from cell A1 and then call the score API and display the score (or a fail message).
The webpage url is passed to the score API as a parameter
&uri=
. The API returns a graph in JSON-LD where each event has a property "score". The other parameters&post_sparql=
and&shacl=
should be constant and are used to pass in the file for validating the data and the sparql to run after which adds the score to the graph.Here is an example API call:
https://kg.artsdata.ca/en/dereference/external.jsonld?post_sparql=https%3A%2F%2Fraw.githubusercontent.com%2Fculturecreates%2Fartsdata-score%2Fmain%2Fsparql%2Fscore_algorithm.sparql&shacl=https%3A%2F%2Fraw.githubusercontent.com%2Fculturecreates%2Fartsdata-score%2Fmain%2Fshacl%2Fshacl_for_scoring.ttl&uri=https%3A%2F%2Fimperialtheatre.ca%2Fevent%2Fretro-film-national-lampoons-vacation-1983%2F
The score API can also be used from the Artsdata nebula user interface by clicking "Compute score" after dereferencing any URL. The only difference in the call is that the path ending with the method
external
is without the format.jsonld
. So instead of/dereference/external.jsonld?
is it/dereference/external?
to display a human readable webpage.In the resulting graph the score property is
http://example.org/score
So using an RDF library the score can be extracted using
[nil, RDF::URI("http://example.org/score"), nil ]
and the resulting object.value can be displayed as a comma separated list in the cell. If there is more than one (because there is more than one event entity in the webpage, then all solutions should be displayed as a list in the spreadsheet cell.Here is an example output in JSON-LD trimmed to include only relevant data:
[
...
{
"@id": "http://top.blank.node/82565b60-3f58-40c0-ab0e-f6a61ed0a5c6",
"@type": [
"http://schema.org/TheaterEvent"
],
...
"http://example.org/score": [
{
"@type": "http://www.w3.org/2001/XMLSchema#integer",
"@value": "55"
}
}
...
]
NOTE: If there is no RDF library then I can enhance the API to return a simple JSON that will be easier to parse, so be sure to let me know if you need this.
The text was updated successfully, but these errors were encountered: