Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch JSON-LD Scoring from spreadsheet #1

Closed
1 task
saumier opened this issue Sep 10, 2024 · 9 comments
Closed
1 task

Batch JSON-LD Scoring from spreadsheet #1

saumier opened this issue Sep 10, 2024 · 9 comments
Assignees

Comments

@saumier
Copy link
Member

saumier commented Sep 10, 2024

There is a structured data score algorithm created using SHACL and SPARQL (in this repo culturecreates/artsdata-score). Here is the full back story culturecreates/artsdata-data-model#120.

This issue is to build a solution that will batch score events. This is a prototype.

We should start with Approach 1. I have included a second alternative approach using Google sheets but that is not needed yet.

Prerequisites:

Approach 1 - Workflow creates a report

Step 1: Create an interactive workflow in this repo
Step 2: The workflow calls Orion's new Action to fetch-data (not including push to artsdata) with the page_url and entity_identifier to extract the webpage urls and options like headless and is_paginated --> creates a Github artifact (instead of a commit)
Step 3: The workflow calls a new ruby code in this repo:

  • loads the github artifact dump JSON-LD into a graph graph = RDF::Graph.load(artifact)
  • runs the existing SHACL SHACL.open(shacl.ttl).execute(graph) and the existing construct SPARQL SPARQL.execute(score_sparql, graph, update: true) to insert the score.
  • runs a new select SPARQL report = SPARQL.execute(report_sparql, graph) to create a report with url, event URI, score, breakdown for each webpage
  • commits the report to the Github repo using the website name

Alternate Approach - Google sheet App script [NOT NEEDED YET]

Build an App script to call the structured data score API in Artsdata from a Google sheet. This Google sheet would have a column of webpage urls for event details. So if a website had 50 events there would be 50 webpage urls, one per event.

The idea is to load the score (a number from 0 to 60) into the cell next to the url. For example, if a person enters "=strucutredDateScore(A1)" into the cell it should call the App script to get the url from cell A1 and then call the score API and display the score (or a fail message).

The webpage url is passed to the score API as a parameter &uri=. The API returns a graph in JSON-LD where each event has a property "score". The other parameters &post_sparql= and &shacl= should be constant and are used to pass in the file for validating the data and the sparql to run after which adds the score to the graph.

Here is an example API call:

https://kg.artsdata.ca/en/dereference/external.jsonld?post_sparql=https%3A%2F%2Fraw.githubusercontent.com%2Fculturecreates%2Fartsdata-score%2Fmain%2Fsparql%2Fscore_algorithm.sparql&shacl=https%3A%2F%2Fraw.githubusercontent.com%2Fculturecreates%2Fartsdata-score%2Fmain%2Fshacl%2Fshacl_for_scoring.ttl&uri=https%3A%2F%2Fimperialtheatre.ca%2Fevent%2Fretro-film-national-lampoons-vacation-1983%2F

The score API can also be used from the Artsdata nebula user interface by clicking "Compute score" after dereferencing any URL. The only difference in the call is that the path ending with the method external is without the format .jsonld. So instead of /dereference/external.jsonld? is it /dereference/external? to display a human readable webpage.

In the resulting graph the score property is http://example.org/score

So using an RDF library the score can be extracted using [nil, RDF::URI("http://example.org/score"), nil ] and the resulting object.value can be displayed as a comma separated list in the cell. If there is more than one (because there is more than one event entity in the webpage, then all solutions should be displayed as a list in the spreadsheet cell.

Here is an example output in JSON-LD trimmed to include only relevant data:
[
...
{
"@id": "http://top.blank.node/82565b60-3f58-40c0-ab0e-f6a61ed0a5c6",
"@type": [
"http://schema.org/TheaterEvent"
],
...
"http://example.org/score": [
{
"@type": "http://www.w3.org/2001/XMLSchema#integer",
"@value": "55"
}
}
...
]

NOTE: If there is no RDF library then I can enhance the API to return a simple JSON that will be easier to parse, so be sure to let me know if you need this.

@saumier
Copy link
Member Author

saumier commented Sep 16, 2024

@dev-aravind Please advance this and try to get it to work without waiting for the other issue to migrate to Github Artifacts. culturecreates/artsdata-orion#70 because I am blocked with being able to use Github artifact download urls.

@dev-aravind
Copy link
Contributor

@saumier The development part is complete, but the initial step of calling the fetch-data workflow is failing because artsdata-score cannot commit the JSON-LD that artsdata-orion generated because of a permission issue. This is blocking us from adding the score to the JSON-LD.

workflow: https://github.com/culturecreates/artsdata-score/actions/workflows/generate_report.yml

sample inputs:
page_url: "https://agoradanse.com/evenement/"
entity_identifier: "div.x-container.max a"
file_name: "agoradanse-events.jsonld"
is_paginated: "false"
headless: "false"

@dev-aravind dev-aravind assigned saumier and unassigned dev-aravind Sep 19, 2024
@saumier
Copy link
Member Author

saumier commented Sep 20, 2024

@dev-aravind I propose the following enhancement to the fetch-data workflow. Let's pass an optional parameter to instruct the workflow on how to save the artifact. I propose the following choices: none | S3 | artifact. If you implement the "none" parameter in the workflow then we can complete this project. The default (when no param is passed) would be the same as it is today which is 'commit'. If this makes sense then please proceed so we can get the project completed and delivered to CAPACOA. Thx.

@saumier saumier assigned dev-aravind and unassigned saumier Sep 20, 2024
@dev-aravind
Copy link
Contributor

@saumier The scores can now be generated as a CSV file now. An example for the Agoradance data score can be found here.

@dev-aravind dev-aravind assigned saumier and unassigned dev-aravind Sep 20, 2024
@saumier saumier assigned dev-aravind and unassigned saumier Sep 25, 2024
@saumier
Copy link
Member Author

saumier commented Sep 25, 2024

@dev-aravind This is working pretty well. I made a couple of small changes to the report-sparql to get a single line per webpage.

I noticed that the Artifact contains ALL the websites from Orion, and not just the website being scored. It would be nice to create the Artifact with only the website being scored. It would also be good to not have to add it to Orion at all, but only create the artifact in the artsdata-score workflow. This is a low priority since we don't have many sites for now...

@dev-aravind
Copy link
Contributor

@saumier The artifact now contains just the file that the current run is working on and I also updated the workflow and the ruby code to only create the JSON-LD file as an artifact and not commit it. Please review this and let me know if you need any changes.

@dev-aravind dev-aravind assigned saumier and unassigned dev-aravind Sep 27, 2024
@saumier
Copy link
Member Author

saumier commented Oct 1, 2024

@dev-aravind Can you try this workflow to crawl the website sandersoncentre.ca and let me know why it is not working.
https://github.com/culturecreates/artsdata-score/actions/workflows/sandersoncentre-ca.yml

@saumier saumier assigned dev-aravind and unassigned saumier Oct 1, 2024
@dev-aravind
Copy link
Contributor

@saumier This was caused because when you are dealing with a source with pagination, you have to provide the baseurl with the page param. In this case https://calendar.sandersoncentre.ca was changed to https://calendar.sandersoncentre.ca/?page= . You can find the report for ~70 events here.

@dev-aravind dev-aravind assigned saumier and unassigned dev-aravind Oct 3, 2024
@saumier
Copy link
Member Author

saumier commented Oct 4, 2024

@dev Excellent. Thanks for the link to the report. I was wondering how the page param was getting passed in and now I understand.

@saumier saumier closed this as completed Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

2 participants