Batch JSON-LD Scoring from spreadsheet #1

saumier · 2024-09-10T12:01:37Z

There is a structured data score algorithm created using SHACL and SPARQL (in this repo culturecreates/artsdata-score). Here is the full back story culturecreates/artsdata-data-model#120.

This issue is to build a solution that will batch score events. This is a prototype.

We should start with Approach 1. I have included a second alternative approach using Google sheets but that is not needed yet.

Prerequisites:

Migrate to GitHub artifacts artsdata-orion#70

Approach 1 - Workflow creates a report

Step 1: Create an interactive workflow in this repo
Step 2: The workflow calls Orion's new Action to fetch-data (not including push to artsdata) with the page_url and entity_identifier to extract the webpage urls and options like headless and is_paginated --> creates a Github artifact (instead of a commit)
Step 3: The workflow calls a new ruby code in this repo:

loads the github artifact dump JSON-LD into a graph graph = RDF::Graph.load(artifact)
runs the existing SHACL SHACL.open(shacl.ttl).execute(graph) and the existing construct SPARQL SPARQL.execute(score_sparql, graph, update: true) to insert the score.
runs a new select SPARQL report = SPARQL.execute(report_sparql, graph) to create a report with url, event URI, score, breakdown for each webpage
commits the report to the Github repo using the website name

Alternate Approach - Google sheet App script [NOT NEEDED YET]

Build an App script to call the structured data score API in Artsdata from a Google sheet. This Google sheet would have a column of webpage urls for event details. So if a website had 50 events there would be 50 webpage urls, one per event.

The idea is to load the score (a number from 0 to 60) into the cell next to the url. For example, if a person enters "=strucutredDateScore(A1)" into the cell it should call the App script to get the url from cell A1 and then call the score API and display the score (or a fail message).

The webpage url is passed to the score API as a parameter &uri=. The API returns a graph in JSON-LD where each event has a property "score". The other parameters &post_sparql= and &shacl= should be constant and are used to pass in the file for validating the data and the sparql to run after which adds the score to the graph.

Here is an example API call:

https://kg.artsdata.ca/en/dereference/external.jsonld?post_sparql=https%3A%2F%2Fraw.githubusercontent.com%2Fculturecreates%2Fartsdata-score%2Fmain%2Fsparql%2Fscore_algorithm.sparql&shacl=https%3A%2F%2Fraw.githubusercontent.com%2Fculturecreates%2Fartsdata-score%2Fmain%2Fshacl%2Fshacl_for_scoring.ttl&uri=https%3A%2F%2Fimperialtheatre.ca%2Fevent%2Fretro-film-national-lampoons-vacation-1983%2F

The score API can also be used from the Artsdata nebula user interface by clicking "Compute score" after dereferencing any URL. The only difference in the call is that the path ending with the method external is without the format .jsonld. So instead of /dereference/external.jsonld? is it /dereference/external? to display a human readable webpage.

In the resulting graph the score property is http://example.org/score

So using an RDF library the score can be extracted using [nil, RDF::URI("http://example.org/score"), nil ] and the resulting object.value can be displayed as a comma separated list in the cell. If there is more than one (because there is more than one event entity in the webpage, then all solutions should be displayed as a list in the spreadsheet cell.

Here is an example output in JSON-LD trimmed to include only relevant data:
[
...
{
"@id": "http://top.blank.node/82565b60-3f58-40c0-ab0e-f6a61ed0a5c6",
"@type": [
"http://schema.org/TheaterEvent"
],
...
"http://example.org/score": [
{
"@type": "http://www.w3.org/2001/XMLSchema#integer",
"@value": "55"
}
}
...
]

NOTE: If there is no RDF library then I can enhance the API to return a simple JSON that will be easier to parse, so be sure to let me know if you need this.

The text was updated successfully, but these errors were encountered:

saumier · 2024-09-16T22:38:38Z

@dev-aravind Please advance this and try to get it to work without waiting for the other issue to migrate to Github Artifacts. culturecreates/artsdata-orion#70 because I am blocked with being able to use Github artifact download urls.

dev-aravind · 2024-09-19T12:38:36Z

@saumier The development part is complete, but the initial step of calling the fetch-data workflow is failing because artsdata-score cannot commit the JSON-LD that artsdata-orion generated because of a permission issue. This is blocking us from adding the score to the JSON-LD.

workflow: https://github.com/culturecreates/artsdata-score/actions/workflows/generate_report.yml

sample inputs:
page_url: "https://agoradanse.com/evenement/"
entity_identifier: "div.x-container.max a"
file_name: "agoradanse-events.jsonld"
is_paginated: "false"
headless: "false"

saumier · 2024-09-20T00:43:08Z

@dev-aravind I propose the following enhancement to the fetch-data workflow. Let's pass an optional parameter to instruct the workflow on how to save the artifact. I propose the following choices: none | S3 | artifact. If you implement the "none" parameter in the workflow then we can complete this project. The default (when no param is passed) would be the same as it is today which is 'commit'. If this makes sense then please proceed so we can get the project completed and delivered to CAPACOA. Thx.

dev-aravind · 2024-09-20T09:48:33Z

@saumier The scores can now be generated as a CSV file now. An example for the Agoradance data score can be found here.

saumier · 2024-09-25T13:19:46Z

@dev-aravind This is working pretty well. I made a couple of small changes to the report-sparql to get a single line per webpage.

I noticed that the Artifact contains ALL the websites from Orion, and not just the website being scored. It would be nice to create the Artifact with only the website being scored. It would also be good to not have to add it to Orion at all, but only create the artifact in the artsdata-score workflow. This is a low priority since we don't have many sites for now...

dev-aravind · 2024-09-27T11:37:42Z

@saumier The artifact now contains just the file that the current run is working on and I also updated the workflow and the ruby code to only create the JSON-LD file as an artifact and not commit it. Please review this and let me know if you need any changes.

saumier · 2024-10-01T14:22:52Z

@dev-aravind Can you try this workflow to crawl the website sandersoncentre.ca and let me know why it is not working.
https://github.com/culturecreates/artsdata-score/actions/workflows/sandersoncentre-ca.yml

dev-aravind · 2024-10-03T12:24:48Z

@saumier This was caused because when you are dealing with a source with pagination, you have to provide the baseurl with the page param. In this case https://calendar.sandersoncentre.ca was changed to https://calendar.sandersoncentre.ca/?page= . You can find the report for ~70 events here.

saumier · 2024-10-04T11:50:21Z

@dev Excellent. Thanks for the link to the report. I was wondering how the page param was getting passed in and now I understand.

saumier assigned dev-aravind Sep 10, 2024

saumier mentioned this issue Sep 12, 2024

Migrate to GitHub artifacts culturecreates/artsdata-orion#70

Closed

dev-aravind assigned saumier and unassigned dev-aravind Sep 19, 2024

saumier assigned dev-aravind and unassigned saumier Sep 20, 2024

dev-aravind assigned saumier and unassigned dev-aravind Sep 20, 2024

saumier assigned dev-aravind and unassigned saumier Sep 25, 2024

dev-aravind assigned saumier and unassigned dev-aravind Sep 27, 2024

saumier assigned dev-aravind and unassigned saumier Oct 1, 2024

dev-aravind assigned saumier and unassigned dev-aravind Oct 3, 2024

saumier closed this as completed Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch JSON-LD Scoring from spreadsheet #1

Batch JSON-LD Scoring from spreadsheet #1

saumier commented Sep 10, 2024 •

edited by dev-aravind

Loading

saumier commented Sep 16, 2024

dev-aravind commented Sep 19, 2024

saumier commented Sep 20, 2024

dev-aravind commented Sep 20, 2024

saumier commented Sep 25, 2024

dev-aravind commented Sep 27, 2024

saumier commented Oct 1, 2024

dev-aravind commented Oct 3, 2024

saumier commented Oct 4, 2024

Batch JSON-LD Scoring from spreadsheet #1

Batch JSON-LD Scoring from spreadsheet #1

Comments

saumier commented Sep 10, 2024 • edited by dev-aravind Loading

Approach 1 - Workflow creates a report

Alternate Approach - Google sheet App script [NOT NEEDED YET]

saumier commented Sep 16, 2024

dev-aravind commented Sep 19, 2024

saumier commented Sep 20, 2024

dev-aravind commented Sep 20, 2024

saumier commented Sep 25, 2024

dev-aravind commented Sep 27, 2024

saumier commented Oct 1, 2024

dev-aravind commented Oct 3, 2024

saumier commented Oct 4, 2024

saumier commented Sep 10, 2024 •

edited by dev-aravind

Loading