Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract software metadata from the web (service endpoints and/or webpages) #92

Closed
3 tasks done
proycon opened this issue Mar 3, 2022 · 3 comments
Closed
3 tasks done
Assignees
Labels
FAIR Tool Discovery FAIR Tool Discovery

Comments

@proycon
Copy link
Member

proycon commented Mar 3, 2022

The harvesting pipeline that is being implemented currently (#33) is set up in such a way that the source-code is always the most authoritative place for holding software metadata descriptions.

However, there is a distinction between the software source code and service instances of that software, and the latter may add some metadata that is not applicable to the source as such. Instances are hosted on a particular URL and may have particular access limitations. We want to make that distinction explicit.

In the tool source registry for the harvester, we therefore provide the link to the source code alongside the web endpoints. The harvester first queries the source code repositories and converts the metadata in there to schema.org/codemeta's @SoftwareSourceCode, then it queries the web endpoints and enriches the metadata in the way proposed in codemeta/codemeta#271 .

How can websites and webservices provide metadata? I want to support the following for the harvester pipeline:

@proycon proycon changed the title Extract software metadata from websites (service endpoints and/or webpages) Extract software metadata from the web (service endpoints and/or webpages) Mar 3, 2022
@proycon proycon self-assigned this Mar 3, 2022
@proycon proycon added the FAIR Tool Discovery FAIR Tool Discovery label Mar 3, 2022
@proycon
Copy link
Member Author

proycon commented Mar 3, 2022

It may be worth identifying if there are already CLARIAH services and websites that make their tool metadata available in other ways that may be harvestable (i.e. published by the web endpoint itself, not some other higher-order registry). An important example currently is CLAM, widely used for WP3 webservices and outputting metadata in its own XML format; I will make that output an OpenAPI Info block too (proycon/clam#32).

Please comment if you can answer what metadata descriptions certain CLARIAH partners are currently using?

@ddeboer
Copy link
Contributor

ddeboer commented Mar 11, 2022

Should the type of service instance be documented with the software and/or be derived from the service definition as it is retrieved over HTTP by the harvester? Example: the fact that software x has an OpenAPI endpoint available at URL y and a SPARQL endpoint at URL z.

@proycon
Copy link
Member Author

proycon commented Mar 11, 2022

I am indeed hoping that the type of the service can be automatically extracted, and once extracted I want to represent these webservices using the pending WebAPI proposal ( schemaorg/schemaorg#2635 , schemaorg/schemaorg#1423) . The type of instance would fit their conformsTo property. This will be fairly minimal though. I think that's an important limit to our 'tool discovery' scope; we will merely link to these existing API specifications but not try to redo, reinvent them or convert all aspects of them. Anybody wanting to actually interface with the service (input parameters, output types, return codes etc) needs to dig deeper and parse the linked specification themselves.

I must also add describing web services is still relatively low on the priority list. Describing the schema:WebApplication (i.e. a web interface for human end-users) has more priority.

From the perspective of the harvester and the metadata it produces. I see the source code metadata as the primary representation. This schema:SoftwareSourceCode will be linked to service instances (e.g a schema:WebApplication, a schema:WebAPI or even a schema:WebPage) via the schema:targetProduct property. (codemeta/codemeta#271). As I envision it now, the tool store API (#34) will serve a whole bunch of json files (and also have a SPARQL endpoint), one per tool, each representing a software source code that links to all service instances (bottom up). I hope this makes some sense :)

proycon added a commit to proycon/codemeta-harvester that referenced this issue Mar 15, 2022
proycon added a commit to proycon/codemetapy that referenced this issue Mar 26, 2022
proycon added a commit to proycon/codemetapy that referenced this issue Mar 26, 2022
proycon added a commit to proycon/codemetapy that referenced this issue Mar 26, 2022
proycon added a commit to proycon/codemetapy that referenced this issue Apr 13, 2022
@proycon proycon closed this as completed May 18, 2022
Repository owner moved this from In Progress to Done in CLARIAH+ Shared Service: FAIR Tool Discovery May 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FAIR Tool Discovery FAIR Tool Discovery
Development

No branches or pull requests

2 participants