Extract software metadata from the web (service endpoints and/or webpages) #92

proycon · 2022-03-03T12:38:47Z

The harvesting pipeline that is being implemented currently (#33) is set up in such a way that the source-code is always the most authoritative place for holding software metadata descriptions.

However, there is a distinction between the software source code and service instances of that software, and the latter may add some metadata that is not applicable to the source as such. Instances are hosted on a particular URL and may have particular access limitations. We want to make that distinction explicit.

In the tool source registry for the harvester, we therefore provide the link to the source code alongside the web endpoints. The harvester first queries the source code repositories and converts the metadata in there to schema.org/codemeta's @SoftwareSourceCode, then it queries the web endpoints and enriches the metadata in the way proposed in codemeta/codemeta#271 .

How can websites and webservices provide metadata? I want to support the following for the harvester pipeline:

Support inline schema.org metadata in a <script type="application/ld+json"> block, with @type any subclass of schema:SoftwareApplication or any of the other ones proposed in Linking source code to software applications (entrypoints and service endpoints aka SaaS) with regard for interface type codemeta/codemeta#271, including schema:WebAPI and schema:WebPage.
- This is the most explicit form to provide metadata and the only one that ensures that all metadata ends up in the harvested end-product.
- See also https://developers.google.com/search/docs/advanced/structured-data/sd-policies
- Support for microdata wil be deferred to a later stage (https://schema.org/docs/gs.html)
Support for webservices providing an OpenAPI specification (in json or yaml), parse and convert at least the "Info" block to codemeta.
Support for the fallback option: parse certain meta tags in the HTML head

The text was updated successfully, but these errors were encountered:

proycon · 2022-03-03T12:46:18Z

It may be worth identifying if there are already CLARIAH services and websites that make their tool metadata available in other ways that may be harvestable (i.e. published by the web endpoint itself, not some other higher-order registry). An important example currently is CLAM, widely used for WP3 webservices and outputting metadata in its own XML format; I will make that output an OpenAPI Info block too (proycon/clam#32).

Please comment if you can answer what metadata descriptions certain CLARIAH partners are currently using?

ddeboer · 2022-03-11T08:45:59Z

Should the type of service instance be documented with the software and/or be derived from the service definition as it is retrieved over HTTP by the harvester? Example: the fact that software x has an OpenAPI endpoint available at URL y and a SPARQL endpoint at URL z.

proycon · 2022-03-11T13:09:10Z

I am indeed hoping that the type of the service can be automatically extracted, and once extracted I want to represent these webservices using the pending WebAPI proposal ( schemaorg/schemaorg#2635 , schemaorg/schemaorg#1423) . The type of instance would fit their conformsTo property. This will be fairly minimal though. I think that's an important limit to our 'tool discovery' scope; we will merely link to these existing API specifications but not try to redo, reinvent them or convert all aspects of them. Anybody wanting to actually interface with the service (input parameters, output types, return codes etc) needs to dig deeper and parse the linked specification themselves.

I must also add describing web services is still relatively low on the priority list. Describing the schema:WebApplication (i.e. a web interface for human end-users) has more priority.

From the perspective of the harvester and the metadata it produces. I see the source code metadata as the primary representation. This schema:SoftwareSourceCode will be linked to service instances (e.g a schema:WebApplication, a schema:WebAPI or even a schema:WebPage) via the schema:targetProduct property. (codemeta/codemeta#271). As I envision it now, the tool store API (#34) will serve a whole bunch of json files (and also have a SPARQL endpoint), one per tool, each representing a software source code that links to all service instances (bottom up). I hope this makes some sense :)

…LARIAH/clariah-plus#92)

)

…h-plus#92) WebAPI still needs to be worked out in more detail

…AH/clariah-plus#92)

proycon changed the title ~~Extract software metadata from websites (service endpoints and/or webpages)~~ Extract software metadata from the web (service endpoints and/or webpages) Mar 3, 2022

proycon self-assigned this Mar 3, 2022

proycon added the FAIR Tool Discovery FAIR Tool Discovery label Mar 3, 2022

proycon added a commit to proycon/codemetapy that referenced this issue Mar 15, 2022

implemented initial support for parsing and adding remote services (C…

60c9a76

…LARIAH/clariah-plus#92)

proycon added a commit to proycon/codemeta-harvester that referenced this issue Mar 15, 2022

implemented support for parsing remote services (CLARIAH/clariah-plus#92

e27a5e0

)

proycon added a commit to proycon/codemetapy that referenced this issue Mar 26, 2022

fixes for web parsing (CLARIAH/clariah-plus#92)

9f8d4f0

proycon added a commit to proycon/codemetapy that referenced this issue Mar 26, 2022

Implemented support for extracting metadata from CLAM (CLARIAH/claria…

8bd46f1

…h-plus#92) WebAPI still needs to be worked out in more detail

proycon added a commit to proycon/codemetapy that referenced this issue Mar 26, 2022

fix for missing URL in CLAM parse (CLARIAH/clariah-plus#92)

2c517df

proycon added a commit to proycon/codemetapy that referenced this issue Apr 13, 2022

parsing webpage metadata: parse only head and support itemprop (CLARI…

701191b

…AH/clariah-plus#92)

proycon added a commit to proycon/codemetapy that referenced this issue Apr 13, 2022

html parsing: support itemtype (CLARIAH/clariah-plus#92)

e7c3ddc

proycon closed this as completed May 18, 2022

Repository owner moved this from In Progress to Done in CLARIAH+ Shared Service: FAIR Tool Discovery May 18, 2022

proycon added this to the [Tool Discovery] Phase 1: Harvesting pipeline & Tool store milestone Jun 27, 2022

proycon mentioned this issue Sep 8, 2022

Implement OpenAPI support proycon/codemetapy#25

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract software metadata from the web (service endpoints and/or webpages) #92

Extract software metadata from the web (service endpoints and/or webpages) #92

proycon commented Mar 3, 2022 •

edited

Loading

proycon commented Mar 3, 2022

ddeboer commented Mar 11, 2022 •

edited

Loading

proycon commented Mar 11, 2022 •

edited

Loading

Extract software metadata from the web (service endpoints and/or webpages) #92

Extract software metadata from the web (service endpoints and/or webpages) #92

Comments

proycon commented Mar 3, 2022 • edited Loading

proycon commented Mar 3, 2022

ddeboer commented Mar 11, 2022 • edited Loading

proycon commented Mar 11, 2022 • edited Loading

proycon commented Mar 3, 2022 •

edited

Loading

ddeboer commented Mar 11, 2022 •

edited

Loading

proycon commented Mar 11, 2022 •

edited

Loading