Automatic ingestion of datatsets from "instruments" (including crawled data): EOSC OS I NRP Task 5.3 related #1169

hajic · 2025-02-18T15:10:57Z

Issue is related to (semi)automatic ingestion of data produced by "instruments", with the use case being the internet crawls, cleaned and filetered by source- and goal-specific scripts
The idea is to create a data-ingesting API that will be able accept metadata and data (or reference o data) to ingest. The initiative for each item ingested will be coming from the scripts running anywhere in the world (e.g., in HPCs). However, it is assumed that the initialization will be done at the repository side where the data steward will fill in the basic information about the script, invariant metadata items, periodicity etc., which will be communicated to the "instrument" wrap code (conduit) by a standard API running in the conduit, normalizing the otherwise specific "instrument" (= cleaning script etc.) communication. That API could also provide some status information that the repository can use in various ways, from error detection to command such as cancelling or redirecting to another instance or the repository, using a newly designed protocol. For example, the "cleaning crawled text" use case would run let's say in CESNET, cleaning every new Common Crawl release every month, packaging the data every three months for depositing it in the CLARIN DSpace repository instance. The core metadata info will be inserted by the responsible person for this continuously created dataset from the LINDAT DSpace repository side, filling in the name of the dataset series, authors, required formats, license to be used, credits to funding etc., which will be communicated to the cleaning scripts through some config file and the scripts wrapper API. Then, the script will be run every months to clean the CC release. Every three months, the script will make a bitstream set, complete the core metadata with the date, source datasets' description, provenance description (script reference, parameters, etc etc), total size, etc., and use the new ingestion API of the CLARIN DSpace repo to send it the completed metadata and ref(s) to data to actually ingest it: checking metadata, assigning it a PID, and sending it to the usual approval process.

kosarko · 2025-02-21T13:44:46Z

Just a thought...if we create a new collection, say "Common Crawl data", we can set it with a "template item". This should prefill the metadata.

Do we want to version these with our replaces/replaced by approach?

from the scripts running anywhere in the world (e.g., in HPCs)

this requirement somewhat complicates authorization/authentication. But I guess if we have a dedicated collection, we can create a system account that only access this collection (this should, to a degree, limit the potential impact of a (potentially) leaked access token).
In e-infra connected environments this should, in theory, be the role of e-infra aai.

...and ref(s) to data

So the instrument does not actually push the data via the repository api - Does it have some mechanism to make the data "downloadable" by the repository? How is the repository authorized to fetch this data?

kosarko added the NRP label Feb 21, 2025

kosarko added this to the 2025 milestone Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic ingestion of datatsets from "instruments" (including crawled data): EOSC OS I NRP Task 5.3 related #1169

Automatic ingestion of datatsets from "instruments" (including crawled data): EOSC OS I NRP Task 5.3 related #1169

hajic commented Feb 18, 2025

kosarko commented Feb 21, 2025

Automatic ingestion of datatsets from "instruments" (including crawled data): EOSC OS I NRP Task 5.3 related #1169

Automatic ingestion of datatsets from "instruments" (including crawled data): EOSC OS I NRP Task 5.3 related #1169

Comments

hajic commented Feb 18, 2025

kosarko commented Feb 21, 2025