Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic ingestion of datatsets from "instruments" (including crawled data): EOSC OS I NRP Task 5.3 related #1169

Open
hajic opened this issue Feb 18, 2025 · 1 comment
Labels
Milestone

Comments

@hajic
Copy link

hajic commented Feb 18, 2025

Issue is related to (semi)automatic ingestion of data produced by "instruments", with the use case being the internet crawls, cleaned and filetered by source- and goal-specific scripts
The idea is to create a data-ingesting API that will be able accept metadata and data (or reference o data) to ingest. The initiative for each item ingested will be coming from the scripts running anywhere in the world (e.g., in HPCs). However, it is assumed that the initialization will be done at the repository side where the data steward will fill in the basic information about the script, invariant metadata items, periodicity etc., which will be communicated to the "instrument" wrap code (conduit) by a standard API running in the conduit, normalizing the otherwise specific "instrument" (= cleaning script etc.) communication. That API could also provide some status information that the repository can use in various ways, from error detection to command such as cancelling or redirecting to another instance or the repository, using a newly designed protocol. For example, the "cleaning crawled text" use case would run let's say in CESNET, cleaning every new Common Crawl release every month, packaging the data every three months for depositing it in the CLARIN DSpace repository instance. The core metadata info will be inserted by the responsible person for this continuously created dataset from the LINDAT DSpace repository side, filling in the name of the dataset series, authors, required formats, license to be used, credits to funding etc., which will be communicated to the cleaning scripts through some config file and the scripts wrapper API. Then, the script will be run every months to clean the CC release. Every three months, the script will make a bitstream set, complete the core metadata with the date, source datasets' description, provenance description (script reference, parameters, etc etc), total size, etc., and use the new ingestion API of the CLARIN DSpace repo to send it the completed metadata and ref(s) to data to actually ingest it: checking metadata, assigning it a PID, and sending it to the usual approval process.

@kosarko kosarko added the NRP label Feb 21, 2025
@kosarko kosarko added this to the 2025 milestone Feb 21, 2025
@kosarko
Copy link
Member

kosarko commented Feb 21, 2025

Just a thought...if we create a new collection, say "Common Crawl data", we can set it with a "template item". This should prefill the metadata.

Do we want to version these with our replaces/replaced by approach?

from the scripts running anywhere in the world (e.g., in HPCs)

this requirement somewhat complicates authorization/authentication. But I guess if we have a dedicated collection, we can create a system account that only access this collection (this should, to a degree, limit the potential impact of a (potentially) leaked access token).
In e-infra connected environments this should, in theory, be the role of e-infra aai.

...and ref(s) to data

So the instrument does not actually push the data via the repository api - Does it have some mechanism to make the data "downloadable" by the repository? How is the repository authorized to fetch this data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants