You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Issue is related to (semi)automatic ingestion of data produced by "instruments", with the use case being the internet crawls, cleaned and filetered by source- and goal-specific scripts
The idea is to create a data-ingesting API that will be able accept metadata and data (or reference o data) to ingest. The initiative for each item ingested will be coming from the scripts running anywhere in the world (e.g., in HPCs). However, it is assumed that the initialization will be done at the repository side where the data steward will fill in the basic information about the script, invariant metadata items, periodicity etc., which will be communicated to the "instrument" wrap code (conduit) by a standard API running in the conduit, normalizing the otherwise specific "instrument" (= cleaning script etc.) communication. That API could also provide some status information that the repository can use in various ways, from error detection to command such as cancelling or redirecting to another instance or the repository, using a newly designed protocol. For example, the "cleaning crawled text" use case would run let's say in CESNET, cleaning every new Common Crawl release every month, packaging the data every three months for depositing it in the CLARIN DSpace repository instance. The core metadata info will be inserted by the responsible person for this continuously created dataset from the LINDAT DSpace repository side, filling in the name of the dataset series, authors, required formats, license to be used, credits to funding etc., which will be communicated to the cleaning scripts through some config file and the scripts wrapper API. Then, the script will be run every months to clean the CC release. Every three months, the script will make a bitstream set, complete the core metadata with the date, source datasets' description, provenance description (script reference, parameters, etc etc), total size, etc., and use the new ingestion API of the CLARIN DSpace repo to send it the completed metadata and ref(s) to data to actually ingest it: checking metadata, assigning it a PID, and sending it to the usual approval process.
The text was updated successfully, but these errors were encountered:
Just a thought...if we create a new collection, say "Common Crawl data", we can set it with a "template item". This should prefill the metadata.
Do we want to version these with our replaces/replaced by approach?
from the scripts running anywhere in the world (e.g., in HPCs)
this requirement somewhat complicates authorization/authentication. But I guess if we have a dedicated collection, we can create a system account that only access this collection (this should, to a degree, limit the potential impact of a (potentially) leaked access token).
In e-infra connected environments this should, in theory, be the role of e-infra aai.
...and ref(s) to data
So the instrument does not actually push the data via the repository api - Does it have some mechanism to make the data "downloadable" by the repository? How is the repository authorized to fetch this data?
Issue is related to (semi)automatic ingestion of data produced by "instruments", with the use case being the internet crawls, cleaned and filetered by source- and goal-specific scripts
The idea is to create a data-ingesting API that will be able accept metadata and data (or reference o data) to ingest. The initiative for each item ingested will be coming from the scripts running anywhere in the world (e.g., in HPCs). However, it is assumed that the initialization will be done at the repository side where the data steward will fill in the basic information about the script, invariant metadata items, periodicity etc., which will be communicated to the "instrument" wrap code (conduit) by a standard API running in the conduit, normalizing the otherwise specific "instrument" (= cleaning script etc.) communication. That API could also provide some status information that the repository can use in various ways, from error detection to command such as cancelling or redirecting to another instance or the repository, using a newly designed protocol. For example, the "cleaning crawled text" use case would run let's say in CESNET, cleaning every new Common Crawl release every month, packaging the data every three months for depositing it in the CLARIN DSpace repository instance. The core metadata info will be inserted by the responsible person for this continuously created dataset from the LINDAT DSpace repository side, filling in the name of the dataset series, authors, required formats, license to be used, credits to funding etc., which will be communicated to the cleaning scripts through some config file and the scripts wrapper API. Then, the script will be run every months to clean the CC release. Every three months, the script will make a bitstream set, complete the core metadata with the date, source datasets' description, provenance description (script reference, parameters, etc etc), total size, etc., and use the new ingestion API of the CLARIN DSpace repo to send it the completed metadata and ref(s) to data to actually ingest it: checking metadata, assigning it a PID, and sending it to the usual approval process.
The text was updated successfully, but these errors were encountered: