ARC versioning #9
Replies: 1 comment 1 reply
-
The DOI process is designed to work with Git-LFS. Once a DOI is assigned, S3 versioning is enabled, ensuring that the data cannot be deleted, even if the Git repository itself is removed. This provides a robust mechanism for preserving the dataset's integrity. To avoid data duplication, we will not trigger a Git release. Instead, the DOI will be linked to the ARC RO-Crate, which is stored as Invenio record (along with the corresponding S3-hosted data). The comprehensive ARC RO-Crate representation is sufficient to fully recreate the ARC repository representation (if necessary), effectively resolving any issues related to repository reconstruction. The link to the Git repository will be provided solely for user convenience. I hope that clarifies the discussion a bit more... |
Beta Was this translation helpful? Give feedback.
-
This came up in today's Data Steward meeting and I just wanted to keep the discussion alive and involve more people.
How should ARC versioning work?
What all should be part of this versioning? metadata only? raw data as well?
How do we ensure the data will remain available?
How do we avoid data duplication?
For example, when I publish my ARC, the published page in ARChive simply points to my DataHUB repository for the raw data. Should the publishing process include the creation of a tag or release on the DataHUB repo so that different versions of the published ARC can point to specific tags/releases on the original repo in case I made changes to the data (this is how I thought it would work).
Nothing is stopping me from deleting my DataHUB repository after publishing my ARC, which contradicts the expectation set by minting a DOI. So how do we ensure a copy of all the data in the ARC will always be available but also in a way that avoids data duplication and supports versioning?
ARCs themselves can also point to external data, do we want to impose some restriction here that keep this expectation intact that data will remain available (e.g. only allow linking out to other DOIs or a set of predefined 'trusted' data sources). What is the balance here between giving users a lot of freedom, and safeguarding the FAIR principles/expectations of long-term data availability?
I am sure I missed more parts of the discussion we had today, so please add your thoughts @Brilator @kMutagene @HLWeil @Freymaurer (and anybody else I may have missed or don't know a github id for)
Beta Was this translation helpful? Give feedback.
All reactions