The UMCU Uploader is an application for describing research data sets, preserving them in Archivematica and publishing them with Dataverse.
The University Medical Centre Utrecht (UMCU) sponsored the development of this application to further the objectives of Open Science and FAIR research data management.
The application is currently a minimum viable product to provide an end to end process that can be tested with real data sets.
Apache License Version 2.0 - Copyright Artefactual Systems Inc (2023)
See CONTRIBUTING for guidelines on how to contribute to the project.
The UMCU Uploader is a web-based application that is built using the Python Flask framework. It has been tested using Python 3.6.9.
- Researchers upload a directory of files containing the research data.
- Metadata describing the dataset is added (title, description, related publications, and so on).
- Access rights are assigned to all of the files in the dataset (Public, Restricted or Private).
- Metadata is stored in a metadata.json file in a metadata directory (using a standard Archivematica format)
- The dataset is copied to the Archivematica Transfer Source Directory
- An Archivematica user can then select the Transfer for transfer and ingest processing in Archivematica, which will create an Archival Information Package (AIP)
- The user provides the AIP UUID in the UMCU Uploader application.
- The Uploader will retrieve the AIP from Archivematica (using the Archivematica API).
- The Uploader will then select files from the AIP that have Public or Restricted Access Rights (any files classified as Private are not published).
- The Uploader will then upload the dataset to Dataverse (using the Dataverse API).
- The research data manager will review and approve the dataset before it is visible to public users.
-
Clone repository:
git clone https://github.com/artefactual-labs/umcu-uploader.git cd umcu-uploader
-
Clone submodules:
git submodule update --init --recursive
-
Set up virtualenv in the project root directory:
virtualenv -p python3 venv
-
Activate virtualenv:
source venv/bin/activate
-
Install requirements:
pip install -r requirements/base.txt
-
In a terminal window, start the Flask server:
python run.py
-
The application runs on HTTP port 5000. Confirm that the Flask server and application are up and running at
localhost:5000
in your browser.
Configuration is specified, using YAML, either in .config.yaml
in the
application directory or in /etc/umcu-uploader.yaml
. The various configuration
settings are detailed below:
Setting | Description | Default |
---|---|---|
host | Host to run app on | 0.0.0.0 |
port | HTTP port to listen on | 5000 |
debug | If True , run with built-in debugger |
False |
secret_key | Key used to sign cookies[1] | you-shall-not-pass🧙<200d>♂️ |
data_directory | Directory in which use files will be stored | system temp directory |
transfer_source_directory | Archivematica transfer source directory | none |
dataverse_server | Dataverse server to upload to | https://dataverse.nl/dataverse/ |
dataverse_demo_server | Demo Dataverse server to upload to | https://demo.dataverse.nl/dataverse/ |
dataverse_api_key | Dataverse API key | none |
demo_mode | If True , run using demo Dataverse server |
True |
depositor_name | Name of depositor | ANON |
divisions | Division-specific Archivematica transfer source directories, etc.[2] | none |
storage_server_url | Archivematica Storage Server URL | none |
storage_server_user | Archivematica Storage Server username | none |
storage_server_api_key | Archivematica Storage Server API key | none |
storage_server_basic_auth_user | Archivematica Storage Server basic auth user | none |
storage_server_basic_auth_password | Archivematica Storage Server basic auth password | none |
[1] Cookie signing details: https://stackoverflow.com/questions/22463939/demystify-flask-app-secret-key
[2] The configuration values are single values, except for the divisions
field.
Example value of the divisions
setting:
divisions:
ed:
name: Example Division
transfer_source_directory: /path/to/directory
...
The application itself, as opposed to application data, should be given 1 GB, or so, of disk space (providing room for the source code and the application's SQLite database).
Application data will likely require more disk space. Application data needs to be manually deleted currently. In order to estimate disk space requirements this equation can be used:
Required Disk Space = Estimated Average Size of Dataset * 3 * Estimated Number of Datasets
Here, the "Estimated Average Size of Dataset" is multiplied by 3 and then by the "Estimated Number of Datasets" for whatever period of time passes before old application data is cleaned up.
A fairly simple way of deploying the app is to proxy it through Nginx. This allows the app to be accessed via TLS/SSL, basic access authentication, etc.
Instructions for deploying using uWSGI proxied through Nginx:
-
Add, to the server block of an Nginx configuration, directives to proxy to WSGI:
location = /uploader { rewrite ^ /uploader/; } location /uploader { try_files $uri @uploader; } location @uploader { uwsgi_pass unix:/tmp/uploader.sock; include uwsgi_params; }
-
Run the app using the included config file:
uwsgi uploader.ini
If deploying in Ubuntu you'll likely want to run it using the www-data user:
sudo -u www-data uwsgi uploader.ini
-
Create an application data directory somewhere on your filesystem. This is where research data uploaded to the app will be put.
If deploying in Ubuntu this directory could be put somewhere like, for example,
/var/uploader
. You'll want to make this directory owned by the same user running the app so the app can write data to it.Example:
sudo mkdir /var/umc-uploader sudo chown www-data:www-data /var/umc-uploader
Make sure to set the
data_directory
configuration setting to your chosen application directory.
The UMC uploader needs to copy research data to Archivematica so that AIPs can be created from them.
For the UMC uploader to be able to copy to Archivematica transfer source
directories, however, requires that the file permissions on these directories
allow the user that the UMC uploader is running under (likely www-data
if
running in Ubuntu) to write to them.
If only one Archivematica transfer source directory exists it can be specified
using the transfer_source_directory
configuration setting. To allow
per-department transfer source directories to be specified then the divisions
configuration setting must be populated (see example in Configuration
section of this document).
For the UMC uploader to be able to download created AIPs afterwards, before exporting to Dataverse, requires that Archivematica Storage Service access be configured.
The following configuration settings must be specified:
storage_server_url
(Archivematica Storage Server URL)storage_server_user
(Archivematica Storage Server username)storage_server_api_key
(Archivematica Storage Server API key)
If Archivematica Storage Service is running behind HTTP basic authentication then the configuration settings must also be specified:
storage_server_basic_auth_user
(Archivematica Storage Server basic auth user)storage_server_basic_auth_password
(Archivematica Storage Server basic auth password)
Exporting downloaded AIPs to Dataverse requires the following configuration options be specified:
dataverse_server
(Dataverse server to upload to)dataverse_demo_server
(Demo Dataverse server to upload to)dataverse_api_key
(Dataverse API key)demo_mode
(IfTrue
, run using demo Dataverse server)
A divisions.csv
file must also be present in the application directory for
exports to Dataverse to export properly.