Skip to content
qqmyers edited this page Mar 28, 2023 · 3 revisions

DVWebloader - a bulk file uploader for Dataverse

Overview

DVWebLoader is a small web application that can be configured with Dataverse to allow upload of a whole directory/folder tree of files into a Dataverse dataset, retaining their relative paths within the directory/folder in the dataset. Before uploading, DVWebLoader will check the dataset contents and will, by default, not upload files that already exist in it. Users can modify the default selection by checking/unchecking specific files before initiating the upload.

Limitations

DVWebloader currently works with S3 stores with direct upload enabled and will not work with other types of stores in Dataverse. (S3 with direct upload is the most efficient mechanism for bulk uploads built into Dataverse. Other methods could be supported but would likely reduce upload speed and increase the load on the Dataverse server.)

Also, at present DVWebloader only uses the MD5 algorithm for file fixity (sent as the checksum with uploads) regardless of whether the Dataverse installation has ben configured to use a different algorithm. As of Dataverse ~5.14, which adds an API call for the DVWebloader to determine which algorithm to use, DVWebloader will use the configured algorithm.

Administration

For Dataverse v5.12.1 and earlier, DVWebloader can be installed as an external tool. Instructions are currently listed in the repository README file. This method is not ideal for several reasons:

  • external tools are only available after at least one file has been uploaded to a dataset
  • external tools appear on all datasets, even if the dataset is not using a S3 store with direct upload, and
  • external tools appear in the Access menu rather than in the Upload panel.

Once Dataverse is updated, expected for v5.13, it can be enabled by adding 'dvwebloader' as one of the entries for the :UploadMethods setting and specifying the base :WebloaderUrl where the DVWebloader is installed (https://gdcc.github.io/dvwebloader/src/dvwebloader.html is one such location you can use without having to manage a local copy.)

Use

When configured, and when in a dataset that is using a S3 direct upload store, the DVWebLoader will appear as an button within the Upload pane. Clicking the button will open the DVWebLoader in a new browser tab.

The initial interface shows only a 'Select a Directory' button. Clicking this you can navigate to a specific directory/folder on your computer to upload from. The selected directory will correspond to no path in the dataset, i.e. files in that directory will not have a path in Dataverse. Files in any subdirectories will have a path in the dataset corresponding to the local path relative to the directory you select.

Once you select a directory, you will be prompted to allow uploads. This dialog is a security measure built into the browser. Once you agree, DVWebloader will list all of the files in your selected directory and all subdirectories. Any files that match by path and name to files already in the dataset will be unchecked and color coded and, by default, will not be uploaded. If there are new files to upload, a 'Start Upload' button will appear.

Prior to hitting the 'Start Upload' button, a user can check or uncheck any file(s) to refine which files will be uploaded. (Checking a file already in the dataset will result in a duplicate copy in the dataset, e.g. with a '_1' added to the filename. The 'Start Upload' button will be visible as long as one or more files are checked.

When 'Start Upload' is clicked, DVWebLoader will start uploading files to the S3 store, with ~4 files being transferred in parallel. Progress on these uploads will be shown next to each file. After each upload, each local file will also have its MD5 hash calculated so that Dataverse will be able to verify, e.g. automatically prior to publication, that the copy in S3 is exactly the same as the local one.

In the final step, DVWebLoader will report all of the uploads to Dataverse and add them to the dataset. This step can take significant time for larger numbers of files. Once it is complete, DVWebLoader will indicate success and suggest returning the the dataset page and refreshing to see the new files.

Note: In the future, DVWebLoader may offer a 'Cancel' option to terminate uploads in progress. Until then, users should either allow uploads to complete (and then delete any unwanted files via the dataset page), or notify their Dataverse administrators so that unwanted files and partial uploads can be removed from S3 storage.

Example Interface Picture:

image

Initial Design/Development sponsored by UiT/DataverseNO

Clone this wiki locally