Hierarchical upload API optimized for folders & collections. #5220

jmchilton · 2017-12-14T21:15:02Z

This new API endpoint allows describing hierarchical data in JSON or inferring structure from archives or directories.

Datasets or archive sources can be specified via uploads, URLs, paths (if admin && allow_path_paste), library_import_dir/user_library_import_dir, and/or FTP imports. Unlike the existing library API endpoint, a mix of these on a per file source basis is allowed.

Supported "archives" include gzip, zip, bagit directories, bagit achives (with fetching and validations of downloads).

The existing upload API endpoint is quite rough to work with both in terms of adding parameters (e.g. the file type and dbkey handling in #4563 was difficult to implement, terribly hacky, and should seemingly have been trivial) and in terms of building requests (one needs to build a tool form - not describe sensible inputs in JSON). This API is built to be intelligible from an API standpoint instead of being constrained to the older style tool form. Additionally it built with hierarchical data in mind in a way that would not be easy at all enhancing the tool form components we don't even render.

This implements #5159 by allowing BagIt directories and BagIt archives (e.g. zip or tar.gz files containing BagIt direcotries) to specified as the source of items for a library folder. though a much simpler YAML description of data libraries can be translated directly into API calls into the new endpoint. To demonstrate this I've included an example script fetch_to_library.py that implements this. The example input file included demonstrates this:

destination:
  type: library
  name: Training Material
  description: Data for selected tutorials from https://training.galaxyproject.org.
items:
  - name: Quality Control
    description: |
      Data for sequence quality control tutorial at http://galaxyproject.github.io/training-material/topics/sequence-analysis/tutorials/quality-control/tutorial.html.
      10.5281/zenodo.61771
    items:
      - src: url
        url: https://zenodo.org/record/61771/files/GSM461178_untreat_paired_subset_1.fastq
        name: GSM461178_untreat_paired_subset_1
        ext: fastqsanger
        info: Untreated subseq of GSM461178 from 10.1186/s12864-017-3692-8
      - src: url
        url: https://zenodo.org/record/61771/files/GSM461182_untreat_single_subset.fastq
        name: GSM461182_untreat_single_subset
        ext: fastqsanger
        info: Untreated subseq of GSM461182 from 10.1186/s12864-017-3692-8
  - name: Small RNA-Seq
    description: |
      Data for small RNA-seq tutorial available at http://galaxyproject.github.io/training-material/topics/transcriptomics/tutorials/srna/tutorial.html
      10.5281/zenodo.826906
    items:
      - src: url
        url: https://zenodo.org/record/826906/files/Symp_RNAi_sRNA-seq_rep1_downsampled.fastqsanger.gz
        name: Symp RNAi sRNA Rep1
        ext: fastqsanger.gz
        info: Downsample rep1 from 10.1186/s12864-017-3692-8
      - src: url
        url: https://zenodo.org/record/826906/files/Symp_RNAi_sRNA-seq_rep2_downsampled.fastqsanger.gz
        name: Symp RNAi sRNA Rep2
        ext: fastqsanger.gz
        info: Downsample rep2 from 10.1186/s12864-017-3692-8
      - src: url
        url: https://zenodo.org/record/826906/files/Symp_RNAi_sRNA-seq_rep3_downsampled.fastqsanger.gz
        name: Symp RNAi sRNA Rep3
        ext: fastqsanger.gz
        info: Downsample rep3 from 10.1186/s12864-017-3692-8

This example demonstrates creating a library via this endpoint, but that destination element could be used to create an HDCA in a history or populate the contents of an existing library folder just as easily.

Once this lands in a stable release - I'll port a cleaned up version of this example script as a replacement or augmentation to the data library script in Ephemeris. Unlike the script in Ephemeris and the existing library folder upload options - this handles much more metadata and allows the creation of nested folders instead of just flat lists of contents.

In future PRs I'll add filtering options to this and it will serve as the backend to #4733.

Implements #4734.

bgruening · 2017-12-17T17:44:37Z

@jmchilton this is super cool and such a yaml file turned out to be super useful in our training community. Looking forward to these nice enhancements.

What do you think about adding permissions also to this schema, so that I can specify during upload the group/role?

martenson · 2017-12-20T20:10:29Z

What do you think about adding permissions also to this schema, so that I can specify during upload the group/role?

Making permissions part of the schema would make the 'bag' not portable though since it would relay on given instance's namespace. Or did you mean something like 'set access to the bag to these four users on import' feature?

jmchilton · 2017-12-20T20:17:34Z

So a syntax for bag imports currently would be:

destination:
  type: library
  name: Training Material
  description: Data for selected tutorials from https://training.galaxyproject.org.
items_from: bagit_archive
src: url
url: http://example.org/coolbag.zip

or

destination:
  type: library
  name: Training Material
  description: Data for selected tutorials from https://training.galaxyproject.org.
items_from: bagit_archive
src: ftp_import
ftp_path: relative_ftp_path/to/coolbag.zip

or some other stuff for direct path imports, server_dir imports, uploads, etc... At any rate I would imagine one would stash the roles needed for library creation in the destination part:

destination:
  type: library
  name: Training Material
  description: Data for selected tutorials from https://training.galaxyproject.org.
  roles:
    admin:
      - name_of_role
    read: 
      - name_or_ro_role
items_from: bagit_archive
src: ftp_import
ftp_path: relative_ftp_path/to/coolbag.zip

Or however we would identify the different kinds of roles and the roles themselves - so I think this could make sense. The destination is pretty independent from what is being imported. Another thing is that existing library folders can be targeted currently (it is actually how I started implementing this).

destination:
  type: library_folder
  library_folder_id: folder_id
items_from: bagit_archive
src: ftp_import
ftp_path: relative_ftp_path/to/coolbag.zip

so certainly this can be used to upload bags or zips or files or directories to existing folders with custom permissions as well.

I have a TODO list for this PR that definitely has permissions on it - but it might be relegated to a follow up issue unless someone feels strongly this needs to be in the first attempt.

martenson · 2017-12-20T20:19:47Z

I see, so the roles here would be user emails which can be portable to some extent?

edit: because the bag does not know about groups and custom roles names/ids

jmchilton · 2017-12-20T20:55:54Z

I see, so the roles here would be user emails which can be portable to some extent?

There are two things - the API and the the YAML format. They corresponding pretty directly and certainly role ids or role names in the API make a lot of sense. I wouldn't really expect for instance sample YAML files with library descriptions to be portable if they defined such roles though. Perhaps CLI parameters for settings those up or a GUI that you can paste the generic config into and then customize.

martenson · 2017-12-20T21:08:06Z

GUI that you can paste the generic config into and then customize.

That is what I was thinking. Load your bag definition and then you would have an ability to choose instance specific settings like permissions, parent folder etc. before you click the big 'Import' button.

bgruening · 2018-01-07T13:57:57Z

lib/galaxy/workflow/modules.py

@@ -809,6 +809,9 @@ def execute(self, trans, progress, invocation_step, use_cached_job=False):
        invocation = invocation_step.workflow_invocation
        step = invocation_step.workflow_step
        tool = trans.app.toolbox.get_tool(step.tool_id, tool_version=step.tool_version)
+        if tool.is_workflow_compatible:


Yes - thanks!

jmchilton · 2018-03-07T04:46:35Z

@anatskiy So I hit a few problems getting your example to work but I've come up with solutions and test cases:

I was right that this broke down quite a bit when I merged #5609 back into this branch, but that breakage should all be fixed with 6252a9e.

Then I noticed that Galaxy doesn't consistently sniff CSV files in the same way depending on how you are uploading - so I opened #5643 for the existing uploader (which is included in this PR) and added a fix for this new uploader.

Finally I added in c695112 which has a bunch more tests for the new uploader including CSV tests derived from your file.

So

destination:
  type: hdas
  name: Some dataset
  description: some description
items:
  - name: Dataset 1
    src: url 
    url: http://samplecsvs.s3.amazonaws.com/SalesJan2009.csv

should now upload just fine. The one caveat I did learn is that your file has carriage returns in it - that Python doesn't by default know how to parse when sniffing as a csv, so this file ends up as a txt input. This isn't a problem for the regular uploader because it always converts carriage returns to posix newlines. This API doesn't do that by default because it is less eager to mess with your data - but it can be switched on. So to get this to sniff as CSV you can use:

destination:
  type: hdas
  name: Some dataset
  description: some description
items:
  - name: Dataset 1
    src: url 
    url: http://samplecsvs.s3.amazonaws.com/SalesJan2009.csv
    to_posix_lines: true

or just

destination:
  type: hdas
  name: Some dataset
  description: some description
items:
  - name: Dataset 1
    src: url 
    url: http://samplecsvs.s3.amazonaws.com/SalesJan2009.csv
    to_posix_lines: true
    ext: csv

Thanks again for trying it out, hope these fixes help.

anatskiy · 2018-03-08T15:20:58Z

@jmchilton thanks a lot! It works perfectly now!

I have just one question. Is it possible to upload data with just a POST request? Or such API doesn't exist?

bgruening · 2018-03-08T15:27:55Z

@jmchilton if you can resolve the conflict please I think we are ready to merge.

…on_util. I need to use this from upload code.

Allows describing hierarchical data in JSON or inferring structure from archives or directories. Datasets or archive sources can be specified via uploads, URLs, paths (if admin && allow_path_paste), library_import_dir/user_library_import_dir, and/or FTP imports. Unlike existing API endpoints, a mix of these on a per file basis is allowed and they work seemlessly between libraries and histories. Supported "archives" include gzip, zip, bagit directories, bagit achives (with fetching and validations of downloads). The existing upload API endpoint is quite rough to work with both in terms of adding parameters (e.g. the file type and dbkey hanlding in 4563 was difficult to implement, terribly hacky, and should seemingly have been trivial) and in terms of building requests (one needs to build a tool form - not describe sensible inputs in JSON). This API is built to be intelligable from an API standpoint instead of being constrained to the older style tool form. Additionally it built with hierarchical data in mind in a way that would not be easy at all enhancing the tool form components we don't even render. This implements 5159 though much simpler YAML descriptions of data libraries should be possible basically as the API descriptions. We can replace the data library script in Ephemeris https://github.com/galaxyproject/ephemeris/blob/master/ephemeris/setup_data_libraries.py with one that converts a simple YAML file into an API call and allows many new options for free. In future PRs I'll add filtering options to this and it will serve as the backend to 4733.

In the case of data-fetch there is extra validation that is done so this is somewhat important.

Concerning that they sometimes will get deleted in production with default settings - see 5361.

Trying to improve the user experience of this rule based uploader by placing HDAs and HDCAs in the history at the outset that the history panel can poll and that we can turn red if the upload fails. From Marius' PR review: > I can see that a job launched in my logs, but it failed and there were no visual indications of this in the UI Not every HDA for instance can be created, for example if reading them from a zip file for instance that happens on the backend still. Likewise if HDCAs don't define a collection type up front they cannot be pre-created (if for instance that is inferred from a folder structure). Library things aren't precreated at all in this commit. There is room to pre-create more but I think this is an atomic commit as it is now and it will hopefully improve the user experience for the rule based uploader considerably.

- Remove seemingly unneeded hack in upload_common. - Remove stray debug statement. - Add more comments in the output collection code related to different destination types. - Restructure if/else in data_fetch to avoid assertion with constant.

Previously sniffing would happen on the original file (before carriage returns and tabular spaces were converted) if in_place was false and on the converted file if it was true.

jmchilton · 2018-03-08T15:44:24Z

@bgruening Awesome - thanks for the review. I've rebased and resolved the conflicts.

@anatskiy Yeah - it is possible via the API - that publication script doesn't work that way though. There are examples in the test code - https://github.com/galaxyproject/galaxy/pull/5220/files#diff-a1e38277f26f32a21c0878b6b640accbR191.

So any dictionary in the API that can support for instance {"src": "path", "path": "/path/to"} or {"src": "url", "url": "http://example.com"} can also support {"src": "files"}. This means treat the multi-part upload parameters files_0, files_1, etc... as an iterator and grab "the next one". So the number of instances of {"src": "files"} appearing should match the number of files attached via multi-part posts.

It doesn't yet support {"src": "paste"} - so you can't put the content directly in a simple non-multi-part POST I think. It will need that at some point to serve as the main upload GUI backend, but it also doesn't yet tackle composite files which will also be needed someday. I think the structure of the JSON though will allow both of those more naturally than the older endpoint.

bgruening · 2018-03-09T10:36:26Z

Cool thanks a lot @jmchilton.
ping @sneumann and @pcm32 this will make you happy I suppose :)

jmchilton added area/API area/dataset-collections area/libraries Data libraries area/testing/api kind/enhancement labels Dec 14, 2017

galaxybot added the status/WIP label Dec 14, 2017

jmchilton force-pushed the upload_2.0 branch from a8e69f0 to 4ba5607 Compare December 15, 2017 15:17

jmchilton mentioned this pull request Dec 15, 2017

Using a collection [from a list of collections] as tools input #4324

Closed

jmchilton force-pushed the upload_2.0 branch 2 times, most recently from 6ecd830 to 90c35e8 Compare December 15, 2017 18:23

jmchilton mentioned this pull request Dec 15, 2017

Refactor upload.py toward reuse #5229

Merged

jmchilton force-pushed the upload_2.0 branch from 90c35e8 to 9260d96 Compare December 15, 2017 21:01

jmchilton force-pushed the upload_2.0 branch 3 times, most recently from ba2f029 to 24d15bb Compare December 20, 2017 14:23

jmchilton force-pushed the upload_2.0 branch from 24d15bb to dcc2ede Compare January 1, 2018 16:13

martenson added this to the 18.05 milestone Jan 2, 2018

jmchilton force-pushed the upload_2.0 branch 3 times, most recently from 2d693b7 to 73ff3df Compare January 6, 2018 19:04

bgruening reviewed Jan 7, 2018

View reviewed changes

jmchilton force-pushed the upload_2.0 branch 2 times, most recently from f056df4 to b48841b Compare January 9, 2018 20:24

jmchilton force-pushed the upload_2.0 branch from b48841b to 4a0beb2 Compare January 14, 2018 18:30

jmchilton force-pushed the upload_2.0 branch from 77c3cb9 to 863fefe Compare March 7, 2018 04:37

jmchilton added 19 commits March 8, 2018 10:32

Upload simplification - just base this check on link_only.

ebac0fd

Refactor shed's CompressedFile abstraction into galaxy.util.compressi…

94cc4df

…on_util. I need to use this from upload code.

Improved doc for library API.

306dded

Stronger assertion about link_data_only option in upload.py.

a45cbfe

Test case for link_to_files during upload.

ce2170a

Do not allow workflows to run tools that are not workflow-compatible.

b6f4bff

In the case of data-fetch there is extra validation that is done so this is somewhat important.

Don't purge library path pastes and such in upload.py during testing.

347f1e1

Concerning that they sometimes will get deleted in production with default settings - see 5361.

Allow uploading individual HDAs via fetch API.

3bf5990

More upload testing, some input is getting deleted that shouldn't.

1720354

Handle compressed datatypes appropriately in data fetch API.

5c5dbd2

Simplify and reduce duplication of upload actions.

54ee573

Avoid symlinks in upload FTP tests.

d783fc3

Fixes for pre-creating HDAs using data fetch API.

ca28000

Consistent sniffing regardless of in_place.

495d125

Previously sniffing would happen on the original file (before carriage returns and tabular spaces were converted) if in_place was false and on the converted file if it was true.

More upload tests and fixes.

d651348

Fix for data_fetch sniffing.

bca2c3c

jmchilton force-pushed the upload_2.0 branch from 863fefe to bca2c3c Compare March 8, 2018 15:34

bgruening merged commit 2195911 into galaxyproject:dev Mar 9, 2018

This was referenced Mar 20, 2018

Add a check to preserve a symlink if file is a symlink (object store) #5585

Draft

BagIt Descriptions to Libraries #5159

Closed

najitaleb mentioned this pull request Oct 5, 2018

uploading collections through API #6822

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hierarchical upload API optimized for folders & collections. #5220

Hierarchical upload API optimized for folders & collections. #5220

jmchilton commented Dec 14, 2017 •

edited

Loading

bgruening commented Dec 17, 2017

martenson commented Dec 20, 2017

jmchilton commented Dec 20, 2017 •

edited

Loading

martenson commented Dec 20, 2017 •

edited

Loading

jmchilton commented Dec 20, 2017

martenson commented Dec 20, 2017

bgruening Jan 7, 2018

jmchilton Jan 7, 2018

jmchilton commented Mar 7, 2018 •

edited

Loading

anatskiy commented Mar 8, 2018 •

edited

Loading

bgruening commented Mar 8, 2018

jmchilton commented Mar 8, 2018

bgruening commented Mar 9, 2018

Hierarchical upload API optimized for folders & collections. #5220

Hierarchical upload API optimized for folders & collections. #5220

Conversation

jmchilton commented Dec 14, 2017 • edited Loading

bgruening commented Dec 17, 2017

martenson commented Dec 20, 2017

jmchilton commented Dec 20, 2017 • edited Loading

martenson commented Dec 20, 2017 • edited Loading

jmchilton commented Dec 20, 2017

martenson commented Dec 20, 2017

bgruening Jan 7, 2018

Choose a reason for hiding this comment

jmchilton Jan 7, 2018

Choose a reason for hiding this comment

jmchilton commented Mar 7, 2018 • edited Loading

anatskiy commented Mar 8, 2018 • edited Loading

bgruening commented Mar 8, 2018

jmchilton commented Mar 8, 2018

bgruening commented Mar 9, 2018

jmchilton commented Dec 14, 2017 •

edited

Loading

jmchilton commented Dec 20, 2017 •

edited

Loading

martenson commented Dec 20, 2017 •

edited

Loading

jmchilton commented Mar 7, 2018 •

edited

Loading

anatskiy commented Mar 8, 2018 •

edited

Loading