feat: allow wildcards in output paths #185

uniqueg · 2022-08-11T01:27:21Z

Fixes #77

Description

This PR adds provisions that allow clients to specify pathname matching wildcards ("globs") when specifying task outputs.

It addresses the discussion points summarized in this comment in the following ways:

Should TES implementations be required to support pathname matching in task outputs?
Yes. I feel that the use case is sufficiently common that it justifies the added complexity incurred by providing native support in TES, especially since the lack of support for globbing has been a blocker for a more widespread adoption of TES for some upstream implementers, e.g., Nextflow. This proposal adds the provisions for such native support.
How to indicate that TES should interpret a specified output as a glob?
This proposal requires TES implementations to always apply pathname matching rules to tesOutput.path values, unless wildcards are explicitly escaped. This behavior is familiar from (most) shells, completely removes any ambiguities and only alters the tesOutput.path description, so no structural changes to the specification are necessary. However, depending on a TES implementer's interpretation of the current specification, it may constitute a breaking change. Specifically, tesOutput.path values containing pathname matching wildcards would be treated differently by a TES implementation adopting the proposed change if (and only if) it previously interpreted wildcards literally. However, the expected behavior for paths including wildcards is currently underspecified (implementations may choose to expand or interpret them literally), so the proposed change rather adds clarity to a potential source of ambiguity across implementations.
Which globbing rules to prescribe?
This proposal suggests prescribing POSIX/Open Group pathname/pattern matching rules, because they are formally specified (see here and here), making them compatible with any POSIX-compatible shell environment, including Bash.
How to constuct remote storage URLs?
This proposal requires clients to do the following, whenever wildcards are used in tesOutput.path:
1. specify a directory for tesOuput.url
2. provide a tesOutput.path_prefix (new property) to indicate to TES implementations a prefix to remove from matched pathnames in order to identify the subdirectory tree that is then recreated at the directory specified for tesOuput.url
This solution fully supports wildcards in any part of tesOutput.path while avoiding name clashes, maximizing the control clients have over the eventual location of outputs and being straightforward to implement on the server side.

Examples

Use wildcards to copy files from multiple directories whose names are not known (new behavior).

Outputs to be copied:

 /path/in/container/unknown_subdir_1/output.bam
 /path/in/container/unknown_subdir_1/output.sam
 /path/in/container/unknown_subdir_2/output.bam
 /path/in/container/unknown_subdir_2/output.sam

Schema values to supply:

tesOutput.path: /path/in/container/*/*.?am
tesOutput.path_prefix: /path/in/container
tesOutput.url: https://my.storage.org/bucket/

URLs of outputs:

https://my.storage.org/bucket/unknown_subdir_1/output.bam
https://my.storage.org/bucket/unknown_subdir_1/output.sam
https://my.storage.org/bucket/unknown_subdir_2/output.bam
https://my.storage.org/bucket/unknown_subdir_2/output.sam

Copy a file whose name contains wildcard characters (behavior previously unspecified).

Output to be copied:

/path/with/wildc*rd_char?

Schema values to supply:

tesOutput.path: /path/with/wildc\*rd_char\?
tesOutput.path_prefix: # ignored
tesOutput.url: https://my.storage.org/bucket/wildc*ard_char?

URL of output:

https://my.storage.org/bucket/wildc*ard_char?

Copy output directory whose name is known (behavior unchanged).

Output to be copied:

/path/on/container/output/

Schema values to supply:

tesOutput.path: /path/on/container/output/
tesOutput.path_prefix: # ignored
tesOutput.url: https://my.storage.org/bucket/output/

URL of output:

https://my.storage.org/bucket/output/

Develop 1.1

Update link for Azure TES implementation.

uniqueg · 2022-08-11T01:33:04Z

Would appreciate feedback @kellrott @aniewielska @pditommaso @mr-c @MattMcL4475 @vsmalladi @wleepang @geoffjentry 🙏🏻

mr-c · 2022-08-11T10:33:09Z

Looks good to me; tagging @tetron for his Arvados perspective and additional CWL perspective

wleepang · 2022-08-12T16:26:42Z

I think this looks good, but I'm curious to see some concrete examples. For instance, if I set:

tesOutput.path: /path/to/folder
tesOutput.path_prefix: /path/to/folder
yes output.url: s3://bucketname/my/results

Should I expect the contents of /path/to/folder to be copied recursively to s3://bucketname/my/results?

vsmalladi · 2022-08-12T16:37:37Z

@wleepang If i understand this correctly this is what I would expect

tesOutput.path: /path/to/folder/data
tesOutput.path_prefix: /path/to/folder
output.url: s3://bucketname/my/results

URL: s3://bucketname/my/results/data

@uniqueg can you confirm this is what is expected with the combination of output_url and tesOutput.path_prefix?

uniqueg · 2022-08-12T17:46:18Z

@wleepang @vsmalladi

I have now added three examples to the PR description (my bad for not doing so earlier!).

Note that in the examples you give, the path doesn't include any wildcards, so a TES implementation should ignore tesOutput.path_prefix and interpret the value of tesOutput.url as a fully qualified URL at which the object created at tesOutput.path should be made accessible.

So in your example, @wleepang, contents of /path/to/folder would indeed be recursively copied to s3://bucketname/my/results.

In your case though, @vsmalladi, contents of /path/to/folder/data would be copied to s3://bucketname/my/results, not s3://bucketname/my/results/data.

Or at least this is my interpretation of the current specs.

vsmalladi · 2022-08-19T14:50:34Z

@uniqueg Thanks for the clarification.

Then LGTM.

kellrott and others added 4 commits April 20, 2022 12:58

Merge pull request #173 from ga4gh/develop-1.1

36181ff

Develop 1.1

Update link for Azure TES implementation.

d523f90

Merge pull request #177 from vsmalladi/fix_azure_links

fa8d96f

Update link for Azure TES implementation.

feat: allow wildcards in output paths

328fe1f

mr-c approved these changes Aug 11, 2022

View reviewed changes

mr-c approved these changes Aug 26, 2022

View reviewed changes

kellrott changed the base branch from develop to develop-1.1 September 21, 2022 05:21

kellrott merged commit 70b8270 into ga4gh:develop-1.1 Sep 21, 2022

vsmalladi added this to the 1.1 milestone Jan 11, 2023

This was referenced Jan 23, 2023

WIP Add V1.1 feature tests elixir-cloud-aai/openapi-test-runner#14

Closed

v1.1: allow wildcards in output paths elixir-cloud-aai/openapi-test-runner#17

Closed

uniqueg mentioned this pull request Aug 17, 2023

Clarification required: tesOutput.path_prefix #194

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: allow wildcards in output paths #185

feat: allow wildcards in output paths #185

uniqueg commented Aug 11, 2022 •

edited

Loading

uniqueg commented Aug 11, 2022

mr-c commented Aug 11, 2022

wleepang commented Aug 12, 2022

vsmalladi commented Aug 12, 2022

uniqueg commented Aug 12, 2022 •

edited

Loading

vsmalladi commented Aug 19, 2022

feat: allow wildcards in output paths #185

feat: allow wildcards in output paths #185

Conversation

uniqueg commented Aug 11, 2022 • edited Loading

Description

Examples

uniqueg commented Aug 11, 2022

mr-c commented Aug 11, 2022

wleepang commented Aug 12, 2022

vsmalladi commented Aug 12, 2022

uniqueg commented Aug 12, 2022 • edited Loading

vsmalladi commented Aug 19, 2022

uniqueg commented Aug 11, 2022 •

edited

Loading

uniqueg commented Aug 12, 2022 •

edited

Loading