-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: allow wildcards in output paths #185
Conversation
Develop 1.1
Update link for Azure TES implementation.
Would appreciate feedback @kellrott @aniewielska @pditommaso @mr-c @MattMcL4475 @vsmalladi @wleepang @geoffjentry 🙏🏻 |
Looks good to me; tagging @tetron for his Arvados perspective and additional CWL perspective |
I think this looks good, but I'm curious to see some concrete examples. For instance, if I set:
Should I expect the contents of |
@wleepang If i understand this correctly this is what I would expect tesOutput.path: /path/to/folder/data URL: s3://bucketname/my/results/data @uniqueg can you confirm this is what is expected with the combination of |
I have now added three examples to the PR description (my bad for not doing so earlier!). Note that in the examples you give, the path doesn't include any wildcards, so a TES implementation should ignore So in your example, @wleepang, contents of In your case though, @vsmalladi, contents of Or at least this is my interpretation of the current specs. |
@uniqueg Thanks for the clarification. Then LGTM. |
Fixes #77
Description
This PR adds provisions that allow clients to specify pathname matching wildcards ("globs") when specifying task outputs.
It addresses the discussion points summarized in this comment in the following ways:
Should TES implementations be required to support pathname matching in task outputs?
Yes. I feel that the use case is sufficiently common that it justifies the added complexity incurred by providing native support in TES, especially since the lack of support for globbing has been a blocker for a more widespread adoption of TES for some upstream implementers, e.g., Nextflow. This proposal adds the provisions for such native support.
How to indicate that TES should interpret a specified output as a glob?
This proposal requires TES implementations to always apply pathname matching rules to
tesOutput.path
values, unless wildcards are explicitly escaped. This behavior is familiar from (most) shells, completely removes any ambiguities and only alters thetesOutput.path
description, so no structural changes to the specification are necessary. However, depending on a TES implementer's interpretation of the current specification, it may constitute a breaking change. Specifically,tesOutput.path
values containing pathname matching wildcards would be treated differently by a TES implementation adopting the proposed change if (and only if) it previously interpreted wildcards literally. However, the expected behavior for paths including wildcards is currently underspecified (implementations may choose to expand or interpret them literally), so the proposed change rather adds clarity to a potential source of ambiguity across implementations.Which globbing rules to prescribe?
This proposal suggests prescribing POSIX/Open Group pathname/pattern matching rules, because they are formally specified (see here and here), making them compatible with any POSIX-compatible shell environment, including Bash.
How to constuct remote storage URLs?
This proposal requires clients to do the following, whenever wildcards are used in
tesOutput.path
:tesOuput.url
tesOutput.path_prefix
(new property) to indicate to TES implementations a prefix to remove from matched pathnames in order to identify the subdirectory tree that is then recreated at the directory specified fortesOuput.url
This solution fully supports wildcards in any part of
tesOutput.path
while avoiding name clashes, maximizing the control clients have over the eventual location of outputs and being straightforward to implement on the server side.Examples
Use wildcards to copy files from multiple directories whose names are not known (new behavior).
Outputs to be copied:
Schema values to supply:
URLs of outputs:
Copy a file whose name contains wildcard characters (behavior previously unspecified).
Output to be copied:
/path/with/wildc*rd_char?
Schema values to supply:
URL of output:
https://my.storage.org/bucket/wildc*ard_char?
Copy output directory whose name is known (behavior unchanged).
Output to be copied:
/path/on/container/output/
Schema values to supply:
URL of output:
https://my.storage.org/bucket/output/