Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow wildcards in output paths #185

Merged
merged 4 commits into from
Sep 21, 2022
Merged

feat: allow wildcards in output paths #185

merged 4 commits into from
Sep 21, 2022

Conversation

uniqueg
Copy link
Contributor

@uniqueg uniqueg commented Aug 11, 2022

Fixes #77

Description

This PR adds provisions that allow clients to specify pathname matching wildcards ("globs") when specifying task outputs.

It addresses the discussion points summarized in this comment in the following ways:

  • Should TES implementations be required to support pathname matching in task outputs?
    Yes. I feel that the use case is sufficiently common that it justifies the added complexity incurred by providing native support in TES, especially since the lack of support for globbing has been a blocker for a more widespread adoption of TES for some upstream implementers, e.g., Nextflow. This proposal adds the provisions for such native support.

  • How to indicate that TES should interpret a specified output as a glob?
    This proposal requires TES implementations to always apply pathname matching rules to tesOutput.path values, unless wildcards are explicitly escaped. This behavior is familiar from (most) shells, completely removes any ambiguities and only alters the tesOutput.path description, so no structural changes to the specification are necessary. However, depending on a TES implementer's interpretation of the current specification, it may constitute a breaking change. Specifically, tesOutput.path values containing pathname matching wildcards would be treated differently by a TES implementation adopting the proposed change if (and only if) it previously interpreted wildcards literally. However, the expected behavior for paths including wildcards is currently underspecified (implementations may choose to expand or interpret them literally), so the proposed change rather adds clarity to a potential source of ambiguity across implementations.

  • Which globbing rules to prescribe?
    This proposal suggests prescribing POSIX/Open Group pathname/pattern matching rules, because they are formally specified (see here and here), making them compatible with any POSIX-compatible shell environment, including Bash.

  • How to constuct remote storage URLs?
    This proposal requires clients to do the following, whenever wildcards are used in tesOutput.path:

    1. specify a directory for tesOuput.url
    2. provide a tesOutput.path_prefix (new property) to indicate to TES implementations a prefix to remove from matched pathnames in order to identify the subdirectory tree that is then recreated at the directory specified for tesOuput.url

    This solution fully supports wildcards in any part of tesOutput.path while avoiding name clashes, maximizing the control clients have over the eventual location of outputs and being straightforward to implement on the server side.

Examples

  • Use wildcards to copy files from multiple directories whose names are not known (new behavior).

    Outputs to be copied:

     /path/in/container/unknown_subdir_1/output.bam
     /path/in/container/unknown_subdir_1/output.sam
     /path/in/container/unknown_subdir_2/output.bam
     /path/in/container/unknown_subdir_2/output.sam

    Schema values to supply:

    tesOutput.path: /path/in/container/*/*.?am
    tesOutput.path_prefix: /path/in/container
    tesOutput.url: https://my.storage.org/bucket/

    URLs of outputs:

    https://my.storage.org/bucket/unknown_subdir_1/output.bam
    https://my.storage.org/bucket/unknown_subdir_1/output.sam
    https://my.storage.org/bucket/unknown_subdir_2/output.bam
    https://my.storage.org/bucket/unknown_subdir_2/output.sam
  • Copy a file whose name contains wildcard characters (behavior previously unspecified).

    Output to be copied:

    /path/with/wildc*rd_char?

    Schema values to supply:

    tesOutput.path: /path/with/wildc\*rd_char\?
    tesOutput.path_prefix: # ignored
    tesOutput.url: https://my.storage.org/bucket/wildc*ard_char?

    URL of output:

    https://my.storage.org/bucket/wildc*ard_char?
  • Copy output directory whose name is known (behavior unchanged).

    Output to be copied:

    /path/on/container/output/

    Schema values to supply:

    tesOutput.path: /path/on/container/output/
    tesOutput.path_prefix: # ignored
    tesOutput.url: https://my.storage.org/bucket/output/

    URL of output:

    https://my.storage.org/bucket/output/

@uniqueg
Copy link
Contributor Author

uniqueg commented Aug 11, 2022

Would appreciate feedback @kellrott @aniewielska @pditommaso @mr-c @MattMcL4475 @vsmalladi @wleepang @geoffjentry 🙏🏻

@mr-c
Copy link

mr-c commented Aug 11, 2022

Looks good to me; tagging @tetron for his Arvados perspective and additional CWL perspective

@wleepang
Copy link

I think this looks good, but I'm curious to see some concrete examples. For instance, if I set:

tesOutput.path: /path/to/folder
tesOutput.path_prefix: /path/to/folder
yes output.url: s3://bucketname/my/results

Should I expect the contents of /path/to/folder to be copied recursively to s3://bucketname/my/results?

@vsmalladi
Copy link
Contributor

@wleepang If i understand this correctly this is what I would expect

tesOutput.path: /path/to/folder/data
tesOutput.path_prefix: /path/to/folder
output.url: s3://bucketname/my/results

URL: s3://bucketname/my/results/data

@uniqueg can you confirm this is what is expected with the combination of output_url and tesOutput.path_prefix?

@uniqueg
Copy link
Contributor Author

uniqueg commented Aug 12, 2022

@wleepang @vsmalladi

I have now added three examples to the PR description (my bad for not doing so earlier!).

Note that in the examples you give, the path doesn't include any wildcards, so a TES implementation should ignore tesOutput.path_prefix and interpret the value of tesOutput.url as a fully qualified URL at which the object created at tesOutput.path should be made accessible.

So in your example, @wleepang, contents of /path/to/folder would indeed be recursively copied to s3://bucketname/my/results.

In your case though, @vsmalladi, contents of /path/to/folder/data would be copied to s3://bucketname/my/results, not s3://bucketname/my/results/data.

Or at least this is my interpretation of the current specs.

@vsmalladi
Copy link
Contributor

@uniqueg Thanks for the clarification.

Then LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proposal: add a GLOB value to FileType enum
5 participants