Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tag different process output files when using Fusion #4031

Closed
jordeu opened this issue Jun 16, 2023 · 7 comments · Fixed by #3892
Closed

Tag different process output files when using Fusion #4031

jordeu opened this issue Jun 16, 2023 · 7 comments · Fixed by #3892

Comments

@jordeu
Copy link
Collaborator

jordeu commented Jun 16, 2023

Description

When using Fusion files stored at S3 are always tagged following this pattern:

final static public String DEFAULT_TAGS = "[.command.*|.exitcode|.fusion.*](nextflow.io/metadata=true),[*](nextflow.io/temporary=true)"

This allow us to differentiate only two groups of files:

  • Nextflow generated files
  • Process generated files.

Given that when using Fusion we run with scratch=false it will be useful to differentiate which of the process generated files are output of the process and which are just intermediate or temporal files that the process is not cleaning at the end.

It will be interesting to tag they different.

Usage scenario

This will allow to define a S3 cleaning policy to remove all intermediate files that are not needed even to resume the pipeline.

Another future use case, could be that Fusion directly do not upload them to reduce storage usage.

@pditommaso
Copy link
Member

My take is that any file in task workdir is temporary by default. therefore not should we should implement this

@bentsherman
Copy link
Member

Since we are already working on automatic cleanup for task directories, I think we should focus here on not uploading intermediate files. Also like Paolo says, really all of these files are temporary because of the publish dir.

Therefore I propose that Nextflow simply provides the output patterns to Fusion, I guess as an environment variable. Then Fusion can try to avoid uploading intermediate files as long as they can be cached locally.

@pditommaso
Copy link
Member

Therefore I propose that Nextflow simply provides the output patterns to Fusion, I guess as an environment variable

Indeed, I think we talked doing this in the past with @jordeu

@jordeu
Copy link
Collaborator Author

jordeu commented Jun 29, 2023

I understand that the automatic cleanup is going to clean all the task files (also the outputs). It's the resume still going to work with this cleanup?

This issue was thought for two different things:

  1. From a @robsyme comment I understand that current users of Fusion v2 would like to be able to clean with an S3 policy all the "non-task-output" files (so to keep only the same that you keep when you run with scratch=true). To achieve this imagine that we have a output: *.fasta then the only change that we need to do is Nextflow pass a tag pattern like [.command.*|.exitcode|.fusion.*](nextflow.io/metadata=true),[*.fasta](nextflow.io/output=true),[*](nextflow.io/temporary=true)" and it will work without any change at Fusion side.

  2. Then if we have this info on Fusion we can improve fusion performance in different ways:
    - avoid uploading file tagged like temporary if we have enough temporal cache
    - do not compact chunked temporary files at shutdown
    - at shutdown remove any temporary files to save space

We could pass same glob patterns using a different environment variable, but for Fusion would be just like another tag, and if it matches the tag "temporary" then it's going to treat that files a bit different. I do not see any benefit in using a different environment variable to pass this glob patterns to fusion.

@bentsherman
Copy link
Member

The automatic cleanup should be able to resume a task even if the task outputs were deleted, because it will check the task's consumers and skip the task if the consumers are also cached. But right now this is only theoretical, there might be edge cases I haven't considered yet.

I see your point, we might as well include the patterns in the tags rather than another environment variable.

Also, the automatic cleanup might make the nextflow.io/temporary tag unnecessary.

Copy link

stale bot commented Dec 15, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Dec 15, 2023
@ewels
Copy link
Member

ewels commented Mar 22, 2024

I believe that Fusion now tags quite a few files in this way, but some created by Nextflow (such as the nf-{id}-reports.tsv file from the nf-tower plugin) are not. A request came up again to detect when the workdir is an S3 bucket, an append a nextflow.io/metadata=true tag to the reports mapping file generated at the root of the workdir.

@bentsherman bentsherman linked a pull request Apr 25, 2024 that will close this issue
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants