-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tag different process output files when using Fusion #4031
Comments
My take is that any file in task workdir is temporary by default. therefore not should we should implement this |
Since we are already working on automatic cleanup for task directories, I think we should focus here on not uploading intermediate files. Also like Paolo says, really all of these files are temporary because of the publish dir. Therefore I propose that Nextflow simply provides the output patterns to Fusion, I guess as an environment variable. Then Fusion can try to avoid uploading intermediate files as long as they can be cached locally. |
Indeed, I think we talked doing this in the past with @jordeu |
I understand that the automatic cleanup is going to clean all the task files (also the outputs). It's the resume still going to work with this cleanup? This issue was thought for two different things:
We could pass same glob patterns using a different environment variable, but for Fusion would be just like another tag, and if it matches the tag "temporary" then it's going to treat that files a bit different. I do not see any benefit in using a different environment variable to pass this glob patterns to fusion. |
The automatic cleanup should be able to resume a task even if the task outputs were deleted, because it will check the task's consumers and skip the task if the consumers are also cached. But right now this is only theoretical, there might be edge cases I haven't considered yet. I see your point, we might as well include the patterns in the tags rather than another environment variable. Also, the automatic cleanup might make the |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I believe that Fusion now tags quite a few files in this way, but some created by Nextflow (such as the |
Description
When using Fusion files stored at S3 are always tagged following this pattern:
nextflow/modules/nextflow/src/main/groovy/nextflow/fusion/FusionConfig.groovy
Line 31 in cb95920
This allow us to differentiate only two groups of files:
Given that when using Fusion we run with
scratch=false
it will be useful to differentiate which of the process generated files are output of the process and which are just intermediate or temporal files that the process is not cleaning at the end.It will be interesting to tag they different.
Usage scenario
This will allow to define a S3 cleaning policy to remove all intermediate files that are not needed even to resume the pipeline.
Another future use case, could be that Fusion directly do not upload them to reduce storage usage.
The text was updated successfully, but these errors were encountered: