Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support more retryable exit codes for Google Cloud Batch #5136

Closed
siddharthab opened this issue Jul 11, 2024 · 6 comments
Closed

Support more retryable exit codes for Google Cloud Batch #5136

siddharthab opened this issue Jul 11, 2024 · 6 comments

Comments

@siddharthab
Copy link
Contributor

New feature

According to https://cloud.google.com/batch/docs/troubleshooting, there are more retryable exit codes that we should add to the lifecycle policy of jobs launched by Nextflow. In particular, our jobs are hit by 50002 sometimes. Currently, Nextflow has hardcoded only the exit code 50001 to the lifecycle policy.

Usage scenario

Usage would be for default invocations of Google Cloud Batch pipelines.

Suggest implementation

A simple implementation would be to add the exit codes to the policy directly. An extended implementation would be to parse the exit codes like here to make retries user configurable for different error conditions.

I am happy to send a PR for either of these implementations if the idea sounds OK to the Nextflow team.

@VasLem
Copy link

VasLem commented Jul 11, 2024

Hi, the issue has also been mentioned here

@pditommaso
Copy link
Member

What do you think are legit retriable error codes other 50001?

@siddharthab
Copy link
Contributor Author

Maybe 50006? At first I thought 50002 could also be one. But it turns out that, at least in our case, it was because of the machine becoming unresponsive due to memory pressure. So maybe not eligible for automatic retries, at least not without adjusting the resource requirements on the task.

How about I send in a PR to parse these error codes and then the user can mention these error codes in their process.errorStrategy? If I understand the purpose of the below code correctly.

if( lastEvent?.getDescription()?.contains('due to Spot VM preemption with exit code 50001') ) {

@pditommaso
Copy link
Member

Ok, I've created a PR #5141 for this.

The above changes captures all 5000x error codes and allows nextflow to handle using the usual errorStrategy instead of reporting the error Process terminated for an unknown reason -- Likely it has been terminated by the external system.

@siddharthab
Copy link
Contributor Author

Thank you!

@siddharthab
Copy link
Contributor Author

The merged PR sufficiently addresses this issue, so please feel free to close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants