-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support more retryable exit codes for Google Cloud Batch #5136
Comments
Hi, the issue has also been mentioned here |
What do you think are legit retriable error codes other |
Maybe 50006? At first I thought 50002 could also be one. But it turns out that, at least in our case, it was because of the machine becoming unresponsive due to memory pressure. So maybe not eligible for automatic retries, at least not without adjusting the resource requirements on the task. How about I send in a PR to parse these error codes and then the user can mention these error codes in their nextflow/plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchTaskHandler.groovy Line 508 in 572f211
|
Ok, I've created a PR #5141 for this. The above changes captures all 5000x error codes and allows nextflow to handle using the usual |
Thank you! |
The merged PR sufficiently addresses this issue, so please feel free to close. |
New feature
According to https://cloud.google.com/batch/docs/troubleshooting, there are more retryable exit codes that we should add to the lifecycle policy of jobs launched by Nextflow. In particular, our jobs are hit by 50002 sometimes. Currently, Nextflow has hardcoded only the exit code 50001 to the lifecycle policy.
nextflow/plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchTaskHandler.groovy
Line 269 in 572f211
Usage scenario
Usage would be for default invocations of Google Cloud Batch pipelines.
Suggest implementation
A simple implementation would be to add the exit codes to the policy directly. An extended implementation would be to parse the exit codes like here to make retries user configurable for different error conditions.
I am happy to send a PR for either of these implementations if the idea sounds OK to the Nextflow team.
The text was updated successfully, but these errors were encountered: