Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Google Batch 5000x error class handling #5141

Merged
merged 3 commits into from
Jul 15, 2024

Conversation

pditommaso
Copy link
Member

This PR improve the handling of Google Batch 5000x error class in such a way that

  1. All 50001 .. 50006 error codes managed, see here
  2. When this error condition arises is converted into ProcessException including the source error description
  3. The usual Nextflow errorRetry strategies can be applied to handle the error condition

Signed-off-by: Paolo Di Tommaso <[email protected]>
Copy link

netlify bot commented Jul 12, 2024

Deploy Preview for nextflow-docs-staging ready!

Name Link
🔨 Latest commit cb63ec9
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/6690fe184f75e4000875c90e
😎 Deploy Preview https://deploy-preview-5141--nextflow-docs-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@bentsherman
Copy link
Member

See also:

// retry on spot reclaim
if( executor.config.maxSpotAttempts ) {
// Note: Google Batch uses the special exit status 50001 to signal
// the execution was terminated due a spot reclaim. When this happens
// The policy re-execute the jobs automatically up to `maxSpotAttempts` times
taskSpec
.setMaxRetryCount( executor.config.maxSpotAttempts )
.addLifecyclePolicies(
LifecyclePolicy.newBuilder()
.setActionCondition(
LifecyclePolicy.ActionCondition.newBuilder()
.addExitCodes(50001)
)
.setAction(LifecyclePolicy.Action.RETRY_TASK)
)
}

@VasLem
Copy link

VasLem commented Jul 12, 2024

Just a user opinion, I would prefer these errors not to change the task attempt number, as in a dynamic configurations this would lead to unnecessary increase of resources, or unnecessary termination/ignoration of the process.

@pditommaso
Copy link
Member Author

This change makes it possible to recover errors from 50002 to 50006, similarly to 50001. Currently, they are causing the termination of the pipeline execution.

@bentsherman Regarding the Batch automatic recovery, I didn't change it on purpose because the semantic settings are specific for spot interruption since it's named maxSpotAttempts.

The this failures can be recovered with the usual error strategy.

@bentsherman
Copy link
Member

From meeting discussion, we suggested using a new config setting e.g. google.batch.autoRetryExitCodes to provide a list of exit codes to retry natively (without increasing the task attempt)

@pditommaso
Copy link
Member Author

Merging this. I'll open a separate PR for the retry logic at this comment

@pditommaso pditommaso merged commit 61b2205 into master Jul 15, 2024
21 checks passed
@pditommaso pditommaso deleted the google-batch-error-retry branch July 15, 2024 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants