Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Google Batch hang when internal error during scheduling #5567

Merged

Conversation

jorgee
Copy link
Contributor

@jorgee jorgee commented Dec 4, 2024

When there is an internal error in Google Batch the job during scheduling. Jobs go from queued to failed. This situation is happening when request a VM type which is not available in the selected region. In this situation, there are no tasks in the batch job and NF is blocked in an infinite loop caused by the following part of code.

nextflow/plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchTaskHandler.groovy

In this PR, this part is modified including a job status check to avoid to continue checking when job is failed.

I still have pending to decide what to do if job do not have tasks and job status is null. I think we should abort but @bentsherman do you remember why you returned PENDING when there are no tasks in the job? Is the some initial state where GCP is creating the tasks and we should consider PENDING?

@jorgee jorgee linked an issue Dec 4, 2024 that may be closed by this pull request
Copy link

netlify bot commented Dec 4, 2024

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 32fc8a3
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/6750995c8d61ff000897dffa

jorgee and others added 2 commits December 4, 2024 13:29
…atchTaskHandlerTest.groovy

Co-authored-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Jorge Ejarque <[email protected]>
…atchTaskHandlerTest.groovy

Co-authored-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Jorge Ejarque <[email protected]>
@pditommaso pditommaso marked this pull request as ready for review December 4, 2024 12:46
jorgee and others added 2 commits December 4, 2024 15:11
…atchTaskHandlerTest.groovy [ci skip]

Co-authored-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Jorge Ejarque <[email protected]>
jorgee and others added 4 commits December 4, 2024 15:54
…atchTaskHandler.groovy

Co-authored-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Jorge Ejarque <[email protected]>
…atchTaskHandler.groovy

Co-authored-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Jorge Ejarque <[email protected]>
…atchTaskHandler.groovy [ci skip]

Co-authored-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Jorge Ejarque <[email protected]>
@bentsherman
Copy link
Member

I still have pending to decide what to do if job do not have tasks and job status is null. I think we should abort but @bentsherman do you remember why you returned PENDING when there are no tasks in the job? Is the some initial state where GCP is creating the tasks and we should consider PENDING?

Yes, initially the job might not have any tasks, but we need to check the task state now to support job arrays, so when there are no tasks we just wait for tasks to be created. Checking the job status for failure makes sense

@pditommaso
Copy link
Member

@bentsherman approve if you think it's good

@bentsherman
Copy link
Member

It would be nice to test against a real error case, but not sure if it can be easily replicated. Aside from that, the changes look good to me

@pditommaso
Copy link
Member

Good point, Esha reported the issue she has a test case

@pditommaso
Copy link
Member

Tested internally

@pditommaso pditommaso merged commit 18f7de1 into master Dec 9, 2024
21 checks passed
@pditommaso pditommaso deleted the 5550-google-batch-run-hangs-when-a-job-fail-to-start branch December 9, 2024 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Google Batch run hangs when a job fail to start
3 participants