Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CANCELING state to delete cloud resources #189

Merged

Conversation

MattMcL4475
Copy link
Contributor

@MattMcL4475 MattMcL4475 commented Sep 30, 2022

Currently TES has a CANCELED state. However, cancelling a task means the associated cloud resources need to be deleted, which can take a few minutes to complete - much longer than is reasonable for an HTTP request timeout.

Proposal: Add a new TES state: CANCELING. When a caller calls the CancelTask operation, TES shall set the task's state to CANCELED if and only if all dependent/downstream resources have been deleted. If it cannot delete the downstream cloud resources during the HTTP request lifetime, it shall set the task state to CANCELING, until all downstream resources have been deleted.

Example: in Cromwell on Azure, a user can submit a workflow that runs one task. If a user cancels that workflow while the task is still running, currently, Cromwell will call CancelTask, and the TES state will be set to CANCELED. However, at this point, the Azure Batch job and pool have NOT been deleted yet, since this is done by a background thread running on the TES web server. This is problematic because Cromwell cannot observe whether TES' expensive resources have been deleted yet, which can leave the system in a broken and expensive state, with VMs sitting idle in Azure Batch. A control plane that is managing the Cromwell environment (such as Terra) can also therefore not observe whether the system has successfully completely cancelled all running workflows.

Copy link
Contributor

@uniqueg uniqueg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion! For consistency with WES, I would prefer to name the state CANCELING though. See enum definition and PR in WES.

@kellrott: seems that the CI only works for PRs that do not originate from forks. I'll see if I find some time to fix that in the coming days (and by fix, I probably mean disabling the failing actions for PRs from forks)

@kellrott
Copy link
Member

kellrott commented Oct 3, 2022

Can you re-target the PR to develop-1.1? I'm trying to collect all changes into that, so I can run on a final merged PR with CI before merging into develop.

@MattMcL4475 MattMcL4475 changed the base branch from develop to develop-1.1 October 3, 2022 18:36
@MattMcL4475 MattMcL4475 changed the title Add CANCELLATION_REQUESTED state to delete cloud resources Add CANCELING state to delete cloud resources Oct 3, 2022
@MattMcL4475
Copy link
Contributor Author

Thank you @uniqueg and @kellrott for the quick feedback! Changed it to CANCELING and re-targeted develop-1.1

Copy link
Contributor

@uniqueg uniqueg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Maybe you could add a full stop to the description of the new state, as it seems the other state descriptions have one (at least the ones I could see in the diff)

@MattMcL4475
Copy link
Contributor Author

Thanks @uniqueg ; @kellrott looks like it's ready to merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants