-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CANCELING state to delete cloud resources #189
Add CANCELING state to delete cloud resources #189
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion! For consistency with WES, I would prefer to name the state CANCELING
though. See enum definition and PR in WES.
@kellrott: seems that the CI only works for PRs that do not originate from forks. I'll see if I find some time to fix that in the coming days (and by fix, I probably mean disabling the failing actions for PRs from forks)
Can you re-target the PR to develop-1.1? I'm trying to collect all changes into that, so I can run on a final merged PR with CI before merging into develop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Maybe you could add a full stop to the description of the new state, as it seems the other state descriptions have one (at least the ones I could see in the diff)
Currently TES has a
CANCELED
state. However, cancelling a task means the associated cloud resources need to be deleted, which can take a few minutes to complete - much longer than is reasonable for an HTTP request timeout.Proposal: Add a new TES state:
CANCELING
. When a caller calls theCancelTask
operation, TES shall set the task's state toCANCELED
if and only if all dependent/downstream resources have been deleted. If it cannot delete the downstream cloud resources during the HTTP request lifetime, it shall set the task state toCANCELING
, until all downstream resources have been deleted.Example: in Cromwell on Azure, a user can submit a workflow that runs one task. If a user cancels that workflow while the task is still running, currently, Cromwell will call
CancelTask
, and the TES state will be set toCANCELED
. However, at this point, the Azure Batch job and pool have NOT been deleted yet, since this is done by a background thread running on the TES web server. This is problematic because Cromwell cannot observe whether TES' expensive resources have been deleted yet, which can leave the system in a broken and expensive state, with VMs sitting idle in Azure Batch. A control plane that is managing the Cromwell environment (such as Terra) can also therefore not observe whether the system has successfully completely cancelled all running workflows.