-
Notifications
You must be signed in to change notification settings - Fork 98
fix: Have GKEPodOperator delete running pods during cleanup (DENG-8290)
#2192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
4b103ea
642194b
637bdde
649da8d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,12 +1,18 @@ | ||
| import logging | ||
|
|
||
| import kubernetes.client as k8s | ||
| from airflow.providers.cncf.kubernetes.callbacks import KubernetesPodOperatorCallback | ||
| from airflow.providers.cncf.kubernetes.utils.pod_manager import OnFinishAction, PodPhase | ||
| from airflow.providers.google.cloud.links.kubernetes_engine import KubernetesEnginePodLink | ||
| from airflow.providers.google.cloud.operators.kubernetes_engine import ( | ||
| GKEStartPodOperator as UpstreamGKEPodOperator, | ||
| ) | ||
| from airflow.utils.context import Context | ||
|
|
||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class GKEPodOperatorCallbacks(KubernetesPodOperatorCallback): | ||
| @staticmethod | ||
| def on_pod_completion( | ||
|
|
@@ -92,3 +98,34 @@ def get_or_create_pod(self, pod_request_obj: k8s.V1Pod, context: Context) -> k8s | |
| self.pod = pod | ||
| KubernetesEnginePodLink.persist(context=context, task_instance=self) | ||
| return pod | ||
|
|
||
| def process_pod_deletion(self, pod: k8s.V1Pod, *, reraise=True): | ||
| if pod is None: | ||
| return | ||
|
|
||
| # Since we default to on_finish_action="keep_pod" the pod may be left running if the task | ||
| # was stopped for a reason other than the pod succeeding/failing, like a pod startup timeout | ||
| # or a task execution timeout (this could be considered a bug in the Kubernetes provider). | ||
| # As a workaround we delete the pod during cleanup if it's still running. | ||
| try: | ||
| remote_pod: k8s.V1Pod = self.client.read_namespaced_pod( | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's the difference between Can you just use that?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
So here I'm setting While this isn't technically required for the core functionality of cleaning up unterminated pods, I figured it was worth doing for recording what actually happened. However, if you think it's overkill I'm not opposed to simplifying this and using the out-of-date
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok that makes sense. I guess you would need the updated state for the |
||
| pod.metadata.name, pod.metadata.namespace | ||
| ) | ||
| if ( | ||
| remote_pod.status.phase not in PodPhase.terminal_states | ||
| and self.on_finish_action != OnFinishAction.DELETE_POD | ||
| ): | ||
| logger.info( | ||
| f"Deleting {remote_pod.status.phase.lower()} pod: {pod.metadata.name}" | ||
| ) | ||
| self.pod_manager.delete_pod(remote_pod) | ||
| else: | ||
| super().process_pod_deletion(pod, reraise=reraise) | ||
| except Exception as e: | ||
| if isinstance(e, k8s.ApiException) and e.status == 404: | ||
| # Ignore "404 Not Found" errors. | ||
|
sean-rose marked this conversation as resolved.
|
||
| logger.warning(f'Pod "{pod.metadata.name}" not found.') | ||
| elif reraise: | ||
| raise | ||
| else: | ||
| logger.exception(e) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have preferred to use an
on_pod_cleanupcallback rather than overriding this internal-ish method, but unfortunatelyon_pod_cleanupcallbacks aren't called when Airflow tasks fail (maybe a bug?).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea why
on_pod_cleanupcallbacks aren't called in the event of a task failure? Did you look in the Airflow project to see if this is a known issue/bug?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need the
*boundary?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's because the
on_pod_cleanupcallback is only called after thecleanupmethod completes, but thecleanupmethod raises an exception if the task failed.Yes. I didn't see an existing issue for this, and the problem still exists on the
mainbranch so it hasn't been fixed yet (though it's possible this is an intentional design decision).I did just submit this Airflow PR with a possible fix, so we'll so how that goes (though even if that's accepted it'd likely be a while before it's included in an
apache-airflow-providers-cncf-kubernetespackage release and we can upgrade to that).This matches the argument signature of the original
process_pod_deletionmethod that this is overriding, and the*makes it so that subsequent arguments can only be specified as keyword arguments.