Skip to content

Conversation

@turboFei
Copy link
Member

@turboFei turboFei commented Sep 12, 2024

🔍 Description

Issue References 🔗

To close #6686

image

The pod already in failed state, and the driver container is in waiting state.

We shall mark the application terminated and ignore the container state.

Describe Your Solution 🔧

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Types of changes 🔖

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Test Plan 🧪

Behavior Without This Pull Request ⚰️

Behavior With This Pull Request 🎉

Related Unit Tests


Checklist 📝

Be nice. Be informative.

@turboFei turboFei changed the title [KYUUBI #6686] Check whether pod based application state terminated [KYUUBI #6686] Mark application terminated if pod is terminated Sep 12, 2024
@turboFei turboFei self-assigned this Sep 12, 2024
@turboFei turboFei changed the title [KYUUBI #6686] Mark application terminated if pod is terminated [KYUUBI #6686] Ignore container state if pod is terminated Sep 12, 2024
@turboFei turboFei changed the title [KYUUBI #6686] Ignore container state if pod is terminated [KYUUBI #6686] Ignore Spark pod container state if pod is terminated Sep 12, 2024
@codecov-commenter
Copy link

codecov-commenter commented Sep 12, 2024

Codecov Report

Attention: Patch coverage is 0% with 16 lines in your changes missing coverage. Please review.

Project coverage is 0.00%. Comparing base (db5ce0c) to head (0d4c8a2).
Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
...kyuubi/engine/KubernetesApplicationOperation.scala 0.00% 15 Missing ⚠️
...in/scala/org/apache/kyuubi/config/KyuubiConf.scala 0.00% 1 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff           @@
##           master   #6690   +/-   ##
======================================
  Coverage    0.00%   0.00%           
======================================
  Files         684     684           
  Lines       42233   42244   +11     
  Branches     5755    5756    +1     
======================================
- Misses      42233   42244   +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@turboFei turboFei closed this in f8431da Sep 14, 2024
@turboFei turboFei added this to the v1.10.0 milestone Sep 14, 2024
@turboFei turboFei deleted the pod_state branch September 14, 2024 19:28
@turboFei
Copy link
Member Author

thanks, merged to 1.0.0.

turboFei added a commit that referenced this pull request Apr 16, 2025
…pp state than terminated pod state

### Why are the changes needed?

I found that, for a kyuubi batch on kubernetes.

1. It has been `FINISHED`.
2. then I delete the pod manually, then I check the k8s-audit.log, then the appState became `FAILED`.

```
2025-04-15 11:16:30.453 INFO [-675216314-pool-44-thread-839] org.apache.kyuubi.engine.KubernetesApplicationAuditLogger: label=61e7d8c1-e5a9-46cd-83e7-c611003f0224     context=97      namespace=dls-prod      pod=kyuubi-spark-61e7d8c1-e5a9-46cd-83e7-c611003f0224-driver podState=Running        containers=[microvault->ContainerState(running=ContainerStateRunning(startedAt=2025-04-15T18:13:48Z, additionalProperties={}), terminated=null, waiting=null, additionalProperties={}),spark-kubernetes-driver->ContainerState(running=null, terminated=ContainerStateTerminated(containerID=containerd://72704f8e7ccb5e877c8f6b10bf6ad810d0c019e07e0cb5975be733e79762c1ec, exitCode=0, finishedAt=2025-04-15T18:14:22Z, message=null, reason=Completed, signal=null, startedAt=2025-04-15T18:13:49Z, additionalProperties={}), waiting=null, additionalProperties={})]   appId=spark-228c62e0dc37402bacac189d01b871e4    appState=FINISHED       appError=''
:2025-04-15 11:16:30.854 INFO [-675216314-pool-44-thread-840] org.apache.kyuubi.engine.KubernetesApplicationAuditLogger: label=61e7d8c1-e5a9-46cd-83e7-c611003f0224     context=97      namespace=dls-prod      pod=kyuubi-spark-61e7d8c1-e5a9-46cd-83e7-c611003f0224-driver podState=Failed containers=[microvault->ContainerState(running=null, terminated=ContainerStateTerminated(containerID=containerd://91654e3ee74e2c31218e14be201b50a4a604c2ad15d3afd84dc6f620e59894b7, exitCode=2, finishedAt=2025-04-15T18:16:30Z, message=null, reason=Error, signal=null, startedAt=2025-04-15T18:13:48Z, additionalProperties={}), waiting=null, additionalProperties={}),spark-kubernetes-driver->ContainerState(running=null, terminated=ContainerStateTerminated(containerID=containerd://72704f8e7ccb5e877c8f6b10bf6ad810d0c019e07e0cb5975be733e79762c1ec, exitCode=0, finishedAt=2025-04-15T18:14:22Z, message=null, reason=Completed, signal=null, startedAt=2025-04-15T18:13:49Z, additionalProperties={}), waiting=null, additionalProperties={})]    appId=spark-228c62e0dc37402bacac189d01b871e4    appState=FAILED appError='{
```

This PR is a followup for #6690 , which ignore the container state if POD is terminated.

It is more reasonable to respect the terminated container state than terminated pod state.

### How was this patch tested?

Integration testing.

```
:2025-04-15 13:53:24.551 INFO [-1077768163-pool-36-thread-3] org.apache.kyuubi.engine.KubernetesApplicationAuditLogger: eventType=DELETE	label=e0eb4580-3cfa-43bf-bdcc-efeabcabc93c	context=97	namespace=dls-prod	pod=kyuubi-spark-e0eb4580-3cfa-43bf-bdcc-efeabcabc93c-driver	podState=Failed	containers=[microvault->ContainerState(running=null, terminated=ContainerStateTerminated(containerID=containerd://66c42206730950bd422774e3c1b0f426d7879731788cea609bbfe0daab24a763, exitCode=2, finishedAt=2025-04-15T20:53:22Z, message=null, reason=Error, signal=null, startedAt=2025-04-15T20:52:00Z, additionalProperties={}), waiting=null, additionalProperties={}),spark-kubernetes-driver->ContainerState(running=null, terminated=ContainerStateTerminated(containerID=containerd://9179a73d9d9e148dcd9c13ee6cc29dc3e257f95a33609065e061866bb611cb3b, exitCode=0, finishedAt=2025-04-15T20:52:28Z, message=null, reason=Completed, signal=null, startedAt=2025-04-15T20:52:01Z, additionalProperties={}), waiting=null, additionalProperties={})]	appId=spark-578df0facbfd4958a07f8d1ae79107dc	appState=FINISHED	appError=''
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #7025 from turboFei/container_terminated.

Closes #7025

Closes #6686

a3b2a5a [Wang, Fei] comments
4356d1b [Wang, Fei] fix the app state logical

Authored-by: Wang, Fei <[email protected]>
Signed-off-by: Wang, Fei <[email protected]>
turboFei added a commit that referenced this pull request Apr 16, 2025
…pp state than terminated pod state

### Why are the changes needed?

I found that, for a kyuubi batch on kubernetes.

1. It has been `FINISHED`.
2. then I delete the pod manually, then I check the k8s-audit.log, then the appState became `FAILED`.

```
2025-04-15 11:16:30.453 INFO [-675216314-pool-44-thread-839] org.apache.kyuubi.engine.KubernetesApplicationAuditLogger: label=61e7d8c1-e5a9-46cd-83e7-c611003f0224     context=97      namespace=dls-prod      pod=kyuubi-spark-61e7d8c1-e5a9-46cd-83e7-c611003f0224-driver podState=Running        containers=[microvault->ContainerState(running=ContainerStateRunning(startedAt=2025-04-15T18:13:48Z, additionalProperties={}), terminated=null, waiting=null, additionalProperties={}),spark-kubernetes-driver->ContainerState(running=null, terminated=ContainerStateTerminated(containerID=containerd://72704f8e7ccb5e877c8f6b10bf6ad810d0c019e07e0cb5975be733e79762c1ec, exitCode=0, finishedAt=2025-04-15T18:14:22Z, message=null, reason=Completed, signal=null, startedAt=2025-04-15T18:13:49Z, additionalProperties={}), waiting=null, additionalProperties={})]   appId=spark-228c62e0dc37402bacac189d01b871e4    appState=FINISHED       appError=''
:2025-04-15 11:16:30.854 INFO [-675216314-pool-44-thread-840] org.apache.kyuubi.engine.KubernetesApplicationAuditLogger: label=61e7d8c1-e5a9-46cd-83e7-c611003f0224     context=97      namespace=dls-prod      pod=kyuubi-spark-61e7d8c1-e5a9-46cd-83e7-c611003f0224-driver podState=Failed containers=[microvault->ContainerState(running=null, terminated=ContainerStateTerminated(containerID=containerd://91654e3ee74e2c31218e14be201b50a4a604c2ad15d3afd84dc6f620e59894b7, exitCode=2, finishedAt=2025-04-15T18:16:30Z, message=null, reason=Error, signal=null, startedAt=2025-04-15T18:13:48Z, additionalProperties={}), waiting=null, additionalProperties={}),spark-kubernetes-driver->ContainerState(running=null, terminated=ContainerStateTerminated(containerID=containerd://72704f8e7ccb5e877c8f6b10bf6ad810d0c019e07e0cb5975be733e79762c1ec, exitCode=0, finishedAt=2025-04-15T18:14:22Z, message=null, reason=Completed, signal=null, startedAt=2025-04-15T18:13:49Z, additionalProperties={}), waiting=null, additionalProperties={})]    appId=spark-228c62e0dc37402bacac189d01b871e4    appState=FAILED appError='{
```

This PR is a followup for #6690 , which ignore the container state if POD is terminated.

It is more reasonable to respect the terminated container state than terminated pod state.

### How was this patch tested?

Integration testing.

```
:2025-04-15 13:53:24.551 INFO [-1077768163-pool-36-thread-3] org.apache.kyuubi.engine.KubernetesApplicationAuditLogger: eventType=DELETE	label=e0eb4580-3cfa-43bf-bdcc-efeabcabc93c	context=97	namespace=dls-prod	pod=kyuubi-spark-e0eb4580-3cfa-43bf-bdcc-efeabcabc93c-driver	podState=Failed	containers=[microvault->ContainerState(running=null, terminated=ContainerStateTerminated(containerID=containerd://66c42206730950bd422774e3c1b0f426d7879731788cea609bbfe0daab24a763, exitCode=2, finishedAt=2025-04-15T20:53:22Z, message=null, reason=Error, signal=null, startedAt=2025-04-15T20:52:00Z, additionalProperties={}), waiting=null, additionalProperties={}),spark-kubernetes-driver->ContainerState(running=null, terminated=ContainerStateTerminated(containerID=containerd://9179a73d9d9e148dcd9c13ee6cc29dc3e257f95a33609065e061866bb611cb3b, exitCode=0, finishedAt=2025-04-15T20:52:28Z, message=null, reason=Completed, signal=null, startedAt=2025-04-15T20:52:01Z, additionalProperties={}), waiting=null, additionalProperties={})]	appId=spark-578df0facbfd4958a07f8d1ae79107dc	appState=FINISHED	appError=''
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #7025 from turboFei/container_terminated.

Closes #7025

Closes #6686

a3b2a5a [Wang, Fei] comments
4356d1b [Wang, Fei] fix the app state logical

Authored-by: Wang, Fei <[email protected]>
Signed-off-by: Wang, Fei <[email protected]>
(cherry picked from commit 7e199d6)
Signed-off-by: Wang, Fei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Kyuubi batch state abnormal - batch failed but marked as finished

4 participants