Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ScaledJob Scaler does not take "Terminating" Pods into consideration #2910

Closed
FullyScaled opened this issue Apr 14, 2022 · 16 comments · Fixed by #3338
Closed

ScaledJob Scaler does not take "Terminating" Pods into consideration #2910

FullyScaled opened this issue Apr 14, 2022 · 16 comments · Fixed by #3338
Labels
bug Something isn't working

Comments

@FullyScaled
Copy link
Contributor

FullyScaled commented Apr 14, 2022

Report

When a new scaledJob spec is rolled out, all existing Pods for those jobs receive a Sig-Term when using the default rollout strategy. This is desired behavior for us and fits our usecase.

Since we are doing long term calculations in those pods, we ignore the Sig-Term and our pod finishes its work. The Pod status of those sig-termed Pods is "Terminating".

Keda only seems to care about Pods that are either Pending or Running. As a consequence Keda will not count the "Terminating" Pods to the already running and healthy Pods. This means the amount of created Jobs by Keda is off by the number of Pods that are in state "Terminating".

Is there a possibility to configure that "Terminating" Pods are counted as running Pods as well? We had a look into the PendingPodConditions but they dont seem to work with the actual Pod status (Terminating).

Expected Behavior

KEDA should create N pods - number of the Terminating pods from the previous version.

Actual Behavior

KEDA will create N pods + number of the Terminating pods from the previous version.

Steps to Reproduce the Problem

  1. Delay Sig-Term Shutdown in your workload pod (the one that gets created by the ScaledJob)
  2. Rollout new ScaledJob-Spec or send a sig-Term to one of your workload pods
  3. Number of created Jobs is off by the number of "Terminating" Pods

KEDA Version

2.5.0

Kubernetes Version

1.21

Platform

Microsoft Azure

Scaler Details

Redis

Anything else?

Slack-Discussion: https://kubernetes.slack.com/archives/CKZJ36A5D/p1649865034815469

@FullyScaled FullyScaled added the bug Something isn't working label Apr 14, 2022
@tomkerkhove tomkerkhove moved this to Proposed in Roadmap - KEDA Core Apr 14, 2022
@FullyScaled
Copy link
Contributor Author

Terminating Pods are still in Phase "Running".
image

@stale
Copy link

stale bot commented Jun 13, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Jun 13, 2022
@JorTurFer
Copy link
Member

hey,
Sorry for the super slow response, I have to apologize because I didn't see the issue...
did you try adding Terminating to the PendingPodConditions? I theory you could specify there the status that you want

@stale stale bot removed the stale All issues that are marked as stale due to inactivity label Jun 13, 2022
@FullyScaled
Copy link
Contributor Author

Hi Jorge,
we already tried to add "Terminating" to those pending pod conditations. The pendingPodConditions seems to be something different than the pod phase which is "Terminating" in our case.

@JorTurFer
Copy link
Member

JorTurFer commented Jun 14, 2022

your use case makes sense I'd say, are you willing to contribute it?
I mean, adding the support to specify it also as pending status

@FullyScaled
Copy link
Contributor Author

FullyScaled commented Jun 14, 2022

Hi, yes we could imagine to contribute this feature.

Some questions on our side:
When Keda receives a ScaledJob update it deletes the old Jobs. In this case we end up with some pods that are in state "Terminating" that dont have a parent k8s jobs. We would need some mechanism that maps those "zombie" pods to the scaled job resource, right?

@JorTurFer
Copy link
Member

Sorry again for the slow response. I have read your use case again, and I don't get the current problem. If you are using ScaledJob, your pods shouldn't receive the termination signal, at least form KEDA. KEDA creates the job and monitors them, but KEDA doesn't kill them (correct me if I'm wrong @zroubalik ).
Are you sure that is KEDA who is killing the jobs? Does KEDA update already existing jobs with the new spec @zroubalik ?
Maybe it's enough if we don't touch already existing jobs and modify the new ones

@zroubalik
Copy link
Member

Yeah, you are right on this one @JorTurFer

@FullyScaled
Copy link
Contributor Author

FullyScaled commented Jul 4, 2022

When Keda is not configured to use the "gradual rollout" feature, keda does delete the existing jobs and creates new jobs instead. In this case kubernetes will send a sig term to the underlying pods. Those pods will be in state "Terminating". As a consequence the amount of running pods will be off by the number of "Terminating" pods that dont have a parent job any more.

In our case we want that sig-term in our pods because we can decide with logic that runs in the pod if the pod should terminate itself or shutdown needs to be delayed (which causes the "Terminating" status).

@JorTurFer
Copy link
Member

JorTurFer commented Jul 4, 2022

But in that case, KEDA doesn't know if those pods are jobs or not :(
Is doing the rollout with gradual not enough? I mean, that will keep the old jobs as you know, so you don't need to handle anything...

@FullyScaled
Copy link
Contributor Author

Yes, you are right. This is what adds the complexity here :/...

In order to achieve this, we would require some logic to find ther pods that are labeled with the according scaledjob.keda.sh/name that dont have a parent job any more and add this amount to the already running jobs for a scaled job. This is somehow ugly and I dont think this is a implementation direction you would be willing to accept.

I am currently doing some reading on ownerships in kubernetes.

Scaled-Job Definition (Custom Resource) ----> Kubernetes Jobs ----> Kubernetes Pods

Keda uses the Foreground ownership as a default. This means that whenever the parent job is deleted, the corresponding child pod is deleted first. All we have to achieve now is that the job deletion is delayed until the pod could be deleted as well.
If this is achievable via kubernetes mechanism we dont need to add this logic to keda at all.

@FullyScaled
Copy link
Contributor Author

FullyScaled commented Jul 5, 2022

According to the kubernetes docu the deletion of the parent resource (job) is delayed until the child (pod) was deleted when setting ownerReference.blockOwnerDeletion=true:

Once the “deletion in progress” state is set, the garbage collector deletes the object’s dependents. Once the garbage collector has deleted all “blocking” dependents (objects with ownerReference.blockOwnerDeletion=true), it deletes the owner object.
Source: https://people.wikimedia.org/~jayme/k8s-docs/v1.16/docs/concepts/workloads/controllers/garbage-collection/

Keda defines this setting as a default so I am curious why we observe the issue described here in the first place to be honest.

Edit: It seems that background delegation is the default. We will do some further tests with foreground delegation.

@FullyScaled
Copy link
Contributor Author

I guess all we have to do is making the propagation policy configurable here:

err = r.Client.Delete(ctx, &job, client.PropagationPolicy(metav1.DeletePropagationBackground))

I will prepare a PR for this. This will solve the issue described here.

@JorTurFer
Copy link
Member

nice investigation!
Maybe adding a parameter to configure it in the ScaledJob?
WDYT @zroubalik ?

@FullyScaled
Copy link
Contributor Author

Yes, we are currently preparing such a parameter.

Do you prefer something like this:

rolloutStrategy:
	strategy: default
	propagationPolicy: background

or something like this:

rolloutStrategy: default
defaultRolloutStrategyPropagationPolicy: background

We would prefer option 1 but this change would be incompatible.

@JorTurFer
Copy link
Member

JorTurFer commented Jul 6, 2022

I'd say that we can keep both. I mean, we can create another section named rollout with the keys strategy and propagation. In that case, we need to log a message notifying the deprecation of the previous value if it's set and support getting the value from both places (prioritizing one of them)

Something like:

rolloutStrategy: default #Marking this as deprecated
rollout:
  strategy: default
  propagationPolicy: background

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
3 participants