Job backoffLimit does not cap pod restarts when restartPolicy: OnFailure #54870

ritsok · 2017-10-31T17:47:17Z

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:
When creating a job with backoffLimit: 2 and restartPolicy: OnFailure, the pod with the configuration included below continued to restart and was never marked as failed:

$ kubectl get pods
NAME               READY     STATUS    RESTARTS   AGE
failed-job-t6mln   0/1       Error     4          57s

$ kubectl describe job failed-job
Name:           failed-job
Namespace:      default
Selector:       controller-uid=58c6d945-be62-11e7-86f7-080027797e6b
Labels:         controller-uid=58c6d945-be62-11e7-86f7-080027797e6b
                job-name=failed-job
Annotations:    kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"batch/v1","kind":"Job","metadata":{"annotations":{},"name":"failed-job","namespace":"default"},"spec":{"backoffLimit":2,"template":{"met...
Parallelism:    1
Completions:    1
Start Time:     Tue, 31 Oct 2017 10:38:46 -0700
Pods Statuses:  1 Running / 0 Succeeded / 0 Failed
Pod Template:
  Labels:  controller-uid=58c6d945-be62-11e7-86f7-080027797e6b
           job-name=failed-job
  Containers:
   nginx:
    Image:  nginx:1.7.9
    Port:   <none>
    Command:
      bash
      -c
      exit 1
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>
Events:
  Type    Reason            Age   From            Message
  ----    ------            ----  ----            -------
  Normal  SuccessfulCreate  1m    job-controller  Created pod: failed-job-t6mln

What you expected to happen:

We expected only 2 attempted pod restarts before the job was marked as failed. The pod kept restarting and job.pod.status was never set to failed, remained active.

How to reproduce it (as minimally and precisely as possible):

 apiVersion: batch/v1
 kind: Job
 metadata:
   name: failed-job
   namespace: default
 spec:
   backoffLimit: 2
   template:
     metadata:
       name: failed-job
     spec:
       containers:
       - name: nginx
         image: nginx:1.7.9
         command: ["bash", "-c", "exit 1"]
       restartPolicy: OnFailure

Create the above job and observe the number of pod restarts.

Anything else we need to know?:

The backoffLimit flag works as expected when restartPolicy: Never.

Environment:

Kubernetes version (use kubectl version): 1.8.0 using minikube

The text was updated successfully, but these errors were encountered:

ritsok · 2017-10-31T17:59:16Z

@kubernetes/sig-api-machinery-bugs

k8s-ci-robot · 2017-10-31T17:59:24Z

@ritsok: Reiterating the mentions to trigger a notification:
@kubernetes/sig-api-machinery-bugs

In response to this:

@kubernetes/sig-api-machinery-bugs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

weiwei04 · 2017-11-02T10:53:58Z

I think this is because for restartPolicy: "OnFailure", kubelet will try to restart failed pod in place, and the pod status will be Running even the container is failed to run, so jobcontroller get a illusion as the pod is still running.

Here's my naive thoughts: for short-term to avoid this, add a spec.activeDeadlineSeconds, for long-term we need consider how to recognize continue restarting Pod and count this kind of Pod as failed pod.

cc @soltysh

timoreimann · 2017-11-07T18:53:36Z

Just ran into the issue as well. It might be useful to amend the documentation to warn about the fact that backoffLimit won't be effective if OnFailure is chosen as restart policy. At least, it wasn't obvious to me and I didn't catch any relevant note in the docs. Happy to do a quick PR if that's a desirable short-term action.

For Pods failing as reported in this issue, what's the difference in behavior between a Never restart policy and a hypothetical scenario where backoffLimit is respected by OnFailure? Anything other than failing Pods being replaced by newly created ones in the former case vs. existing Pods being restarted in the latter?

weiwei04 · 2017-11-08T02:43:30Z

For OnFailure, the failed Pod will restart on the same kubelet host, so emptyDir/hostPath is still there, for Never the failed Pod got delete and a replacement will create on another(may be the same) kubelet host, if the Pod use emptyDir/hostPath to store some tmp data, the tmp data may be lost.

For Pod don't use emptyDir/hostPath, I think there's no much difference for usage(Never will involve controller, apiserver, scheduler to do some work as it needs to create a replacement, backoffLimit policy will avoid bad jobSpec & container cause to create too many replacement consume control plane bandwidth).

timoreimann · 2017-11-08T08:24:50Z

Might it be a reasonable idea to move the backoffLimit parameter into the Pod? I'd think it could prove valuable to have an upper bound on the restarts on Pods themselves independently of Jobs, and it may solve the "how does the Cronjob controller determine that the limit has been reached" problem more elegantly if the Pod's state automatically transitions to failed once the limit is hit.

WDYT?

innovia · 2017-12-13T17:38:45Z

can anyone tell me if I set the Never and the back off to 3 will it try a restart 3 times? this is very confusing

timoreimann · 2017-12-13T18:30:12Z

@innovia yes, it will. In fact, setting it to Never is the only way right now to have a back off limit be applied and effective.

timoreimann · 2017-12-13T18:31:32Z

I should mention that my statement is based on Kubernetes 1.8. I don't know if any fixes possibly made it into the upcoming 1.9. Then again, if there were, this issue should have been updated or at least referenced.

innovia · 2017-12-25T10:21:44Z

@timoreimann this is not working for a cronjob job spec (i am running 1.8.6)
in the picture i still get new pods when i kill the previous on (5 created 3 should have been created and the job should have failed)

any idea how to get this to actually work?

 jobTemplate:
    spec:
      backoffLimit: 3
      template:
        spec:
          containers:
          - name: cron-job
            command:
            - "/bin/bash"
            - "-c"
            - | 
                WORKER_PID=""
                
                handle_sig_term(){
                  echo -e "Sending SIGTERM to $WORKER_PID"
                  kill -SIGTERM $WORKER_PID
                  echo -e "Waiting for $WORKER_PID"
                  wait $WORKER_PID
                  exit 143  # exit code of SIGTERM is 128 + 15
                }

                trap "handle_sig_term" TERM
 
                 x=1
                 while [ $x -le 100 ]; do
                   echo "Welcome $x times"
                   x=$(( $x + 1 ))
                   sleep 5
                  done   & WORKER_PID=$!
              
                  echo "Job PID: ${WORKER_PID}"

                 # wait <n> waits until the process with ID is complete
                 # (it will block until the process completes)
                 wait $WORKER_PID

                 EXIT_CODE=$?

                 if [ $EXIT_CODE -ne 0 ]; then
                   echo "Job failed with exit code: $EXIT_CODE"
                   exit $EXIT_CODE
                else
                  echo "Job done!"
                fi
          restartPolicy: Never

timoreimann · 2017-12-25T10:49:06Z

@innovia what's your spec's schedule? I don't see it in your manifest.

innovia · 2017-12-25T11:05:18Z

every minute just for testing

timoreimann · 2017-12-25T11:47:34Z

Aren't you risking then that restarting jobs from a failed schedule and those from a succeeding schedule might be overlapping? How does it look like when you set the interval to, say, 10 minutes?

innovia · 2017-12-25T14:16:15Z

@timoreimann same result with 10 minutes interval

timoreimann · 2017-12-25T22:41:49Z

@innovia I ran a simplified version of your manifest against my Kubernetes 1.8.2 cluster with the command field replaced by /bin/bash exit 1 so that every run fails, the schedule set to every 5 minutes, and the back-off limit reduced to 2. Here's my manifest:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: hello
spec:
  schedule: "*/5 * * * *"
  jobTemplate:
    spec:
      backoffLimit: 2
      template:
        spec:
          containers:
          - name: cron-job
            image: ubuntu
            command:
            - "/bin/bash"
            - "-c"
            - "exit 1"
          restartPolicy: Never

Initially, I get a series of three failing pods:

Mon Dec 25 23:15:28 CET 2017
NAME                     READY     STATUS    RESTARTS   AGE
hello-1514240100-7mktd   0/1       Error     0          19s
hello-1514240100-bf6rw   0/1       Error     0          5s
hello-1514240100-h44kz   0/1       Error     0          15s

Then it takes pretty much 5 minutes for the next batch to start up and fail:

Mon Dec 25 23:20:23 CET 2017
NAME                     READY     STATUS    RESTARTS   AGE
hello-1514240400-cj4bn   0/1       Error     0          4s
hello-1514240400-shnph   0/1       Error     0          14s

Interestingly, I only get two new containers instead of three. The next batch though comes with three again:

Mon Dec 25 23:25:16 CET 2017
NAME                     READY     STATUS    RESTARTS   AGE
hello-1514240700-jr9sp   0/1       Error     0          3s
hello-1514240700-jxck9   0/1       Error     0          13s
hello-1514240700-x26kg   0/1       Error     0          16s

and so does the final batch I tested:

Mon Dec 25 23:30:16 CET 2017
NAME                     READY     STATUS    RESTARTS   AGE

hello-1514241000-mm8c9   0/1       Error     0          2s
hello-1514241000-pqx8f   0/1       Error     0          16s
hello-1514241000-qjtn5   0/1       Error     0          12s

Apart from the supposed off-by-one error, things seem to work for me.

timoreimann · 2017-12-25T22:43:14Z

Maybe you can double-check using my manifest?

soltysh · 2018-01-29T16:35:29Z

In the original proposal we've stated it should work for both restart policies, I've addressed the problem in #58972. It counts pod's restartCount, if overall (iow. all containers in all pods) number exceeds backoff limit the job will fail.

Automatic merge from submit-queue (batch tested with PRs 61962, 58972, 62509, 62606). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Fix job's backoff limit for restart policy OnFailure **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Fixes #54870 **Release note**: ```release-note NONE ``` /assign janetkuo

zhangguanzhang · 2018-06-03T13:43:52Z

I think the backoffLimit doesn't work at all. My kubernetes version is 1.10.3

[root@k8s-m1 k8s]# cat job.yml 
apiVersion: batch/v1
kind: Job
metadata:
  name: test-job
spec:
  backoffLimit: 2
  template:
    spec:
      containers:
      - name: pi
        image: busybox
        command: ["exit 1"]
      restartPolicy: Never
[root@k8s-m1 k8s]# kubectl apply -f job.yml 
job.batch "test-job" created
[root@k8s-m1 k8s]# kubectl get pod
NAME             READY     STATUS               RESTARTS   AGE
pod-test         1/1       Running              0          6h
test-affinity    1/1       Running              0          6h
test-job-g4r8w   0/1       ContainerCreating    0          1s
test-job-tqbvl   0/1       ContainerCannotRun   0          3s
[root@k8s-m1 k8s]# kubectl get pod
NAME             READY     STATUS               RESTARTS   AGE
pod-test         1/1       Running              0          6h
test-affinity    1/1       Running              0          6h
test-job-2h5vp   0/1       ContainerCannotRun   0          13s
test-job-2klzs   0/1       ContainerCannotRun   0          15s
test-job-4c2xz   0/1       ContainerCannotRun   0          24s
test-job-6bcqc   0/1       ContainerCannotRun   0          8s
test-job-6r7wq   0/1       ContainerCannotRun   0          11s
test-job-7cz7c   0/1       Terminating          0          8s
test-job-7mkdn   0/1       Terminating          0          16s
test-job-88ws8   0/1       ContainerCannotRun   0          26s
test-job-bqfk4   0/1       ContainerCannotRun   0          22s
test-job-jh7dp   0/1       ContainerCannotRun   0          4s
test-job-k2c4r   0/1       ContainerCannotRun   0          18s
test-job-qfj7m   0/1       ContainerCannotRun   0          6s
test-job-r8794   0/1       ContainerCreating    0          1s
test-job-r9gz6   0/1       Terminating          0          6s
test-job-w4f9r   0/1       Terminating          0          1s
[root@k8s-m1 k8s]# kubectl delete  -f job.yml 
job.batch "test-job" deleted

soltysh · 2018-06-05T07:14:49Z

See #62382.

* This will cause that for each retry a new pod will be started instead of trying with the same all the time. Even though it seems to be fixed kubernetes/kubernetes#54870 (comment) in Kubernetes 1.9.4 backoffLimit is ignored with OnFailure policy.

* Upgrades code to support Kubernetes 1.9.4 (1.10 compatible already). * Uses restartpolicy never. This will cause that for each retry a new pod will be started instead of trying with the same all the time. Even though it seems to be fixed kubernetes/kubernetes#54870 (comment) in Kubernetes 1.9.4 backoffLimit is ignored with OnFailure policy.

lowang-bh · 2018-12-06T11:46:39Z

kubectl get pod

NAME READY STATUS RESTARTS AGE
job-backofflimit-2-9n74x 0/1 Error 0 34s
job-backofflimit-2-bt796 0/1 Error 0 1m
job-backofflimit-2-fgktc 0/1 Error 0 49s
job-backofflimit-2-gfhlx 0/1 Error 0 1m
job-backofflimit-2-gzd54 0/1 ContainerCreating 0 8s
job-backofflimit-2-ljlpx 0/1 Error 0 1m
job-backofflimit-2-p5fxm 0/1 Error 0 1m
job-backofflimit-2-w6s5d 0/1 Error 0 21s

cat job-with-second.yaml

apiVersion: batch/v1
kind: Job
metadata:
name: job-backofflimit-2
spec:
backoffLimit: 2
template:
spec:
containers:
- name: job-with-python2
image: registry.laincloud.test/k8s-job/jobtest:0.0
restartPolicy: Never

k8s version

v1.10.4

moshucong · 2018-12-24T08:59:58Z

I should mention that my statement is based on Kubernetes 1.8. I don't know if any fixes possibly made it into the upcoming 1.9. Then again, if there were, this issue should have been updated or at least referenced.

1.9 has the same problem. My cluster is 1.9.6, and i run into the same issue

This should prevent them from re-starting when the workflow fails Note that on kubernetes before 1.12, kubernetes/kubernetes#54870 exists and these fields are not interpreted

k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Oct 31, 2017

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. kind/bug Categorizes issue or PR as related to a bug. labels Oct 31, 2017

k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Oct 31, 2017

lavalamp added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Nov 2, 2017

tengqm mentioned this issue Nov 8, 2017

Add warning about backoffLimit limitation kubernetes/website#6227

Merged

soltysh self-assigned this Jan 29, 2018

soltysh mentioned this issue Jan 29, 2018

Fix job's backoff limit for restart policy OnFailure #58972

Merged

k8s-github-robot closed this as completed in #58972 Apr 19, 2018

jlzhao27 mentioned this issue Aug 24, 2018

Backofflimit 0 with restartPolicy: OnFailure does not start pods in 1.10.x #67848

Closed

geerlingguy mentioned this issue Sep 26, 2018

Support for Kubernetes 1.12 kubernetes/minikube#3164

Closed

kbhawkey mentioned this issue Oct 1, 2018

Update RestartPolicy issue note kubernetes/website#10447

Merged

hansmi mentioned this issue Oct 29, 2018

Re-add support for OpenShift 3.7 appuio/gluster-recycler#13

Merged

johnbradley mentioned this issue Mar 14, 2019

improve k8s calrissian retry processes Duke-GCB/lando#172

Open

soltysh mentioned this issue Jun 3, 2020

Job Controller: job.Status.Failed is incorrect in some cases #89630

Closed

Cweiping mentioned this issue Jun 21, 2020

Remove note from Job concept about Pod backoff failure policy kubernetes/website#21950

Merged

kolomenkin mentioned this issue Apr 28, 2021

BackoffLimit does not work for pods in Jobs (Kubernetes 1.19) #101584

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job backoffLimit does not cap pod restarts when restartPolicy: OnFailure #54870

Job backoffLimit does not cap pod restarts when restartPolicy: OnFailure #54870

ritsok commented Oct 31, 2017

ritsok commented Oct 31, 2017

k8s-ci-robot commented Oct 31, 2017

weiwei04 commented Nov 2, 2017 •

edited

Loading

timoreimann commented Nov 7, 2017 •

edited

Loading

weiwei04 commented Nov 8, 2017

timoreimann commented Nov 8, 2017

innovia commented Dec 13, 2017

timoreimann commented Dec 13, 2017

timoreimann commented Dec 13, 2017

innovia commented Dec 25, 2017

timoreimann commented Dec 25, 2017 •

edited

Loading

innovia commented Dec 25, 2017

timoreimann commented Dec 25, 2017

innovia commented Dec 25, 2017

timoreimann commented Dec 25, 2017 •

edited

Loading

timoreimann commented Dec 25, 2017

soltysh commented Jan 29, 2018

zhangguanzhang commented Jun 3, 2018

soltysh commented Jun 5, 2018

lowang-bh commented Dec 6, 2018 •

edited

Loading

moshucong commented Dec 24, 2018

Job backoffLimit does not cap pod restarts when restartPolicy: OnFailure #54870

Job backoffLimit does not cap pod restarts when restartPolicy: OnFailure #54870

Comments

ritsok commented Oct 31, 2017

ritsok commented Oct 31, 2017

k8s-ci-robot commented Oct 31, 2017

weiwei04 commented Nov 2, 2017 • edited Loading

timoreimann commented Nov 7, 2017 • edited Loading

weiwei04 commented Nov 8, 2017

timoreimann commented Nov 8, 2017

innovia commented Dec 13, 2017

timoreimann commented Dec 13, 2017

timoreimann commented Dec 13, 2017

innovia commented Dec 25, 2017

timoreimann commented Dec 25, 2017 • edited Loading

innovia commented Dec 25, 2017

timoreimann commented Dec 25, 2017

innovia commented Dec 25, 2017

timoreimann commented Dec 25, 2017 • edited Loading

timoreimann commented Dec 25, 2017

soltysh commented Jan 29, 2018

zhangguanzhang commented Jun 3, 2018

soltysh commented Jun 5, 2018

lowang-bh commented Dec 6, 2018 • edited Loading

kubectl get pod

cat job-with-second.yaml

k8s version

moshucong commented Dec 24, 2018

weiwei04 commented Nov 2, 2017 •

edited

Loading

timoreimann commented Nov 7, 2017 •

edited

Loading

timoreimann commented Dec 25, 2017 •

edited

Loading

timoreimann commented Dec 25, 2017 •

edited

Loading

lowang-bh commented Dec 6, 2018 •

edited

Loading