@@ -28,42 +28,50 @@ You should already be familiar with the basic use of [Job](/docs/concepts/worklo
2828
2929{{< include "task-tutorial-prereqs.md" >}} {{< version-check >}}
3030
31- ## Using Pod failure policy to avoid unnecessary Pod retries
31+ ## Usage scenarios
32+
33+ Consider the following usage scenarios for Jobs that define a Pod failure policy :
34+ - [ Avoiding unnecessary Pod retries] ( #pod-failure-policy-failjob )
35+ - [ Ignoring Pod disruptions] ( #pod-failure-policy-ignore )
36+ - [ Avoiding unnecessary Pod retries based on custom Pod Conditions] ( #pod-failure-policy-config-issue )
37+ - [ Avoiding unnecessary Pod retries per index] ( #backoff-limit-per-index-failindex )
38+
39+ ### Using Pod failure policy to avoid unnecessary Pod retries {#pod-failure-policy-failjob}
3240
3341With the following example, you can learn how to use Pod failure policy to
3442avoid unnecessary Pod restarts when a Pod failure indicates a non-retriable
3543software bug.
3644
37- First, create a Job based on the config :
45+ 1 . Examine the following manifest :
3846
39- {{% code_sample file="/controllers/job-pod-failure-policy-failjob.yaml" %}}
47+ {{% code_sample file="/controllers/job-pod-failure-policy-failjob.yaml" %}}
4048
41- by running :
49+ 1 . Apply the manifest :
4250
43- ``` sh
44- kubectl create -f job-pod-failure-policy-failjob.yaml
45- ```
51+ ``` sh
52+ kubectl create -f https://k8s.io/examples/controllers/ job-pod-failure-policy-failjob.yaml
53+ ```
4654
47- After around 30s the entire Job should be terminated. Inspect the status of the Job by running:
55+ 1 . After around 30 seconds the entire Job should be terminated. Inspect the status of the Job by running:
4856
49- ``` sh
50- kubectl get jobs -l job-name=job-pod-failure-policy-failjob -o yaml
51- ```
57+ ``` sh
58+ kubectl get jobs -l job-name=job-pod-failure-policy-failjob -o yaml
59+ ```
5260
53- In the Job status, the following conditions display:
54- - ` FailureTarget ` condition: has a ` reason ` field set to ` PodFailurePolicy ` and
55- a ` message ` field with more information about the termination, like
56- ` Container main for pod default/job-pod-failure-policy-failjob-8ckj8 failed with exit code 42 matching FailJob rule at index 0 ` .
57- The Job controller adds this condition as soon as the Job is considered a failure.
58- For details, see [ Termination of Job Pods] ( /docs/concepts/workloads/controllers/job/#termination-of-job-pods ) .
59- - ` Failed ` condition: same ` reason ` and ` message ` as the ` FailureTarget `
60- condition. The Job controller adds this condition after all of the Job's Pods
61- are terminated.
61+ In the Job status, the following conditions display:
62+ - ` FailureTarget ` condition: has a ` reason ` field set to ` PodFailurePolicy ` and
63+ a ` message ` field with more information about the termination, like
64+ ` Container main for pod default/job-pod-failure-policy-failjob-8ckj8 failed with exit code 42 matching FailJob rule at index 0 ` .
65+ The Job controller adds this condition as soon as the Job is considered a failure.
66+ For details, see [ Termination of Job Pods] ( /docs/concepts/workloads/controllers/job/#termination-of-job-pods ) .
67+ - ` Failed ` condition: same ` reason ` and ` message ` as the ` FailureTarget `
68+ condition. The Job controller adds this condition after all of the Job's Pods
69+ are terminated.
6270
63- For comparison, if the Pod failure policy was disabled it would take 6 retries
64- of the Pod, taking at least 2 minutes.
71+ For comparison, if the Pod failure policy was disabled it would take 6 retries
72+ of the Pod, taking at least 2 minutes.
6573
66- ### Clean up
74+ #### Clean up
6775
6876Delete the Job you created:
6977
@@ -73,7 +81,7 @@ kubectl delete jobs/job-pod-failure-policy-failjob
7381
7482The cluster automatically cleans up the Pods.
7583
76- ## Using Pod failure policy to ignore Pod disruptions
84+ ### Using Pod failure policy to ignore Pod disruptions {#pod-failure-policy-ignore}
7785
7886With the following example, you can learn how to use Pod failure policy to
7987ignore Pod disruptions from incrementing the Pod retry counter towards the
@@ -85,35 +93,35 @@ execution. In order to trigger a Pod disruption it is important to drain the
8593node while the Pod is running on it (within 90s since the Pod is scheduled).
8694{{< /caution >}}
8795
88- 1 . Create a Job based on the config :
96+ 1 . Examine the following manifest :
8997
9098 {{% code_sample file="/controllers/job-pod-failure-policy-ignore.yaml" %}}
9199
92- by running :
100+ 1 . Apply the manifest :
93101
94102 ``` sh
95- kubectl create -f job-pod-failure-policy-ignore.yaml
103+ kubectl create -f https://k8s.io/examples/controllers/ job-pod-failure-policy-ignore.yaml
96104 ```
97105
98- 2 . Run this command to check the ` nodeName ` the Pod is scheduled to:
106+ 1 . Run this command to check the ` nodeName ` the Pod is scheduled to:
99107
100108 ``` sh
101109 nodeName=$( kubectl get pods -l job-name=job-pod-failure-policy-ignore -o jsonpath=' {.items[0].spec.nodeName}' )
102110 ```
103111
104- 3 . Drain the node to evict the Pod before it completes (within 90s):
105-
112+ 1 . Drain the node to evict the Pod before it completes (within 90s):
113+
106114 ``` sh
107115 kubectl drain nodes/$nodeName --ignore-daemonsets --grace-period=0
108116 ```
109117
110- 4 . Inspect the ` .status.failed ` to check the counter for the Job is not incremented:
118+ 1 . Inspect the ` .status.failed ` to check the counter for the Job is not incremented:
111119
112120 ``` sh
113121 kubectl get jobs -l job-name=job-pod-failure-policy-ignore -o yaml
114122 ```
115123
116- 5 . Uncordon the node:
124+ 1 . Uncordon the node:
117125
118126 ``` sh
119127 kubectl uncordon nodes/$nodeName
@@ -124,7 +132,7 @@ The Job resumes and succeeds.
124132For comparison, if the Pod failure policy was disabled the Pod disruption would
125133result in terminating the entire Job (as the ` .spec.backoffLimit ` is set to 0).
126134
127- ### Cleaning up
135+ #### Cleaning up
128136
129137Delete the Job you created:
130138
@@ -134,7 +142,7 @@ kubectl delete jobs/job-pod-failure-policy-ignore
134142
135143The cluster automatically cleans up the Pods.
136144
137- ## Using Pod failure policy to avoid unnecessary Pod retries based on custom Pod Conditions
145+ ### Using Pod failure policy to avoid unnecessary Pod retries based on custom Pod Conditions {#pod-failure-policy-config-issue}
138146
139147With the following example, you can learn how to use Pod failure policy to
140148avoid unnecessary Pod restarts based on custom Pod Conditions.
@@ -145,19 +153,19 @@ deleted pods, in the `Pending` phase, to a terminal phase
145153(see: [ Pod Phase] ( /docs/concepts/workloads/pods/pod-lifecycle/#pod-phase ) ).
146154{{< /note >}}
147155
148- 1 . First, create a Job based on the config :
156+ 1 . Examine the following manifest :
149157
150158 {{% code_sample file="/controllers/job-pod-failure-policy-config-issue.yaml" %}}
151159
152- by running :
160+ 1 . Apply the manifest :
153161
154162 ``` sh
155- kubectl create -f job-pod-failure-policy-config-issue.yaml
163+ kubectl create -f https://k8s.io/examples/controllers/ job-pod-failure-policy-config-issue.yaml
156164 ```
157165
158166 Note that, the image is misconfigured, as it does not exist.
159167
160- 2 . Inspect the status of the job's Pods by running:
168+ 1 . Inspect the status of the job's Pods by running:
161169
162170 ``` sh
163171 kubectl get pods -l job-name=job-pod-failure-policy-config-issue -o yaml
@@ -181,7 +189,7 @@ deleted pods, in the `Pending` phase, to a terminal phase
181189 image could get pulled. However, in this case, the image does not exist so
182190 we indicate this fact by a custom condition.
183191
184- 3 . Add the custom condition. First prepare the patch by running :
192+ 1 . Add the custom condition. First prepare the patch by running :
185193
186194 ` ` ` sh
187195 cat <<EOF > patch.yaml
@@ -210,13 +218,13 @@ deleted pods, in the `Pending` phase, to a terminal phase
210218 pod/job-pod-failure-policy-config-issue-k6pvp patched
211219 ` ` `
212220
213- 4 . Delete the pod to transition it to `Failed` phase, by running the command :
221+ 1 . Delete the pod to transition it to `Failed` phase, by running the command :
214222
215223 ` ` ` sh
216224 kubectl delete pods/$podName
217225 ` ` `
218226
219- 5 . Inspect the status of the Job by running :
227+ 1 . Inspect the status of the Job by running :
220228
221229 ` ` ` sh
222230 kubectl get jobs -l job-name=job-pod-failure-policy-config-issue -o yaml
@@ -232,7 +240,7 @@ In a production environment, the steps 3 and 4 should be automated by a
232240user-provided controller.
233241{{< /note >}}
234242
235- # ## Cleaning up
243+ # ### Cleaning up
236244
237245Delete the Job you created :
238246
@@ -242,6 +250,66 @@ kubectl delete jobs/job-pod-failure-policy-config-issue
242250
243251The cluster automatically cleans up the Pods.
244252
253+ # ## Using Pod Failure Policy to avoid unnecessary Pod retries per index {#backoff-limit-per-index-failindex}
254+
255+ To avoid unnecessary Pod restarts per index, you can use the _Pod failure policy_ and
256+ _backoff limit per index_ features. This section of the page shows how to use these features
257+ together.
258+
259+ 1. Examine the following manifest :
260+
261+ {{% code_sample file="/controllers/job-backoff-limit-per-index-failindex.yaml" %}}
262+
263+ 1. Apply the manifest :
264+
265+ ` ` ` sh
266+ kubectl create -f https://k8s.io/examples/controllers/job-backoff-limit-per-index-failindex.yaml
267+ ` ` `
268+
269+ 1. After around 15 seconds, inspect the status of the Pods for the Job. You can do that by running :
270+
271+ ` ` ` shell
272+ kubectl get pods -l job-name=job-backoff-limit-per-index-failindex -o yaml
273+ ` ` `
274+
275+ You will see output similar to this :
276+
277+ ` ` ` none
278+ NAME READY STATUS RESTARTS AGE
279+ job-backoff-limit-per-index-failindex-0-4g4cm 0/1 Error 0 4s
280+ job-backoff-limit-per-index-failindex-0-fkdzq 0/1 Error 0 15s
281+ job-backoff-limit-per-index-failindex-1-2bgdj 0/1 Error 0 15s
282+ job-backoff-limit-per-index-failindex-2-vs6lt 0/1 Completed 0 11s
283+ job-backoff-limit-per-index-failindex-3-s7s47 0/1 Completed 0 6s
284+ ` ` `
285+
286+ Note that the output shows the following :
287+
288+ * Two Pods have index 0, because of the backoff limit allowed for one retry
289+ of the index.
290+ * Only one Pod has index 1, because the exit code of the failed Pod matched
291+ the Pod failure policy with the `FailIndex` action.
292+
293+ 1. Inspect the status of the Job by running :
294+
295+ ` ` ` sh
296+ kubectl get jobs -l job-name=job-backoff-limit-per-index-failindex -o yaml
297+ ` ` `
298+
299+ In the Job status, see that the `failedIndexes` field shows "0,1", because
300+ both indexes failed. Because the index 1 was not retried the number of failed
301+ Pods, indicated by the status field "failed" equals 3.
302+
303+ # ### Cleaning up
304+
305+ Delete the Job you created :
306+
307+ ` ` ` sh
308+ kubectl delete jobs/job-backoff-limit-per-index-failindex
309+ ` ` `
310+
311+ The cluster automatically cleans up the Pods.
312+
245313# # Alternatives
246314
247315You could rely solely on the
0 commit comments