Skip to content

Commit 81347ff

Browse files
committed
Blog for container restart policy
1 parent eaf7596 commit 81347ff

File tree

1 file changed

+197
-0
lines changed

1 file changed

+197
-0
lines changed
Lines changed: 197 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
---
2+
layout: blog
3+
title: "Finer-Grained Control Over Container Restarts"
4+
date: 2025-07-30
5+
draft: false
6+
slug: per-container-restart-policy
7+
author: >
8+
[Yuan Wang](https://github.com/yuanwang04)
9+
---
10+
11+
With the release of Kubernetes 1.34, we are introducing a new alpha feature
12+
that gives you more granular control over container restarts within a Pod. This
13+
feature, named **Container Restart Policy and Rules**, allows you to specify a
14+
restart policy for each container individually, overriding the Pod's global
15+
restart policy. In addition, it also allows you to conditionally restart
16+
individual containers based on their exit codes. This feature is available
17+
behind the alpha feature gate `ContainerRestartRules`.
18+
19+
This has been a long-requested feature, and we're excited to finally bring it
20+
to you. Let's dive into how it works and how you can use it.
21+
22+
## The Problem with a Single Restart Policy
23+
24+
Before this feature, the `restartPolicy` was set at the Pod level. This meant
25+
that all containers in a Pod shared the same restart policy (`Always`,
26+
`OnFailure`, or `Never`). While this works for many use cases, it can be
27+
limiting in others.
28+
29+
For example, consider a Pod with a main application container and an init
30+
container that performs some initial setup. You might want the main container
31+
to always restart on failure, but the init container should only run once and
32+
never restart. With a single Pod-level restart policy, this wasn't possible.
33+
34+
## Introducing Per-Container Restart Policies
35+
36+
With the new `ContainerRestartRules` feature gate, you can now specify a
37+
`restartPolicy` for each container in your Pod's spec. You can also define
38+
`restartPolicyRules` to control restarts based on exit codes. This gives you
39+
the fine-grained control you need to handle complex scenarios.
40+
41+
## Use Cases
42+
43+
Let's look at some real-life use cases where per-container restart policies can
44+
be beneficial.
45+
46+
### In-place restarts for training jobs
47+
48+
In ML research, it's common to orchestrate a large number of long-running AI/ML
49+
training workloads. In these scenarios, workload failures are unavoidable. When
50+
a workload fails with a retriable exit code, you want the container to restart
51+
quickly without rescheduling the entire Pod, which consumes a significant amount
52+
of time and resources. Restarting the failed container "in-place" is critical
53+
for better utilization of compute resources. The container should only restart
54+
"in-place" if it failed due to a retriable error; otherwise, the container and
55+
Pod should terminate and possibly be rescheduled.
56+
57+
This can now be achieved with container-level `restartPolicyRules`. The workload
58+
can exit with different codes to represent retriable and non-retriable errors.
59+
With `restartPolicyRules`, the workload can be restarted in-place quickly, but
60+
only when the error is retriable.
61+
62+
### Try-once init containers
63+
64+
Init containers are often used to perform initialization work for the main
65+
container, such as setting up environments and credentials. Sometimes, you want
66+
the main container to always be restarted, but you don't want to retry
67+
initialization if it fails.
68+
69+
With a container-level `restartPolicy`, this is now possible. The init container
70+
can be executed only once, and its failure would be considered a Pod failure. If
71+
the initialization succeeds, the main container can be always restarted.
72+
73+
### Pods with multiple containers
74+
75+
For Pods that run multiple containers, you might have different restart
76+
requirements for each container. Some containers might have a clear definition
77+
of success and should only be restarted on failure. Others might need to be
78+
always restarted.
79+
80+
This is now possible with a container-level `restartPolicy`, allowing individual
81+
containers to have different restart policies.
82+
83+
## How to Use It
84+
85+
To use this new feature, you need to enable the `ContainerRestartRules` feature
86+
gate on your Kubernetes cluster control-plane and worker nodes running
87+
Kubernetes 1.34+. Once enabled, you can specify the `restartPolicy` and
88+
`restartPolicyRules` fields in your container definitions.
89+
90+
Here are some examples:
91+
92+
### Example 1: Restarting on specific exit codes
93+
94+
In this example, we want to restart the container if and only if it fails with a
95+
retriable error, represented by exit code 42.
96+
97+
To achieve this, we have a container with `restartPolicy: Never`, and a restart
98+
policy rule that tells Kubernetes to restart the container in-place if it exits
99+
with code 42.
100+
101+
```yaml
102+
apiVersion: v1
103+
kind: Pod
104+
metadata:
105+
name: restart-on-exit-codes
106+
spec:
107+
restartPolicy: Never
108+
containers:
109+
- name: restart-on-exit-codes
110+
image: docker.io/library/busybox:1.28
111+
command: ['sh', '-c', 'sleep 60 && exit 0']
112+
restartPolicy: Never # Container restart policy must be specified if rules are specified
113+
restartPolicyRules: # Only restart the container if it exits with code 42
114+
- action: Restart
115+
exitCodes:
116+
operator: In
117+
values: [42]
118+
```
119+
120+
### Example 2: A try-once init container
121+
122+
In this example, we want a Pod that should always be restarted once the
123+
initialization succeeds. However, the initialization should only be tried once.
124+
125+
To achieve this, we have a Pod with an `Always` restart policy. The `init-once`
126+
init container will only try once. If it fails, the Pod will fail. This allows
127+
the Pod to fail if the initialization failed, but also keep running once the
128+
initialization succeeds.
129+
130+
```yaml
131+
apiVersion: v1
132+
kind: Pod
133+
metadata:
134+
name: fail-pod-if-init-fails
135+
spec:
136+
restartPolicy: Always
137+
initContainers:
138+
- name: init-once # This init container will only try once. If it fails, the Pod will fail.
139+
image: docker.io/library/busybox:1.28
140+
command: ['sh', '-c', 'echo "Failing initialization" && sleep 10 && exit 1']
141+
restartPolicy: Never
142+
containers:
143+
- name: main-container # This container will always be restarted once initialization succeeds.
144+
image: docker.io/library/busybox:1.28
145+
command: ['sh', '-c', 'sleep 1800 && exit 0']
146+
```
147+
148+
### Example 3: Containers with different restart policies
149+
150+
In this example, we have two containers with different restart requirements. One
151+
should always be restarted, while the other should only be restarted on failure.
152+
153+
To achieve this, we use a different container-level `restartPolicy` on each of
154+
the two containers.
155+
```yaml
156+
apiVersion: v1
157+
kind: Pod
158+
metadata:
159+
name: on-failure-pod
160+
spec:
161+
containers:
162+
- name: restart-on-failure
163+
image: docker.io/library/busybox:1.28
164+
command: ['sh', '-c', 'echo "Not restarting after success" && sleep 10 && exit 0']
165+
restartPolicy: OnFailure
166+
- name: restart-always
167+
image: docker.io/library/busybox:1.28
168+
command: ['sh', '-c', 'echo "Always restarting" && sleep 1800 && exit 0']
169+
restartPolicy: Always
170+
```
171+
172+
## Learn More
173+
174+
- Read the documentation for
175+
[container restart policy](/docs/concepts/workloads/pod-lifecycle/#container-restart-rules).
176+
- Read the KEP for the
177+
[Container Restart Rules](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/5307-container-restart-policy)
178+
179+
## Roadmap
180+
181+
More actions and signals to restart Pods and containers are coming! Notably, we
182+
are looking into adding support for restarting the entire Pod. Planning and
183+
discussions on these features are in progress. Feel free to share feedback or
184+
requests with the SIG Node community!
185+
186+
## We Want Your Feedback!
187+
188+
This is an alpha feature, and we'd love to hear your feedback. Please try it out
189+
and let us know what you think. This feature is driven by the
190+
[SIG Node](https://github.com/Kubernetes/community/blob/master/sig-node/README.md).
191+
If you are interested in helping develop this feature, sharing feedback, or
192+
participating in any other ongoing SIG Node projects, please reach out to us!
193+
194+
You can reach SIG Node by several means:
195+
- Slack: [#sig-node](https://kubernetes.slack.com/messages/sig-node)
196+
- [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-node)
197+
- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/sig%2Fnode)

0 commit comments

Comments
 (0)