From 041e363bc0e588cf3c2c8437e7e7aa0973830e12 Mon Sep 17 00:00:00 2001 From: Yuan Wang Date: Wed, 30 Jul 2025 23:19:53 +0000 Subject: [PATCH 1/2] Blog for container restart policy --- ...2025-0X-XX-Per-Container-Restart-Policy.md | 204 ++++++++++++++++++ 1 file changed, 204 insertions(+) create mode 100644 content/en/blog/_posts/2025-0X-XX-Per-Container-Restart-Policy.md diff --git a/content/en/blog/_posts/2025-0X-XX-Per-Container-Restart-Policy.md b/content/en/blog/_posts/2025-0X-XX-Per-Container-Restart-Policy.md new file mode 100644 index 0000000000000..8b8d07e632eb8 --- /dev/null +++ b/content/en/blog/_posts/2025-0X-XX-Per-Container-Restart-Policy.md @@ -0,0 +1,204 @@ +--- +layout: blog +title: "Kubernetes v1.34: Finer-Grained Control Over Container Restarts" +date: 2025-0X-XX +draft: false +slug: kubernetes-v1-34-per-container-restart-policy +author: > + [Yuan Wang](https://github.com/yuanwang04) +--- + +With the release of Kubernetes 1.34, a new alpha feature is introduced +that gives you more granular control over container restarts within a Pod. This +feature, named **Container Restart Policy and Rules**, allows you to specify a +restart policy for each container individually, overriding the Pod's global +restart policy. In addition, it also allows you to conditionally restart +individual containers based on their exit codes. This feature is available +behind the alpha feature gate `ContainerRestartRules`. + +This has been a long-requested feature. Let's dive into how it works and how you +can use it. + +## The problem with a single restart policy + +Before this feature, the `restartPolicy` was set at the Pod level. This meant +that all containers in a Pod shared the same restart policy (`Always`, +`OnFailure`, or `Never`). While this works for many use cases, it can be +limiting in others. + +For example, consider a Pod with a main application container and an init +container that performs some initial setup. You might want the main container +to always restart on failure, but the init container should only run once and +never restart. With a single Pod-level restart policy, this wasn't possible. + +## Introducing per-container restart policies + +With the new `ContainerRestartRules` feature gate, you can now specify a +`restartPolicy` for each container in your Pod's spec. You can also define +`restartPolicyRules` to control restarts based on exit codes. This gives you +the fine-grained control you need to handle complex scenarios. + +## Use cases + +Let's look at some real-life use cases where per-container restart policies can +be beneficial. + +### In-place restarts for training jobs + +In ML research, it's common to orchestrate a large number of long-running AI/ML +training workloads. In these scenarios, workload failures are unavoidable. When +a workload fails with a retriable exit code, you want the container to restart +quickly without rescheduling the entire Pod, which consumes a significant amount +of time and resources. Restarting the failed container "in-place" is critical +for better utilization of compute resources. The container should only restart +"in-place" if it failed due to a retriable error; otherwise, the container and +Pod should terminate and possibly be rescheduled. + +This can now be achieved with container-level `restartPolicyRules`. The workload +can exit with different codes to represent retriable and non-retriable errors. +With `restartPolicyRules`, the workload can be restarted in-place quickly, but +only when the error is retriable. + +### Try-once init containers + +Init containers are often used to perform initialization work for the main +container, such as setting up environments and credentials. Sometimes, you want +the main container to always be restarted, but you don't want to retry +initialization if it fails. + +With a container-level `restartPolicy`, this is now possible. The init container +can be executed only once, and its failure would be considered a Pod failure. If +the initialization succeeds, the main container can be always restarted. + +### Pods with multiple containers + +For Pods that run multiple containers, you might have different restart +requirements for each container. Some containers might have a clear definition +of success and should only be restarted on failure. Others might need to be +always restarted. + +This is now possible with a container-level `restartPolicy`, allowing individual +containers to have different restart policies. + +## How to use it + +To use this new feature, you need to enable the `ContainerRestartRules` feature +gate on your Kubernetes cluster control-plane and worker nodes running +Kubernetes 1.34+. Once enabled, you can specify the `restartPolicy` and +`restartPolicyRules` fields in your container definitions. + +Here are some examples: + +### Example 1: Restarting on specific exit codes + +In this example, the container should restart if and only if it fails with a +retriable error, represented by exit code 42. + +To achieve this, the container has `restartPolicy: Never`, and a restart +policy rule that tells Kubernetes to restart the container in-place if it exits +with code 42. + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: restart-on-exit-codes + annotations: + kubernetes.io/description: "This Pod only restart the container only when it exits with code 42." +spec: + restartPolicy: Never + containers: + - name: restart-on-exit-codes + image: docker.io/library/busybox:1.28 + command: ['sh', '-c', 'sleep 60 && exit 0'] + restartPolicy: Never # Container restart policy must be specified if rules are specified + restartPolicyRules: # Only restart the container if it exits with code 42 + - action: Restart + exitCodes: + operator: In + values: [42] +``` + +### Example 2: A try-once init container + +In this example, a Pod should always be restarted once the initialization succeeds. +However, the initialization should only be tried once. + +To achieve this, the Pod has an `Always` restart policy. The `init-once` +init container will only try once. If it fails, the Pod will fail. This allows +the Pod to fail if the initialization failed, but also keep running once the +initialization succeeds. + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: fail-pod-if-init-fails + annotations: + kubernetes.io/description: "This Pod has an init container that runs only once. After initialization succeeds, the main container will always be restarted." +spec: + restartPolicy: Always + initContainers: + - name: init-once # This init container will only try once. If it fails, the Pod will fail. + image: docker.io/library/busybox:1.28 + command: ['sh', '-c', 'echo "Failing initialization" && sleep 10 && exit 1'] + restartPolicy: Never + containers: + - name: main-container # This container will always be restarted once initialization succeeds. + image: docker.io/library/busybox:1.28 + command: ['sh', '-c', 'sleep 1800 && exit 0'] +``` + +### Example 3: Containers with different restart policies + +In this example, there are two containers with different restart requirements. One +should always be restarted, while the other should only be restarted on failure. + +This is achieved by using a different container-level `restartPolicy` on each of +the two containers. +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: on-failure-pod + annotations: + kubernetes.io/description: "This Pod has two containers with different restart policies." +spec: + containers: + - name: restart-on-failure + image: docker.io/library/busybox:1.28 + command: ['sh', '-c', 'echo "Not restarting after success" && sleep 10 && exit 0'] + restartPolicy: OnFailure + - name: restart-always + image: docker.io/library/busybox:1.28 + command: ['sh', '-c', 'echo "Always restarting" && sleep 1800 && exit 0'] + restartPolicy: Always +``` + +## Learn more + +- Read the documentation for + [container restart policy](/docs/concepts/workloads/pod-lifecycle/#container-restart-rules). +- Read the KEP for the + [Container Restart Rules](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/5307-container-restart-policy) + +## Roadmap + +More actions and signals to restart Pods and containers are coming! Notably, +there are plans to add support for restarting the entire Pod. Planning and +discussions on these features are in progress. Feel free to share feedback or +requests with the SIG Node community! + +## Your feedback is welcome! + +This is an alpha feature, and the Kubernetes project would love to hear your feedback. +Please try it out. This feature is driven by the +[SIG Node](https://github.com/Kubernetes/community/blob/master/sig-node/README.md). +If you are interested in helping develop this feature, sharing feedback, or +participating in any other ongoing SIG Node projects, please reach out to the +SIG Node community! + +You can reach SIG Node by several means: +- Slack: [#sig-node](https://kubernetes.slack.com/messages/sig-node) +- [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-node) +- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/sig%2Fnode) From a223ceaa2006165f3c81f3c81aad3f9276c32c40 Mon Sep 17 00:00:00 2001 From: Tim Bannister <193443691+lmktfy@users.noreply.github.com> Date: Sun, 10 Aug 2025 22:49:43 +0100 Subject: [PATCH 2/2] Ensure article is marked as draft The article must be draft until the Kubernetes v1.34 release has happened, and also must stay draft until it has a (future) publication date set. --- .../en/blog/_posts/2025-0X-XX-Per-Container-Restart-Policy.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/en/blog/_posts/2025-0X-XX-Per-Container-Restart-Policy.md b/content/en/blog/_posts/2025-0X-XX-Per-Container-Restart-Policy.md index 8b8d07e632eb8..7ac3b7967e4f3 100644 --- a/content/en/blog/_posts/2025-0X-XX-Per-Container-Restart-Policy.md +++ b/content/en/blog/_posts/2025-0X-XX-Per-Container-Restart-Policy.md @@ -2,7 +2,7 @@ layout: blog title: "Kubernetes v1.34: Finer-Grained Control Over Container Restarts" date: 2025-0X-XX -draft: false +draft: true slug: kubernetes-v1-34-per-container-restart-policy author: > [Yuan Wang](https://github.com/yuanwang04)