Skip to content

KEP-5683: Specialized Lifecycle Management#5769

Open
rthallisey wants to merge 1 commit intokubernetes:masterfrom
rthallisey:specialized-lifecycle-management
Open

KEP-5683: Specialized Lifecycle Management#5769
rthallisey wants to merge 1 commit intokubernetes:masterfrom
rthallisey:specialized-lifecycle-management

Conversation

@rthallisey
Copy link

@rthallisey rthallisey commented Jan 5, 2026

  • One-line PR description: A standardized, declarative API for coordinating lifecycle management that follows the architectural patterns of PersistentVolumeClaims and Dynamic Resource Allocation
  • Other comments:
    • This proposal would improve observability of Node and Device lifecycle states
    • The scope is deliberately small to focus on providing some value while we prove this technique is viable
    • Our expectation is this framework will help Graceful Node Shutdown reach GA, but that work will be done separately - Graceful node shutdown #2000
    • This KEP is meant to replace KEP-4212: Declarative Node Maintenance #4213

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jan 5, 2026
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jan 5, 2026
@rthallisey rthallisey force-pushed the specialized-lifecycle-management branch 3 times, most recently from 43adeb5 to 6c465f2 Compare January 8, 2026 20:42
@rthallisey rthallisey force-pushed the specialized-lifecycle-management branch from 6c465f2 to e3deff1 Compare January 13, 2026 16:54
Copy link
Contributor

@alaypatel07 alaypatel07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, the KEP is strictly scoped to things that can be achieved in single release, having been following this space, this seems like a great start. Added some questions inline.

```Go
// LifecycleEvent represents a binding between a LifecycleTransition
// and the Kubelet/Driver responsible for executing it.
type LifecycleEvent struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some clarifying questions for this API:

  1. Is this a namespace scoped or cluster scoped API? What kinds of users should have access to this? It seems this is an admin like persona or someone with higher privileges to disrupt workloads should only have access.
  2. How long do we expect objects to live? Do we have additional usecase like lookup what triggered a node reboot, was it a lifecycle event or something else.
  3. Doing some napkin math, in a 100 node cluster, how many such objects do we expect to be present in apiserver/etcd?
  4. Looking at the design it seems this API is tied to node? Is that intended? If this lifecycle event is for some arbitrary object, assuming that kubelet is not present, how will binding work in that case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this a namespace scoped or cluster scoped API?

I've been going back and forth on this because there's use-cases for both. I'll make it more clear in the design.

How long do we expect objects to live

Depends on the SLA. From my experience, that's Hours to Days.

Doing some napkin math, in a 100 node cluster, how many such objects do we expect to be present in apiserver/etcd?

<5% of Nodes being unhealthy is generally what I've come across. So around 5 objects.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the design it seems this API is tied to node? Is that intended? If this lifecycle event is for some arbitrary object, assuming that kubelet is not present, how will binding work in that case?

What scenario would we need to lifecycle something that doesn't have a kubelet? Only case I can come up with would be a NetworkSwitch or DPU. But that device would still be attached to a Node and have a driver capable of reconciling it.

I think I'd need some more context.

Copy link
Contributor

@alaypatel07 alaypatel07 Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What scenario would we need to lifecycle something that doesn't have a kubelet? Only case I can come up with would be a NetworkSwitch or DPU. But that device would still be attached to a Node and have a driver capable of reconciling it.

I think I'd need some more context.

I think I was thinking more like a device on a node, but in an offline discussion we agreed that even in case of a device it could be tied back to a node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is also about fulfilling the contract when kubelet is not running on the node: please see #5769 (comment)


// Start identifies the initial state of the lifecycle transition.
// This value is reflected as a Condition on the target K8s resource API
Start string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have some specific states that we intend to allow? The current API allows for doing unexpected things for example, say a user wants to save cost and in order to do this via this API, the implement start state if Suspend, end state of Resume and SLA of 48h. While this is a legitimate usecase, it is unclear if such unexpected uses will be allowed/supported.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modeled the LifecycleTransition after the ResourceSlice in DRA. So the kubelet will fill out these fields with whatever a Lifecycle Drivers says it can support. Making it so only vendors define these transitions and not users.

- DrainStarted to DrainComplete
- Uncording to MaintenanceComplete
- The _DrainStarted_ to _DrainComplete_ transition will run `kubectl drain --ignore-daemonsets --timeout $SLA`
- The _Uncordoning_ to _MaintenanceComplete_ transition will uncordon the Node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates an interesting edge in the node object.

Until now, the node object would be ready only if the Ready condition on it was present. With this, do we expect users to check for Ready as well as MaintenanceComplete condition? Clients who dont upgrade, will not be able to recognize Node Ready and but not MaintenanceComplete state

Copy link
Author

@rthallisey rthallisey Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect there will be a change for the majority of users. A Node change impacting user capacity still means the Node should be cordoned by the admin. So these new conditions aren't meant to replace the Ready condition, taints, or other commonly used techniques. They are meant to provide addition lifecycle detail, if the user is interested in that. So it's an opt-in model.

@atiratree
Copy link
Member

I think the API portion of the KEP is moving in the right direction towards solving the problem domain. However, I can see the following disadvantages to using the Kubelet to facilitate communication (based on DRA model).

  1. Different Targets: this API could be used to handle the lifecycle of objects beyond the Node, but as mentioned above, we don't have concrete use cases yet.
  2. Scope and Timing: even if they are bound to a node the lifecycle transitions/events may not coincide with the kubelet or container runtime running. For example handling and updating the lifecycle of the underlying machine or a VM. E.g. during shutdown, restart, provisioning or an os upgrade.
  3. Resource Consumption: there could be a significant number of nodes undergoing a maintanance (e.g. during a cluster upgrade). Running multiple DaemonSets to support multiple events/drivers would result in higher resource usage.
  4. Cleanup: we need to have a termination order for these drivers, since shutting down a node will affect their ability to do the work. This is difficult to solve especially between multiple owners of these drivers and lifecycle transitions/events.
  5. High Availability: we lose the high availability benefit. This is more important for a lifecycle that goes beyond the node.
  6. Cross-cutting work: who should handle the resposibility of coordinating a multi node maintenance / events?
  7. Observability and Kubelet Responsibility: over time, we see an increasing need for observabilty into what a component is doing. I expect this to be the case for the driver behavior as well. Currently, we would process the observabilty part in kubelet and report certain conditions to a Node. However, with increased demands for observabilty, the kubelet could become a bottleneck

Overall, I think it would be better for most of this communication to flow through the API server.

I understand that the kubelet approach offers certain benefits such as security, Node status ownership and fault tolerance (the ability to continue node maintanance if the network is down). I believe we can achieve the same with the API server approach and additional enhancements (even a kubelet oriented ones) as well.

@rthallisey rthallisey force-pushed the specialized-lifecycle-management branch from e3deff1 to 4e326dd Compare January 15, 2026 15:25
@rthallisey
Copy link
Author

Overall, I think it would be better for most of this communication to flow through the API server.

I understand that the kubelet approach offers certain benefits such as security, Node status ownership and fault tolerance (the ability to continue node maintanance if the network is down). I believe we can achieve the same with the API server approach and additional enhancements (even a kubelet oriented ones) as well.

I think we agree both API-server approach and the kubelet approach is valid. Each can support certain use cases the other cannot, e.g. API-sever approach can track RMA or hardware in the factory, versus the Kubelet approach can do local storage and device cleanup. I’ve included 'lifecycle controllers' in the diagram specifically to acknowledge that the API-server path remains open.

The primary challenge with the API-server approach is the lack of a strict coordination boundary. Drivers would operate in a pull-model, requiring broad read/write access to both the LifecycleEvent and the target objects to function. This significantly expands the RBAC footprint and blurs the lines of ownership.

From a vendor perspective, this is problematic. One of the Kubelet approach’s strengths is that it provides a defined interface for vendors to plug into without needing full access to the cluster state.

That said, I'm of the opinion we should enable both approaches, but I think your point is about which of these should be built in-tree. For the API server approach, we would likely need a separate KEP to define a 'strict handshake' mechanism that limits driver scope. The current Kubelet-oriented design provides that boundary today, so wouldn't be blocked. Also this approach doesn't preclude anyone from implementing an API-driven flow out-of-tree. So if admins are willing to hand out such permissions, then it can be done that way.

That's my opinion, so maybe there are others out there. We should continue to discuss.

@atiratree
Copy link
Member

versus the Kubelet approach can do local storage and device cleanup

The controller talking to the API server can also run on the node and do the local cleanup.

Also this approach doesn't preclude anyone from implementing an API-driven flow out-of-tree. So if admins are willing to hand out such permissions, then it can be done that way.

As far as I can tell, the API and the claiming process are currently designed/scoped for kubelet only and do not permit an external controller. The API should be designed not to clash with kubelet if we want to support external driver selection. Ideally, there would also be an in-tree controller implementing the claiming and external driver discoverability/registration.

To ensure the API can evolve with additional future capabilities, it would be helpful to expand the use cases and user stories.

Copy link
Member

@lmktfy lmktfy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People might want to process a set of nodes, for example draining and then rebooting at most two nodes at a time.

Does this proposal allow for that approach?


Over the past several years, it has become a common pattern to use Taints, PDBs, Labels, and Annotations to coordinate lifecycle operations between controllers. While this technique is flexible and easy to use, it is brittle, has little reusability across projects, and limited upside. Now with new workloads entering the Kubernetes ecosystem through the introduction of Dynamic Resource Allocation (DRA), the day-2 lifecycle will get even more challenging, overwhelming these commonly used techniques.

This KEP proposes establishing a lifecycle management framework. A standardized, declarative API for coordinating lifecycle management that follows the architectural patterns of PersistentVolumeClaims (PVC) and Dynamic Resource Allocation (DRA). This API would:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I use this to define a lifecycle for arbitrary resources (eg Namespace, ClusterRoleBinding, ConfigMap) or just for Nodes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design decision to use a meta object like Conditions for tracking transitions was to leave room for expansion. Therefore, you can use this feature for lifecycling arbitrary resources, however I've scoped this kep to only Nodes because those use cases are well-defined.

end: DrainComplete
sla: 12h
allNodes: true
driver: server_side_kubectl_drain.example.com
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
driver: server_side_kubectl_drain.example.com
driver: example.com/server-side-node-drain

If it's server side, it's not kubectl doing the drain.

The `LifecycleTransition` object is heavily inspired by the DRA `ResourceSlice` object. We would use the same methods for the same reasons described in the [publishing-node-resources](https://github.com/pohly/enhancements/blob/624bec4521a2ad67642bebd315006623f9bd66a3/keps/sig-node/4381-dra-structured-parameters/README.md#publishing-node-resources) section of the dra-structured-parameters KEP.

### Driver Registration
A user will create their own specialized Lifecycle Driver that runs as a Daemonset and registers with the Kubelet through the plugin manager interface. The Driver will register two functions: `StartLifecycleTransition(...)` and `EndLifecycleTransition(...)`, each corresponding to the start and end fields from the `LifecycleTransition` spec. It will also register its name and the start + end transitions it will be responsible for.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it have to be a DaemonSet? If so, why?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design leverages the Kubelet for binding and grpc to a local Pod, similar to the PVC/DRA architecture. So the drivers that consume that pattern would need to be Daemonsets.

If writing a driver using a custom implementation of bind, then the driver does not need to be a Daemonset.

Copy link
Member

@lmktfy lmktfy Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What puzzles me is: let's say we want extend this beyond Node and maybe we decide we want advanced lifecycle management for PersistentVolume.

A design that expects a local driver won't be appropriate there, because you can't execute code on storage.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not required to have a local driver. Only when the admin uses the Kubelet for binding, that drivers need to be local.

The LifecycleEvent and LifecycleTransition objects are building blocks. When the Kubelet can't be relied on for binding (for whatever reason), then the admin can fall back to a custom binding solution. That could look like a dedicated controller, binding Events to drivers in it's own way.


### Names
- `LifecycleTransition` - The specification that encapsulates a single, complete lifecycle transition (start to end) within a defined location and time period
- `LifecycleEvent` - A binding API, used to indicate ownership of the active `LifecycleTransition` by the Kubelet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a LifecycleEvent have a spec? If it doesn't, what intent is it recording or defining?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we're really thinking of a NodeLifecycleTransition?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A LifecycleEvent would have a spec. Since it's a binding object, it references back to the LifecycleTransition and has a reference to which Kubelet can claim the event (i.e. Node).

A LifecycleTransition is meant to expand to any K8s resource API. With that in mind, much of the writing for this KEP is scoped for Node, since those use cases are well-defined.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but see #5769 (comment)

If this is just for Node, it feels like we are overcomplicating it.

If it is general we are assuming the existence of local compute in a way that may prove problematic.

spec:
start: DrainStarted
end: DrainComplete
sla: 12h
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid calling this a service level agreement. Maybe a transitionActiveDeadline?

@rthallisey
Copy link
Author

People might want to process a set of nodes, for example draining and then rebooting at most two nodes at a time.

Does this proposal allow for that approach?

@lmktfy, yes. A person can use these building blocks to make a tool that lifecycles group of Nodes.

@rthallisey rthallisey force-pushed the specialized-lifecycle-management branch from 4e326dd to 7df99eb Compare January 19, 2026 17:35

## Summary

Over the past several years, it has become a common pattern to use Taints, PDBs, Labels, and Annotations to coordinate lifecycle operations between controllers. While this technique is flexible and easy to use, it is brittle, has little reusability across projects, and limited upside. Now with new workloads entering the Kubernetes ecosystem through the introduction of Dynamic Resource Allocation (DRA), the day-2 lifecycle will get even more challenging, overwhelming these commonly used techniques.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now with new workloads entering the Kubernetes ecosystem through the introduction of Dynamic Resource Allocation (DRA), the day-2 lifecycle will get even more challenging

Could we defend this statement? It's not obvious how the introduction of DRA makes day 2 ops more challenging. (Perhaps we are suggesting that "legacy" strategies for managing node lifecycle, taints, etc. don't always work well w/ DRA?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strategies of taints, ect., have not worked out well for managing node lifecycle in general. With DRA, it decomposes a Node, adding new primitives that users would lifecycle (e.g. devices).

My argument is that we don't extend theses lifecycle strategies to DRA, that we go a different direction. When we need to lifecycle a multi-device multi-node Job, that requires the utmost precision, much more than the average cpu application. It raises questions like, how do we lifecycle a single failing GPU without evicting a 50-node distributed training job? Current taint/drain patterns are too blunt for this.

@dchen1107
Copy link
Member

Please ensure @pwschuurman being included as one of the reviewers. cc/ @yujuhong @wangzhen127 for vis.

- "@rthallisey"
owning-sig: sig-node
participating-sigs:
- sig-apps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section usually means impact of this KEP on these sigs.

Are you playing on adding code for this KEP in 5 different sigs? Or is this just left overs from the working group?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kannon92, many of these SIGs would be consumers of this KEP. As in, I'd expect there will be code managed by those sigs that uses this lifecycle framework, most likely in the form of a driver.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is that in scope of this KEP? ie does sig-cli have code you are planning to add as part of this API?

Or they will eventually use the API?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most sigs on this list will eventually use the API. The work to consume these APIs would be delivered in later KEPs, so those sigs are reviewers. For sig-cli, this KEP would deliver code to enhance kubectl drain. Sig-node is the other sig that will have code delivered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually for a KEP we want to know what SIGs will own the code you are making changes in for.

It may be hard to get reviews for each tech lead on this SIG for an alpha implementation. I would suggest limiting this to the actual API / who owns these changes. Consumers are future KEPs imo.

Or if you want sig-apps to leverage this you could consider how the controller would interact with this API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are calling out kubelet and kube-apiserver as the major components for this change so I expect maybe 1 or two sigs to be required for review for this.

@rthallisey rthallisey force-pushed the specialized-lifecycle-management branch from 7df99eb to 7791142 Compare February 5, 2026 22:06
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rthallisey
Once this PR has been reviewed and has the lgtm label, please assign johnbelamaric for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rthallisey rthallisey force-pushed the specialized-lifecycle-management branch from 7791142 to 68f107b Compare February 16, 2026 21:25
Signed-off-by: Ryan Hallisey <rhallisey@nvidia.com>
@rthallisey rthallisey force-pushed the specialized-lifecycle-management branch from a0fae7a to 7c3a565 Compare February 25, 2026 15:45
@k8s-ci-robot
Copy link
Contributor

@rthallisey: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-enhancements-verify 7c3a565 link true /test pull-enhancements-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Create the `LifecycleTransition` and `LifecycleEvent` APIs. Here’s a sample for doing Node drain:
```yaml
apiVersion: v1alpha1
kind: LifecycleTransition

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we reconcile multiple LifecycleTransition objects for the same resource?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This object doesn't have any status. It's used for accounting purposes to advertise lifecycle driver capabilities. So wouldn't expect anything to reconcile this object.

know that this has succeeded?
-->

- Introduce the `LifecycleTransition` API to express intent for lifecycle state changes that require external coordination

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be its own API? For the various objects that need lifecycle management, is it simpler to use those object's specs as the source of truth for what lifecycle state the object is transitioning into?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to broaden your question: should resource APIs have dedicated spec/status fields for lifecycle desired/current state? Maybe. The biggest obstacle being that we can define a state machine capable of handling a full resource lifecycle.

My approach is to slowly reserve these lifecycle states over time. So I'm starting with something that is intentionally low cost, Conditions, but is enough to prove the thesis.

Does this need to be its own API?

Regarding the LifecycleTransition API, we still need something to declare ownership and intent. Who is reconciling this lifecycle state? What is being reconciled?

```
```yaml
apiVersion: v1alpha1
kind: LifecycleEvent

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this different from the Events API?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a binding object. It tracks whether something is working on the lifecycleTransition and who is working on it.

This object is very similar to the resourceClaim in DRA - https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#terminology.


Existing techniques are also imprecise for complex lifecycle management. A Label is extensible in that it is an arbitrary string, but it is limited in its ability to express the state of a multi-node remediation. Such an expression would likely require a structured API with embedded fields.

CRDs are an often-used technique [^1], but they also have limitations. CRDs give the user a structured and extensible API, but they cannot capture all the necessary lifecycle state. Certain states exist on the objects themselves and must be mirrored back to the CRD - e.g., a Node is `NotReady`. This leads to CRDs supporting end-user business logic and states, limiting their ecosystem reusability.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How motivated are we by re-usability? IMO one of the benefits of single tenant CRDs is that they are single tenant and that means that maintainers of projects and end users don't need to reason about interoperability failures

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without a native lifecycle API in K8s, end users often create the own lifecycle solution. The list in the footnote is no where near exhaustive.

Having reusability as a requirement brings on more challenges, but I think it's worth taking those on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants