KEP-5683: Specialized Lifecycle Management#5769
KEP-5683: Specialized Lifecycle Management#5769rthallisey wants to merge 1 commit intokubernetes:masterfrom
Conversation
43adeb5 to
6c465f2
Compare
6c465f2 to
e3deff1
Compare
alaypatel07
left a comment
There was a problem hiding this comment.
+1, the KEP is strictly scoped to things that can be achieved in single release, having been following this space, this seems like a great start. Added some questions inline.
| ```Go | ||
| // LifecycleEvent represents a binding between a LifecycleTransition | ||
| // and the Kubelet/Driver responsible for executing it. | ||
| type LifecycleEvent struct { |
There was a problem hiding this comment.
I have some clarifying questions for this API:
- Is this a namespace scoped or cluster scoped API? What kinds of users should have access to this? It seems this is an admin like persona or someone with higher privileges to disrupt workloads should only have access.
- How long do we expect objects to live? Do we have additional usecase like lookup what triggered a node reboot, was it a lifecycle event or something else.
- Doing some napkin math, in a 100 node cluster, how many such objects do we expect to be present in apiserver/etcd?
- Looking at the design it seems this API is tied to node? Is that intended? If this lifecycle event is for some arbitrary object, assuming that kubelet is not present, how will binding work in that case?
There was a problem hiding this comment.
this a namespace scoped or cluster scoped API?
I've been going back and forth on this because there's use-cases for both. I'll make it more clear in the design.
How long do we expect objects to live
Depends on the SLA. From my experience, that's Hours to Days.
Doing some napkin math, in a 100 node cluster, how many such objects do we expect to be present in apiserver/etcd?
<5% of Nodes being unhealthy is generally what I've come across. So around 5 objects.
There was a problem hiding this comment.
Looking at the design it seems this API is tied to node? Is that intended? If this lifecycle event is for some arbitrary object, assuming that kubelet is not present, how will binding work in that case?
What scenario would we need to lifecycle something that doesn't have a kubelet? Only case I can come up with would be a NetworkSwitch or DPU. But that device would still be attached to a Node and have a driver capable of reconciling it.
I think I'd need some more context.
There was a problem hiding this comment.
What scenario would we need to lifecycle something that doesn't have a kubelet? Only case I can come up with would be a NetworkSwitch or DPU. But that device would still be attached to a Node and have a driver capable of reconciling it.
I think I'd need some more context.
I think I was thinking more like a device on a node, but in an offline discussion we agreed that even in case of a device it could be tied back to a node.
There was a problem hiding this comment.
it is also about fulfilling the contract when kubelet is not running on the node: please see #5769 (comment)
|
|
||
| // Start identifies the initial state of the lifecycle transition. | ||
| // This value is reflected as a Condition on the target K8s resource API | ||
| Start string |
There was a problem hiding this comment.
Do we have some specific states that we intend to allow? The current API allows for doing unexpected things for example, say a user wants to save cost and in order to do this via this API, the implement start state if Suspend, end state of Resume and SLA of 48h. While this is a legitimate usecase, it is unclear if such unexpected uses will be allowed/supported.
There was a problem hiding this comment.
I modeled the LifecycleTransition after the ResourceSlice in DRA. So the kubelet will fill out these fields with whatever a Lifecycle Drivers says it can support. Making it so only vendors define these transitions and not users.
| - DrainStarted to DrainComplete | ||
| - Uncording to MaintenanceComplete | ||
| - The _DrainStarted_ to _DrainComplete_ transition will run `kubectl drain --ignore-daemonsets --timeout $SLA` | ||
| - The _Uncordoning_ to _MaintenanceComplete_ transition will uncordon the Node |
There was a problem hiding this comment.
This creates an interesting edge in the node object.
Until now, the node object would be ready only if the Ready condition on it was present. With this, do we expect users to check for Ready as well as MaintenanceComplete condition? Clients who dont upgrade, will not be able to recognize Node Ready and but not MaintenanceComplete state
There was a problem hiding this comment.
I don't expect there will be a change for the majority of users. A Node change impacting user capacity still means the Node should be cordoned by the admin. So these new conditions aren't meant to replace the Ready condition, taints, or other commonly used techniques. They are meant to provide addition lifecycle detail, if the user is interested in that. So it's an opt-in model.
|
I think the API portion of the KEP is moving in the right direction towards solving the problem domain. However, I can see the following disadvantages to using the Kubelet to facilitate communication (based on DRA model).
Overall, I think it would be better for most of this communication to flow through the API server. I understand that the kubelet approach offers certain benefits such as security, Node status ownership and fault tolerance (the ability to continue node maintanance if the network is down). I believe we can achieve the same with the API server approach and additional enhancements (even a kubelet oriented ones) as well. |
e3deff1 to
4e326dd
Compare
I think we agree both API-server approach and the kubelet approach is valid. Each can support certain use cases the other cannot, e.g. API-sever approach can track RMA or hardware in the factory, versus the Kubelet approach can do local storage and device cleanup. I’ve included 'lifecycle controllers' in the diagram specifically to acknowledge that the API-server path remains open. The primary challenge with the API-server approach is the lack of a strict coordination boundary. Drivers would operate in a pull-model, requiring broad read/write access to both the LifecycleEvent and the target objects to function. This significantly expands the RBAC footprint and blurs the lines of ownership. From a vendor perspective, this is problematic. One of the Kubelet approach’s strengths is that it provides a defined interface for vendors to plug into without needing full access to the cluster state. That said, I'm of the opinion we should enable both approaches, but I think your point is about which of these should be built in-tree. For the API server approach, we would likely need a separate KEP to define a 'strict handshake' mechanism that limits driver scope. The current Kubelet-oriented design provides that boundary today, so wouldn't be blocked. Also this approach doesn't preclude anyone from implementing an API-driven flow out-of-tree. So if admins are willing to hand out such permissions, then it can be done that way. That's my opinion, so maybe there are others out there. We should continue to discuss. |
The controller talking to the API server can also run on the node and do the local cleanup.
As far as I can tell, the API and the claiming process are currently designed/scoped for kubelet only and do not permit an external controller. The API should be designed not to clash with kubelet if we want to support external driver selection. Ideally, there would also be an in-tree controller implementing the claiming and external driver discoverability/registration. To ensure the API can evolve with additional future capabilities, it would be helpful to expand the use cases and user stories. |
lmktfy
left a comment
There was a problem hiding this comment.
People might want to process a set of nodes, for example draining and then rebooting at most two nodes at a time.
Does this proposal allow for that approach?
|
|
||
| Over the past several years, it has become a common pattern to use Taints, PDBs, Labels, and Annotations to coordinate lifecycle operations between controllers. While this technique is flexible and easy to use, it is brittle, has little reusability across projects, and limited upside. Now with new workloads entering the Kubernetes ecosystem through the introduction of Dynamic Resource Allocation (DRA), the day-2 lifecycle will get even more challenging, overwhelming these commonly used techniques. | ||
|
|
||
| This KEP proposes establishing a lifecycle management framework. A standardized, declarative API for coordinating lifecycle management that follows the architectural patterns of PersistentVolumeClaims (PVC) and Dynamic Resource Allocation (DRA). This API would: |
There was a problem hiding this comment.
Can I use this to define a lifecycle for arbitrary resources (eg Namespace, ClusterRoleBinding, ConfigMap) or just for Nodes?
There was a problem hiding this comment.
The design decision to use a meta object like Conditions for tracking transitions was to leave room for expansion. Therefore, you can use this feature for lifecycling arbitrary resources, however I've scoped this kep to only Nodes because those use cases are well-defined.
| end: DrainComplete | ||
| sla: 12h | ||
| allNodes: true | ||
| driver: server_side_kubectl_drain.example.com |
There was a problem hiding this comment.
| driver: server_side_kubectl_drain.example.com | |
| driver: example.com/server-side-node-drain |
If it's server side, it's not kubectl doing the drain.
| The `LifecycleTransition` object is heavily inspired by the DRA `ResourceSlice` object. We would use the same methods for the same reasons described in the [publishing-node-resources](https://github.com/pohly/enhancements/blob/624bec4521a2ad67642bebd315006623f9bd66a3/keps/sig-node/4381-dra-structured-parameters/README.md#publishing-node-resources) section of the dra-structured-parameters KEP. | ||
|
|
||
| ### Driver Registration | ||
| A user will create their own specialized Lifecycle Driver that runs as a Daemonset and registers with the Kubelet through the plugin manager interface. The Driver will register two functions: `StartLifecycleTransition(...)` and `EndLifecycleTransition(...)`, each corresponding to the start and end fields from the `LifecycleTransition` spec. It will also register its name and the start + end transitions it will be responsible for. |
There was a problem hiding this comment.
Does it have to be a DaemonSet? If so, why?
There was a problem hiding this comment.
The design leverages the Kubelet for binding and grpc to a local Pod, similar to the PVC/DRA architecture. So the drivers that consume that pattern would need to be Daemonsets.
If writing a driver using a custom implementation of bind, then the driver does not need to be a Daemonset.
There was a problem hiding this comment.
What puzzles me is: let's say we want extend this beyond Node and maybe we decide we want advanced lifecycle management for PersistentVolume.
A design that expects a local driver won't be appropriate there, because you can't execute code on storage.
There was a problem hiding this comment.
It's not required to have a local driver. Only when the admin uses the Kubelet for binding, that drivers need to be local.
The LifecycleEvent and LifecycleTransition objects are building blocks. When the Kubelet can't be relied on for binding (for whatever reason), then the admin can fall back to a custom binding solution. That could look like a dedicated controller, binding Events to drivers in it's own way.
|
|
||
| ### Names | ||
| - `LifecycleTransition` - The specification that encapsulates a single, complete lifecycle transition (start to end) within a defined location and time period | ||
| - `LifecycleEvent` - A binding API, used to indicate ownership of the active `LifecycleTransition` by the Kubelet |
There was a problem hiding this comment.
Would a LifecycleEvent have a spec? If it doesn't, what intent is it recording or defining?
There was a problem hiding this comment.
Maybe we're really thinking of a NodeLifecycleTransition?
There was a problem hiding this comment.
A LifecycleEvent would have a spec. Since it's a binding object, it references back to the LifecycleTransition and has a reference to which Kubelet can claim the event (i.e. Node).
A LifecycleTransition is meant to expand to any K8s resource API. With that in mind, much of the writing for this KEP is scoped for Node, since those use cases are well-defined.
There was a problem hiding this comment.
OK, but see #5769 (comment)
If this is just for Node, it feels like we are overcomplicating it.
If it is general we are assuming the existence of local compute in a way that may prove problematic.
| spec: | ||
| start: DrainStarted | ||
| end: DrainComplete | ||
| sla: 12h |
There was a problem hiding this comment.
Avoid calling this a service level agreement. Maybe a transitionActiveDeadline?
@lmktfy, yes. A person can use these building blocks to make a tool that lifecycles group of Nodes. |
4e326dd to
7df99eb
Compare
|
|
||
| ## Summary | ||
|
|
||
| Over the past several years, it has become a common pattern to use Taints, PDBs, Labels, and Annotations to coordinate lifecycle operations between controllers. While this technique is flexible and easy to use, it is brittle, has little reusability across projects, and limited upside. Now with new workloads entering the Kubernetes ecosystem through the introduction of Dynamic Resource Allocation (DRA), the day-2 lifecycle will get even more challenging, overwhelming these commonly used techniques. |
There was a problem hiding this comment.
Now with new workloads entering the Kubernetes ecosystem through the introduction of Dynamic Resource Allocation (DRA), the day-2 lifecycle will get even more challenging
Could we defend this statement? It's not obvious how the introduction of DRA makes day 2 ops more challenging. (Perhaps we are suggesting that "legacy" strategies for managing node lifecycle, taints, etc. don't always work well w/ DRA?)
There was a problem hiding this comment.
The strategies of taints, ect., have not worked out well for managing node lifecycle in general. With DRA, it decomposes a Node, adding new primitives that users would lifecycle (e.g. devices).
My argument is that we don't extend theses lifecycle strategies to DRA, that we go a different direction. When we need to lifecycle a multi-device multi-node Job, that requires the utmost precision, much more than the average cpu application. It raises questions like, how do we lifecycle a single failing GPU without evicting a 50-node distributed training job? Current taint/drain patterns are too blunt for this.
|
Please ensure @pwschuurman being included as one of the reviewers. cc/ @yujuhong @wangzhen127 for vis. |
| - "@rthallisey" | ||
| owning-sig: sig-node | ||
| participating-sigs: | ||
| - sig-apps |
There was a problem hiding this comment.
This section usually means impact of this KEP on these sigs.
Are you playing on adding code for this KEP in 5 different sigs? Or is this just left overs from the working group?
There was a problem hiding this comment.
@kannon92, many of these SIGs would be consumers of this KEP. As in, I'd expect there will be code managed by those sigs that uses this lifecycle framework, most likely in the form of a driver.
There was a problem hiding this comment.
But is that in scope of this KEP? ie does sig-cli have code you are planning to add as part of this API?
Or they will eventually use the API?
There was a problem hiding this comment.
Most sigs on this list will eventually use the API. The work to consume these APIs would be delivered in later KEPs, so those sigs are reviewers. For sig-cli, this KEP would deliver code to enhance kubectl drain. Sig-node is the other sig that will have code delivered.
There was a problem hiding this comment.
Usually for a KEP we want to know what SIGs will own the code you are making changes in for.
It may be hard to get reviews for each tech lead on this SIG for an alpha implementation. I would suggest limiting this to the actual API / who owns these changes. Consumers are future KEPs imo.
Or if you want sig-apps to leverage this you could consider how the controller would interact with this API.
There was a problem hiding this comment.
You are calling out kubelet and kube-apiserver as the major components for this change so I expect maybe 1 or two sigs to be required for review for this.
7df99eb to
7791142
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: rthallisey The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
7791142 to
68f107b
Compare
909163c to
a5c9cd7
Compare
a5c9cd7 to
a0fae7a
Compare
Signed-off-by: Ryan Hallisey <rhallisey@nvidia.com>
a0fae7a to
7c3a565
Compare
|
@rthallisey: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| Create the `LifecycleTransition` and `LifecycleEvent` APIs. Here’s a sample for doing Node drain: | ||
| ```yaml | ||
| apiVersion: v1alpha1 | ||
| kind: LifecycleTransition |
There was a problem hiding this comment.
How do we reconcile multiple LifecycleTransition objects for the same resource?
There was a problem hiding this comment.
This object doesn't have any status. It's used for accounting purposes to advertise lifecycle driver capabilities. So wouldn't expect anything to reconcile this object.
| know that this has succeeded? | ||
| --> | ||
|
|
||
| - Introduce the `LifecycleTransition` API to express intent for lifecycle state changes that require external coordination |
There was a problem hiding this comment.
Does this need to be its own API? For the various objects that need lifecycle management, is it simpler to use those object's specs as the source of truth for what lifecycle state the object is transitioning into?
There was a problem hiding this comment.
I'm going to broaden your question: should resource APIs have dedicated spec/status fields for lifecycle desired/current state? Maybe. The biggest obstacle being that we can define a state machine capable of handling a full resource lifecycle.
My approach is to slowly reserve these lifecycle states over time. So I'm starting with something that is intentionally low cost, Conditions, but is enough to prove the thesis.
Does this need to be its own API?
Regarding the LifecycleTransition API, we still need something to declare ownership and intent. Who is reconciling this lifecycle state? What is being reconciled?
| ``` | ||
| ```yaml | ||
| apiVersion: v1alpha1 | ||
| kind: LifecycleEvent |
There was a problem hiding this comment.
Why is this different from the Events API?
There was a problem hiding this comment.
This is a binding object. It tracks whether something is working on the lifecycleTransition and who is working on it.
This object is very similar to the resourceClaim in DRA - https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#terminology.
|
|
||
| Existing techniques are also imprecise for complex lifecycle management. A Label is extensible in that it is an arbitrary string, but it is limited in its ability to express the state of a multi-node remediation. Such an expression would likely require a structured API with embedded fields. | ||
|
|
||
| CRDs are an often-used technique [^1], but they also have limitations. CRDs give the user a structured and extensible API, but they cannot capture all the necessary lifecycle state. Certain states exist on the objects themselves and must be mirrored back to the CRD - e.g., a Node is `NotReady`. This leads to CRDs supporting end-user business logic and states, limiting their ecosystem reusability. |
There was a problem hiding this comment.
How motivated are we by re-usability? IMO one of the benefits of single tenant CRDs is that they are single tenant and that means that maintainers of projects and end users don't need to reason about interoperability failures
There was a problem hiding this comment.
Without a native lifecycle API in K8s, end users often create the own lifecycle solution. The list in the footnote is no where near exhaustive.
Having reusability as a requirement brings on more challenges, but I think it's worth taking those on.
Uh oh!
There was an error while loading. Please reload this page.