KEP-5683: Specialized Lifecycle Management by rthallisey · Pull Request #5769 · kubernetes/enhancements

rthallisey · 2026-01-05T20:32:31Z

One-line PR description: A standardized, declarative API for coordinating lifecycle management that follows the architectural patterns of PersistentVolumeClaims and Dynamic Resource Allocation

Issue link: Specialized Lifecycle Management #5683

Other comments:
- This proposal would improve observability of Node and Device lifecycle states
- The scope is deliberately small to focus on providing some value while we prove this technique is viable
- Our expectation is this framework will help Graceful Node Shutdown reach GA, but that work will be done separately - Graceful node shutdown #2000
- This KEP is meant to replace KEP-4212: Declarative Node Maintenance #4213

alaypatel07

+1, the KEP is strictly scoped to things that can be achieved in single release, having been following this space, this seems like a great start. Added some questions inline.

alaypatel07 · 2026-01-13T19:33:45Z

keps/sig-node/5683-specialized-lifecycle-management/README.md

+```Go
+// LifecycleEvent represents a binding between a LifecycleTransition
+// and the Kubelet/Driver responsible for executing it.
+type LifecycleEvent struct {


I have some clarifying questions for this API:

Is this a namespace scoped or cluster scoped API? What kinds of users should have access to this? It seems this is an admin like persona or someone with higher privileges to disrupt workloads should only have access.

How long do we expect objects to live? Do we have additional usecase like lookup what triggered a node reboot, was it a lifecycle event or something else.

Doing some napkin math, in a 100 node cluster, how many such objects do we expect to be present in apiserver/etcd?

Looking at the design it seems this API is tied to node? Is that intended? If this lifecycle event is for some arbitrary object, assuming that kubelet is not present, how will binding work in that case?

this a namespace scoped or cluster scoped API?

I've been going back and forth on this because there's use-cases for both. I'll make it more clear in the design.

How long do we expect objects to live

Depends on the SLA. From my experience, that's Hours to Days.

Doing some napkin math, in a 100 node cluster, how many such objects do we expect to be present in apiserver/etcd?

<5% of Nodes being unhealthy is generally what I've come across. So around 5 objects.

Looking at the design it seems this API is tied to node? Is that intended? If this lifecycle event is for some arbitrary object, assuming that kubelet is not present, how will binding work in that case?

What scenario would we need to lifecycle something that doesn't have a kubelet? Only case I can come up with would be a NetworkSwitch or DPU. But that device would still be attached to a Node and have a driver capable of reconciling it.

I think I'd need some more context.

What scenario would we need to lifecycle something that doesn't have a kubelet? Only case I can come up with would be a NetworkSwitch or DPU. But that device would still be attached to a Node and have a driver capable of reconciling it.

I think I'd need some more context.

I think I was thinking more like a device on a node, but in an offline discussion we agreed that even in case of a device it could be tied back to a node.

it is also about fulfilling the contract when kubelet is not running on the node: please see #5769 (comment)

keps/sig-node/5683-specialized-lifecycle-management/README.md

alaypatel07 · 2026-01-13T19:41:30Z

keps/sig-node/5683-specialized-lifecycle-management/README.md

+
+    // Start identifies the initial state of the lifecycle transition.
+    // This value is reflected as a Condition on the target K8s resource API
+    Start string


Do we have some specific states that we intend to allow? The current API allows for doing unexpected things for example, say a user wants to save cost and in order to do this via this API, the implement start state if Suspend, end state of Resume and SLA of 48h. While this is a legitimate usecase, it is unclear if such unexpected uses will be allowed/supported.

I modeled the LifecycleTransition after the ResourceSlice in DRA. So the kubelet will fill out these fields with whatever a Lifecycle Drivers says it can support. Making it so only vendors define these transitions and not users.

keps/sig-node/5683-specialized-lifecycle-management/README.md

alaypatel07 · 2026-01-13T19:47:00Z

keps/sig-node/5683-specialized-lifecycle-management/README.md

+  - DrainStarted to DrainComplete
+  - Uncording to MaintenanceComplete
+- The _DrainStarted_ to _DrainComplete_ transition will run `kubectl drain --ignore-daemonsets --timeout $SLA`
+- The _Uncordoning_ to _MaintenanceComplete_ transition will uncordon the Node


This creates an interesting edge in the node object.

Until now, the node object would be ready only if the Ready condition on it was present. With this, do we expect users to check for Ready as well as MaintenanceComplete condition? Clients who dont upgrade, will not be able to recognize Node Ready and but not MaintenanceComplete state

I don't expect there will be a change for the majority of users. A Node change impacting user capacity still means the Node should be cordoned by the admin. So these new conditions aren't meant to replace the Ready condition, taints, or other commonly used techniques. They are meant to provide addition lifecycle detail, if the user is interested in that. So it's an opt-in model.

atiratree · 2026-01-14T22:07:18Z

I think the API portion of the KEP is moving in the right direction towards solving the problem domain. However, I can see the following disadvantages to using the Kubelet to facilitate communication (based on DRA model).

Different Targets: this API could be used to handle the lifecycle of objects beyond the Node, but as mentioned above, we don't have concrete use cases yet.
Scope and Timing: even if they are bound to a node the lifecycle transitions/events may not coincide with the kubelet or container runtime running. For example handling and updating the lifecycle of the underlying machine or a VM. E.g. during shutdown, restart, provisioning or an os upgrade.
Resource Consumption: there could be a significant number of nodes undergoing a maintanance (e.g. during a cluster upgrade). Running multiple DaemonSets to support multiple events/drivers would result in higher resource usage.
Cleanup: we need to have a termination order for these drivers, since shutting down a node will affect their ability to do the work. This is difficult to solve especially between multiple owners of these drivers and lifecycle transitions/events.
High Availability: we lose the high availability benefit. This is more important for a lifecycle that goes beyond the node.
Cross-cutting work: who should handle the resposibility of coordinating a multi node maintenance / events?
Observability and Kubelet Responsibility: over time, we see an increasing need for observabilty into what a component is doing. I expect this to be the case for the driver behavior as well. Currently, we would process the observabilty part in kubelet and report certain conditions to a Node. However, with increased demands for observabilty, the kubelet could become a bottleneck

Overall, I think it would be better for most of this communication to flow through the API server.

I understand that the kubelet approach offers certain benefits such as security, Node status ownership and fault tolerance (the ability to continue node maintanance if the network is down). I believe we can achieve the same with the API server approach and additional enhancements (even a kubelet oriented ones) as well.

rthallisey · 2026-01-15T18:48:48Z

Overall, I think it would be better for most of this communication to flow through the API server.

I understand that the kubelet approach offers certain benefits such as security, Node status ownership and fault tolerance (the ability to continue node maintanance if the network is down). I believe we can achieve the same with the API server approach and additional enhancements (even a kubelet oriented ones) as well.

I think we agree both API-server approach and the kubelet approach is valid. Each can support certain use cases the other cannot, e.g. API-sever approach can track RMA or hardware in the factory, versus the Kubelet approach can do local storage and device cleanup. I’ve included 'lifecycle controllers' in the diagram specifically to acknowledge that the API-server path remains open.

The primary challenge with the API-server approach is the lack of a strict coordination boundary. Drivers would operate in a pull-model, requiring broad read/write access to both the LifecycleEvent and the target objects to function. This significantly expands the RBAC footprint and blurs the lines of ownership.

From a vendor perspective, this is problematic. One of the Kubelet approach’s strengths is that it provides a defined interface for vendors to plug into without needing full access to the cluster state.

That said, I'm of the opinion we should enable both approaches, but I think your point is about which of these should be built in-tree. For the API server approach, we would likely need a separate KEP to define a 'strict handshake' mechanism that limits driver scope. The current Kubelet-oriented design provides that boundary today, so wouldn't be blocked. Also this approach doesn't preclude anyone from implementing an API-driven flow out-of-tree. So if admins are willing to hand out such permissions, then it can be done that way.

That's my opinion, so maybe there are others out there. We should continue to discuss.

atiratree · 2026-01-16T09:51:11Z

versus the Kubelet approach can do local storage and device cleanup

The controller talking to the API server can also run on the node and do the local cleanup.

Also this approach doesn't preclude anyone from implementing an API-driven flow out-of-tree. So if admins are willing to hand out such permissions, then it can be done that way.

As far as I can tell, the API and the claiming process are currently designed/scoped for kubelet only and do not permit an external controller. The API should be designed not to clash with kubelet if we want to support external driver selection. Ideally, there would also be an in-tree controller implementing the claiming and external driver discoverability/registration.

To ensure the API can evolve with additional future capabilities, it would be helpful to expand the use cases and user stories.

lmktfy

People might want to process a set of nodes, for example draining and then rebooting at most two nodes at a time.

Does this proposal allow for that approach?

keps/sig-node/5683-specialized-lifecycle-management/kep.yaml

lmktfy · 2026-01-16T15:59:53Z

keps/sig-node/5683-specialized-lifecycle-management/README.md

+
+Over the past several years, it has become a common pattern to use Taints, PDBs, Labels, and Annotations to coordinate lifecycle operations between controllers. While this technique is flexible and easy to use, it is brittle, has little reusability across projects, and limited upside. Now with new workloads entering the Kubernetes ecosystem through the introduction of Dynamic Resource Allocation (DRA), the day-2 lifecycle will get even more challenging, overwhelming these commonly used techniques.
+
+This KEP proposes establishing a lifecycle management framework. A standardized, declarative API for coordinating lifecycle management that follows the architectural patterns of PersistentVolumeClaims (PVC) and Dynamic Resource Allocation (DRA). This API would:


Can I use this to define a lifecycle for arbitrary resources (eg Namespace, ClusterRoleBinding, ConfigMap) or just for Nodes?

The design decision to use a meta object like Conditions for tracking transitions was to leave room for expansion. Therefore, you can use this feature for lifecycling arbitrary resources, however I've scoped this kep to only Nodes because those use cases are well-defined.

lmktfy · 2026-01-16T16:02:45Z

keps/sig-node/5683-specialized-lifecycle-management/README.md

+  end: DrainComplete
+  sla: 12h
+  allNodes: true
+  driver: server_side_kubectl_drain.example.com


Suggested change

driver: server_side_kubectl_drain.example.com

driver: example.com/server-side-node-drain

If it's server side, it's not kubectl doing the drain.

keps/sig-node/5683-specialized-lifecycle-management/README.md

lmktfy · 2026-01-16T16:05:21Z

keps/sig-node/5683-specialized-lifecycle-management/README.md

+The `LifecycleTransition` object is heavily inspired by the DRA `ResourceSlice` object. We would use the same methods for the same reasons described in the [publishing-node-resources](https://github.com/pohly/enhancements/blob/624bec4521a2ad67642bebd315006623f9bd66a3/keps/sig-node/4381-dra-structured-parameters/README.md#publishing-node-resources) section of the dra-structured-parameters KEP.
+
+### Driver Registration
+A user will create their own specialized Lifecycle Driver that runs as a Daemonset and registers with the Kubelet through the plugin manager interface. The Driver will register two functions: `StartLifecycleTransition(...)` and `EndLifecycleTransition(...)`, each corresponding to the start and end fields from the `LifecycleTransition` spec. It will also register its name and the start + end transitions it will be responsible for.


Does it have to be a DaemonSet? If so, why?

The design leverages the Kubelet for binding and grpc to a local Pod, similar to the PVC/DRA architecture. So the drivers that consume that pattern would need to be Daemonsets.

If writing a driver using a custom implementation of bind, then the driver does not need to be a Daemonset.

What puzzles me is: let's say we want extend this beyond Node and maybe we decide we want advanced lifecycle management for PersistentVolume.

A design that expects a local driver won't be appropriate there, because you can't execute code on storage.

It's not required to have a local driver. Only when the admin uses the Kubelet for binding, that drivers need to be local.

The LifecycleEvent and LifecycleTransition objects are building blocks. When the Kubelet can't be relied on for binding (for whatever reason), then the admin can fall back to a custom binding solution. That could look like a dedicated controller, binding Events to drivers in it's own way.

keps/sig-node/5683-specialized-lifecycle-management/README.md

lmktfy · 2026-01-16T16:11:18Z

keps/sig-node/5683-specialized-lifecycle-management/README.md

+
+### Names
+- `LifecycleTransition` - The specification that encapsulates a single, complete lifecycle transition (start to end) within a defined location and time period
+- `LifecycleEvent` - A binding API, used to indicate ownership of the active `LifecycleTransition` by the Kubelet


Would a LifecycleEvent have a spec? If it doesn't, what intent is it recording or defining?

Maybe we're really thinking of a NodeLifecycleTransition?

A LifecycleEvent would have a spec. Since it's a binding object, it references back to the LifecycleTransition and has a reference to which Kubelet can claim the event (i.e. Node).

A LifecycleTransition is meant to expand to any K8s resource API. With that in mind, much of the writing for this KEP is scoped for Node, since those use cases are well-defined.

OK, but see #5769 (comment)

If this is just for Node, it feels like we are overcomplicating it.

If it is general we are assuming the existence of local compute in a way that may prove problematic.

keps/sig-node/5683-specialized-lifecycle-management/README.md

lmktfy · 2026-01-16T16:15:20Z

keps/sig-node/5683-specialized-lifecycle-management/README.md

+spec:
+  start: DrainStarted
+  end: DrainComplete
+  sla: 12h


Avoid calling this a service level agreement. Maybe a transitionActiveDeadline?

rthallisey · 2026-01-19T16:21:55Z

People might want to process a set of nodes, for example draining and then rebooting at most two nodes at a time.

Does this proposal allow for that approach?

@lmktfy, yes. A person can use these building blocks to make a tool that lifecycles group of Nodes.

jackfrancis · 2026-01-20T16:59:38Z

keps/sig-node/5683-specialized-lifecycle-management/README.md

+
+## Summary
+
+Over the past several years, it has become a common pattern to use Taints, PDBs, Labels, and Annotations to coordinate lifecycle operations between controllers. While this technique is flexible and easy to use, it is brittle, has little reusability across projects, and limited upside. Now with new workloads entering the Kubernetes ecosystem through the introduction of Dynamic Resource Allocation (DRA), the day-2 lifecycle will get even more challenging, overwhelming these commonly used techniques.


Now with new workloads entering the Kubernetes ecosystem through the introduction of Dynamic Resource Allocation (DRA), the day-2 lifecycle will get even more challenging

Could we defend this statement? It's not obvious how the introduction of DRA makes day 2 ops more challenging. (Perhaps we are suggesting that "legacy" strategies for managing node lifecycle, taints, etc. don't always work well w/ DRA?)

The strategies of taints, ect., have not worked out well for managing node lifecycle in general. With DRA, it decomposes a Node, adding new primitives that users would lifecycle (e.g. devices).

My argument is that we don't extend theses lifecycle strategies to DRA, that we go a different direction. When we need to lifecycle a multi-device multi-node Job, that requires the utmost precision, much more than the average cpu application. It raises questions like, how do we lifecycle a single failing GPU without evicting a 50-node distributed training job? Current taint/drain patterns are too blunt for this.

dchen1107 · 2026-01-26T21:31:21Z

Please ensure @pwschuurman being included as one of the reviewers. cc/ @yujuhong @wangzhen127 for vis.

kannon92 · 2026-02-03T03:43:48Z

keps/sig-node/5683-specialized-lifecycle-management/kep.yaml

+  - "@rthallisey"
+owning-sig: sig-node
+participating-sigs:
+  - sig-apps


This section usually means impact of this KEP on these sigs.

Are you playing on adding code for this KEP in 5 different sigs? Or is this just left overs from the working group?

@kannon92, many of these SIGs would be consumers of this KEP. As in, I'd expect there will be code managed by those sigs that uses this lifecycle framework, most likely in the form of a driver.

But is that in scope of this KEP? ie does sig-cli have code you are planning to add as part of this API?

Or they will eventually use the API?

Most sigs on this list will eventually use the API. The work to consume these APIs would be delivered in later KEPs, so those sigs are reviewers. For sig-cli, this KEP would deliver code to enhance kubectl drain. Sig-node is the other sig that will have code delivered.

Usually for a KEP we want to know what SIGs will own the code you are making changes in for.

It may be hard to get reviews for each tech lead on this SIG for an alpha implementation. I would suggest limiting this to the actual API / who owns these changes. Consumers are future KEPs imo.

Or if you want sig-apps to leverage this you could consider how the controller would interact with this API.

You are calling out kubelet and kube-apiserver as the major components for this change so I expect maybe 1 or two sigs to be required for review for this.

k8s-ci-robot · 2026-02-05T22:06:37Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rthallisey
Once this PR has been reviewed and has the lgtm label, please assign johnbelamaric for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-node/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Ryan Hallisey <rhallisey@nvidia.com>

k8s-ci-robot · 2026-02-25T15:49:52Z

@rthallisey: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-enhancements-verify	`7c3a565`	link	true	`/test pull-enhancements-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

DerekFrank · 2026-02-25T17:13:58Z

keps/sig-node/5683-specialized-lifecycle-management/README.md

+Create the `LifecycleTransition` and `LifecycleEvent` APIs. Here’s a sample for doing Node drain:
+```yaml
+apiVersion: v1alpha1
+kind: LifecycleTransition


How do we reconcile multiple LifecycleTransition objects for the same resource?

This object doesn't have any status. It's used for accounting purposes to advertise lifecycle driver capabilities. So wouldn't expect anything to reconcile this object.

DerekFrank · 2026-02-25T17:30:16Z

keps/sig-node/5683-specialized-lifecycle-management/README.md

+know that this has succeeded?
+-->
+
+- Introduce the `LifecycleTransition` API to express intent for lifecycle state changes that require external coordination


Does this need to be its own API? For the various objects that need lifecycle management, is it simpler to use those object's specs as the source of truth for what lifecycle state the object is transitioning into?

I'm going to broaden your question: should resource APIs have dedicated spec/status fields for lifecycle desired/current state? Maybe. The biggest obstacle being that we can define a state machine capable of handling a full resource lifecycle.

My approach is to slowly reserve these lifecycle states over time. So I'm starting with something that is intentionally low cost, Conditions, but is enough to prove the thesis.

Does this need to be its own API?

Regarding the LifecycleTransition API, we still need something to declare ownership and intent. Who is reconciling this lifecycle state? What is being reconciled?

DerekFrank · 2026-02-25T17:34:23Z

keps/sig-node/5683-specialized-lifecycle-management/README.md

+```
+```yaml
+apiVersion: v1alpha1
+kind: LifecycleEvent


Why is this different from the Events API?

This is a binding object. It tracks whether something is working on the lifecycleTransition and who is working on it.

This object is very similar to the resourceClaim in DRA - https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#terminology.

DerekFrank · 2026-02-25T17:37:57Z

keps/sig-node/5683-specialized-lifecycle-management/README.md

+
+Existing techniques are also imprecise for complex lifecycle management. A Label is extensible in that it is an arbitrary string, but it is limited in its ability to express the state of a multi-node remediation. Such an expression would likely require a structured API with embedded fields.
+
+CRDs are an often-used technique [^1], but they also have limitations. CRDs give the user a structured and extensible API, but they cannot capture all the necessary lifecycle state. Certain states exist on the objects themselves and must be mirrored back to the CRD - e.g., a Node is `NotReady`. This leads to CRDs supporting end-user business logic and states, limiting their ecosystem reusability.


How motivated are we by re-usability? IMO one of the benefits of single tenant CRDs is that they are single tenant and that means that maintainers of projects and end users don't need to reason about interoperability failures

Without a native lifecycle API in K8s, end users often create the own lifecycle solution. The list in the footnote is no where near exhaustive.

Having reusability as a requirement brings on more challenges, but I think it's worth taking those on.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jan 5, 2026

k8s-ci-robot requested review from dchen1107 and derekwaynecarr January 5, 2026 20:32

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jan 5, 2026

rthallisey mentioned this pull request Jan 5, 2026

Specialized Lifecycle Management #5683

Open

4 tasks

rthallisey force-pushed the specialized-lifecycle-management branch 3 times, most recently from 43adeb5 to 6c465f2 Compare January 8, 2026 20:42

rthallisey force-pushed the specialized-lifecycle-management branch from 6c465f2 to e3deff1 Compare January 13, 2026 16:54

SergeyKanzhelev moved this to Triage in SIG Node 1.36 KEPs planning Jan 13, 2026

SergeyKanzhelev added this to SIG Node 1.36 KEPs planning Jan 13, 2026

SergeyKanzhelev removed this from SIG Node 1.36 KEPs planning Jan 13, 2026

alaypatel07 reviewed Jan 13, 2026

View reviewed changes

rthallisey force-pushed the specialized-lifecycle-management branch from e3deff1 to 4e326dd Compare January 15, 2026 15:25

atiratree mentioned this pull request Jan 16, 2026

KEP-4212: Declarative Node Maintenance #4213

Closed

6 tasks

lmktfy reviewed Jan 16, 2026

View reviewed changes

rthallisey force-pushed the specialized-lifecycle-management branch from 4e326dd to 7df99eb Compare January 19, 2026 17:35

jackfrancis reviewed Jan 20, 2026

View reviewed changes

rthallisey mentioned this pull request Jan 26, 2026

Mark Node ready for Termination kubernetes/autoscaler#8157

Closed

kannon92 reviewed Feb 3, 2026

View reviewed changes

rthallisey force-pushed the specialized-lifecycle-management branch from 7df99eb to 7791142 Compare February 5, 2026 22:06

rthallisey force-pushed the specialized-lifecycle-management branch from 7791142 to 68f107b Compare February 16, 2026 21:25

rthallisey mentioned this pull request Feb 16, 2026

kubelet starts workloads on a node that is currently in process of being shut down after it is restarted kubernetes/kubernetes#122674

Open

rthallisey force-pushed the specialized-lifecycle-management branch 2 times, most recently from 909163c to a5c9cd7 Compare February 23, 2026 14:03

rthallisey force-pushed the specialized-lifecycle-management branch from a5c9cd7 to a0fae7a Compare February 24, 2026 18:42

Specialized Lifecycle Management

7c3a565

Signed-off-by: Ryan Hallisey <rhallisey@nvidia.com>

rthallisey force-pushed the specialized-lifecycle-management branch from a0fae7a to 7c3a565 Compare February 25, 2026 15:45

DerekFrank reviewed Feb 25, 2026

View reviewed changes


		Over the past several years, it has become a common pattern to use Taints, PDBs, Labels, and Annotations to coordinate lifecycle operations between controllers. While this technique is flexible and easy to use, it is brittle, has little reusability across projects, and limited upside. Now with new workloads entering the Kubernetes ecosystem through the introduction of Dynamic Resource Allocation (DRA), the day-2 lifecycle will get even more challenging, overwhelming these commonly used techniques.

		This KEP proposes establishing a lifecycle management framework. A standardized, declarative API for coordinating lifecycle management that follows the architectural patterns of PersistentVolumeClaims (PVC) and Dynamic Resource Allocation (DRA). This API would:

	driver: server_side_kubectl_drain.example.com
	driver: example.com/server-side-node-drain


		## Summary

		Over the past several years, it has become a common pattern to use Taints, PDBs, Labels, and Annotations to coordinate lifecycle operations between controllers. While this technique is flexible and easy to use, it is brittle, has little reusability across projects, and limited upside. Now with new workloads entering the Kubernetes ecosystem through the introduction of Dynamic Resource Allocation (DRA), the day-2 lifecycle will get even more challenging, overwhelming these commonly used techniques.


		Existing techniques are also imprecise for complex lifecycle management. A Label is extensible in that it is an arbitrary string, but it is limited in its ability to express the state of a multi-node remediation. Such an expression would likely require a structured API with embedded fields.

		CRDs are an often-used technique [^1], but they also have limitations. CRDs give the user a structured and extensible API, but they cannot capture all the necessary lifecycle state. Certain states exist on the objects themselves and must be mirrored back to the CRD - e.g., a Node is `NotReady`. This leads to CRDs supporting end-user business logic and states, limiting their ecosystem reusability.

Conversation

rthallisey commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alaypatel07 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alaypatel07 Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rthallisey Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

atiratree commented Jan 14, 2026

Uh oh!

rthallisey commented Jan 15, 2026

Uh oh!

atiratree commented Jan 16, 2026

Uh oh!

lmktfy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lmktfy Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rthallisey commented Jan 19, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dchen1107 commented Jan 26, 2026

Uh oh!

Choose a reason for hiding this comment

rthallisey commented Jan 5, 2026 •

edited

Loading

alaypatel07 Jan 14, 2026 •

edited

Loading

rthallisey Jan 14, 2026 •

edited

Loading

lmktfy Jan 20, 2026 •

edited

Loading