KEP-5647: Initial KEP for Stale Controller Detection #5649

michaelasp · 2025-10-09T23:37:49Z

One-line PR description: Add initial KEP for stale controller metrics

Issue link: Stale Controller Detection and Mitigation #5647

Other comments:

michaelasp · 2025-10-09T23:38:03Z

/cc @serathius

keps/sig-api-machinery/5647-stale-controller-metrics/README.md

serathius · 2025-10-10T08:44:39Z

keps/sig-api-machinery/5647-stale-controller-metrics/README.md

-a principled approach with rules and guardrails.
-
-See:
+As Kubernetes increases in scale, greater pressure is being put on all


While staleness usually happens in scale, it's not exclusive to it. It's a consequence of building reconciliation on watch protocol which is an eventually consistent. There is not guarantee of how far behind the watch is, next event might arrive in a second, or in an hour. The problem is that currently don't have any way to measure it.

Makes sense, removing the section about scale and just generally talking about eventual consistency

serathius · 2025-10-10T08:47:58Z

keps/sig-api-machinery/5647-stale-controller-metrics/README.md

+components. One issue that arises are controllers falling out of sync with the
+apiserver as they cannot keep up. When this happens, a controller may act on
+information that is old without realizing and get stuck in an irrecoverable
+state where it keeps trying to act on the world and does not see its own writes


I would expand this more, the problem of controller acting on outdated information, which in best cases can cause conflicts on writes increasing error rates or in worst case invalid behaviors like duplicating objects that overwhelmed control plane.

serathius · 2025-10-10T08:49:40Z

keps/sig-api-machinery/5647-stale-controller-metrics/README.md

- Prevent flaking, unreliable tests
- Ensure result reporting is structured
- Must not impact the conformance test suite
+The goal of this kep is to add a set of metrics that we define for a certain set


Let's expand the goals to be able to measure the staleness of controller reconcilation to allow administrators and controllers to take an action if a threshold is reached.

serathius · 2025-10-10T08:57:36Z

keps/sig-api-machinery/5647-stale-controller-metrics/README.md

- Enable completely arbitrary checks
- Targeting integration tests.
-  - We are specifically aiming for end to end tests for this purpose.
+We also will focus on metrics in this KEP and not propose solutions to the


I don't think we should have just a KEP for a metric, when I suggested cutting scope to metrics I meant that it's a good first step and we can expand it.

Ack, added overall metrics and detection as two parts of the overall KEP.

serathius · 2025-10-10T09:00:18Z

keps/sig-api-machinery/5647-stale-controller-metrics/README.md

+#### Story 1

-How will UX be reviewed, and by whom?
+I am a cluster administrator, I want to be able to check my metrics and see if


Again, overloading is not the only cause staleness and increasing resources is not the only mitigation. I would skip describing a specific mitigation, just focus on being able to monitor and alert on the issue so an oncall can take an action.

Yep, just focused on staleness monitoring and mitigation

serathius · 2025-10-10T09:01:05Z

keps/sig-api-machinery/5647-stale-controller-metrics/README.md


-If implemented poorly, this could result in tests flaking in any number of e2e
-test CI jobs that are now running these tests.
+I am a user and am trying to optimize my workloads, I look at my usage patterns


Not sure I understand the story. What user can do here?

Yeah removed this for now, too similar to the admin and there's not a great story.

serathius · 2025-10-10T09:04:14Z

keps/sig-api-machinery/5647-stale-controller-metrics/README.md

-What are the risks of this proposal, and how do we mitigate? Think broadly.
-For example, consider both security and how this will impact the larger
-Kubernetes ecosystem.
+### User Stories (Optional)


Please add a story for a controller developer that wants to ensure reliable behavior of controller regardless of watch delay.

serathius · 2025-10-10T09:15:47Z

keps/sig-api-machinery/5647-stale-controller-metrics/README.md

-to flaking invariant tests in a timeline fashion will result in demoting or
-removing them.
+```
+&compbasemetrics.HistogramOpts{


Don't think exact metric definition here is needed, the important parts are describing thresholds we want to detect and what sampling resolution (buckets) we need to detect it.

When working on kubernetes/kubernetes#123448 I used the following watch latency thresholds to distinquish state of watch:

< 100ms - GREAT,

<1s - GOOD

< 10s - SLOW but acceptable for large clusters

>10s - STALE

They might need to be adjusted for controllers reconciliation which is further in the stack, and use those values to decide how often we should sample RV and what metric buckers we should pick.

Added to the design section, we can iterate off that.

keps/sig-api-machinery/5647-stale-controller-metrics/README.md

serathius · 2025-10-10T09:23:39Z

keps/sig-api-machinery/5647-stale-controller-metrics/README.md

-If not run in many CI jobs, there will be limited benefit to the signal.
-We will aim to generally introduce these as default selected tests.

+By adding a new prober like this, we introduce more APIServer requests to the


Risk: Measuring latency on watch requires knowing current RV in etcd, that means periodic polling of RV from apiserver. Making a LIST request every X seconds per each controller/resource is too much load.
Mitigation: Start with only pods for KCM controller as the most problematic one.

serathius · 2025-10-10T09:28:25Z

keps/sig-api-machinery/5647-stale-controller-metrics/README.md

+KCM controllers and report when they have not been updated for a certain amount
+of time. We do this by adding probers into common codepaths that run into
+scalability limits, such as statefulsets and expose metrics that give an idea of
+how out of sync a controller is with the apiserver itself.


Maybe mention your KEP for comparing RV that now finally have a way to measure delay that is experienced by controllers.

Added to motivation.

serathius · 2025-10-10T09:45:34Z

keps/sig-api-machinery/5647-stale-controller-metrics/README.md

-
-A shared system will be introduced to the e2e framework to enable this form of
-testing.
+We propose the exposure of several key metrics, a histogram metric for every


When trying to address a open issue like this, it's good to structure the proposal in 3 themes: Prevent, Detect, Mitigate.

Preventing controller staleness issues would require performance improvements, that would be outside the scope of this KEP. So it would be good to discuss in Detect and Mitigate:

Detect: Monitor controller reconciliation delay to allow administrators configure alerting and act when staleness appears. Solved by adding a metric

Mitigate: Simplify controller resilience to staleness by preventing reconciliation on stale data. Here the idea proposed by @liggitt to not sync when actions from previous attempt were not not observed.

Idea for Prevent: use the metric to measure the latency in K8s scalability tests, define a thresholds to detect regressions and and define it as SLO for K8s project.

We don't need to design everything in detail for Alpha, just specify this a steps that are needed to address the issue comprehensively. Let's propose a plan that makes high level sense and tackle first step.

serathius · 2025-10-10T09:47:08Z

keps/sig-api-machinery/5647-stale-controller-metrics/README.md

+We will then identify controllers, (Daemonset, StatefulSet, Deployment, ...)
+that are the highest churn and most at risk of running stale and will compare
+the current latest read resource version to the probers. We will run through the
+probers list of resource versions until we find the first object that is older


Don't understand which "first object" you mean?

Updated the description to be clearer, lmk if it makes more sense now.

serathius · 2025-10-10T09:56:19Z

keps/sig-api-machinery/5647-stale-controller-metrics/README.md

 change are understandable. This may include API specs (though not always
 required) or even code snippets. If there's any ambiguity about HOW your
 proposal will be implemented, this is the place to discuss them.
 -->


Please structure the proposal in high level changes you want to do and only go to details when needed. For me there are 3 changes:

Sample a RV from apiserver. A periodic loop that requests LIST on a resource to get current RV and store it in queue.

Update the informer code to store latest RV.

For set of controllers measure the latency by comparing RV returned by pod informer to RV sampled from apiserver.

Yeah essentially the same with my update, the prober samples the RVs on high churn resources and the informer updates it. Made it much higher level. Lmk what you think.

serathius · 2025-10-10T10:06:53Z

keps/sig-api-machinery/5647-stale-controller-metrics/kep.yaml

-title: Remove gogo protobuf dependency
-kep-number: 5589
+title: Stale Controller Handling
+kep-number: 5647


Think the KEP number should come from number of issue used for tracking progress across releases, not PR number. Please open enhancement tracking issue.

keps/sig-api-machinery/5647-stale-controller-metrics/kep.yaml

serathius · 2025-10-10T10:10:09Z

keps/sig-api-machinery/5647-stale-controller-metrics/kep.yaml

-  - sig-architecture
-status: implementable
-creation-date: 2025-09-29
+status: provisional


Status provisional is not enough start work on in the release. We could quickly merge it as provisional, but it would need another iteration to get it into "implementable".

Keeping it provisional, we have agreement that these initial fixes are not tied to the KEP.

jpbetz · 2025-10-13T15:11:51Z

Please add me as PRR review (in the requisite files) and I'll do a pass.

k8s-ci-robot · 2025-10-13T17:32:16Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: michaelasp
Once this PR has been reviewed and has the lgtm label, please assign deads2k for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-api-machinery/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 9, 2025

k8s-ci-robot requested review from deads2k and fedebongio October 9, 2025 23:37

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Oct 9, 2025

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Oct 9, 2025

k8s-ci-robot requested a review from serathius October 9, 2025 23:38

michaelasp force-pushed the staleControllerKEP branch from fbf7156 to 84877d0 Compare October 9, 2025 23:42

serathius reviewed Oct 10, 2025

View reviewed changes

keps/sig-api-machinery/5647-stale-controller-metrics/README.md Outdated Show resolved Hide resolved

serathius reviewed Oct 10, 2025

View reviewed changes

keps/sig-api-machinery/5647-stale-controller-metrics/README.md Outdated Show resolved Hide resolved

serathius reviewed Oct 10, 2025

View reviewed changes

keps/sig-api-machinery/5647-stale-controller-metrics/kep.yaml Outdated Show resolved Hide resolved

serathius reviewed Oct 10, 2025

View reviewed changes

wojtek-t self-assigned this Oct 13, 2025

michaelasp force-pushed the staleControllerKEP branch from 84877d0 to df20fe6 Compare October 13, 2025 17:32

michaelasp force-pushed the staleControllerKEP branch 2 times, most recently from 314d589 to 4af5987 Compare October 13, 2025 17:51

Initial KEP for Stale Controller Detection

77435fe

michaelasp force-pushed the staleControllerKEP branch from 4af5987 to 77435fe Compare October 13, 2025 17:58

KEP-5647: Initial KEP for Stale Controller Detection #5649

Are you sure you want to change the base?

KEP-5647: Initial KEP for Stale Controller Detection #5649

Conversation

michaelasp commented Oct 9, 2025

Uh oh!

michaelasp commented Oct 9, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serathius Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serathius Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpbetz commented Oct 13, 2025

Uh oh!

k8s-ci-robot commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

serathius Oct 10, 2025 •

edited

Loading

serathius Oct 10, 2025 •

edited

Loading