-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-5647: Initial KEP for Stale Controller Detection #5649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
michaelasp
commented
Oct 9, 2025
- One-line PR description: Add initial KEP for stale controller metrics
- Issue link: Stale Controller Detection and Mitigation #5647
- Other comments:
/cc @serathius |
fbf7156
to
84877d0
Compare
a principled approach with rules and guardrails. | ||
|
||
See: | ||
As Kubernetes increases in scale, greater pressure is being put on all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While staleness usually happens in scale, it's not exclusive to it. It's a consequence of building reconciliation on watch protocol which is an eventually consistent. There is not guarantee of how far behind the watch is, next event might arrive in a second, or in an hour. The problem is that currently don't have any way to measure it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, removing the section about scale and just generally talking about eventual consistency
components. One issue that arises are controllers falling out of sync with the | ||
apiserver as they cannot keep up. When this happens, a controller may act on | ||
information that is old without realizing and get stuck in an irrecoverable | ||
state where it keeps trying to act on the world and does not see its own writes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expand this more, the problem of controller acting on outdated information, which in best cases can cause conflicts on writes increasing error rates or in worst case invalid behaviors like duplicating objects that overwhelmed control plane.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
- Prevent flaking, unreliable tests | ||
- Ensure result reporting is structured | ||
- Must not impact the conformance test suite | ||
The goal of this kep is to add a set of metrics that we define for a certain set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's expand the goals to be able to measure the staleness of controller reconcilation to allow administrators and controllers to take an action if a threshold is reached.
- Enable completely arbitrary checks | ||
- Targeting integration tests. | ||
- We are specifically aiming for end to end tests for this purpose. | ||
We also will focus on metrics in this KEP and not propose solutions to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should have just a KEP for a metric, when I suggested cutting scope to metrics I meant that it's a good first step and we can expand it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, added overall metrics and detection as two parts of the overall KEP.
#### Story 1 | ||
|
||
How will UX be reviewed, and by whom? | ||
I am a cluster administrator, I want to be able to check my metrics and see if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, overloading is not the only cause staleness and increasing resources is not the only mitigation. I would skip describing a specific mitigation, just focus on being able to monitor and alert on the issue so an oncall can take an action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, just focused on staleness monitoring and mitigation
|
||
If implemented poorly, this could result in tests flaking in any number of e2e | ||
test CI jobs that are now running these tests. | ||
I am a user and am trying to optimize my workloads, I look at my usage patterns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the story. What user can do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah removed this for now, too similar to the admin and there's not a great story.
What are the risks of this proposal, and how do we mitigate? Think broadly. | ||
For example, consider both security and how this will impact the larger | ||
Kubernetes ecosystem. | ||
### User Stories (Optional) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a story for a controller developer that wants to ensure reliable behavior of controller regardless of watch delay.
to flaking invariant tests in a timeline fashion will result in demoting or | ||
removing them. | ||
``` | ||
&compbasemetrics.HistogramOpts{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think exact metric definition here is needed, the important parts are describing thresholds we want to detect and what sampling resolution (buckets) we need to detect it.
When working on kubernetes/kubernetes#123448 I used the following watch latency thresholds to distinquish state of watch:
- < 100ms - GREAT,
- <1s - GOOD
- < 10s - SLOW but acceptable for large clusters
- >10s - STALE
They might need to be adjusted for controllers reconciliation which is further in the stack, and use those values to decide how often we should sample RV and what metric buckers we should pick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to the design section, we can iterate off that.
If not run in many CI jobs, there will be limited benefit to the signal. | ||
We will aim to generally introduce these as default selected tests. | ||
|
||
By adding a new prober like this, we introduce more APIServer requests to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Risk: Measuring latency on watch requires knowing current RV in etcd, that means periodic polling of RV from apiserver. Making a LIST request every X seconds per each controller/resource is too much load.
Mitigation: Start with only pods for KCM controller as the most problematic one.
KCM controllers and report when they have not been updated for a certain amount | ||
of time. We do this by adding probers into common codepaths that run into | ||
scalability limits, such as statefulsets and expose metrics that give an idea of | ||
how out of sync a controller is with the apiserver itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe mention your KEP for comparing RV that now finally have a way to measure delay that is experienced by controllers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to motivation.
|
||
A shared system will be introduced to the e2e framework to enable this form of | ||
testing. | ||
We propose the exposure of several key metrics, a histogram metric for every |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When trying to address a open issue like this, it's good to structure the proposal in 3 themes: Prevent, Detect, Mitigate.
Preventing controller staleness issues would require performance improvements, that would be outside the scope of this KEP. So it would be good to discuss in Detect and Mitigate:
- Detect: Monitor controller reconciliation delay to allow administrators configure alerting and act when staleness appears. Solved by adding a metric
- Mitigate: Simplify controller resilience to staleness by preventing reconciliation on stale data. Here the idea proposed by @liggitt to not sync when actions from previous attempt were not not observed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea for Prevent: use the metric to measure the latency in K8s scalability tests, define a thresholds to detect regressions and and define it as SLO for K8s project.
We don't need to design everything in detail for Alpha, just specify this a steps that are needed to address the issue comprehensively. Let's propose a plan that makes high level sense and tackle first step.
We will then identify controllers, (Daemonset, StatefulSet, Deployment, ...) | ||
that are the highest churn and most at risk of running stale and will compare | ||
the current latest read resource version to the probers. We will run through the | ||
probers list of resource versions until we find the first object that is older |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't understand which "first object" you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the description to be clearer, lmk if it makes more sense now.
change are understandable. This may include API specs (though not always | ||
required) or even code snippets. If there's any ambiguity about HOW your | ||
proposal will be implemented, this is the place to discuss them. | ||
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please structure the proposal in high level changes you want to do and only go to details when needed. For me there are 3 changes:
- Sample a RV from apiserver. A periodic loop that requests LIST on a resource to get current RV and store it in queue.
- Update the informer code to store latest RV.
- For set of controllers measure the latency by comparing RV returned by pod informer to RV sampled from apiserver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah essentially the same with my update, the prober samples the RVs on high churn resources and the informer updates it. Made it much higher level. Lmk what you think.
title: Remove gogo protobuf dependency | ||
kep-number: 5589 | ||
title: Stale Controller Handling | ||
kep-number: 5647 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think the KEP number should come from number of issue used for tracking progress across releases, not PR number. Please open enhancement tracking issue.
- sig-architecture | ||
status: implementable | ||
creation-date: 2025-09-29 | ||
status: provisional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Status provisional is not enough start work on in the release. We could quickly merge it as provisional, but it would need another iteration to get it into "implementable".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping it provisional, we have agreement that these initial fixes are not tied to the KEP.
Please add me as PRR review (in the requisite files) and I'll do a pass. |
84877d0
to
df20fe6
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: michaelasp The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
314d589
to
4af5987
Compare
4af5987
to
77435fe
Compare