KEP-5055: DRA device taints: document state in 1.35 and plan beta in 1.36#5716
Conversation
|
Skipping CI for Draft Pull Request. |
e92153d to
725970b
Compare
| Automated upgrade/downgrade testing verifies that: | ||
| - A DeviceTaintRule created before a downgrade prevents pod scheduling after a downgrade. | ||
| - A pod which gets scheduled because of a toleration after the downgrade | ||
| is kept running after an upgrade. |
There was a problem hiding this comment.
I've written this in present tense ("verifies") because I am assuming that it will be true soon, maybe even before this PR gets merged.
However, it is not true yet right now. I'm currently working on rewriting the upgrade/downgrade tests in kubernetes/kubernetes#135664 and will add the test case described here soon.
There was a problem hiding this comment.
I have it implemented and noticed one small difference: the KEP asks about "upgrade->downgrade->upgrade" (three cluster changes). The automated test only does "upgrade->downgrade" (two). Does this really matter?
Maybe, but as I cannot image what else can go wrong for "upgrade->downgrade->upgrade" that doesn't already go wrong for just "upgrade" I am not sure what I should test for.
My simplified test does:
- A pod which gets scheduled on the previous release because of a toleration is kept running after an upgrade.
- A DeviceTaintRule created to evict the pod before a downgrade prevents pod scheduling after a downgrade.
There was a problem hiding this comment.
Has this been completed? Looks that way since kubernetes/kubernetes#135664 has merged. Asking both to check whether this comment should be updated to reflect the current state and if this has been updated so we can use the pattern for other features.
There was a problem hiding this comment.
kubernetes/kubernetes#135664 is just preparation, and it had to be reverted. New attempt in kubernetes/kubernetes#136156.
The test described here is pending in kubernetes/kubernetes#135732, so not fully completed yet.
Asking both to check whether this comment should be updated to reflect the current state and if this has been updated so we can use the pattern for other features.
I'm struggling with how to handle this in KEPs. If we write this in future tense during the KEP planning period before KEP freeze, then we express the intent. I'm skeptical about such a KEP being updated later after the work has been done (low priority work never gets done, plus it needs another review round...).
If we write this in present tense, it's not quite correct yet, but the same is true for "stage: beta", which we also need to merge now. I'm leaning towards present tense and then during the final beta promotion PR checking that everything that was meant to be done really is done.
The same problem occurs further down when it asks for proof that tests are executed. That only works for tests which were already implemented in the previous release.
Ultimately the problem is that the KEP review process is trying to verify that a feature is ready for promotion at a time when that verification cannot be done completely yet - IMHO.
There was a problem hiding this comment.
At this point in time (given that kubernetes/kubernetes#136156 merged, and kubernetes/kubernetes#135732 is still WIP) it's ok to leave it as is. But after the other merges it's probably best to link to these tests.
There was a problem hiding this comment.
So do another update of the KEP README.md in a future PR?
liggitt
left a comment
There was a problem hiding this comment.
updates reflecting API decisions made during 1.35 lgtm
|
|
||
| Usage of DeviceTaintRules can be seen in the apiserver's | ||
| `apiserver_resource_objects` metric with labels `group=resource.k8s.io` and | ||
| `resource=deviceTaintRules`. |
There was a problem hiding this comment.
| `resource=deviceTaintRules`. | |
| `resource=devicetaintrules`. |
There was a problem hiding this comment.
Nit, but it would be good to fix it.
There was a problem hiding this comment.
Agreed, I was merely waiting for more change requests.
Updated now:
|
/approve |
|
/assign |
|
/wg device-management |
|
/assign @soltysh For PRR review/approval. |
| Automated upgrade/downgrade testing verifies that: | ||
| - A DeviceTaintRule created before a downgrade prevents pod scheduling after a downgrade. | ||
| - A pod which gets scheduled because of a toleration after the downgrade | ||
| is kept running after an upgrade. |
There was a problem hiding this comment.
At this point in time (given that kubernetes/kubernetes#136156 merged, and kubernetes/kubernetes#135732 is still WIP) it's ok to leave it as is. But after the other merges it's probably best to link to these tests.
|
|
||
| Usage of DeviceTaintRules can be seen in the apiserver's | ||
| `apiserver_resource_objects` metric with labels `group=resource.k8s.io` and | ||
| `resource=deviceTaintRules`. |
There was a problem hiding this comment.
Nit, but it would be good to fix it.
|
|
||
| ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? | ||
| As for normal pod scheduling of pods using ResourceClaims there is no SLO for | ||
| scheduling with taints. |
There was a problem hiding this comment.
Is it much different from the default scheduling SLOs?
There was a problem hiding this comment.
We only have SLOs for "stateless pods" - in particular, no volumes, no extra devices. As soon as those enter the picture, it becomes impossible to predict what additional latency might be introduced by the additional, non-Kubernetes components.
One could argue that pure scheduling should have an SLO, but even that then depends on the complexity of the ResourceSlices published by driver(s).
|
|
||
| - 1.33: first KEP revision and implementation | ||
| - 1.35: revised alpha with `effect: None` and DeviceTaintRule status | ||
|
|
There was a problem hiding this comment.
Nit: missing entry for 1.36 update.
There was a problem hiding this comment.
1.36 isn't "history" yet?
I can add "- 1.36: graduation to beta (tentative)" if you want something.
There was a problem hiding this comment.
I always treat it as history when you start working on something within given timeline 😉
725970b to
6c37484
Compare
|
|
||
| - 1.33: first KEP revision and implementation | ||
| - 1.35: revised alpha with `effect: None` and DeviceTaintRule status | ||
|
|
There was a problem hiding this comment.
I always treat it as history when you start working on something within given timeline 😉
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: macsko, pohly, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold cancel |
One-line PR description: DRA device taints: document state in 1.35 and plan beta in 1.36
Issue link: DRA: device taints and tolerations #5055
Other comments:
The actual implementation in 1.35 was a bit different than planned. This gets reflected here, plus the changes for beta.