Skip to content
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions keps/prod-readiness/sig-scheduling/3094.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
kep-number: 3094
alpha:
approver: "@wojtek-t"
beta:
approver: "@wojtek-t"
Original file line number Diff line number Diff line change
Expand Up @@ -494,7 +494,6 @@ enhancement:
- Previously configured values will be ignored.

### Version Skew Strategy
N/A

<!--
If applicable, how will the component handle version skew with other
Expand All @@ -509,6 +508,8 @@ enhancement:
CRI or CNI may require updating that component before the kubelet.
-->

Kube-scheduler generally has the same version as api-server. So no version skew strategy.

## Production Readiness Review Questionnaire

<!--
Expand Down Expand Up @@ -555,7 +556,7 @@ Pick one of these and delete the rest.
Any change of default behavior may be surprising to users or break existing
automations, so be extremely careful here.
-->
No.
No, it's backwards compatible.

###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Expand All @@ -565,13 +566,27 @@ feature, can it break the existing applications?).

NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
-->
Yes.
Yes, we can just disable the feature gate.

###### What happens if we reenable the feature if it was previously rolled back?
The policies are respected again.

###### Are there any tests for feature enablement/disablement?
No, appropriate unit tests will be added for Alpha.
Yes, both unit tests and integration tests are added.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please link them here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated @wojtek-t

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those aren't typical enablement/disablement tests - those test the feature when it is enabled/disabled.
The enablement/disablement tests change whether the feature is enabled/disabled in the middle of the test.

On the scheduler side this seems to be good enough, because in scheduler this is basically "in-memory" feature.
So maybe please add something like:

"In the scheduler, this is in-memory feature, so only tests checking both feature being enabled and disabled were added".

However, this KEP is also introducing an API change, so a test similar to this one would be useful on the registry side:
https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282

Would you be able to add something like that?
I don't want to block this PR, so if you could add here that a strategy test will be added and add that to beta graduation criteria, I would be fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add the test here kubernetes/kubernetes#112805, PTAL @wojtek-t

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kerthcet - I don't want to block Beta on it - can you please add here that such test will be added (and add that to beta graduation criteria) and I will approve the PRR then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the enablement tests again. PTAL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kerthcet the KEP freeze is tomorrow. Just update this PR to say that the tests will be added (linking to the PR is fine too).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.


Unit tests:

- pkg/api/pod/util_test.go#TestDropNodeInclusionPolicyFields
- pkg/scheduler/framework/plugins/podtopologyspread/filtering_test.go#TestPreFilterState
- pkg/scheduler/framework/plugins/podtopologyspread/filtering_test.go#TestSingleConstraint
- pkg/scheduler/framework/plugins/podtopologyspread/filtering_test.go#TestMultipleConstraints
- pkg/scheduler/framework/plugins/podtopologyspread/filtering_test.go#TestPreScoreStateEmptyNodes
- pkg/scheduler/framework/plugins/podtopologyspread/filtering_test.go#TestPodTopologySpreadScore

Integration tests:

- test/integration/scheduler/filters/filters_test.go#TestPodTopologySpreadFilter
- test/integration/scheduler/scoring/priorities_test.go#TestPodTopologySpreadScoring

<!--
The e2e framework does not currently support enabling or disabling feature
Expand Down Expand Up @@ -611,14 +626,87 @@ that might indicate a serious problem?
- A spike on failure events with keyword "failed spreadConstraint" in scheduler log.

###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
No. This will be tested upon beta graduation.

<!--
Describe manual testing that was done and the outcomes.
Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
-->

Not yet, but it will be tested manually prior to upgrade following below steps:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scenario LGTM - please update the PRR once you run it (fine to do this after feature freeze, but please do that before graduating the feature to beta in k/k).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then let's merge this PR first for I may leave several days.. And I'll update this section after.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG


1. Install kubernetes v1.24 cluster with two workloads via installation tools like Kind.
2. Let's name these nodes as node1 and node2, both labelled with key `kubernetes.io/hostname`.
3. Add a taint to node1 like `foo=bar:NoSchedule`
4. Apply a deployment like:

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 2
selector:
matchLabels:
foo: bar
template:
metadata:
labels:
foo: bar
spec:
restartPolicy: Always
containers:
- name: nginx
image: nginx:1.14.2
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
```

5. We'll see one pod pending.
6. Delete the deployment via `kubectl delete -f`.
7. Configure the api-server with feature-gate `NodeInclusionPolicyInPodTopologySpread` enabled.
8. Redeploy the deployment with `NodeTaintsPolicy` honored.

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 2
selector:
matchLabels:
foo: bar
template:
metadata:
labels:
foo: bar
spec:
restartPolicy: Always
containers:
- name: nginx
image: nginx:1.14.2
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
NodeTaintsPolicy: Honor
labelSelector:
matchLabels:
foo: bar
```

9. All pods will be allocated successfully.
10. Delete the deployment.
11. Disable the feature gate with api-server restarted.
12. Apply the deployment for the third time, we'll see one pending again.

###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No

Expand Down Expand Up @@ -661,7 +749,9 @@ Recall that end users cannot usually observe component logs or access metrics.
- Other field:
- [ ] Other (treat as last resort)
- Details: -->
N/A

- [x] Other (treat as last resort)
- Details: We can only observe the behaviors based on pod scheduling results.

###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?

Expand Down Expand Up @@ -711,7 +801,9 @@ Pick one more of these and delete the rest.
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
implementation difficulties, etc.).
-->
N/A

Yes, we have a plan to improve observability via metrics [here](https://github.com/kubernetes/kubernetes/issues/110643),
but still on the way.

### Dependencies

Expand Down Expand Up @@ -748,7 +840,6 @@ For beta, this section is required: reviewers must answer these questions.
For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
-->
No

###### Will enabling / using this feature result in any new API calls?

Expand Down Expand Up @@ -831,7 +922,8 @@ details). For now, we leave it here.
-->

###### How does this feature react if the API server and/or etcd is unavailable?
N/A

It only works in pod scheduling, but if the API server or etcd down, pods will not be scheduled successfully.

###### What are other known failure modes?

Expand All @@ -851,7 +943,10 @@ For each of them, fill in the following information by copying the below templat
Configuration errors are logged to stderr.

###### What steps should be taken if SLOs are not being met to determine the problem?
N/A

If we see obviously performance degradation or error rate going up with this feature gate enabled,
we should disable it ASAP, and restart the apiserver. If we have fewer workloads, we can disable the
policy in `PodTopologySpread` one by one for emergency.

## Implementation History

Expand All @@ -868,13 +963,15 @@ Major milestones might include:

- 2021.01.12: KEP proposed for review, including motivation, proposal, risks,
test plan and graduation criteria.
- 2022.09.22: Graduate to Beta in v1.26.

## Drawbacks

<!--
Why should this KEP _not_ be implemented?
-->
N/A

None, it's a backward compatible feature, if users don't want it, no need to configure anything.

## Alternatives

Expand All @@ -896,4 +993,5 @@ Use this section if you need things from the project/SIG. Examples include a
new subproject, repos requested, or GitHub details. Listing these here allows a
SIG to get the process for these resources started right away.
-->
N/A

No
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@ kep-number: 3094
authors:
- "@kerthcet"
owning-sig: sig-scheduling
participating-sigs:
status: implementable
creation-date: 2021-12-30
reviewers:
Expand All @@ -25,7 +24,7 @@ see-also:
#prr-approvers:

# The target maturity stage in the current dev cycle for this KEP.
stage: alpha
stage: beta

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
Expand Down