Skip to content

Commit 63d1ce8

Browse files
committed
reopen KEP 1860
1 parent 35befff commit 63d1ce8

File tree

3 files changed

+308
-11
lines changed

3 files changed

+308
-11
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
# The KEP must have an approver from the
2+
# "prod-readiness-approvers" group
3+
# of http://git.k8s.io/enhancements/OWNERS_ALIASES
4+
kep-number: 1860
5+
alpha:
6+
approver: "@wojtek-t"

keps/sig-network/1860-kube-proxy-IP-node-binding/README.md

+296-5
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,13 @@
2323
- [Beta/GA](#betaga)
2424
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
2525
- [Version Skew Strategy](#version-skew-strategy)
26+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
27+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
28+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
29+
- [Monitoring Requirements](#monitoring-requirements)
30+
- [Dependencies](#dependencies)
31+
- [Scalability](#scalability)
32+
- [Troubleshooting](#troubleshooting)
2633
<!-- /toc -->
2734

2835
## Release Signoff Checklist
@@ -32,13 +39,17 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
3239
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
3340
- [x] (R) KEP approvers have approved the KEP status as `implementable`
3441
- [x] (R) Design details are appropriately documented
35-
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
36-
- [x] (R) Graduation criteria is in place
42+
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
43+
- [ ] e2e Tests for all Beta API Operations (endpoints)
44+
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
45+
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
46+
- [ ] (R) Graduation criteria is in place
47+
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
3748
- [ ] (R) Production readiness review completed
38-
- [ ] Production readiness review approved
49+
- [ ] (R) Production readiness review approved
3950
- [ ] "Implementation History" section is up-to-date for milestone
4051
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
41-
- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
52+
- [ ] Supporting documentatione.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
4253

4354

4455
## Summary
@@ -122,7 +133,6 @@ API changes to Service:
122133
Unit tests:
123134
- unit tests for the ipvs and iptables rules
124135
- unit tests for the validation
125-
- unit tests for a new util in pkg/proxy
126136

127137
E2E tests:
128138
- The default behavior for `ipMode` does not break any existing e2e tests
@@ -149,3 +159,284 @@ On downgrade, the feature gate will simply be disabled, and as long as `kube-pro
149159
### Version Skew Strategy
150160

151161
Version skew from the control plane to `kube-proxy` should be trivial since `kube-proxy` will simply ignore the `ipMode` field.
162+
163+
## Production Readiness Review Questionnaire
164+
165+
### Feature Enablement and Rollback
166+
167+
###### How can this feature be enabled / disabled in a live cluster?
168+
169+
- [x] Feature gate (also fill in values in `kep.yaml`)
170+
- Feature gate name: LoadBalancerIPMode
171+
- Components depending on the feature gate: kube-proxy, kube-apiserver
172+
173+
###### Does enabling the feature change any default behavior?
174+
175+
No.
176+
177+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
178+
179+
Yes.
180+
181+
###### What happens if we reenable the feature if it was previously rolled back?
182+
183+
It works. The forwarding rules for services which have the value of `ipMode` had been set to "Proxy" will be removed by kube-proxy.
184+
185+
###### Are there any tests for feature enablement/disablement?
186+
187+
Yes. There are some unit tests and an integration test added for this feature enablement/disablement.
188+
189+
### Rollout, Upgrade and Rollback Planning
190+
191+
<!--
192+
This section must be completed when targeting beta to a release.
193+
-->
194+
195+
###### How can a rollout or rollback fail? Can it impact already running workloads?
196+
197+
<!--
198+
Try to be as paranoid as possible - e.g., what if some components will restart
199+
mid-rollout?
200+
201+
Be sure to consider highly-available clusters, where, for example,
202+
feature flags will be enabled on some API servers and not others during the
203+
rollout. Similarly, consider large clusters and how enablement/disablement
204+
will rollout across nodes.
205+
-->
206+
207+
###### What specific metrics should inform a rollback?
208+
209+
<!--
210+
What signals should users be paying attention to when the feature is young
211+
that might indicate a serious problem?
212+
-->
213+
214+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
215+
216+
<!--
217+
Describe manual testing that was done and the outcomes.
218+
Longer term, we may want to require automated upgrade/rollback tests, but we
219+
are missing a bunch of machinery and tooling and can't do that now.
220+
-->
221+
222+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
223+
224+
<!--
225+
Even if applying deprecation policies, they may still surprise some users.
226+
-->
227+
228+
### Monitoring Requirements
229+
230+
<!--
231+
This section must be completed when targeting beta to a release.
232+
233+
For GA, this section is required: approvers should be able to confirm the
234+
previous answers based on experience in the field.
235+
-->
236+
237+
###### How can an operator determine if the feature is in use by workloads?
238+
239+
<!--
240+
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
241+
checking if there are objects with field X set) may be a last resort. Avoid
242+
logs or events for this purpose.
243+
-->
244+
245+
###### How can someone using this feature know that it is working for their instance?
246+
247+
<!--
248+
For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
249+
for each individual pod.
250+
Pick one more of these and delete the rest.
251+
Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
252+
and operation of this feature.
253+
Recall that end users cannot usually observe component logs or access metrics.
254+
-->
255+
256+
- [ ] Events
257+
- Event Reason:
258+
- [ ] API .status
259+
- Condition name:
260+
- Other field:
261+
- [ ] Other (treat as last resort)
262+
- Details:
263+
264+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
265+
266+
<!--
267+
This is your opportunity to define what "normal" quality of service looks like
268+
for a feature.
269+
270+
It's impossible to provide comprehensive guidance, but at the very
271+
high level (needs more precise definitions) those may be things like:
272+
- per-day percentage of API calls finishing with 5XX errors <= 1%
273+
- 99% percentile over day of absolute value from (job creation time minus expected
274+
job creation time) for cron job <= 10%
275+
- 99.9% of /health requests per day finish with 200 code
276+
277+
These goals will help you determine what you need to measure (SLIs) in the next
278+
question.
279+
-->
280+
281+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
282+
283+
<!--
284+
Pick one more of these and delete the rest.
285+
-->
286+
287+
- [ ] Metrics
288+
- Metric name:
289+
- [Optional] Aggregation method:
290+
- Components exposing the metric:
291+
- [ ] Other (treat as last resort)
292+
- Details:
293+
294+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
295+
296+
<!--
297+
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
298+
implementation difficulties, etc.).
299+
-->
300+
301+
### Dependencies
302+
303+
<!--
304+
This section must be completed when targeting beta to a release.
305+
-->
306+
307+
###### Does this feature depend on any specific services running in the cluster?
308+
309+
<!--
310+
Think about both cluster-level services (e.g. metrics-server) as well
311+
as node-level agents (e.g. specific version of CRI). Focus on external or
312+
optional services that are needed. For example, if this feature depends on
313+
a cloud provider API, or upon an external software-defined storage or network
314+
control plane.
315+
316+
For each of these, fill in the following—thinking about running existing user workloads
317+
and creating new ones, as well as about cluster-level services (e.g. DNS):
318+
- [Dependency name]
319+
- Usage description:
320+
- Impact of its outage on the feature:
321+
- Impact of its degraded performance or high-error rates on the feature:
322+
-->
323+
324+
### Scalability
325+
326+
<!--
327+
For alpha, this section is encouraged: reviewers should consider these questions
328+
and attempt to answer them.
329+
330+
For beta, this section is required: reviewers must answer these questions.
331+
332+
For GA, this section is required: approvers should be able to confirm the
333+
previous answers based on experience in the field.
334+
-->
335+
336+
###### Will enabling / using this feature result in any new API calls?
337+
338+
<!--
339+
Describe them, providing:
340+
- API call type (e.g. PATCH pods)
341+
- estimated throughput
342+
- originating component(s) (e.g. Kubelet, Feature-X-controller)
343+
Focusing mostly on:
344+
- components listing and/or watching resources they didn't before
345+
- API calls that may be triggered by changes of some Kubernetes resources
346+
(e.g. update of object X triggers new updates of object Y)
347+
- periodic API calls to reconcile state (e.g. periodic fetching state,
348+
heartbeats, leader election, etc.)
349+
-->
350+
351+
###### Will enabling / using this feature result in introducing new API types?
352+
353+
<!--
354+
Describe them, providing:
355+
- API type
356+
- Supported number of objects per cluster
357+
- Supported number of objects per namespace (for namespace-scoped objects)
358+
-->
359+
360+
###### Will enabling / using this feature result in any new calls to the cloud provider?
361+
362+
<!--
363+
Describe them, providing:
364+
- Which API(s):
365+
- Estimated increase:
366+
-->
367+
368+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
369+
370+
<!--
371+
Describe them, providing:
372+
- API type(s):
373+
- Estimated increase in size: (e.g., new annotation of size 32B)
374+
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
375+
-->
376+
377+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
378+
379+
<!--
380+
Look at the [existing SLIs/SLOs].
381+
382+
Think about adding additional work or introducing new steps in between
383+
(e.g. need to do X to start a container), etc. Please describe the details.
384+
385+
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
386+
-->
387+
388+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
389+
390+
<!--
391+
Things to keep in mind include: additional in-memory state, additional
392+
non-trivial computations, excessive access to disks (including increased log
393+
volume), significant amount of data sent and/or received over network, etc.
394+
This through this both in small and large cases, again with respect to the
395+
[supported limits].
396+
397+
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
398+
-->
399+
400+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
401+
402+
<!--
403+
Focus not just on happy cases, but primarily on more pathological cases
404+
(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
405+
If any of the resources can be exhausted, how this is mitigated with the existing limits
406+
(e.g. pods per node) or new limits added by this KEP?
407+
408+
Are there any tests that were run/should be run to understand performance characteristics better
409+
and validate the declared limits?
410+
-->
411+
412+
### Troubleshooting
413+
414+
<!--
415+
This section must be completed when targeting beta to a release.
416+
417+
For GA, this section is required: approvers should be able to confirm the
418+
previous answers based on experience in the field.
419+
420+
The Troubleshooting section currently serves the `Playbook` role. We may consider
421+
splitting it into a dedicated `Playbook` document (potentially with some monitoring
422+
details). For now, we leave it here.
423+
-->
424+
425+
###### How does this feature react if the API server and/or etcd is unavailable?
426+
427+
###### What are other known failure modes?
428+
429+
<!--
430+
For each of them, fill in the following information by copying the below template:
431+
- [Failure mode brief description]
432+
- Detection: How can it be detected via metrics? Stated another way:
433+
how can an operator troubleshoot without logging into a master or worker node?
434+
- Mitigations: What can be done to stop the bleeding, especially for already
435+
running user workloads?
436+
- Diagnostics: What are the useful log messages and their required logging
437+
levels that could help debug the issue?
438+
Not required until feature graduated to beta.
439+
- Testing: Are there any tests for failure mode? If not, describe why.
440+
-->
441+
442+
###### What steps should be taken if SLOs are not being met to determine the problem?

keps/sig-network/1860-kube-proxy-IP-node-binding/kep.yaml

+6-6
Original file line numberDiff line numberDiff line change
@@ -14,18 +14,18 @@ approvers:
1414
- "@thockin"
1515
- "@andrewsykim"
1616

17-
# latest-milestone: "v1.21"
17+
stage: "alpha"
18+
19+
latest-milestone: "v1.29"
1820

1921
milestone:
20-
alpha: "v1.21"
21-
beta: "v1.22"
22+
alpha: "v1.29"
23+
beta: "v1.30"
24+
stable: "v1.31"
2225

2326
feature-gates:
2427
- name: LoadBalancerIPMode
2528
components:
2629
- kube-apiserver
2730
- kube-proxy
2831
disable-supported: true
29-
30-
latest-milestone: "0.0"
31-
stage: "alpha"

0 commit comments

Comments
 (0)