Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions keps/prod-readiness/sig-autoscaling/4951.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
kep-number: 4951
alpha:
approver: "@soltysh"
beta:
approver: "@soltysh"
171 changes: 154 additions & 17 deletions keps/sig-autoscaling/4951-configurable-hpa-tolerance/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ tags, and then generate with `hack/update-toc.sh`.
- [e2e tests](#e2e-tests)
- [Graduation Criteria](#graduation-criteria)
- [Alpha](#alpha)
- [Beta](#beta)
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
- [Upgrade](#upgrade)
- [Downgrade](#downgrade)
Expand Down Expand Up @@ -106,16 +107,16 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
- [x] (R) KEP approvers have approved the KEP status as `implementable`
- [x] (R) Design details are appropriately documented
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- [ ] e2e Tests for all Beta API Operations (endpoints)
- [x] e2e Tests for all Beta API Operations (endpoints)
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
- [ ] (R) Graduation criteria is in place
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
- [ ] (R) Production readiness review completed
- [ ] (R) Production readiness review approved
- [x] (R) Production readiness review completed
- [x] (R) Production readiness review approved
- [ ] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

[kubernetes.io]: https://kubernetes.io/
[kubernetes/enhancements]: https://git.k8s.io/enhancements
Expand Down Expand Up @@ -352,12 +353,15 @@ https://storage.googleapis.com/k8s-triage/index.html
We expect no non-infra related flakes in the last month as a GA graduation criteria.
-->

We will add the follow [e2e autoscaling tests]:
Existing e2e tests ensure the autoscaling behavior uses the default tolerance when no
configurable tolerance is specified.

- For both scale up and scale down:
- Workload does not scale because the metric ratio is in tolerance.
- Workload scales successfully because the metric ratio is out of tolerance.
- Autoscaling uses the default when no tolerances are set.
The new [e2e autoscaling tests] covering this feature are:

- [Test with large configurable tolerance](https://github.com/kubernetes/kubernetes/blob/07142400ecd02126602ffaa6f91712cd3f1e170c/test/e2e/autoscaling/horizontal_pod_autoscaling_behavior.go#L509): [SIG autoscaling](https://testgrid.k8s.io/sig-autoscaling-hpa#gci-gce-autoscaling-hpa-cpu-alpha-beta-pull&include-filter-by-regex=HPAConfigurableTolerance.*large%20configurable%20tolerance), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=HPAConfigurableTolerance.*large%20configurable%20tolerance)

Before the graduation to beta, we will add an integration test verifying the autoscaling
behavior when smaller and larger than default tolerances are set on an HPA.

[e2e autoscaling tests]: https://github.com/kubernetes/kubernetes/tree/master/test/e2e/autoscaling

Expand Down Expand Up @@ -430,6 +434,12 @@ in back-to-back releases.
- Feature implemented behind a `HPAConfigurableTolerance` feature flag
- Initial e2e tests completed and enabled

#### Beta

- All tests described in the [`e2e tests` section](#e2e-tests) are implemented
and linked in this KEP.
- We have monitored for negative user feedback and addressed relevant concerns.

### Upgrade / Downgrade Strategy

#### Upgrade
Expand Down Expand Up @@ -543,7 +553,7 @@ You can take a look at one potential example of such test in:
https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
-->

We will add a unit test verifying that HPAs with and without the new fields are
[Unit tests have been added](https://github.com/kubernetes/kubernetes/blob/07142400ecd02126602ffaa6f91712cd3f1e170c/pkg/apis/autoscaling/validation/validation_test.go#L1648) to verify that HPAs with and without the new fields are
properly validated, both when the feature gate is enabled or not.

### Rollout, Upgrade and Rollback Planning
Expand All @@ -564,13 +574,20 @@ rollout. Similarly, consider large clusters and how enablement/disablement
will rollout across nodes.
-->

This feature does not introduce new failure modes: during rollout/rollback, some
API servers will allow or disallow setting the new 'tolerance' field. The new
field is possibly ignored until the controller manager is fully updated.

###### What specific metrics should inform a rollback?

<!--
What signals should users be paying attention to when the feature is young
that might indicate a serious problem?
-->

A high `horizontal_pod_autoscaler_controller_metric_computation_duration_seconds`
metric can indicate a problem related to this feature.

###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

<!--
Expand All @@ -579,12 +596,105 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
-->

The upgrade→downgrade→upgrade testing was done manually using a 1.33 cluster with the following steps:

1. Start the cluster with the HPA enabled:

```sh
kind create cluster --name configurable-tolerance --image kindest/node:v1.33.0 --config config.yaml
```
with the following `config.yaml` file content:
```yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
"HPAConfigurableTolerance": true
nodes:
- role: control-plane
- role: worker
```

Install metrics-server:
```sh
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/download/v0.7.2/components.yaml
kubectl patch -n kube-system deployment metrics-server --type=json -p '[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]'
```

Create a deployment starting Pods that consume a 50% CPU utilization, and an associated HPA with a very large tolerance:
```sh
kubectl apply -f configurable-tolerance-test.yaml
```
with the following `configurable-tolerance-test.yaml` file content:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: cpu-stress-deployment
labels:
app: cpu-stressor
spec:
replicas: 1
selector:
matchLabels:
app: cpu-stressor
template:
metadata:
labels:
app: cpu-stressor
spec:
containers:
- name: cpu-stressor
image: alpine:latest
command: ["/bin/sh"]
args: # Load: 1% (10 milliCPU)
- "-c"
- "apk add --no-cache stress-ng && stress-ng --cpu 1 --cpu-load 1 --cpu-method=crc16 --timeout 3600s"
resources:
requests:
cpu: "20m"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: cpu-stress-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: cpu-stress-deployment
minReplicas: 1
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 10
behavior:
scaleUp:
tolerance: 20. # 2000%
```

Check that, after a 5 minutes, `kubectl describe hpa cpu-stress-hpa` displays `ScalingLimited: False` (i.e.
the HPA doesn't recommend to scale up because of the large tolerance).

2. Simulate downgrade by disabling the feature for api server and control-plane (update the `config.yaml` file
to set it to false). Follow the procedure described in step 1, and observe that this time
`kubectl describe hpa cpu-stress-hpa` displays `ScalingLimited: True`.

4. Simulate downgrade by re-enabling the feature for api server and control-plane. Follow the procedure described
in step 1, and observe that the HPA description mentions `ScalingLimited: False`, demonstrates that the feature
is working again.

###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

<!--
Even if applying deprecation policies, they may still surprise some users.
-->

No.

### Monitoring Requirements

<!--
Expand Down Expand Up @@ -625,9 +735,9 @@ values. Users can get both values using
and use them to verify that scaling events are triggered when their ratio is out
of tolerance.

We will update the controller-manager logs to help users understand the behavior
of the autoscaler. The data added to the logs will include the tolerance used
for each scaling decision.
The [controller-manager logs have been updated](https://github.com/kubernetes/kubernetes/blob/07142400ecd02126602ffaa6f91712cd3f1e170c/pkg/controller/podautoscaler/horizontal.go#L846)
to help users understand the behavior of the autoscaler. The data added to the
logs includes the tolerance used for each scaling decision.

###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?

Expand All @@ -646,7 +756,9 @@ These goals will help you determine what you need to measure (SLIs) in the next
question.
-->

N/A.
Although the absolute value of the `horizontal_pod_autoscaler_controller_metric_computation_duration_seconds`
metric depends on HPAs configuration, it should be unimpacted by this feature. This metric should not vary
by more than 5%.

###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Expand All @@ -658,8 +770,7 @@ This KEP is not expected to have any impact on SLIs/SLOs as it doesn't introduce
a new HPA behavior, but merely allows users to easily change the value of a
parameter that's otherwise difficult to update.

Standard HPA metrics (e.g.
`horizontal_pod_autoscaler_controller_metric_computation_duration_seconds`) can
The standard HPA metric `horizontal_pod_autoscaler_controller_metric_computation_duration_seconds` can
be used to verify the HPA controller health.

###### Are there any missing metrics that would be useful to have to improve observability of this feature?
Expand Down Expand Up @@ -698,6 +809,8 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
- Impact of its degraded performance or high-error rates on the feature:
-->

No, this feature does not depend on any specific service.

### Scalability

<!--
Expand Down Expand Up @@ -817,6 +930,8 @@ details). For now, we leave it here.

###### How does this feature react if the API server and/or etcd is unavailable?

API server or etcd issues do not impact this feature.

###### What are other known failure modes?

<!--
Expand All @@ -832,8 +947,20 @@ For each of them, fill in the following information by copying the below templat
- Testing: Are there any tests for failure mode? If not, describe why.
-->

We do not expect any new failure mode. (While setting `tolerance` below 10% can cause HPAs
to scale up and down as frequently as every 30s, and higher values might stop scaling altogether
if the metric remains within the tolerance band, the feature is still working as intended.
To make HPAs respond faster, decrease the tolerance value. Conversely, to make them respond
slower, increase the tolerance value.)

###### What steps should be taken if SLOs are not being met to determine the problem?

If possible increase the log level for kube-controller-manager and check controller logs:
1. Search for "Proposing desired replicas", verify that the tolerance is set as expected,
and check (using `kubectl describe hpa`) if the ratio between the _current_ and _desired_
metric values is in tolerance.
3. Look for warnings and errors which might point where the problem lies.

## Implementation History

<!--
Expand All @@ -848,13 +975,18 @@ Major milestones might include:
-->

2025-01-21: KEP PR merged.
2025-03-24: [Implementation PR](https://github.com/kubernetes/kubernetes/pull/130797) merged.
2025-05-15: Kubernetes v1.33 released (includes this feature).
2025-05-16: This KEP updated for beta graduation.

## Drawbacks

<!--
Why should this KEP _not_ be implemented?
-->

No major drawbacks have been identified.

## Alternatives

<!--
Expand All @@ -863,10 +995,15 @@ not need to be as detailed as the proposal, but should include enough
information to express the idea and why it was not acceptable.
-->

On non-managed Kubernetes instances, users can update the cluster-wide
`--horizontal-pod-autoscaler-tolerance` tolerance parameter,

## Infrastructure Needed (Optional)

<!--
Use this section if you need things from the project/SIG. Examples include a
new subproject, repos requested, or GitHub details. Listing these here allows a
SIG to get the process for these resources started right away.
-->

N/A.
6 changes: 3 additions & 3 deletions keps/sig-autoscaling/4951-configurable-hpa-tolerance/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,17 +17,17 @@ see-also:
replaces:

# The target maturity stage in the current dev cycle for this KEP.
stage: alpha
stage: beta

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.33"
latest-milestone: "v1.34"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v1.33"
beta: TBD
beta: "v1.34"
stable: TBD

# The following PRR answers are required at alpha release
Expand Down