diff --git a/vertical-pod-autoscaler/enhancements/7862-cpu-startup-boost/README.md b/vertical-pod-autoscaler/enhancements/7862-cpu-startup-boost/README.md new file mode 100644 index 000000000000..8bc236253a8b --- /dev/null +++ b/vertical-pod-autoscaler/enhancements/7862-cpu-startup-boost/README.md @@ -0,0 +1,312 @@ +# AEP-7862: CPU Startup Boost + + +- [AEP-7862: CPU Startup Boost](#aep-7862-cpu-startup-boost) + - [Summary](#summary) + - [Goals](#goals) + - [Non-Goals](#non-goals) + - [Proposal](#proposal) + - [Design Details](#design-details) + - [Workflow](#workflow) + - [API Changes](#api-changes) + - [Priority of `StartupBoost`](#priority-of-startupboost) + - [Validation](#validation) + - [Static Validation](#static-validation) + - [Dynamic Validation](#dynamic-validation) + - [Mitigating Failed In-Place Downsizes](#mitigating-failed-in-place-downsizes) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [How can this feature be enabled / disabled in a live cluster?](#how-can-this-feature-be-enabled--disabled-in-a-live-cluster) + - [Kubernetes Version Compatibility](#kubernetes-version-compatibility) + - [Test Plan](#test-plan) + - [Examples](#examples) + - [CPU Boost Only](#cpu-boost-only) + - [CPU Boost and Vanilla VPA](#cpu-boost-and-vanilla-vpa) + - [Implementation History](#implementation-history) + + +## Summary + +Long application start time is a known problem for more traditional workloads +running in containerized applications, especially Java workloads. This delay can +negatively impact the user experience and overall application performance. One +potential solution is to provide additional CPU resources to pods during their +startup phase, but this can lead to waste if the extra CPU resources are not +set back to their original values after the pods have started up. + +This proposal allows VPA to boost the CPU request and limit of containers during +the pod startup and to scale the CPU resources back down when the pod is +`Ready` or after certain time has elapsed, leveraging the +[in-place pod resize Kubernetes feature](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources). + +> [!NOTE] +> This feature depends on the new `InPlaceOrRecreate` VPA mode: +> [AEP-4016: Support for in place updates in VPA](https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/enhancements/4016-in-place-updates-support/README.md) + +### Goals + +* Allow VPA to boost the CPU request and limit of a pod's containers during the +pod (re-)creation time. +* Allow VPA to scale pods down [in-place](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources) +to the existing VPA recommendation for that container, if any, or to the CPU +resources configured in the pod spec, as soon as their [`Ready`](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions) +condition is true and `StartupBoost.CPU.Duration` has elapsed. + +### Non-Goals + +* Allow VPA to boost CPU resources of pods outside of the pod (re-)creation +time. +* Allow VPA to boost memory resources. + * This is out of scope for now because the in-place pod resize feature + [does not support memory limit decrease yet.](https://github.com/kubernetes/enhancements/tree/758ea034908515a934af09d03a927b24186af04c/keps/sig-node/1287-in-place-update-pod-resources#memory-limit-decreases) + +## Proposal + +* To extend [`ContainerResourcePolicy`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L191) +with a new `StartupBoost` field to allow users to configure the CPU startup +boost. + +* To extend [`ContainerScalingMode`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L231-L236) +with a new `StartupBoostOnly` mode to allow users to only enable the startup +boost feature and not vanilla VPA altogether. + +* To allow CPU startup boost if a `StartupBoost` config is specified in `Auto` +[`ContainerScalingMode`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L231-L236) +container policies. + +## Design Details + +### Workflow + +1. The user first configures the CPU startup boost on their VPA object + +1. When a pod targeted by that VPA is created, the kube-apiserver invokes the +VPA Admission Controller + +1. The VPA Admission Controller modifies the pod's containers CPU request and +limits to align with its `StartupBoost` policy, if specified, during the pod +creation. + +1. The VPA Updater monitors pods targeted by the VPA object and when the pod +condition is `Ready` and `StartupBoost.CPU.Duration` has elapsed, it scales +down the CPU resources to the appropriate non-boosted value: +`existing VPA recommendation for that container` (if any) OR the +`CPU resources configured in the pod spec`. + * The scale down is applied [in-place](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources). + +### API Changes + +The new `StartupBoost` parameter will be added to the [`ContainerResourcePolicy`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L191) +and contain the following fields: + * `StartupBoost.CPU.Factor`: the factor by which to multiply the initial + resource request and limit of the containers' targeted by the VPA object. + * `StartupBoost.CPU.Value`: the target value of the CPU request or limit + during the startup boost phase. + * [Optional] `StartupBoost.CPU.Duration`: if specified, it indicates for how + long to keep the pod boosted **after** it goes to `Ready`. + +> [!IMPORTANT] +> The boosted CPU value will be capped by +> [`--container-recommendation-max-allowed-cpu`](https://github.com/kubernetes/autoscaler/blob/4d294562e505431d518a81e8833accc0ec99c9b8/vertical-pod-autoscaler/pkg/recommender/main.go#L122) +> flag value, if set. + +> [!IMPORTANT] +> Only one of `Factor` or `Value` may be specified per container policy. + + +> [!NOTE] +> To ensure that containers are unboosted only after their applications are +> started and ready, it is recommended to configure a +> [Readiness or a Startup probe](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) +> for the containers that will be CPU boosted. Check the [Test Plan](#test-plan) +> section for more details on this feature's behavior for different combinations +> of probers + `StartupBoost.CPU.Duration`. + +We will also add a new mode to the [`ContainerScalingMode`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L231-L236): + * **NEW**: `StartupBoostOnly`: new mode that will allow users to only enable + the startup boost feature for a container and not vanilla VPA altogether. + * **NEW**: `Auto`: we will modify the existing `Auto` mode to enable both + vanilla VPA and CPU Startup Boost (when `StartupBoost` parameter is + specified). + +#### Priority of `StartupBoost` + +The new `StartupBoost` field will take precedence over the rest of the container +resource policy configurations. Functioning independently from all other fields +in [`ContainerResourcePolicy`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L191), +**except for**: + * [`ContainerName`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L192-L195) + * [`Mode`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L196-L198) + * [`ControlledValues`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L214-L217) + +This means that a container's CPU request/limit can be boosted during startup +beyond [`MaxAllowed`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L203-L206), +for example, or it will be able to be boosted even if CPU is explicitly +excluded from [`ControlledResources`](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler-1.3.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L208-L212). + +### Validation + +#### Static Validation + +* We will check that the `startupBoost` configuration is valid when VPA objects +are created/updated: + * The VPA autoscaling mode must be `InPlaceOrRecreate` (since it does not + make sense to use this feature with disruptive modes of VPA). + * The boost factor is >= 1 (via CRD validation rules) + * Only one of `StartupBoost.CPU.Factor` or `StartupBoost.CPU.Value` is + specified + * The [feature enablement](#feature-enablement) flags must be on. + + +#### Dynamic Validation + +* `StartupBoost.CPU.Value` must be greater than the CPU request or limit of the + container during the boost phase, otherwise we risk downscaling the container. + +### Mitigating Failed In-Place Downsizes + +The VPA Updater **will not** evict a pod if it attempted to scaled the pod down +in place (to unboost its CPU resources) and the update failed (see the +[scenarios](https://github.com/kubernetes/autoscaler/blob/0a34bf5d3a71b486bdaa440f1af7f8d50dc8e391/vertical-pod-autoscaler/enhancements/4016-in-place-updates-support/README.md?plain=1#L164-L169 ) where the VPA +updater will consider that the update failed). This is to avoid an eviction +loop: + +1. A pod is created and has its CPU resources boosted +1. The pod meets the conditions to be unboosted. VPA Updater tries to downscale +the pod in-place and it fails. +1. VPA Updater evicts the pod. Logic flow goes back to (1). + +### Feature Enablement and Rollback + +#### How can this feature be enabled / disabled in a live cluster? + +* Feature gates names: `CPUStartupBoost` and `InPlaceOrRecreate` (from +[AEP-4016](https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/enhancements/4016-in-place-updates-support/README.md#feature-enablement-and-rollback)) +* Components depending on the feature gates: + * admission-controller + * updater + +Enabling of feature gates `CPUStartupBoost` AND `InPlaceOrRecreate` will cause +the following to happen: + * admission-controller to **accept** new VPA objects being created with +`StartupBoostOnly` configured. + * admission-controller to **boost** CPU resources. + * updater to **unboost** the CPU resources. + +Disabling of feature gates `CPUStartupBoost` OR `InPlaceOrRecreate` will cause +the following to happen: + * admission-controller to **reject** new VPA objects being created with + `StartupBoostOnly` configured. + * A descriptive error message should be returned to the user letting them + know that they are using a feature gated feature. + * admission-controller **to not** boost CPU resources, should it encounter a + VPA configured with a `StartupBoost` config and `StartupBoostOnly` or `Auto` + `ContainerScalingMode`. + * updater **to not** unboost CPU resources when pods meet the scale down + requirements, should it encounter a VPA configured with a `StartupBoost` + config and `StartupBoostOnly` or `Auto` `ContainerScalingMode`. + +### Kubernetes Version Compatibility + +Similarly to [AEP-4016](https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler/enhancements/4016-in-place-updates-support#kubernetes-version-compatibility), +`StartupBoost` configuration and `StartupBoostOnly` mode are built assuming that +VPA will be running on a Kubernetes 1.33+ with the beta version of +[KEP-1287: In-Place Update of Pod Resources](https://github.com/kubernetes/enhancements/issues/1287) +enabled. If this is not the case, VPA's attempt to unboost pods may fail and the +pods may remain boosted for their whole lifecycle. + +## Test Plan + +Other than comprehensive unit tests, we will also add the following scenarios to +our e2e tests: + +* CPU Startup Boost recommendation is applied to pod controlled by VPA until it +becomes `Ready` and `StartupBoost.CPU.Duration` has elapsed. Then, the pod is +scaled back down in-place. We'll also test the following sub-cases: + * Boost is applied to all containers of a pod. + * Boost is applied only to a subset of containers in a pod. + * Combinations of probes + `StartupBoost.CPU.Duration`: + * No probes and no `StartupBoost.CPU.Duration` specified: unboost will + likely happen immediately. + * No probes and a 60s `StartupBoost.CPU.Duration`: unboost will likely + happen after 60s. + * A readiness/startup probe and no `StartupBoost.CPU.Duration` specified: + unboost will likely as soon as the pod becomes `Ready`. + * A readiness/startup probe and a 60s `StartupBoost.CPU.Duration` + specified: unboost will likely happen 60s **after** the pod becomes `Ready`. + +* Pod is not evicted if the in-place update fails when scaling the pod back +down. + +## Examples + +Here are some examples of the VPA CR incorporating CPU boosting for different +scenarios. + +### CPU Boost Only + +All containers under `example` deployment will receive "regular" VPA updates, +**except for** `boosted-container-name`. `boosted-container-name` will only be +CPU boosted/unboosted, because it has a `StartupBoostOnly` container policy. + +```yaml +apiVersion: "autoscaling.k8s.io/v1" +kind: VerticalPodAutoscaler +metadata: + name: example-vpa +spec: + targetRef: + apiVersion: "apps/v1" + kind: Deployment + name: example + updatePolicy: + # VPA Update mode must be InPlaceOrRecreate + updateMode: "InPlaceOrRecreate" + resourcePolicy: + containerPolicies: + - containerName: "boosted-container-name" + mode: "StartupBoostOnly" + startupBoost: + cpu: + factor: 2.0 +``` + +### CPU Boost and Vanilla VPA + +All containers under `example` deployment will receive "regular" VPA updates, +**including** `boosted-container-name`. Additionally, `boosted-container-name` +will be CPU boosted/unboosted, because it has a `StartupBoost` config in its +container policy and `Auto` container policy mode. + +```yaml +apiVersion: "autoscaling.k8s.io/v1" +kind: VerticalPodAutoscaler +metadata: + name: example-vpa +spec: + targetRef: + apiVersion: "apps/v1" + kind: Deployment + name: example + updatePolicy: + # VPA Update mode must be InPlaceOrRecreate + updateMode: "InPlaceOrRecreate" + resourcePolicy: + containerPolicies: + - containerName: "boosted-container-name" + mode: "Auto" # Vanilla VPA mode + Startup Boost + minAllowed: + cpu: "250m" + memory: "100Mi" + maxAllowed: + cpu: "500m" + memory: "600Mi" + # The CPU boosted resources can go beyond maxAllowed. + startupBoost: + cpu: + value: 4 +``` + +## Implementation History + +* 2025-03-20: Initial version. +