diff --git a/keps/4136-admission-fair-sharing/README.md b/keps/4136-admission-fair-sharing/README.md new file mode 100644 index 00000000000..6435d0f87f6 --- /dev/null +++ b/keps/4136-admission-fair-sharing/README.md @@ -0,0 +1,255 @@ +# KEP-4136: Admission Fair Sharing + + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit Tests](#unit-tests) + - [Integration tests](#integration-tests) + - [Graduation Criteria](#graduation-criteria) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + + +## Summary + +This KEP describes the mechanism for fair admission of workloads coming from a group of +different sources (like Cluster and Local Queues) based on the source shared resource usage. +Workloads from sources that use less are admitted before workloads coming from sources that use more. + +## Motivation + +Currently Kueue has a Fair Sharing mechanism that enforces fair-sharing of unused resources +via preemption. If one Cluster Queue is using much more of the resources than the other one +that is in need, some workloads from the first one may be preempted to allow more “fair” distribution. + +This model has multiple assumptions: + +* Workloads can be preempted. +* Users get a proper quota to do their regular business. +* Users come from multiple teams/organizations and they would rather have a strict policy but +fair policy than show understanding to those consuming all shared resources. + +However, these assumptions are not universal. Sometimes: + +* Workloads should not be preempted. +* Users operate on a shared but bigger quota. +* Fairness should not win over getting the workloads eventually completed. + +In that case the existing mode doesn’t work and a different one needs to be employed. + +### Goals + +* Establish a method for how shared resource usage is calculated and recorded and how users can fine tune the mechanism. +* Allow to specify a fair admission scope at either individual Cluster Queue or Cohort scope. +* Allow to specify the relative importance of LocalQueues targeting the same ClusterQueue. +* Amend the admission mechanism to work on admission scopes instead of only on ClusterQueues. +* Select the appropriate admission candidates for each of the admission scopes and admit them according to the selected queueing policy. +* Make the new mechanism complementary to the existing preemption-based fair sharing + +### Non-Goals + +* Store time series data inside K8S. +* Provide precise shared resource usage accounting or billing. + +## Proposal + +* Modify CQ’s FairSharing struct with + +```go +type FairSharing struct { + // Weight denotes how important the given queue when competing against other queues +// for unused shared resources. The exact impact of the weight in fair share calculations +// depends on the fair share algorithm used. Default = 1. + Weight *resource.Quantity `json:"weight,omitempty"` +} +``` + +* Expand LQ spec with the same FairSharing struct (Cohort will be expanded with FairSharing +as a part of the Preemptive FS implementation in hierarchical structure). + +* Modify FairSharingStatus struct with + +```go +type FairSharingStatus struct { + // WeightedShare represents the usage above nominal quota, with the weight applied in. + // The bigger the value is the more shared resources has been allocated and the less + // entitled the queue is for more shared resources. + // The exact details and the interpretation of the value depends on + // the fair sharing algorithm used. + WeightedShare int64 `json:"weightedShare"` + + + // ConsumedResources represents the aggregated usage of resources over time, + // with decaying function applied. + // The value is populated if usage consumption functionality is enabled in Kueue config. + ConsumedResources corev1.ResourceList `json:"consumedResources,omitempty"` + + // LastUpdate is the time when share and consumed resources were updated. + LastUpdate metav1.Time `json:"lastUpdate,omitempty"` +} +``` + +* Add FairSharingStatus to LocalQueue (and Cohort). + +* Create a new struct AdmissionScope and make it an optional field for CQ and Cohort Spec. If +not provided, CQ or Cohort is not considered an AdmissionScope and is not a subject for new +admission logic. If there are two AdmissionScopes on the path from CQ/Cohort to the top of +the hierarchy tree, the higher one is used. + +```go +const ( + // FairSharing based on usage, with QueuingStrategy as defined in CQ. + UsageBasedAdmissionFairSharing AdmissionMode = UsageBasedFairSharing + + NoAdmissionFairSharing AdmissionMode = NoAdmissionFairSharing +) + +type AdmissionScope struct { + AdmissionMode AdmissionMode +} +``` + +* When selecting candidates for admission groups all workloads from LQ to CQ or the +topmost Cohort that is marked with AdmissionScope. Then sort them using criterias: + +- Usage vector built from ConsumedResources from TopCohort to LQ +- Priority +- Timestamp + +The usage vector will be sum of the `ConsumedResources` weighted according to `resourceWeights` +(mentioned later in the KEP) + +Let’s look at a couple scenarios: + +1. AdmissionScope at CQ, CQ queueing policy is FIFO, 3 LQ pointing to CQ. Kueue considers +all CQ resources and potentially borrowed resources as “shared” resources and fair sharing +is applied to all workloads. +Kueue sorts the workloads by their LQ usage (if mode is `UsageBasedFairSharing`), priority and +timestamp and tries to admit the first one from the list. Other workloads are not attempted +until the first one is not admitted. + +2. AdmissionScope at CQ, CQ queueing policy is BestEffort, 3 LQ pointing to CQ. +Kueue considers all CQ resources and potentially borrowed resources as “shared” resources and +fair sharing is applied to all workloads. + +Kueue sorts the workloads by their LQ usage (if mode is `UsageBasedFairSharing`), priority and +timestamp and tries to admit the first one from the list. If it fails and the second, third +or following is possible then that workload is admitted, under condition that it might get preempted. + +3. AdmissionScope at Cohort level - Kueue operates in a mixed mode. Inside CQ workloads are +selected according to their AdmissionMode (if specified). If a workload fits entirely into +nominal quota, then it is admitted immediately, if not it goes into cohort-level fair sharing. +For Cohort we select all the “sticking out” workloads, and sort them by their CQ usage, priority +and timestamp. Kueue attempts to admit the first workload from the list of sticking-out +(workload + current_usage > resources), just like if it was one big strict FIFO queue. +For multi-level hierarchy under one AdmissionScope we would treat underlying Cohorts as Fifo CQs. + + +* Additionally there will be the resource usage calculation loop. The frequency of the +calculation will be controlled globally in Kueue’s config file. Accounting would be done +using something like geometric average: + +usage_sum = (1-A) * previous_usage_sum + A * current_usage. + +The value will be stored in FairSharingStatus for all LQ, CQ, and Cohorts. The value will not be zeroed +after Kueue restart or after brief period of downtime. However if the period is longer, +the value should be automatically zeroed. + + +The user will be able to configure the decaying factor A in Kueue’s config file by specifying +the half life decay time - after what time the current shared usage will decay to half of its original value. + +A = 1 - 0.5 ^ (sampling/half_life_decay) + +* Configuration will sit in FairSharing stuct in Kueue config. There will be the following modifications: + + - usageHalfLifeDecayTime - half life decay time of usage, as described above. + - usageSamplingInterval - how often usage is calculated. + - resourceWeights - how much consumption of individual resources is important when comparing usage. + - resetInactivityPeriod - if Kueue has not updated the value for this period then the value should be zeroed. + +If the user doesn't want any preemptions while fair sharing, preemptionStrategies should be left empty. + +* If preemptionStrategies is non empty Kueue attempts to combine two fair sharings at the same time. +For each of the Admission Scopes Kueue selects one workload to be attempted. And then these pre-selected +workloads are sorted based on their Preemption-based fair share value. If some of them don't fit, +fair sharing preemption may be executed. So admission-based fair sharing only reshuffles workloads +within AdmissionScope and then other mechanisms are applied as usual. + +### User Stories (Optional) +#### Story 1 + +I have multiple users using the same ClusterQueue. Each has its own namespace and LocalQueue +through which they submit workloads. I want to fairly admit their workloads so that one active +user doesn’t block the cluster too much. + +#### Story 2 + +I have multiple teams that may be sending workloads of various sizes. I want to give each team +some guaranteed capacity and at the same time, allow them to fairly share some bigger pool of resources. + +### Risks and Mitigations + +* Having 2 fair sharing mechanisms and confusion between preemption-based fair sharing and admission time fair sharing. + +* Increased complexity of the project. + +## Design Details + +Covered in Proposal. + +### Test Plan +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. +##### Prerequisite testing updates + +#### Unit Tests +The code will be thoroughly covered with unit tests. + +In particular: +* CQ level fair sharing between LQ for both strict FIFO and best effort. +* Cohort with owned resources, and CQ with guaranteed quota. + + +#### Integration tests + +They will mainly focus on larger scope scheduling (involving multiple cohorts/cqs) and +interactions with preemptive fair sharing outside admission scope. + +### Graduation Criteria + +The implementation will be split into 2 subfeatures: + +* CQ + LQ level support +* Multi-level Cohort+CQ+LQ support + +Obviously, the second depends on the first to some extent. The first however may reach +Beta/GA without starting the second. + +The graduation criterias are quite standard: + +* Beta - positive feedback from Alpha, api seems reasonable. +* GA - positive feedback, no bugs, no api changes needed. + +We hope to have CQ+LQ in alpha for the next Kueue release (0.12). + +## Drawbacks +* Adds additional complexity to the system. +* Creates yet another fair sharing mechanism. + +## Alternatives + +* Not having the feature. +* Modifying/replacing the existing preemptive fair sharing algorithm. + diff --git a/keps/4136-admission-fair-sharing/kep.yaml b/keps/4136-admission-fair-sharing/kep.yaml new file mode 100644 index 00000000000..a06a2e83120 --- /dev/null +++ b/keps/4136-admission-fair-sharing/kep.yaml @@ -0,0 +1,36 @@ +title: Admission fair sharing +kep-number: 4136 +authors: + - "@mwielgus" +status: draft +creation-date: 2025-02-03 +reviewers: + - "@mimowo" + - "@gabesaba" + - "@pbundyra" +approvers: + - "@mimowo" + - "@tenzen-y" +sea-also: + - "KEP-1714" + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v0.12" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v0.12" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +disable-supported: true + +# The following PRR answers are required at beta release +# metrics: +# - my_feature_metric +